No title - PDF Free Download

ARTICLE Communicated by Peter Dayan Bayesian Computation in Recurrent Neural Circuits Rajesh P. N. Rao [email protected]...

Author: MIT Press

3 downloads 522 Views 75MB Size Report

This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!

Report copyright / DMCA form

DOWNLOAD PDF

ARTICLE

Communicated by Peter Dayan

Bayesian Computation in Recurrent Neural Circuits Rajesh P. N. Rao [email protected] Department of Computer Science and Engineering, University of Washington, Seattle, WA 98195, U.S.A.

A large number of human psychophysical results have been successfully explained in recent years using Bayesian models. However, the neural implementation of such models remains largely unclear. In this article, we show that a network architecture commonly used to model the cerebral cortex can implement Bayesian inference for an arbitrary hidden Markov model. We illustrate the approach using an orientation discrimination task and a visual motion detection task. In the case of orientation discrimination, we show that the model network can infer the posterior distribution over orientations and correctly estimate stimulus orientation in the presence of signicant noise. In the case of motion detection, we show that the resulting model network exhibits direction selectivity and correctly computes the posterior probabilities over motion direction and position. When used to solve the well-known random dots motion discrimination task, the model generates responses that mimic the activities of evidence-accumulating neurons in cortical areas LIP and FEF. The framework we introduce posits a new interpretation of cortical activities in terms of log posterior probabilities of stimuli occurring in the natural world.

1 Introduction Bayesian models of perception have proved extremely useful in explaining results from a number of human psychophysical experiments (see, e.g., Knill & Richards, 1996; Rao, Olshausen, & Lewicki, 2002). These psychophysical tasks range from inferring 3D structure from 2D images and judging depth from multiple cues to perception of motion and color (Bloj, Kersten, & Hurlbert, 1999; Weiss, Simoncelli, & Adelson, 2002; Mamassian, Landy, and Maloney, 2002; Jacobs, 2002). The strength of the Bayesian approach lies in its ability to quantitatively model the interaction between prior knowledge and sensory evidence. Bayes’ rule prescribes how prior probabilities and stimulus likelihoods should be combined, allowing the responses of subjects during psychophysical tasks to be interpreted in terms of the resulting posterior distributions. Neural Computation 16, 1–38 (2004)

c 2003 Massachusetts Institute of Technology °

2

R. Rao

Additional support for Bayesian models comes from recent neurophysiological and psychophysical studies on visual decision making. Carpenter and colleagues have shown that the reaction time distribution of human subjects making eye movements to one of two targets is well explained by a model computing the log-likelihood ratio of one target over the other (Carpenter & Williams, 1995). In another study, the saccadic response time distribution of monkeys could be predicted from the time taken by neural activity in area FEF to reach a xed threshold (Hanes & Schall, 1996), suggesting that these neurons are involved in accumulating evidence (interpreted as log likelihoods) over time. Similar activity has also been reported in the primate area LIP (Shadlen & Newsome, 2001). A mathematical model based on log-likelihood ratios was found to be consistent with the observed neurophysiological results (Gold & Shadlen, 2001). Given the growing evidence for Bayesian models in perception and decision making, an important open question is how such models could be implemented within the recurrent networks of the primate brain. Previous models of probabilistic inference in neural circuits have been based on the concept of neural tuning curves or basis functions and have typically focused on the estimation of static quantities such as stimulus position or orientation (Anderson & Van Essen, 1994; Zemel, Dayan, & Pouget, 1998; Deneve, Latham, & Pouget, 1999; Pouget, Dayan, & Zemel, 2000). Other models have relied on mean-eld approximations or various forms of Gibbs sampling for perceptual inference (Hinton & Sejnowski, 1986; Dayan, Hinton, Neal, & Zemel, 1995; Dayan & Hinton, 1996; Hinton & Ghahramani, 1997; Rao & Ballard, 1997, 1999; Rao, 1999; Hinton & Brown, 2002). In this article, we describe a new model of Bayesian computation in a recurrent neural circuit. We specify how the feedforward and recurrent connections may be selected to perform Bayesian inference for arbitrary hidden Markov models (see sections 2–4). The approach is illustrated using two tasks: discriminating the orientation of a noisy visual stimulus and detecting the direction of motion of moving stimuli. In the case of orientation discrimination, we show that the model network can infer the posterior distribution over orientations and correctly estimate stimulus orientation in the presence of considerable noise (see section 5.1). In the case of motion detection, we show that the resulting model network exhibits direction selectivity and correctly computes the posterior probabilities over motion direction and position (see section 5.2). We then introduce a decision-making framework based on computing log posterior ratios from the outputs of the motion detection network. We demonstrate that for the well-known random dots motion discrimination task (Shadlen & Newsome, 2001), the decision model produces responses that are qualitatively similar to the responses of “evidence-accumulating” neurons in primate brain areas LIP and FEF (see section 5.3). Some predictions of the proposed framework are explored in section 6. We conclude in sections 7 and 8 by discussing several possi-

Bayesian Computation in Recurrent Neural Circuits

3

bilities for neural encoding of log probabilities and suggest a probabilistic framework for on-line learning of synaptic parameters. 2 Modeling Cortical Networks We begin by considering a commonly used neural architecture for modeling cortical response properties: a recurrent network with ring-rate dynamics (see, e.g., Dayan & Abbott, 2001). Let I denote the vector of input ring rates to the network, and let v represent the output ring rates of the recurrently connected neurons in the network. Let W represent the feedforward synaptic weight matrix and M the recurrent weight matrix. The following equation describes the dynamics of the network, ¿

dv D ¡v C WI C Mv; dt

(2.1)

where ¿ is a time constant. The equation can be written in a discrete form as follows: vi .t C 1/ D vi .t/ C ².¡vi .t/ C wi I.t/ C

X

M ij vj .t//;

(2.2)

j

where ² is the integration rate, vi is the ith component of the vector v, wi is the ith row of the matrix W, and M ij is the element of M in the ith row and jth column. The above equation can be rewritten as vi .t C 1/ D ²wi I.t/ C

X

mij vj .t/;

(2.3)

j

where mij D ² Mij for i 6D j and mii D 1 C ².M ii ¡ 1/. How can Bayesian computation be achieved using the dynamics given by equation 2.3? We approach this problem by considering Bayesian inference in a simple hidden Markov model. 3 Bayesian Computation in a Cortical Network Let µ1 ; : : : ; µN be the hidden states of a hidden Markov model. Let µ .t/ represent the hidden state at time t with transition probabilities denoted by P.µ .t/ D µi jµ.t ¡ 1/ D µj / for i, j D 1; : : : ; N. Let I.t/ be the observable output governed by the probabilities P.I.t/jµ .t//. Then the posterior probability of being in state µi at time t is given by Bayes’ rule, P.µ .t/ D µi jI.t/; : : : ; I.1//

D kP.I.t/jµ.t/ D µi /P.µ .t/ D µi jI.t ¡ 1/; : : : ; I.1//;

(3.1)

4

R. Rao

where k is a normalization constant. The equation can be made recursive with respect to the posterior probabilities as follows: P.µ it jI.t/; : : : ; I.1//

D kP.I.t/jµit /

X j

P.µ it jµjt¡1 /P.µjt¡1 jI.t ¡ 1/; : : : ; I.1//;

(3.2)

where we have written “µ.t/ D µi ” as µit for notational convenience. This equation can be written in the log domain as log P.µit jI.t/; : : : ; I.1// D log P.I.t/jµit / C log k µX ¶ C log P.µ it jµjt¡1 /P.µjt¡1 jI.t ¡ 1/; : : : ; I.1// : j

(3.3) A recurrent network governed by equation 2.3 can implement equation 3.3 if vi .t C 1/ D log P.µit jI.t/; : : : ; I.1// X j

²wi I.t/ D

(3.4)

log P.I.t/jµit /

µX ¶ t¡1 t t¡1 D log jµ jI.t ¡ 1/; I.1// mij vj .t/ P.µ i j /P.µj :::; ;

(3.5) (3.6)

j

with the normalization term log k being added after each update. In other words, the network activities are updated according to vi .t C 1/ D ²wi I.t/ C

X j

mij vj .t/ ¡ log

X

euj .t/ ;

(3.7)

j

P where uj .t/ D ²wi I.t/C j mij vj .t/. The negative log term, which implements divisive normalization in the log domain, can be interpreted as a form of global recurrent inhibition.1 In equation 3.7, the log likelihoods can be computed using linear lters F.µ i / (D ²wi ). For example, F.µi / could represent a spatially localized gaussian or an oriented Gabor lter when the inputs I.t/ are images. Note that as formulated above, the updating of probabilities is directly tied to the time constant of neurons in the network, as specied by the parameter ¿ in If normalization is omitted, the network outputs can be interpreted as representing log joint probabilities, that is, vi .tC1/ D log P.µit ; I.t/; : : : ; I.1//. However, we have found that the lack of normalization makes such a network prone to instabilities. 1

Bayesian Computation in Recurrent Neural Circuits

5

equation 2.1 (or ² in equation 2.2). However, it may be possible to achieve longer integration time constants by combining slow synapses (e.g., NMDA synapses) with relatively strong recurrent excitation (see, e.g., Seung, 1996; Wang, 2001). A particularly challenging problem is to pick recurrent weights mij such that equation 3.6 holds true (the alternative of learning such weights is addressed in section 8.1). For equation 3.6 to hold true, we need to approximate a log-sum with a sum-of-logs. We tackled this problem by generating a set of random probabilities xj .t/ for t D 1; : : : ; T and nding a set of weights mij that satisfy X j

µX ¶ t t¡1 mij log xj .t/ ¼ log P.µi jµj /xj .t/

(3.8)

j

for all i and t (as an aside, note that a similar approximation can be used to compute a set of recurrent inhibitory weights for implementing the negative log term in equation 3.7). We examine this sum-of-logs approximation in more detail in the next section. 4 Approximating Transition Probabilities Using Recurrent Weights A set of recurrent weights mij can be obtained for any given set of transition probabilities P.µit jµjt¡1 / by using the standard pseudoinverse method. Let A represent the matrix of recurrent weights, that is, Aij D mij , and let L represent a matrix of log probabilities, the jth element in the tth column representing log xj .t/ for t D 1; : : : ; T. Let B represent the matrix of log P sums, that is, Bit D log[ j P.µ it jµjt¡1 /xj .t/]. To minimize the squared error in equation 3.8 with respect to the recurrent weights mij (D Aij ), we need to solve the equation AL D B:

(4.1)

Multiplying both sides by the pseudoinverse of L, we obtain the following expression for the recurrent weight matrix: A D BLT .LLT /¡1 :

(4.2)

We rst tested this approximation as a function of the transition probabilities P.µ it jµjt¡1 / for a xed set of uniformly random probabilities xj .t/. The matrix of transition probabilities was chosen to be a sparse random matrix. The degree of sparseness was varied by choosing different values for the density of the matrix, dened as the fraction of nonzero elements. Approximation accuracy was measured in terms of the absolute value of the error between the left and right sides of equation 3.8, averaged over both time

6

R. Rao

Figure 1: Approximation accuracy for different classes of transition probability matrices. Each data point represents the average approximation error from Equation 3.8 for a particular number of neurons in the recurrent network for a xed set of conditional probabilities xj .t/ D P.µjt¡1 jI.t ¡ 1/; : : : ; I.1//. Each curve represents the errors for a particular density d of the random transition probability matrix (d is the fraction of nonzero entries in the matrix).

and neurons. As shown in Figure 1, for any xed degree of sparseness in the transition probability matrix, the average error decreases as a function of the number of neurons in the recurrent network. Furthermore, the overall errors also systematically decrease as the density of the random transition probability matrix is increased (different curves in the graph). The approximation failed in only one case, when the density was 0.1 and the number of neurons was 25, due to the occurrence of a log 0 term in B (see discussion below). We also tested the accuracy of equation 3.8 as a function of the conditional probabilities xj .t/ D P.µjt¡1 jI.t ¡ 1/; : : : ; I.1// for a xed set of uniformly random transition probabilities P.µit jµjt¡1 /. The matrix of conditional probabilities was chosen to be a sparse random matrix (rows denoting indices of neurons and columns representing time steps). The degree of sparseness was varied as before by choosing different values for the density of the matrix. It is clear that the approximation fails if any of the xj .t/ equals exactly 0 since the corresponding entry log xj .t/ in L will be undened. We therefore added a small arbitrary positive value to all entries of the random matrix before renormalizing it. The smallest value in this normalized matrix may be regarded as the smallest probability that can be represented by neurons in the network. This ensures that xj .t/ is never zero while allowing us to test

Bayesian Computation in Recurrent Neural Circuits

7

Figure 2: Approximation accuracy for different sets of conditional probabilities P.µjt¡1 jI.t ¡ 1/; : : : ; I.1// (D xj .t/). The data points represent the average approximation error from Equation 3.8 as a function of the number of neurons in the network for a xed set of transition probabilities. Each curve represents the errors for a particular density d of the random conditional probability matrix (see the text for details).

the stability of the approximation for small values for xj .t/. We dene the density of the random matrix to be the fraction of entries that are different from the smallest value obtained after renormalization. As shown in Figure 2, for any xed degree of sparseness of the conditional probability matrix, the average error decreases as the number of neurons is increased. The error curves also show a gradual decrease as the density of the random conditional probability matrix is increased (different curves in the graph). The approximation failed in two cases: when the number of neurons was 75 or fewer and the density was 0.1, and when the number of neurons was 25 and the density was 0.25. The approximation error was consistently below 0.1 for all densities above 0.5 and number of neurons greater than 100. In summary, the approximation of transition probabilities using recurrent weights based on approximating a log sum with a sum of logs was found to be remarkably robust to a range of choices for both the transition probabilities and the conditional probabilities P.µjt¡1 jI.t ¡ 1/; : : : ; I.1//. The only cases where the approximation failed were when two conditions both held true: (1) the number of neurons used was very small (25 to 75) and (2) a majority of the transition probabilities were equal to zero or a majority of the conditional probabilities were near zero. The rst of these conditions is not a major concern given the typical number of neurons in cortical cir-

8

R. Rao

cuits. For the second condition, we need only assume that the conditional probabilities can take on values between a small positive number ² and 1, where log ² can be related to the neuron’s background ring rate. This seems a reasonable assumption given that cortical neurons in vivo typically have a nonzero background ring rate. Finally, it should be noted that the approximation above is based on assuming a simple linear recurrent model of a cortical network. The addition of a more realistic nonlinear recurrent function in equation 2.1 can be expected to yield more exible and accurate ways of approximating the log sum in equation 3.6. 5 Results 5.1 Example 1: Estimating Orientation from Noisy Images. To illustrate the approach, we rst present results from a simple visual discrimination task involving inference of stimulus orientation from a sequence of noisy images containing an oriented bar. The underlying hidden Markov model uses a set of discrete states µ1 ; µ2 ; : : : ; µM that sample the space of possible orientations of an elongated bar in the external world. During each trial, the state is assume to be xed: P.µ it jµjt¡1 / D 1 if i D j

D 0 otherwise:

(5.1) (5.2)

Finally, the likelihoods are given by the product of the image vector I.t/ with a set of oriented lters F.µi / (see Figure 3A) corresponding to the orientations represented by the states µ1 ; µ2 ; : : : ; µM : log P.I.t/jµit / D ²wi I.t/ D F.µi /I.t/:

(5.3)

For each trial, the sequence of input images was generated by adding zeromean gaussian white noise of xed variance to an image containing a bright oriented bar against a dark background (see Figure 3B). The goal of the network was to compute the posterior probabilities of each orientation given an input sequence of noisy images containing a single oriented bar. Clearly, the likelihood values, given by equation 5.3, can vary signicantly over time (see Figure 4, middle row), the amount of variation being determined by the variance of the additive gaussian noise. The recurrent activity of the network, however, allows a stable estimate of the underlying stimulus orientation to be computed across time (see Figure 4, bottom row). This estimate can be obtained using the maximum a posteriori (MAP) method: the estimated orientation at any time is given by the preferred orientation of the neuron with the maximum activity, that is, the orientation with the maximum posterior probability. We tested the MAP estimation ability of the network as a function of noise variance for a stimulus set containing 36 orientations (0 to 180 in 10 degree steps). For these

Bayesian Computation in Recurrent Neural Circuits

9

Figure 3: Estimating stimulus orientation from noisy images. (A) Twelve of the 36 oriented lters used in the recurrent network for estimating stimulus orientation. These lters represent the feedforward connections F.µi /. Bright pixels denote positive (excitatory) values and dark pixels denote negative (inhibitory) values. (B) Example of a noisy input image sequence containing a bright oriented bar embedded in additive zero-mean gaussian white noise with a standard deviation of 1.5.

Figure 4: Example of noisy orientation estimation. (Top) Example images at different time steps from a noisy input sequence containing an oriented bar. (Middle) Feedforward inputs representing log-likelihood values for the inputs shown above. Note the wide variation in the peak of the activities. (Bottom) Output activities shown as posterior probabilities of the 36 orientations (0 to 180 degrees in 10 degree steps) encoded by the neurons in the network. Note the stable peak in activity achieved through recurrent accumulation of evidence over time despite the wide variation in the input likelihoods.

10

R. Rao

Figure 5: Orientation estimation accuracy as a function of noise level. The data points show the classication rates obtained for a set of 36 oriented bar stimuli embedded in additive zero-mean gaussian white noise with standard deviation values as shown on the x-axis. The dotted line shows the chance classication rate (1=36 £ 100) obtained by the strategy of randomly guessing an orientation.

experiments, the network was run for a xed number of time steps (250 time steps), and the MAP estimate was computed from the nal set of activities. As shown in Figure 5, the classication rate of the network using the MAP method is high (above 80%) for zero-mean gaussian noise of standard deviations less than 1 and remains above chance (1/36) for standard deviations up to 4 (i.e., noise variances up to 16). 5.2 Example 2: Detecting Visual Motion. To illustrate more clearly the dynamical properties of the model, we examine the application of the proposed approach to the problem of visual motion detection. A prominent property of visual cortical cells in areas such as V1 and MT is selectivity to the direction of visual motion. In this section, we interpret the activity of such cells as representing the log posterior probability of stimulus motion in a particular direction at a spatial location, given a series of input images. For simplicity, we focus on the case of 1D motion in an image consisting of X pixels with two possible motion directions: leftward (L) or rightward (R). Let the state µij represent motion at a spatial location i in direction j 2 fL; Rg. Consider a network of N neurons, each representing a particular state µij (see Figure 6A). The feedforward weights are gaussians, that is, F.µ iR / D F.µiL / D F.µi / D gaussian centered at location i with a standard deviation ¾ . Figure 6B depicts the feedforward weights for a network of 30 neurons: 15 encoding leftward and 15 encoding rightward motion.

Bayesian Computation in Recurrent Neural Circuits

11

Figure 6: Recurrent network for motion detection. (A) A recurrent network of neurons, shown for clarity as two chains selective for leftward and rightward motion, respectively. The feedforward synaptic weights for neuron i (in the leftward or rightward chain) are determined by F.µi /. The recurrent weights reect the transition probabilities P.µiR jµkR / and P.µiL jµjL /. (B) Feedforward weights F.µi / for neurons i D 1; : : : ; 15 (rightward chain). The feedforward weights for neurons i D 15; : : : ; 30 (leftward chain) are identical. (C) Transition probabilities P.µ .t/ D µi jµ .t ¡ 1/ D µj / for i; j D 1; : : : ; 30. Probability values are proportional to pixel brightness. (D) Recurrent weights mij computed from the transition probabilities in C using equation 4.2.

To detect motion, the transition probabilities P.µ ij jµkl / must be selected to reect both the direction of motion and speed of the moving stimulus. For this study, the transition probabilities for rightward motion from the state µkR (i.e., P.µiR jµkR /) were set according to a gaussian centered at location k C x, where x is a parameter determined by stimulus speed. The transition probabilities for leftward motion from the state µkL were set to gaussian values centered at k ¡ x. The transition probabilities from states near the

12

R. Rao

Figure 7: Approximating transition probabilities with recurrent weights. (A) A comparison of the right and left sides of equation 3.8 for random test data xj .t/ (different from the training xj .t/) using the transition probabilities and weights shown in Figures 6C and 6D, respectively. The solid line represents the right side of Equation 3.8 and the dotted line the left side. Only the results for a single neuron (i D 8) are shown, but similar results were obtained for the rest of the neurons. (B) The approximation error (difference between the two sides of equation 3.8 over time).

two boundaries (i D 1 and i D X) were chosen to be uniformly random values. Figure 6C shows the matrix of transition probabilities. Given these transition probabilities, the recurrent weights mij can be computed using equation 4.2 (see Figure 6D). Figure 7 shows that for the given set of transition probabilities, the log-sum in equation 3.8 can indeed be approximated quite accurately by a sum-of-logs using the weights mij . 5.2.1 Bayesian Estimation of Visual Motion. Figure 8 shows the output of the network in the middle of a sequence of input images depicting a bar moving either leftward or rightward. As shown in the gure, for a leftwardmoving bar at a particular location i, the highest network output is for the neuron representing location i and direction L, while for a rightward-moving bar, the neuron representing location i and direction R has the highest out-

Bayesian Computation in Recurrent Neural Circuits

13

Figure 8: Network output for a moving stimulus. (Left) The four plots depict, respectively, the log likelihoods, log posteriors, neural ring rates, and posterior probabilities observed in the network for a rightward-moving bar when it arrives at the central image location. Note that the log likelihoods are the same for the rightward and leftward selective neurons (the rst 15 and last 15 neurons respectively, as dictated by the feedforward weights in Figure 6B), but the outputs of these neurons correctly reect the direction of motion as a result of recurrent interactions. (Right) The same four plots for a leftward-moving bar as it reaches the central location.

put. The output ring rates were computed from the log posteriors vi using a simple linear encoding model: fi D [c ¢ vi C m]C , where c is a positive constant (D 12 for this plot), m is the maximum ring rate of the neuron (D 100 in this example), and C denotes rectication (see section 7.1 for more details). Note that although the log likelihoods are the same for leftward- and rightward-moving inputs, the asymmetric recurrent weights (which represent the transition probabilities) allow the network to distinguish between leftward- and rightward-moving stimuli. The plot of posteriors in Figure 8 shows that the network correctly computes posterior probabilities close to 1 for the states µiL and µiR for leftward and rightward motion, respectively, at location i. 5.2.2 Encoding Multiple Stimuli. An important question that needs to be addressed is whether the network can encode uncertainties associated with multiple stimuli occurring in the same image. For example, can the motion detection network represent two or more bars moving simultaneously in the image? We begin by noting that the denition of multiple stimuli is closely related to the notion of state. Recall that our underlying probabilistic model (the hidden Markov model) assumes that the observed world can be in only one of M different states. Thus, any input containing “multiple stimuli” (such as two moving bars) will be interpreted in terms of probabilities for each of the M states (a single moving bar at each location).

14

R. Rao

Figure 9: Encoding multiple motions: Same direction motion. (A) Six images from an input sequence depicting two bright bars, both moving rightward. (B) The panels from top to bottom on the left and continued on the right show the activities (left) and posterior probabilities (right) for the image sequence in A in a motion detection network of 60 neurons—the rst 30 encoding rightward motion and the next 30 encoding leftward motion. Note the two large peaks in activity among the right-selective neurons, indicating the presence of two rightward-moving bars. The corresponding posterior probabilities remain close to 0.5 until only one moving bar remains in the image with a corresponding single peak of probability close to 1.

We tested the ability of the motion detection network to encode multiple stimuli by examining its responses to an image sequence containing two simultaneously moving bars under two conditions: (1) both bars moving in the same direction and (2) the bars moving in opposite directions. The results are shown in Figures 9 and 10, respectively. As seen in these gures, the network is able to represent the two simultaneously moving stimuli under both conditions. Note the two peaks in activity representing the location and direction of motion of the two moving bars. As expected from the underlying probabilistic model, the network computes a posterior probability close to 0.5 for each of the two peaks, although the exact values uctuate depending on the locations of the two bars. The network thus interprets

Bayesian Computation in Recurrent Neural Circuits

15

Figure 10: Encoding multiple motions: Motion in opposite directions. (A) Six images from an input sequence depicting two bright bars moving in opposite directions. (B) The panels show the activities in the motion-detection network (left) and posterior probabilities (right) for the image sequence in A. Note the two large peaks in activity moving in opposite directions, indicating the presence of both a leftward- and a rightward-moving bar. The corresponding posterior probabilities remain close to 0.5 throughout the sequence (after an initial transient).

the two simultaneously moving stimuli as providing evidence for two of the states it can represent, each state representing motion at a particular location and direction. Such a representation could be fed to a higher-level network, whose states could represent frequently occurring combinations of lower-level states, in a manner reminiscent of the hierarchical organization of sensory and motor cortical areas (Felleman & Van Essen, 1991). We discuss possible hierarchical extensions of the model in section 8.2. 5.2.3 Encoding Multiple Velocities. The network described is designed to estimate the posterior probability of a stimulus moving at a xed velocity as given by the transition probability matrix (see Figure 6C). How does the network respond to velocities other than the ones encoded explicitly in the transition probability matrix?

16

R. Rao

We investigated the question of velocity encoding by using two separate networks based on two transition probability matrices. The recurrent connections in the rst network encoded a speed of 1 pixel per time step (see Figure 11A), while those in the second network encoded a speed of 4 pixels per time step (see Figure 11C). Each network was exposed to velocities ranging from 0.5 to 5 pixels per time step in increments of 0.5. The resulting responses of a neuron located in the middle of each network are shown in Figures 11B and 11D, respectively. The bars depict the peak response of the neuron during stimulus motion at a particular velocity in the neuron’s preferred and nonpreferred (null) direction. The response is plotted as a posterior probability for convenience (see equation 3.4). As evident from the gure, the neurons exhibit a relatively broad tuning curve centered on their respective preferred velocities (1 and 4 pixels per time step, respectively). These results suggest that the model can encode multiple velocities using a tuning curve even though the recurrent connections are optimized for a single velocity. In other words, the posterior distribution over velocities can be represented by a xed number of networks optimized for a xed set of preferred velocities, in much the same way as a xed number of neurons can sample the continuous space of orientation, position, or any other continuous state variable. In addition, as discussed in section 8.1, synaptic learning rules could be used to allow the networks to encode the most frequently occurring stimulus velocities, thereby allowing the statespace of velocities to be tiled in an adaptive manner. 5.3 Example 3: Bayesian Decision Making in a Random Dots Task. To establish a connection to behavioral data, we consider the well-known random dots motion discrimination task (see, e.g., Shadlen & Newsome, 2001). The stimulus consists of an image sequence showing a group of moving dots, a xed fraction of which are randomly selected at each frame and moved in a xed direction (e.g., either left or right). The rest of the dots are moved in random directions. The fraction of dots moving in the same direction is called the coherence of the stimulus. Figure 12A depicts the stimulus for two different levels of coherence. The task is to decide the direction of motion of the coherently moving dots for a given input sequence. A wealth of data exists on the psychophysical performance of humans and monkeys as well as the neural responses in brain areas such as MT and LIP in monkeys performing the task (see Shadlen & Newsome, 2001 and references therein). Our goal is to explore the extent to which the proposed model can explain the existing data for this task and make testable predictions. The motion detection network in the previous section can be used to decide the direction of coherent motion by computing the posterior probabilities for leftward and rightward motion, given the input images. These probabilities can be computed by marginalizing the posterior distribution

Bayesian Computation in Recurrent Neural Circuits

17

Figure 11: Encoding multiple velocities. (A) Transition probability matrix that encodes a stimulus moving at a speed of 1 pixel per time step in either direction (see also Figure 6C). (B) Output of a neuron located in the middle of the network as a function of stimulus velocity. The peak outputs of the neuron during stimulus motion in both the neuron’s preferred (“Pref”) direction and the null direction are shown as posterior probabilities. Note the relatively broad tuning curve indicating the neuron’s ability to encode multiple velocities. (C) Transition probability matrix that encodes a stimulus moving at a speed of 4 pixels per time step. (D) Peak outputs of the same neuron as in B (but with recurrent weights approximating the transition probabilities in C). Once again, the neuron exhibits a tuning curve (now centered around 4 pixels per time step) when tested on multiple velocities. Note the shift in velocity sensitivity due to the change in the transition probabilities. A small number of such networks can compute a discrete posterior distribution over the space of stimulus velocities.

18

R. Rao

computed by the neurons for leftward (L) and rightward (R) motion over all spatial positions i: P.LjI.t/; : : : ; I.1// D P.RjI.t/; : : : ; I.1// D

X

P.µ iL jI.t/; : : : ; I.1//

(5.4)

i

P.µ iR jI.t/; : : : ; I.1//:

(5.5)

i

X

Note that the log of these marginal probabilities can also be computed directly from the log posteriors (i.e., outputs of the neurons) using the log sum approximation method described in sections 3 and 4. The log posterior ratio r.t/ of leftward over rightward motion can be used to decide the direction

Bayesian Computation in Recurrent Neural Circuits

19

of motion: r.t/ D log P.LjI.t/; : : : ; I.1// ¡ log P.RjI.t/; : : : ; I.1// D log

P.LjI.t/; : : : ; I.1// : P.RjI.t/; : : : ; I.1//

(5.6) (5.7)

If r.t/ > 0, the evidence seen so far favors leftward motion and vice versa for r.t/ < 0. The instantaneous ratio r.t/ is susceptible to rapid uctuations due to the noisy stimulus. We therefore use the following decision variable dL .t/ to track the running average of the log posterior ratio of L over R, dL .t C 1/ D dL .t/ C ®.r.t/ ¡ dL .t//;

(5.8)

and likewise for dR .t/ (the parameter ® is between 0 and 1). We assume that the decision variables are computed by a separate set of “decision neurons” that receive inputs from the motion detection network. These neurons are once again leaky-integrator neurons as described by equation 5.8, with the driving inputs r.t/ being determined by inhibition between the summed inputs from the two chains in the motion detection network (as in equation 5.6). The output of the model is L if dL .t/ > c and R if dR .t/ > c, where c is a “condence threshold” that depends on task constraints (e.g., accuracy versus speed requirements, Reddi & Carpenter, 2000). Figure 12: Facing page. Output of decision neurons in the model. (A) Depiction of the random dots task. Two different levels of motion coherence (50% and 100%) are shown. A 1D version of this stimulus was used in the model simulations. (B, C) Outputs dL .t/ and dR .t/ of model decision neurons for two different directions of motion. The decision threshold is shown as a horizontal line (labeled c in B). (D) Outputs of decision neurons for three different levels of motion coherence. Note the increase in rate of evidence accumulation at higher coherencies. For a xed decision threshold, the model predicts faster reaction times for higher coherencies (dotted arrows). (E) Activity of a neuron in area FEF for a monkey performing an eye movement task (from Schall & Thompson, 1999, with permission, from the Annual Review of Neuroscience, Volume 22, copyright 1999 by Annual Reviews, www.annualreviews.org). Faster reaction times were associated with a more rapid rise to a xed threshold (see the three different neural activity proles). The arrows point to the initiation of eye movements, which are depicted at the top. (F) Averaged ring rate over time of 54 neurons in area LIP during the random dots task, plotted as a function of motion coherence (from Roitman & Shadlen, 2002, with permission, from the Journal of Neuroscience, Volume 22, copyright 2002 by The Society for Neuroscience). Solid and dashed curves represent trials in which the monkey judged motion direction toward and away from the receptive eld of a given neuron, respectively. The slope of the response is affected by motion coherence (compare, e.g., responses for 51.2% and 25.6%) in a manner similar to the model responses shown in D.

20

R. Rao

Figures 12B and 12C show the responses of the two decision neurons over time for two different directions of motion and two levels of coherence. Besides correctly computing the direction of coherent motion in each case, the model also responds faster when the stimulus has higher coherence. This phenomenon can be appreciated more clearly in Figure 12D, which predicts progressively shorter reaction times for increasingly coherent stimuli (dotted arrows). 5.3.1 Comparison to Neurophysiological Data. The relationship between faster rates of evidence accumulation and shorter reaction times has received experimental support from a number of studies. Figure 12E shows the activity of a neuron in the frontal eye elds (FEF) for fast, medium, and slow responses to a visual target (Schall & Hanes, 1998; Schall & Thompson, 1999). Schall and collaborators have shown that the distribution of monkey response times can be reproduced using the time taken by neural activity in FEF to reach a xed threshold (Hanes & Schall, 1996). A similar rise-tothreshold model by Carpenter and colleagues has received strong support in human psychophysical experiments that manipulate the prior probabilities of targets (Carpenter & Williams, 1995) and the urgency of the task (Reddi & Carpenter, 2000) (see section 6.2). In the case of the random dots task, Shadlen and collaborators have shown that in primates, one of the cortical areas involved in making the decision regarding coherent motion direction is area LIP. The activities of many neurons in this area progressively increase during the motion viewing period, with faster rates of rise for more coherent stimuli (see Figure 12F) (Roitman & Shadlen, 2002). This behavior is similar to the responses of “decision neurons” in the model (see Figures 12B–D), suggesting that the outputs of the recorded LIP neurons could be interpreted as representing the log posterior ratio of one task alternative over another (see Carpenter & Williams, 1995; Gold & Shadlen, 2001, for related suggestions). 5.3.2 Performance of the Network. We examined the performance of the network as a function of motion coherence for different values of the decision threshold. As shown in Figure 13, the results obtained show two trends that are qualitatively similar to human and monkey psychophysical performance on this task. First, for any xed decision threshold T, the accuracy rate (% correct responses) increases from chance levels (50% correct) to progressively higher values (up to 100% correct for higher thresholds) when the motion coherence is increased from 0 to 1. These graphs are qualitatively similar to psychometric performance curves obtained in human and monkey experiments (see, e.g., Britten, Shadlen, Newsome, & Movshon, 1992). We did not attempt to quantitatively t the model curves to psychophysical data because the model currently operates only on 1D images. A second trend is observed when the decision threshold T is varied. As the threshold is increased, the accuracy also increases for each coherence value, reveal-

Bayesian Computation in Recurrent Neural Circuits

21

Figure 13: Performance of the model network as a function of motion coherence. The three curves show the accuracy rate (% correct responses) of the motion network in discriminating between the two directions of motion in the random dots task for three different decision thresholds T. Two trends, which are also observed in human and monkey performance on this task, are the rapid increase in accuracy as a function of motion coherence (especially for large thresholds) and the increase in accuracy for each coherence value for increasing thresholds, with an asymptote close to 100% for large enough thresholds.

ing a set of curves that asymptote around 95% to 100% accuracy for large enough thresholds. However, as expected, larger thresholds also result in longer reaction times, suggesting a correlate in the model for the type of speed-accuracy trade-offs typically seen in psychophysical discrimination and search experiments. The dependence of accuracy on decision threshold suggests a model for handling task urgency: faster decisions can be made by lowering the decision threshold, albeit at the expense of some accuracy (as illustrated in Figure 13). Predictions of such a model for incorporating urgency in the decision-making process are explored in section 6.2. 6 Model Predictions 6.1 Effect of Multiple Stimuli in a Receptive Field. A rst prediction of the model arises from the log probability interpretation of ring rates and the effects of normalization in a probabilistic system. In a network of neurons where each neuron encodes the probability of a particular state, the sum of the probabilities should sum to one. Thus, if an input contains multiple stimuli consistent with multiple states, the probabilities for each state would be appropriately reduced to reect probability normalization.

22

R. Rao

The current model predicts that one should see a concomitant logarithmic decrease in the ring rates, as illustrated in the example below. Consider the simplest case of two static stimuli in the receptive eld, reecting two different states encoded by two neurons (say, 1 and 2). Let v1 and v2 be the outputs (log posteriors) of neurons 1 and 2, respectively, when their preferred stimulus is shown alone in the receptive eld. In the ideal noiseless case, both log posteriors would be zero. Suppose both stimuli are now shown simultaneously. The model predicts that the new outputs v01 0 0 and v02 should obey the relation ev1 C ev2 D 1. If both states are equally likely, 0 ev1 D 0:5, that is, v01 D ¡0:69, a nonlinear decrease from the initial value of zero. To see the impact on ring rates, consider the case where neuronal ring rate f is proportional to the log posterior, that is, f D [c ¢ v C m]C , where c is a positive constant and m is the maximum ring rate (see section 7.1 for a discussion). Then the new ring rate f10 for neuron 1 is related to the old ring rate f1 (D m) by f10 D ¡0:69c C m D f1 ¡ 0:69c. Note that this is different from a simple halving of the single stimulus ring rate, as might be predicted from a linear model. Thus, the model predicts that in the presence of multiple stimuli, the ring rate of a cell encoding for one of the stimuli will be nonlinearly reduced in such a way that the sum of the probabilities encoded by the network approximates 1. Evidence for nonlinear reductions in ring rates for multiple stimuli in the receptive eld comes from some experiments in V1 (DeAngelis, Robson, Ohzawa, & Freeman, 1992) and other attention-related experiments in higher cortical areas (Reynolds, Chelazzi, & Desimone, 1999). The existing data are, however, insufcient to quantitatively validate or falsify the above prediction. More sophisticated experiments involving multiunit recordings may be needed to shed light on this prediction derived from the log probability hypothesis of neural coding. 6.2 Effect of Changing Priors and Task Urgency. The strength of a Bayesian model lies in its ability to predict the effects of varying prior knowledge or various types of biases in a given task. We explored the consequences of varying two types of biases in the motion discrimination task: (1) changing the prior probability of a particular motion direction (L or R) and (2) lowering the decision threshold to favor speed instead of accuracy. Changing the prior probability for a particular motion direction amounts to initializing the decision neuron activities dL and dR to the appropriate log prior ratios: dL .0/ D log P.L/ ¡ log P.R/

dR .0/ D log P.R/ ¡ log P.L/;

(6.1) (6.2)

where P.L/ and P.R/ denote the prior probabilities of leftward and rightward motion, respectively. Note that for the equiprobable case, these variables get initialized to 0 as in the simulations presented thus far. We ex-

Bayesian Computation in Recurrent Neural Circuits

23

Figure 14: Reaction time distributions as a function of prior probability and urgency. (A) The left panel shows the distribution of reaction times for the motion discrimination task obtained from model simulations when leftward and rightward motion are equiprobable. The right panel shows the distribution for the case where leftward motion is more probable than rightward motion (0:51 versus 0:49). Note the leftward shift of the distribution toward shorter reaction times. Both distributions here were restricted to leftward motion trials only. (B) The right panel shows the result of halving the decision threshold, simulating an urgency constraint that favors speed over accuracy. Compared to the nonurgent case (left panel), the reaction time distribution again exhibits a leftward shift toward faster (but potentially inaccurate) responses.

amined the distribution of reaction times obtained as a result of assuming a slightly higher prior probability for L compared to R (0:51 versus 0:49). Figure 14A compares this distribution (shown on the right) to the distribution obtained for the unbiased (equiprobable) case (shown on the left). Both distributions shown are for trials where the direction of motion was L. The model predicts that the entire distribution for a particular task choice should

24

R. Rao

shift leftward (toward shorter reaction times) when the subject assumes a higher prior for that particular choice. Similarly, when time is at a premium, a subject may opt to lower the decision threshold to reach decisions quickly, albeit at the risk of making some incorrect decisions. Such an urgency constraint was simulated in the model by lowering the decision threshold to half the original value, keeping the stimulus coherence constant (60% in this case). As shown in Figure 14B, the model again predicts a leftward shift of the entire reaction time histogram, corresponding to faster (but sometimes inaccurate) responses. Although these predictions are yet be tested in the context of the motion discrimination task (see Mazurek, Ditterich, Palmer, & Shadlen, 2001 for some preliminary results), psychophysical experiments by Carpenter and colleagues lend some support to the model’s predictions. In these experiments (Carpenter & Williams, 1995), a human subject was required to make an eye movement to a dim target that appeared randomly to the left or right of a xation point after a random delay of 0.5 to 1.5 seconds. The latency distributions of the eye movements were recorded as a function of different prior probabilities for one of the targets. As predicted by the model for the motion discrimination task (see Figure 14A), the response latency histogram was observed to shift leftward for increasing prior probabilities. In a second set of experiments, Reddi and Carpenter (2000) examined the effects of imposing urgency constraints on the eye movement task. They asked subjects to respond as rapidly as possible and to worry less about the accuracy of their response. Once again, the latency histogram shifted leftward, similar to the model simulation results in Figure 14B. Furthermore, in a remarkable follow-up experiment, Reddi and Carpenter (2000) showed that this leftward shift in the histogram due to urgency could be countered by a corresponding rightward shift due to lower prior probability for one of the targets, resulting in no overall shift in the latency histogram. This provides strong evidence for a rise-to-threshold model of decision making, where the threshold is chosen according to the constraints of the task at hand. 7 Discussion In this section, we address the problem of how neural activity may encode log probability values, which are nonpositive. We also discuss the benets of probabilistic inference in the log domain for neural systems and conclude the section by summarizing related work in probabilistic neural models and probabilistic decision making. 7.1 Neural Encoding of Log Probabilities. The model introduced above assumes that the outputs of neurons in a recurrent network reect log posterior probabilities of a particular state, given a set of inputs. However, log probabilities take on nonpositive values between ¡1 and 0, raising two

Bayesian Computation in Recurrent Neural Circuits

25

important questions: (1) How are these nonpositive values encoded using ring rates or a spike-based code? (2) How do these codes capture the precision and range of probability values that can be encoded? We address these issues in the context of rate-based and spike-based encoding models. A simple encoding model is to assume that ring rates are linearly related to the log posteriors encoded by the neurons. Thus, if vi denotes the log posterior encoded by neuron i, then the ring rate of the neuron is given by fi D [c ¢ vi C m]C ;

(7.1)

where c is a positive gain factor, m is the maximum ring rate of the neuron, and C denotes rectication ([x]C D x if x > 0 and 0 otherwise). Note that the ring rate fi attains the maximum value m when the log posterior vi D 0, that is, the posterior probability P.µ it jI.t/; : : : ; I.1// D 1. Likewise, fi attains its minimal value of 0 for posteriors below e¡m=c . Thus, the precision and range of probability values that can be encoded in the ring rate are governed by both m and c. Since e¡x quickly becomes negligible (e.g., e¡10 D 0:000045), for typical values of m in the range 100 to 200 Hz, values for c in the range 10 to 20 would allow the useful range of probability values to be accurately encoded in the ring rate. These ring rates can be easily decoded when received as synaptic inputs using the inverse relationship: vi D . fi ¡ m/=c. A second straightforward but counterintuitive encoding model is to assume that fi D ¡c ¢ vi , up to some maximum value m.2 In this model, a high ring rate implies a low posterior probability, that is, an unexpected or novel stimulus. Such an encoding model allows neurons to be interpreted as novelty detectors. Thus, the suppression of neural responses in some cortical areas due to the addition of contextual information, such as surround effects in V1 (Knierim & Van Essen, 1992; Zipser, Lamme, & Schiller, 1996), or due to increasing familiarity with a stimulus, such as response reduction in IT due to stimulus repetition (Miller, Li, & Desimone, 1991), can be interpreted as increases in posterior probabilities of the state encoded by the recorded neurons. Such a model of response suppression is similar in spirit to predictive coding models (Srinivasan, Laughlin, & Dubs, 1982; Rao & Ballard, 1999) but differs from these earlier models in assuming that the responses represent log posterior probabilities rather than prediction errors. A spike-based encoding model can be obtained by interpreting the leaky integrator equation (see equation 2.1) in terms of the membrane potential of a neuron rather than its ring rate. Such an interpretation is consistent with the traditional RC circuit-based model of membrane potential dynamics (see, e.g., Koch, 1999). We assume that the membrane potential Vim is A different model based also on a negative log encoding scheme has been suggested by Barlow (1969, 1972). 2

26

R. Rao

proportional to the log posterior vi , Vim D k ¢ vi C T;

(7.2)

where k is a constant gain factor and T is a constant offset. The dynamics of the neuron’s membrane potential is then given by dV im dvi Dk ; dt dt

(7.3)

i where dv is computed as described in sections 2 and 3. Thus, in this model, dt changes in the membrane potential due to synaptic inputs are proportional to changes in the log posterior being represented by the neuron. More interestingly, if T is regarded as analogous to the spiking threshold, one can dene the probability of neuron i spiking at time t C 1 as m .tC1/¡T/=k

P.spikei .t C 1/jV1m .t C 1// D e.Vi

D evi .tC1/ D P.µit jI.t/; : : : ; I.1//:(7.4)

Such a model for spiking can be viewed as a variant of the noisy threshold model, where the cell spikes if its potential exceeds a noisy threshold. As dened above, the spiking probability of a neuron becomes exactly equal to the posterior probability of the neuron’s preferred state µi given the inputs (see also Anastasio, Patton, & Belkacem-Boussaid, 2000; Hoyer & Hyv¨arinen, in press). This provides a new interpretation of traditional neuronal spiking probabilities computed from a poststimulus time histogram (PSTH). An intriguing open question, given such an encoding model, is how these spike probabilities are decoded “on-line” from input spike trains by synapses and converted to the log domain to inuence the membrane potential as in equation 7.3. We hope to explore this question in future studies. 7.2 Benets of Probabilistic Inference in the Log Domain. A valid question to ask is why a neural circuit should encode probabilities in the log domain in the rst place. Indeed, it has been suggested that multiplicative interactions between inputs may occur in dendrites of cortical cells (Mel, 1993), which could perhaps allow equation 3.2 to be directly implemented in a recurrent circuit (cf. Bridle, 1990). However, there are several reasons why representing probabilities in the log domain could be benecial to a neural system: ² Neurons have a limited dynamic range. A logarithmic transformation allows the most useful range of probabilities to be represented by a small number of activity levels (see the previous discussion on neural encoding of log probabilities). ² Implementing probabilistic inference in the log domain allows the use of addition and subtraction instead of multiplication and division. The

Bayesian Computation in Recurrent Neural Circuits

27

former operations are readily implemented through excitation and inhibition in neural circuits, while the latter are typically much harder to implement neurally. As shown earlier in this article, a standard leaky-integrator model sufces to implement probabilistic inference for a hidden Markov model if performed in the log domain. ² There is growing evidence that the brain uses quantities such as loglikelihood ratios during decision making (see section 7.4). These quantities can be readily computed if neural circuits are already representing probabilities in the log domain. Nevertheless, it is clear that representing probabilities in the log domain also makes certain operations, such as addition or convolution, harder to implement. Specic examples include computing the OR of two distributions or convolving a given distribution with noise. To implement such operations, neural circuits operating in the log domain would need to resort to approximations, such as the log-sum approximation for addition discussed in section 4. 7.3 Neural Encoding of Priors. An important question for Bayesian models of brain function is how prior knowledge about the world can be represented and used during neural computation. In the model discussed in this article, “priors” inuence computation in two ways. First, long-term prior knowledge about the statistical regularities of the environment is stored within the recurrent and feedforward connections in the network. Such knowledge could be acquired through evolution or learning, or some combination of these two strategies. For example, the transition probabilities stored in the recurrent connections of the motion detection network capture prior knowledge about the expected trajectory of a moving stimulus at each location. This stored knowledge could be changed by adapting the recurrent weights if the statistics of moving stimuli suddenly change in the organism’s environment. Similarly, the likelihood function stored in the feedforward connections captures properties of the physical world and can be adapted according to the statistics of the inputs. The study of such stored “priors” constitutes a major research direction in Bayesian psychophysics (e.g., Bloj et al., 1999; Weiss et al., 2002; Mamassian, Landy, & Mahoney, 2002). Second, at shorter timescales, recurrent activity provides “priors” for expected inputs based on the history of past inputs. These priors are then combined with the feedforward input likelihoods to produce the posterior probability distribution (cf. equations 3.3 and 3.7). Note that the priors provided by the recurrent activity at each time step reect the long-term prior knowledge about the environment (transition probabilities) stored in the recurrent connections.

28

R. Rao

7.4 Related Work. This article makes contributions in two related areas: neural models of probabilistic inference and models of probabilistic decision making. Below, we review previous work in these two areas with the aim of comparing and contrasting our approach with these previous approaches. 7.4.1 Neural Models of Probabilistic Inference. Perhaps the earliest neural model of probabilistic inference is the Boltzmann machine (Hinton & Sejnowski, 1983, 1986), a stochastic network of binary units. Inference in the network proceeds by randomly selecting a unit at each time step and setting its activity equal to 1 according to a probability given by a sigmoidal function of a weighted sum of its inputs. This update procedure, also known as Gibbs sampling, allows the network to converge to a set of activities characterized by the Boltzmann distribution (see, e.g., Dayan & Abbott, 2001 for details) The current model differs from the Boltzmann machine in both the underlying generative model and the inference mechanism. Unlike the Boltzmann machine, our model uses a generative model with explicit dynamics in the state-space and can therefore model probabilistic transitions between states. For inference, the model explicitly operates on probabilities (in the log domain), thereby avoiding the need for sampling-based inference techniques, which can be notoriously slow. These features also differentiate the proposed approach from other more recent probabilistic neural network models such as the Helmholtz machine (Dayan et al., 1995; Lewicki & Sejnowski, 1997; Hinton & Ghahramani, 1997). Bridle (1990) rst suggested that statistical inference for hidden Markov models could be implemented using a modication of the standard recurrent neural network. In his model, known as an Alpha-Net, the computation of the likelihood for an input sequence is performed directly using an analog of equation 3.2. Thus, Bridle’s network requires a multiplicative interaction of input likelihoods with the recurrent activity. The model we have proposed performs inference in the log domain, thereby requiring only additive interactions, which are more easily implemented in neural circuitry. In addition, experimental data from human and monkey decision-making tasks have suggested an interpretation of neural activities in terms of log probabilities, supporting a model of inference in the log domain. The log domain implementation, however, necessitates the approximation of the transition probabilities using equation 4.2, an approximation that is avoided if multiplicative interactions are allowed (as in Bridle’s approach). A second set of probabilistic neural models stems from efforts to understand encoding and decoding of information from populations of neurons. One class of approaches uses basis functions to represent probability distributions within neuronal ensembles (Anderson & Van Essen, 1994; Anderson, 1996; Deneve & Pouget, 2001; Eliasmith & Anderson, 2003). In this approach, a distribution P.x/ over stimuli x is represented using a linear combination of basis functions (kernels), X (7.5) P.x/ D ri bi .x/; i

Bayesian Computation in Recurrent Neural Circuits

29

where ri is the normalized response (ring rate) and bi the implicit basis function associated with neuron i in the population. The basis function of each neuron is assumed to be linearly related to the tuning function of the neuron as measured in physiological experiments. The basis function approach is similar to the approach proposed here in that the stimulus space is spanned by a limited number of neurons with preferred stimuli or state vectors. The two approaches differ in how probability distributions are decoded from neural responses, one using an additive method and the other using a logarithmic transformation, as discussed above. A limitation of the basis function approach is that due to its additive nature, it cannot represent distributions that are sharper than the component distributions. A second class of models addresses this problem using a generative approach, where an encoding model is rst assumed and a Bayesian decoding model is used to estimate the stimulus x (or its distribution), given a set of responses ri (Zhang, Ginzburg, McNaughton, & Sejnowski, 1998; Pouget, Zhang, Deneve, & Latham, 1998; Zemel et al., 1998; Zemel & Dayan, 1999; Wu, Chen, Niranjan, & Amari, 2003). For example, in the distributional population coding (DPC) method (Zemel et al., 1998; Zemel & Dayan 1999), the responses are assumed to depend on general distributions P.x/, and a Bayesian method is used to decode a probability distribution over possible distributions over x. The best estimate in this method is not a single value of x but an entire distribution over x, which is assumed to be represented by the neural population. The underlying goal of representing entire distributions within neural populations is common to both the DPC approach and the model being proposed in this article. However, the approaches differ in how they achieve this goal: the DPC method assumes prespecied tuning functions for the neurons and a sophisticated, nonneural decoding operation, whereas the method introduced in this article directly instantiates a probabilistic generative model (in this article, a hidden Markov model) with a comparatively straightforward linear decoding operation as described above. Several probabilistic models have also been suggested for solving specic problems in visual motion processing such as the aperture problem (Simoncelli, 1993; Koechlin, Anton, & Burnod, 1999; Zemel & Dayan, 1999; Ascher & Grzywacz, 2000; Freeman, Haddon, & Pasztor, 2002; Weiss & Fleet, 2002; Weiss et al., 2002) These typically rely on a prespecied bank of spatiotemporal lters to generate a probability distribution over velocities, which is processed according to Bayesian principles. The model proposed in this article differs from these previous approaches in making use of an explicit generative model for capturing the statistics of time-varying inputs (i.e., a hidden Markov model). Thus, the selectivity for direction and velocity is an emergent property of the network and not a consequence of tuned spatiotemporal lters chosen a priori. Finally, Rao and Ballard have suggested a hierarchical model for neural computation based on Kalman ltering (or predictive coding) (Rao & Ballard, 1997, 1999; Rao, 1999). The Kalman lter model is based on a generative

30

R. Rao

model similar to the one used in the hidden Markov model as discussed in this article, except that the states are assumed to be continuous-valued, and both the observation and transition probability densities are assumed to be gaussian. As a result, the inference problem becomes one of estimating the mean and covariance of the gaussian distribution of the state at each time step. Rao and Ballard showed how the mean could be computed in a recurrent network in which feedback from a higher layer to the input layer carries predictions of expected inputs and the feedforward connections convey the errors in prediction, which are used to correct the estimate of the mean state. The neural estimation of covariance was not addressed. In contrast to the Kalman lter model, the current model does not require propagation of predictive error signals and allows the full distribution of states to be estimated, rather than just the mean. In addition, the observation and transition distributions can be arbitrary and need not be gaussian. The main drawback is the restriction to discrete states, which implies that distributions over continuous state variables are approximated using discrete samples. Fortunately, these discrete samples, which are the preferred states µi of neurons in the network, can be adapted by changing the feedforward and feedback weights (W and M) in response to inputs. Thus, the distribution over a continuous state-space can be modeled to different degrees of accuracy in different parts of the state-space by a nite number of neurons that adapt their preferred states to match the statistics of the inputs. Possible procedures for adapting the feedforward and feedback weights are discussed in section 8.1. 7.4.2 Models of Bayesian Decision Making. Recent experiments using psychophysics and electrophysiology have shed light on the probabilistic basis of visual decision making in humans and monkeys. There is converging evidence that decisions are made based on the log of the likelihood ratio of one alternative over another (Carpenter & Williams, 1995; Gold & Shadlen, 2001). Carpenter and colleagues have suggested a mathematical model called LATER (linear approach to threshold with ergodic rate) (Carpenter & Williams, 1995; Reddi & Carpenter, 2000) for explaining results from their visual decision-making task (described in section 6.2). In their model, the evidence for one alternative over another is accumulated in the form of a log-likelihood ratio at a constant rate for a given trial, resulting in a decision in favor of the rst alternative if the accumulated variable exceeds a xed threshold. This is similar to the decision-making process posited in this article, except that we do not assume a linear rise to threshold. Evidence for the LATER model is presented in Carpenter and Williams (1995) and Reddi and Carpenter (2000). A related and more detailed model has been proposed by Gold and Shadlen (2001) to explain results from the random dots task. The Gold-Shadlen model is related to “accumulator models” commonly used in psychology (Ratcliff, Zandt, & McKoon, 1999; Luce, 1986; Usher & McClelland, 2001) and treats decision making as a diffu-

Bayesian Computation in Recurrent Neural Circuits

31

sion process that is biased by a log-likelihood ratio. Rather than assuming that entire probability distributions are represented within cortical circuits as proposed in this article, the Gold-Shadlen model assumes that the difference between two ring rates (e.g., from two different motion-selective neurons in cortical area MT) can be interpreted as a log-likelihood ratio of one task alternative over another (e.g., one motion direction over another). We believe that there are enormous advantages to be gained if the brain could represent entire probability distributions of stimuli because such a representation allows not just decision making when faced with a discrete number of alternatives but also a wide variety of other useful tasks related to probabilistic reasoning, such as estimation of an unknown quantity and planning in uncertain environments. A number of models of decision making, based typically on diffusion processes or races between variables, have been proposed in the psychological literature (Ratcliff et al., 1999; Luce, 1986; Usher & McClelland, 2001). Other recent models have focused on establishing a relation between the effects of rewards on decision making and concepts from decision theory and game theory (Platt & Glimcher, 1999; Glimcher, 2002). These models, like the models of Carpenter, Shadlen, and colleagues, are mathematical models that are aimed mainly at explaining behavioral data but are formulated at a higher level of abstraction than the neural implementation level. The model presented in this article seeks to bridge this gap by showing how rigorous mathematical models of decision making could be implemented within recurrent neural circuits. A recent model proposed by Wang (2002) shares the same goal but approaches the problem from a different viewpoint. Starting with a biophysical circuit that implements an “attractor network,” Wang shows that the network exhibits many of the properties seen in cortical decision-making neurons. The inputs to Wang’s decision-making model are not based on a generative model of time-varying inputs but are assumed to be two abstract inputs modeled as two Poisson processes whose rates are determined by gaussian distributions. The means of these distributions depend on the motion coherence level linearly, such that the two means approach each other as the coherence is reduced. Thus, the effects of motion coherence on the inputs to the decision-making process are built in rather than being computed from input images as in our model. Consequently, Wang’s model, like the other decision-making models discussed here, does not directly address the issue of how probability distributions of stimuli may be represented and processed in neural circuits. 8 Conclusions and Future Work We have shown that a recurrent network of leaky integrator neurons can approximate Bayesian inference for a hidden Markov model. Such networks have previously been used to model a variety of cortical response properties (Dayan & Abbott, 2001) . Our results suggest a new interpretation of neural

32

R. Rao

activities in such networks as representing log posterior probabilities. Such a hypothesis is consistent with the suggestions made by Carpenter, Shadlen, and others based on studies in visual decision making. We illustrated the performance of the model using the examples of visual orientation discrimination and motion detection. In the case of motion detection, we showed how a model cortical network can exhibit direction selectivity and compute the log posterior probability of motion direction, given a sequence of input images. The model thus provides a probabilistic interpretation of direction selective responses in cortical areas V1 and MT. We also showed how the outputs of the motion detection network could be used to decide the direction of coherent motion in the well-known random dots task. The activity of the “decision neurons” in the model during this task resembles the activity of evidence-accumulating neurons in LIP and FEF, two cortical areas known to be involved in visual decision making. The model correctly predicts reaction times as a function of stimulus coherence and produces reaction time distributions that are qualitatively similar to those obtained in eye movement experiments that manipulate prior probabilities of targets and task urgency. Although the results described thus far are encouraging, several important questions remain: (1) How can the feedforward weights W and recurrent weights M be learned directly from input data, (2) How can the approach be generalized to graphical models that are more sophisticated than hidden Markov models, and (3) How can the approach be extended to incorporate the inuence of rewards on Bayesian estimation, learning, and decision making? We conclude by discussing some potential strategies for addressing these questions in future studies. 8.1 Learning Synaptic Weights. We intend to explore the question of synaptic learning using biologically plausible approximations to the expectation-maximization (EM) algorithm (Dempster, Laird, & Rubin, 1977). For standard hidden Markov models, the E-step in the EM algorithm is realized by the “forward-backward” algorithm (Rabiner & Juang, 1986). The network described in this article implements the “forward” part of this algorithm and can be regarded as an approximate E-step for the case of on-line estimation. One possible likelihood function3 for the on-line case is: Qt .W; M/ X D P.µjt jI.t/; : : : ; I.1/; W; M/ log P.µjt ; I.t/jI.t ¡ 1/; : : : ; I.1/; W; M/: j

This function can be rewritten in terms of the outputs vj of the model network and the summed inputs uj from section 3 for the current values of the 3 approximates the full on-line likelihood function QFt .W; M/ D P This t jI.t/; P.µ : : : ; I.1/; W; M/ log P.µjt ; I.t/; I.t ¡ 1/; : : : ; I.1/jW; M/. j j

Bayesian Computation in Recurrent Neural Circuits

33

weights W and M: Qt .W; M/ D

X

evj .tC1/ uj .t/:

(8.1)

j

For synaptic learning, we can perform an on-line M-step to increase Qt through gradient ascent with respect to W and M (this actually implements a form of incremental or generalized EM algorithm; Neal & Hinton, 1998): 1wi .t/ D ®

@Qt @ wi

D ®²gi .t/evi .tC1/ I.t/ @Qt 1mik .t/ D ¯ @mik

(8.2)

(8.3)

D ¯ gi .t/evi .tC1/ vk .t/; where ® and ¯ are learning rates, and gi .t/ D .1 C ui .t/ ¡ Qt .W; M// is a gain function. The above expressions were derived by substituting the right-hand side of the equations for vi .t C 1/ and ui .t/, respectively (see section 3), and approximating the gradient by retaining only the rst-order terms pertaining to time step t. The learning rules above are interesting variants of the well-known Hebb rule (Hebb, 1949): the synaptic weight change is proportional to the product of the input (I.t/ or vk .t/) with an exponential function of the output vi .t C 1/ (rather than simply the output), thereby realizing a soft form of winner-takeall competition. Furthermore, the weight changes are modulated by the gain function gi .t/, which is positive only when .ui .t/ ¡ Qt .W; M// > ¡1, that is, when P.µ it ; I.t/jI.t ¡ 1/; : : : ; I.1/; W; M/ > .1=e/eQt .W;M/ . Thus, both learning rules remain anti-Hebbian and encourage decorrelation (Foldi´ ¨ ak, 1990) unless the net feedforward and recurrent input (ui .t/) is high enough to make the joint probability of the state i and the current input exceed the threshold of .1=e/eQt .W;M/ . In that case, the learning rules switch to a Hebbian regime, allowing the neuron to learn the appropriate weights to encode state i. An in-depth investigation of these and related learning rules is underway and will be discussed in a future article. 8.2 Generalization to Other Graphical Models. A second open question is how the suggested framework could be extended to allow neural implementation of hierarchical Bayesian inference in graphical models that are more general than hidden Markov models (cf. Dayan et al., 1995; Dayan & Hinton, 1996; Hinton & Ghahramani, 1997; Rao & Ballard, 1997, 1999). Such models would provide a better approximation to the multilayered architecture of the cortex and the hierarchical connections that exist between cortical areas (Felleman & Van Essen, 1991). In future studies, we hope to

34

R. Rao

investigate neural implementations of various types of graphical models by extending the method described in this paper for implementing equation 3.2. In particular, equation 3.2 can be regarded as a special case of the general sum-product rule for belief propagation in a Bayesian network (Jordan & Weiss, 2002). Thus, in addition to incorporating “belief messages” from the previous time step and the current input as in equation 3.2, a more general rule for Bayesian inference would also incorporate messages from other hierarchical levels and potentially from future time steps. Investigating such generalizations of equation 3.2 and their neural implementation is an important goal of our ongoing research efforts. 8.3 Incorporating Rewards. A nal question of interest is how rewards associated with various choices available in a task can be made to inuence the activities of decision neurons and the decision-making behavior of the model. We expect ideas from reinforcement learning and Bayesian decision theory as well as recent neurophysiological results in the monkey (Platt & Glimcher, 1999) to be helpful in guiding our research in this direction. Acknowledgments This research was supported by a Sloan Research Fellowship, NSF grant no. 130705, and an ONR Young Investigator Award. I am grateful to Peter Dayan, Sophie Deneve, Aaron Hertzmann, John Palmer, Michael Rudd, Michael Shadlen, Eero Simoncelli, and Josh Tenenbaum for discussions on various ideas related to this work. I also thank the two anonymous reviewers for their constructive comments and suggestions. References Anastasio, T. J., Patton, P. E., & Belkacem-Boussaid, K. (2000). Using Bayes’ rule to model multisensory enhancement in the superior colliculus. Neural Computation, 12(5), 1165–1187. Anderson, C. H. (1996). Unifying perspectives on neuronal codes and processing. In E. Ludena, P. Vashishta & R. Bishop (Eds.), Proceedings of the XIX International Workshop on Condensed Matter Theories. New York: Nova Science. Anderson, C. H., & Van Essen, D. C. (1994). Neurobiological computational systems. In J. M. Zurada, R. J. Marks II, C. J., Robinson (Eds.), Computational intelligence: Imitating life (pp. 213–222). New York: IEEE Press. Ascher, D., & Grzywacz, N. M. (2000). A Bayesian model for the measurement of visual velocity. Vision Research, 40, 3427–3434. Barlow, H. B. (1969). Pattern recognition and the responses of sensory neurons. Annals of the New York Academy of Sciences, 156, 872–881. Barlow, H. B. (1972). Single units and cognition: A neurone doctrine for perceptual psychology. Perception, 1, 371–394.

Bayesian Computation in Recurrent Neural Circuits

35

Bloj, M. G., Kersten, D., & Hurlbert, A. C. (1999). Perception of three-dimensional shape inuences colour perception through mutual illumination. Nature, 402, 877–879. Bridle, J. S. (1990). Alpha-Nets: A recurrent “neural” network architecture with a hidden Markov model interpretation. Speech Communication, 9(1), 83–92. Britten, K. H., Shadlen, M. N., Newsome, W. T., & Movshon, J. A. (1992). The analysis of visual motion: A comparison of neuronal and psychophysical performance. Journal of Neuroscience, 12, 4745–4765. Carpenter, R. H. S., & Williams, M. L. L. (1995). Neural computation of log likelihood in control of saccadic eye movements. Nature, 377, 59–62. Dayan, P., & Abbott, L. F. (2001). Theoretical neuroscience: Computational and mathematical modeling of neural systems. Cambridge, MA: MIT Press. Dayan, P., & Hinton, G. (1996). Varieties of Helmholtz machine. Neural Networks, 9(8), 1385–1403. Dayan, P., Hinton, G., Neal, R., & Zemel, R. (1995). The Helmholtz machine. Neural Computation, 7, 889–904. DeAngelis, G. C., Robson, J. G., Ohzawa, I., & Freeman, R. D. (1992). Organization of suppression in receptive elds of neurons in cat visual cortex. J. Neurophysiol., 68(1), 144–163. Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. J. Royal Statistical Society Series B, 39, 1–38. Deneve, S., Latham, P. E., & Pouget, A. (1999). Reading population codes: A neural implementation of ideal observers. Nature Neuroscience, 2(8), 740– 745. Deneve, S. & Pouget, A. (2001). Bayesian estimation by interconnected neural networks (abstract no. 237.11). Society for Neuroscience Abstracts, 27. Eliasmith, C. & Anderson, C. H. (2003). Neural Engineering: Computation, Representation, and Dynamics in Neurobiological Systems. Cambridge, MA: MIT Press. Felleman, D. J., & Van Essen, D. C. (1991). Distributed hierarchical processing in the primate cerebral cortex. Cerebral Cortex, 1, 1–47. Foldi´ ¨ ak, P. (1990). Forming sparse representations by local anti-Hebbian learning. Biol. Cybern., 64, 165–170. Freeman, W. T., Haddon, J., & Pasztor, E. C. (2002). Learning motion analysis. In R. P. N. Rao, B. A. Olshausen, & M. S. Lewicki (Eds.), Probabilistic models of the brain: Perception and neural function (pp. 97–115). Cambridge, MA: MIT Press. Glimcher, P. (2002). Decisions, decisions, decisions: Choosing a biological science of choice. Neuron, 36(2), 323–332. Gold, J. I., & Shadlen, M. N. (2001). Neural computations that underlie decisions about sensory stimuli. Trends in Cognitive Sciences, 5(1), 10–16. Hanes, D. P., & Schall, J. D. (1996). Neural control of voluntary movement initiation. Science, 274, 427–430. Hebb, D. O. (1949). The organization of behavior. New York: Wiley. Hinton, G. E., & Brown, A. D. (2002). Learning to use spike timing in a restricted Boltzmann machine. In R. P. N. Rao, B. A. Olshausen, & M. S. Lewicki (Eds.),

36

R. Rao

Probabilistic models of the brain: Perception and neural function (pp. 285–296) Cambridge, MA: MIT Press. Hinton, G. E., & Ghahramani, Z. (1997). Generative models for discovering sparse distributed representations. Phil. Trans. Roy. Soc. Lond. B, 352, 1177– 1190. Hinton, G., & Sejnowski, T. (1983). Optimal perceptual inference. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 448–453) New York: IEEE. Hinton, G., & Sejnowski, T. (1986). Learning and relearning in Boltzmann machines. In D. Rumelhart & J. McClelland, (Eds.), Parallel distributed processing (Vol. 1, pp. 282–317). Cambridge, MA: MIT Press. Hoyer, P. O., & Hyva¨ rinen, A. (in press). Interpreting neural response variability as Monte Carlo sampling of the posterior. In S. Becker, S. Thrun, & K. Obermayer (Eds.), Advances in neural informationprocessing systems,15. Cambridge, MA: MIT Press. Jacobs, R. A. (2002). Visual cue integration for depth perception. In R. P. N. Rao, B. A. Olshausen, & M. S. Lewicki (Eds.), Probabilistic models of the brain: Perception and neural function (pp. 61–76). Cambridge, MA: MIT Press. Jordan, M. I., & Weiss, Y. (2002). Graphical models: Probabilistic inference. In M. A. Arbib (Ed.), The handbook of brain theory and neural networks (2nd ed.). Cambridge, MA: MIT Press. Knierim, J., & Van Essen, D. C. (1992). Neural responses to static texture patterns in area V1 of the alert macaque monkey. Journal of Neurophysiology, 67, 961– 980. Knill, D. C., & Richards, W. (1996). Perception as Bayesian inference. Cambridge: Cambridge University Press. Koch, C. (1999). Biophysics of computation: Information processing in single neurons. New York: Oxford University Press. Koechlin, E., Anton, J. L., & Burnod, Y. (1999). Bayesian inference in populations of cortical neurons: A model of motion integration and segmentation in area MT. Biological Cybernetics, 80, 25–44. Lewicki, M. S., & Sejnowski, T. J. (1997). Bayesian unsupervised learning of higher order structure. In M. Mozer, M. Jordan, & T. Petsche (Eds.), Advances in neural information processing systems, 9 (pp. 529–535). Cambridge, MA: MIT Press. Luce, R. D. (1986). Response time: Their role in inferring elementary mental organization. New York: Oxford University Press. Mamassian, P., Landy, M., & Maloney, L.T. (2002). Bayesian modelling of visual perception. In R. P. N. Rao, B. A. Olshausen, & M. S. Lewicki (Eds.), Probabilistic models of the brain: Perception and neural function (pp. 61–76). Cambridge, MA: MIT Press. Mazurek, M. E., Ditterich, J., Palmer, J., & Shadlen, M. N. (2001). Effect of prior probability on behavior and LIP responses in a motion discrimination task (abstract no. 58.11). Society for Neuroscience Abstracts, 27. Mel, B. W. (1993). Synaptic integration in an excitable dendritic tree. J. Neurophysiology, 70(3), 1086–1101.

Bayesian Computation in Recurrent Neural Circuits

37

Miller, E. K., Li, L., & Desimone, R. (1991). A neural mechanism for working and recognition memory in inferior temporal cortex. Science, 254, 1377–1379. Neal, R. M., & Hinton, G. E. (1998). A view of the EM algorithm that justies incremental, sparse, and other variants. In M. I. Jordan (Ed.), Learning in graphical models (pp. 355–368). Norwell, MA: Kluwer. Platt, M. L., & Glimcher, P. W. (1999). Neural correlates of decision variables in parietal cortex. Nature, 400, 233–238. Pouget, A., Dayan, P., & Zemel, R. S. (2000). Information processing with population codes. Nature Reviews Neuroscience, 1(2), 125–132. Pouget, A., Zhang, K., Deneve, S., & Latham, P. E. (1998). Statistically efcient estimation using population coding. Neural Computation, 10(2), 373– 401. Rabiner, L., & Juang, B. (1986). An introduction to hidden Markov models. IEEE ASSP Magazine, 3, 4–16. Rao, R. P. N. (1999). An optimal estimation approach to visual perception and learning. Vision Research, 39(11), 1963–1989. Rao, R. P. N. & Ballard, D. H. (1997). Dynamic model of visual recognition predicts neural response properties in the visual cortex. Neural Computation, 9(4), 721–763. Rao, R. P. N., & Ballard, D. H. (1999). Predictive coding in the visual cortex: A functional interpretation of some extra-classical receptive eld effects. Nature Neuroscience, 2(1), 79–87. Rao, R. P. N., Olshausen, B. A., & Lewicki, M. S. (2002). Probabilistic models of the brain: Perception and neural function. Cambridge, MA: MIT Press. Ratcliff, R., Zandt, T. V., & McKoon, G. (1999). Connectionist and diffusion models of reaction time. Psychological Review, 106, 261–300. Reddi, B. A., & Carpenter, R. H. (2000). The inuence of urgency on decision time. Nature Neuroscience, 3(8), 827–830. Reynolds, J. H., Chelazzi, L., & Desimone, R. (1999). Competitive mechanisms subserve attention in macaque areas V2 and V4. Journal of Neuroscience, 19, 1736–1753. Roitman, J. D., & Shadlen, M. N. (2002). Response of neurons in the lateral intraparietal area during a combined visual discrimination reaction time task. Journal of Neuroscience, 22, 9475–9489. Schall, J. D., & Hanes, D. P. (1998). Neural mechanisms of selection and control of visually guided eye movements. Neural Networks, 11, 1241–1251. Schall, J. D., & Thompson, K. G. (1999). Neural selection and control of visually guided eye movements. Annual Review of Neuroscience, 22, 241–259. Seung, H. S. (1996). How the brain keeps the eyes still. Proceedings of the National Academy of Sciences, 93, 13339–13344. Shadlen, M. N., & Newsome, W. T. (2001). Neural basis of a perceptual decision in the parietal cortex (area LIP) of the rhesus monkey. Journal of Neurophysiology, 86(4), 1916–1936. Simoncelli, E. P. (1993). Distributed representation and analysis of visual motion. Doctoral dissertation, MIT. Srinivasan, M. V., Laughlin, S. B., & Dubs, A. (1982). Predictive coding: A fresh view of inhibition in the retina. Proc. R. Soc. Lond. B, 216, 427–459.

38

R. Rao

Usher, M., & McClelland, J. (2001). On the time course of perceptual choice: The leaky competing accumulator model. Psychological Review, 108, 550–592. Wang, X.-J. (2001). Synaptic reverberation underlying mnemonic persistent activity. Trends in Neuroscience, 24, 455–463. Wang, X.-J. (2002). Probabilistic decision making by slow reverberation in cortical circuits. Neuron, 36(5), 955–968. Weiss, Y., & Fleet, D. J. (2002). Velocity likelihoods in biological and machine vision. In R. P. N. Rao, B. A. Olshausen, & M. S. Lewicki (Eds.), Probabilistic models of the brain: Perception and neural function (pp. 77–96). Cambridge, MA: MIT Press. Weiss, Y., Simoncelli, E. P., & Adelson, E. H. (2002). Motion illusions as optimal percepts. Nature Neuroscience, 5(6), 598–604. Wu, S., Chen, D., Niranjan, M., & Amari, S.-I. (2003). Sequential Bayesian decoding with a population of neurons. Neural Computation, 15, 993–1012. Zemel, R. S., & Dayan, P. (1999). Distributional population codes and multiple motion models. In M. S. Kearns, S. A. Solla, & D. A. Cohn (Eds.), Advances in neural information processing systems, 11 (pp. 174–180). Cambridge, MA: MIT Press. Zemel, R. S., Dayan, P., & Pouget, A. (1998). Probabilistic interpretation of population codes. Neural Computation, 10(2), 403–430. Zhang, K., Ginzburg, I., McNaughton, B. L., & Sejnowski, T. J. (1998). Interpreting neuronal population activity by reconstruction: A unied framework with application to hippocampal place cells. Journal of Neurophysiology, 79(2), 1017– 1044. Zipser, K., Lamme, V. A. F., & Schiller, P. H. (1996). Contextual modulation in primary visual cortex. Journal of Neuroscience, 16(22), 7376–7389. Received August 2, 2002; accepted June 4, 2003.

LETTER

Communicated by Kechen Zhang

Probability of Stimulus Detection in a Model Population of Rapidly Adapting Fibers Burak Gu¸ ¨ clu¨ Burak [email protected]

Stanley J. Bolanowski sandy [email protected] Institute for Sensory Research, Syracuse, NY 13244-5290,and Department of Bioengineering and Neuroscience, Syracuse University, Syracuse, NY 13244

The goal of this study is to establish a link between somatosensory physiology and psychophysics at the probabilistic level. The model for a population of monkey rapidly adapting (RA) mechanoreceptive bers by Gu¸ ¨ clu¨ and Bolanowski (2002) was used to study the probability of stimulus detection when a 40 Hz sinusoidal stimulation is applied with a constant contactor size (2 mm radius) on the terminal phalanx. In the model, the detection was assumed to be mediated by one or more active bers. Two hypothetical receptive eld organizations (uniformly random and gaussian) with varying average innervation densities were considered. At a given stimulus-contactor location, changing the stimulus amplitude generates sigmoid probability-of-detection curves for both receptive eld organizations. The psychophysical results superimposed on these probability curves suggest that 5 to 10 active bers may be required for detection. The effects of the contactor location on the probability of detection reect the pattern of innervation in the model. However, the psychophysical data do not match with the predictions from the populations with uniform or gaussian distributed receptive eld centers. This result may be due to some unknown mechanical factors along the terminal phalanx, or simply because a different receptive eld organization is present. It has been reported that human observers can detect one single spike in an RA ber. By considering the probability of stimulus detection across subjects and RA populations, this article proves that more than one active ber is indeed required for detection.

1 Introduction Rapidly adapting (RA) bers is one of four classes of low-threshold mechanoreceptive nerve bers responsible for the sense of touch in human glabrous skin (Knibestol ¨ & Vallbo, 1970; Johansson, 1976, 1978). Fibers with similar response properties have been found in the monkey (Talbot, Darian-Smith, Neural Computation 16, 39–58 (2004)

c 2003 Massachusetts Institute of Technology °

40

B. Gu¸ ¨ clu¨ and S. Bolanowski

Kornhuber, & Mountcastle, 1968), cat (J¨anig, Schmidt, & Zimmermann, 1968), raccoon (Pubols & Pubols, 1976), and other mammalian species. The sensory end organs of these bers are Meissner corpuscles (Lindblom, 1965) located in the dermis. The response of RA bers to sinusoidal displacements has been studied by numerous researchers (Talbot et al., 1968; Johnson, 1974; Goodwin, Youl, & Zimmerman, 1981; Johansson, Landstr om, ¨ & Lundstr om ¨ 1982; Gu¸ ¨ clu¨ & Bolanowski, 2001, 2003). The average ring rate is a piecewise function of the stimulus amplitude always made up of ramps and plateaus. This fact has been used in population models for RA bers (Johnson, 1974; Goodwin & Pierce, 1981; Gu¸ ¨ clu¨ & Bolanowski, 1999, 2002). Johnson (1974) statistically modeled the distribution of rate-intensity functions. Goodwin and Pierce (1981) considered the case for an “average” RA ber. Gu¸ ¨ clu¨ and Bolanowski (1999, 2002) presented a computational model that consists of anatomical distributions of receptive eld centers as well as statistically modeled rate/intensity functions. Here, this latter model will be used to study the probability of activity in a model RA population caused by 40 Hz sinusoidal stimulation on the terminal phalanx. The frequency of stimulation is the best frequency for this ber class (Talbot et al., 1968). RA bers mediate the psychophysical NP I channel, which is one of the four channels that contribute to tactile sensitivity (Bolanowski, Gescheider, Verrillo, & Checkosky, 1988; Gescheider, Bolanowski, & Hardick, 2001). Activation of RA bers produces utter sensation (Ochoa & Torebjork, ¨ 1983), and it has been reported that a single action potential in an RA ber can give rise to sensation (e.g., Vallbo, 1981). Hensel and Boman (1960) found that the thresholds eliciting a single impulse in a single mechanoreceptive ber were comparable to those for arousing a subjective touch sensation. Vallbo and Johansson’s experiments (1976) show that the neural and perceptive threshold curves coincide for RA units located on the distal phalanx. During percutaneous recording, a spike was almost always coincident with a “yes” response. Intraneural microstimulation studies also conrm this (Vallbo, Olsson, Westberg, & Clark, 1984). Therefore, one active ber in an RA population may trigger a psychophysical response through the NP I channel, at least at the distal phalanx. The main goal of this study is to establish a reliable link between RA population physiology and NP I channel psychophysics. This task is accomplished by deriving a probabilistic measure, using the population model by Gu¸ ¨ clu¨ and Bolanowski (2002), that can be compared directly to psychophysical results (Gu¸ ¨ clu, ¨ 2003). Specically, the effects of stimulus amplitude, its location, and the average innervation density on the probability of detection in the model population are presented. The probability of detection is found by assuming that one or more active bers contribute to the detection of the stimulus. Comparing the psychophysical probability of detection with the derived functions gives an estimate of how many bers are necessary for detection.

Probability of Stimulus Detection in a Model Population

41

The probability of at least one active ber is equivalent to the probability of at least one spike in the population model and is the minimum limit for detection. Since the model does not include the temporal response properties of bers, it is not reliable to count action potentials using the very low average ring rates at the detection level. Therefore, the probabilities of multiple spikes are not calculated. It is also important to note that the derived psychometric functions do not give probabilities of detection in a single subject but represent probabilities of activity across subjects. This is because the variation in the output of the population model is due to the variation of rate-intensity functions in the population and not to the variation of the output in a single ber. Studying the probability of detection across subjects is unique in the literature, and it contributes further to the signicance of the work presented here. 2 Methods 2.1 Theory. The responses of RA bers to a 40 Hz sinusoidal steadystate stimulus are considered. It was demonstrated (Johnson, 1974; Gu¸ ¨ clu¨ & Bolanowski, 2001, 2003) that the rate-intensity functions of the bers can be approximated by 8 0I 0 < A < a0 > > > > > f > s > > .A ¡ a0 /I a0 < A < a1 > < a1 ¡ a0 fs I a1 < A < a2 rD > ³ ´ > > A ¡ a2 > > > 1 C I a2 < A < a3 f > > s a3 ¡ a2 > : 2 fs I a 3 < A

(2.1)

for a0 < a1 < a2 < a3 ; where A is the effective amplitude at the ber’s receptive eld center, fs the stimulus frequency, and (a0 , a1 , a2 , a3 ) are the parameters that are distributed randomly among bers. The rst parameter a0 (´ a) is of importance here, since the ber will be considered to be active (r 6D 0) if A is larger than this parameter. It is important to note that the experiments to obtain the statistics regarding equation 2.1 were done using long (1 s) stimuli. Consequently, the following analyses presume constant duration and sufciently long stimuli. The probability of a psychophysical response is assumed to be equal to the probability of at least n active bers in a population of N independent bers indexed by i; j; k; : : : That is, at least n bers should have nonzero average ring rates in the model. This can also be written as the complement

42

B. Gu¸ ¨ clu¨ and S. Bolanowski

of the probability of n ¡ 1 or fewer active bers: O Prfat least n active bersg PD ³ ´ Prfno active bersg C Prf1 active bergC D 1¡ Prf2 active bersg C : : : C Prfn ¡ 1 active bersg D 1¡ ¡

¡

N Y

Prfber i inactiveg

iD1

N X

0 @Prfber i activeg

iD1 N¡1 X iD1

N Y

1 Prfber j inactivegA

j6Di

0

N B X B BPrfber i activegPrfber j activeg @ jDiC1

1 N Y

0

k 6D i k 6D j

C N¡nC1 X N¡nC2 X C Prfber k inactivegC ¡ : : : ¡ ::: A iD1 jDiC1

B B B N B X B BPrfber i activegPrfber j activeg : : : Prfber y activeg B y B B @ 1 N Y z 6D i z 6D j ::: z 6D y

C C C C Prfber z inactivegC C: C C A

(2.2)

The bers in equation 2.2 are sampled from the same stochastic model. Therefore, by letting p = Prfber i inactiveg = Prfber j inactiveg = : : : , equation 2.2 can be simplied to: O Prfat least n active bersg D 1 ¡ pN ¡ N.1 ¡ p/pN¡1 PD ³ ´ ³ ´ N N ¡ .1 ¡ p/2 pN¡2 ¡ : : : ¡ .1 ¡ p/n¡1 pN¡nC1 2 n¡1

Probability of Stimulus Detection in a Model Population

43

³

´ N D Prfat least n ¡ 1 active bersg ¡ .1 ¡ p/n¡1 pN¡nC1 ; (2.3) n¡1 where the large parentheses are the binomial coefcients. If p can be derived, the probability of detection from the contribution of any n or more out of N bers can be sequentially found by substituting p into equation 2.3. For ber i to be inactive, the effective amplitude of the stimulus at that ber’s receptive eld center (xi , yi ) must be smaller than the ber’s threshold. The effective amplitude (A) as a function of location (x, y) was given (Johnson, 1974; Gu¸ ¨ clu¨ & Bolanowski, 2002) as: 8 2 2 2
where (xs , ys ) are the coordinates of the stimulus contactor, rs the radius of the contactor, and As the stimulus amplitude. It is important to note that equation 2.4 is an approximation based on the experiments by Johnson (1974). The thresholds for monkey RA bers were observed to vary statistically (Johnson, 1974). The probability density function for the threshold (a) was given (Gu¸ ¨ clu¨ & Bolanowski, 2002) as tted to the data in Johnson (1974): ( µ ³ ´¶2 ) k1 a exp ¡k2 log10 f .a/ D a k3 (2.5) where k1 D 0:41; k2 D 2:82; k3 D 12: Let ai be the value of the parameter a for the ber indexed by i. The contactor radius (rs ) is constant at 2 mm, because equations 2.4 and 2.5 were obtained under this condition (Johnson, 1974). Since the locations of the receptive eld centers vary, the joint probability density (g) for ber i to be at location (xi , yi ) and also to be inactive should be integrated across the terminal phalanx area (S) and admissible thresholds (T). Anatomical distribution of receptive eld centers and physiological distribution of thresholds are statistically independent in the computational model. Therefore, the probability of an inactive ber (p) becomes: Z Z pD gfA.x; y/ < ai ; .xi D x; yi D y/gdai dx dy S T

Z Z D

S T

gT fA.x; y/ < ai ggS fxi D x; yi D ygdai dx dy:

(2.6)

gS .²/ is the density function for the anatomical locations, and gT .²/ the density function for the thresholds, which is equivalent to equation 2.5.

44

B. Gu¸ ¨ clu¨ and S. Bolanowski

The probability of the effective stimulus amplitude being smaller than the threshold can be found by integrating the density function, equation 2.5. Hence, Z pD

Z1

S A.x;y/

© ª f .a/da ¢ gS xi D x; yi D y dx dy:

(2.7)

Equation 2.7 is a generalized expression for different anatomical distributions of receptive eld centers. In the next section, two hypothetical cases will be studied. Randomly uniform distribution is the simplest case where the receptive elds are equally likely to be located anywhere on the skin. The gaussian case is a special one where the innervation density increases distally toward the ngertip (Gu¸ ¨ clu¨ & Bolanowski, 2002). Although the actual statistical model of the distribution of receptive elds is not known, Johansson and Vallbo (1979) give estimates for the density of RA bers in the human hand as about 0.25 bers per mm2 at the palm, 1.40 bers per mm2 at the ngertip, and an average of 0.75 bers per mm2 across the nger. The average innervation density (d) is varied as a parameter in this report. It is important to note that Meissner corpuscles (see section 1) are orderly arranged at both sides of the intermediate ridge and follow the contours of the ngerprints (Cauna, 1965; Bolanowski, Pawson, & Dishman, 2000; Bolanowski & Pawson, in press). Their density is much higher than the innervation density of RA bers found by physiological recording, because each ber can innervate multiple corpuscles. However, the change of density along the proximo-distal axis is similar in both anatomical and physiological data. 2.2 Experiments. Psychophysical data in G u¸ ¨ clu¨ (2003) were obtained by a two-interval forced-choice paradigm using ve human subjects. The stimuli were bursts of 0.5 s duration, 40 Hz sine waves superimposed on a static indentation of 0.5 mm. The stimulus bursts started and ended as cosine-squared ramps with 50 ms rise and fall times. Each interval was initiated by a high-level 1 s duration masking stimulus (250 Hz), which preceded the test stimulus by 100 ms. This forward-masking procedure enabled masking of the P channel (see Bolanowski et al., 1988) so that the 40 Hz stimulus was detected by the NP I channel. The stimuli were applied at ve different proximo-distal locations on the terminal phalanx using a 2 mm radius contactor. For each subject and for each contactor location, thresholds were tracked four times at the 75% correct detection probability level (Zwislocki & Relkin, 2001). The psychometric functions were obtained from the tracking data. Using the tracking data, the number of subjects who detected the stimulus with correct probability more than 50% was found. The ratio of this number to the total number of subjects is the probability of detection across subjects,

Probability of Stimulus Detection in a Model Population

45

and this probability is directly comparable to the theoretical predictions presented here. If a subject detected the stimulus exactly 50% of the time given a stimulus amplitude, that particular subject was excluded from the calculations. The psychometric data were tted with logistic distributions of the form: 1 ± ²; 1 C exp ¡ As¯¡®

PD

(2.8)

where As is the stimulus amplitude and P is the probability of detection across subjects. The parameters ® and ¯ determine the mean and the variance of the distribution 2.8, respectively. For the psychometric functions at different contactor locations, the R2 values obtained from logistic curve tting are 0.906, 0.837, 0.956, 0.701, and 0.890 (proximal to distal). At a given stimulus amplitude, the probability of detection as a function of contactor location was found by reading values from psychometric functions of corresponding stimulus-contactor locations. 3 Results When the innermost integral in equation 2.7 is evaluated, the following expression is obtained: O I.x; y/D

Z1

A.x;y/

µ ³p ´¶ 1 A.x; y/ 1 ¡ erf f .a/da D k2 log10 : 2 k3

(3.1)

(The coefcient arises one-half because the density function is normalized.) By assuming a rectangular phalanx with length l = 30 mm and width w = 20 mm, equation 2.7 becomes: Zw Z l pD

0

0

I.x; y/ ¢ hfxi D x; yi D ygdx dy:

(3.2)

When the necessary substitutions are made into equation 3.2, it can be solved by numerical methods. MATLAB Version 5.3 (MathWorks, Inc., Natick, MA) was used for this purpose. Finally, p is substituted in equation 2.3 to obtain the following probabilities: Prfat least 0 active bersg D 1

Prfat least 1 active berg D Prfat least 0 active bersg ¡ pN Prfat least 2 active bersg D Prfat least 1 active berg ¡N.1 ¡ p/pN¡1

46

B. Gu¸ ¨ clu¨ and S. Bolanowski

Figure 1: Population response of an RA population with gaussian distribution of receptive eld centers on the terminal phalanx. Average innervation density is 0.75 bers per mm2 . The stimulus contactor, depicted by the circular line, is located 5 mm proximal to the ngertip (0 mm). Given the stimulus amplitude of 30 ¹m, ¯ units are entrained at 40 spikes per second. units have average ring rates below that value. Inactive bers are not shown.

Prfat least 3 active bersg D Prfat least 2 active bersg N.N ¡ 1/ ¡ .1 ¡ p/2 pN¡2 2 and so on:

(3.3)

Discharge rates for a sample population generated by the computational model (Gu¸ ¨ clu¨ & Bolanowski, 2002) are shown in Figure 1. Here, a suprathreshold stimulus amplitude of 30 ¹m was applied using a circular contactor probe 5 mm proximal to the ngertip (0 mm). There are many active bers (inactive bers are not shown), some of which (the circles) are entrained by the sinusoidal stimulus. Although the activity is stronger just beneath the contactor (depicted by the thin circle), there is a random spread depending on the thresholds of the individual RA bers and the attenuation of the stimulus. Below, these factors are treated analytically to calculate the probability of detection for a given stimulus level, position, or innervation density. 3.1 Effects of Amplitude and Location in a Randomly Uniform Distribution. For this anatomical organization, the probability density is a con-

Probability of Stimulus Detection in a Model Population

47

stant. Note that x- and y-coordinates are independent: hfxi D x; yi D yg D

1 1 ¢ for 0 · x · l \ 0 · y · w: l w

(3.4)

Equation 3.4 is substituted in equation 3.2, and the latter is integrated at a given stimulus contactor location (5 mm) as a function of stimulus amplitude. The results for three average innervation densities are shown in Figure 2a, assuming that one active ber is adequate for detection (see Prfat least 1 active berg in equation 3.3). Sigmoid curves similar to psychometric functions are obtained. Given a human sample with average innervation density 0.75 bers per mm2 , the probability of an active ber is less than 10% for stimuli below 1 ¹m amplitude. However, there is about 90% chance of activity if the amplitude is 4 ¹m. If the average innervation density is increased, the probability is higher and the probability varies monotonically as a function of density. If equation 3.2 is integrated at a given stimulus amplitude (2 ¹m), the change in probability with respect to contactor location can be obtained (see Figure 2b). As could be expected from a randomly uniform distribution, the location of the contactor does not matter except at the terminal ends of the phalanx. At these regions, the number of bers around the contactor is reduced, which causes a drop in the probability. Otherwise, the probability stays constant. Note that the 5 mm points on Figure 2b correspond to the same points for 2 ¹m stimulation for Figure 2a. 3.2 Effects of Amplitude and Location in a Gaussian Distribution. This anatomical organization has increasing innervation density toward the distal end of the phalanx and is more realistic than a uniform distribution. The density is constant along the medio-lateral axis. The increase of density in the proximo-distal axis is modeled as a half-gaussian with an added constant (Gu¸ ¨ clu¨ & Bolanowski, 2002). Since the x- and y-coordinates are independent, the joint probability density function is: µ ³ ´ ¶ 1 dmax ¡ dmin x2 dmin ¢ exp ¡ 2 C hfxi D x; yi D yg D 2¾ w d¢l d¢l for 0 · x · l \ 0 · y · w: (3.5) In equation 3.5, the density at the distal end of the phalanx (dmax ) and the density at the proximal end (dmin ) are calculated from the average innervation density (d) input as a parameter. They are scaled according to the human density estimates: dmax = d £ 1.40/0.75 and dmin = d £ 0.25/0.75. With these substitutions made, ¾ can be uniquely determined because the integral of the density function must be unity. Equation 3.5 is substituted in equation 3.2, and the latter is integrated at a given stimulus contactor location (5 mm) as a function of stimulus amplitude for three average innervation density values. Assuming a single active ber adequate for detection, sigmoid

48

B. Gu¸ ¨ clu¨ and S. Bolanowski

Figure 2: Probability of detection for a randomly uniform distributed RA population. At least one active ber is assumed to be required for detection. (a) Effects of stimulus amplitude at three different average innervation densities. The stimulus contactor location is 5 mm. (b) Effects of stimulus contactor location at three different average innervation densities. The stimulus amplitude is 2 ¹m.

Probability of Stimulus Detection in a Model Population

49

curves similar to the previous case are obtained (see Figure 3a). However, for the gaussian case, the probabilities are higher. For example, at d = 0.75 bers per mm2 , the probability of at least one active ber is greater than 10% for a 1 ¹m amplitude stimulus, whereas this probability is about 7% in the randomly uniform distribution (see Figure 2a). Similarly, the probability of an active ber is greater than 95% when the stimulation amplitude is 4 ¹m (see Figure 3a). If equation 3.2 is integrated at a given amplitude (2 ¹m), the reason for the preceding results can be seen more clearly (see Figure 3b). The predicted probability of detection across subjects is much higher distally than the uniform case and then decreases smoothly to below the uniform case values at the proximal end. This is because although the average innervation density is kept constant for both distributions, the gaussian case has a higher local density distally and lower local density proximally. As seen in the previous case, the probability tapers off at the terminal ends of the phalanx (see Figure 3b) due to the decrease in the available bers at those regions. For 50% detection across subjects of a stimulus applied 5 mm proximal to the ngertip, the stimulus amplitude should be 1.86 ¹m. 3.3 Effects of Average Innervation Density. Figures 2 and 3 were plotted with average innervation density as a parameter. Clearly, higher innervation density (i.e., more available bers) yields higher probability of active bers and vice versa. However, it is of interest to nd the relation between the probability of at least one active ber and average density explicitly. For this purpose, equation 3.2 is integrated for one pair of stimulus amplitude and location values, but the number of bers (N) in equation 3.3 was varied. The results for the two distributions when the stimulation is at a relatively distal location are shown in Figure 4. The probability increases with decreasing slope for both distributions. That is, the probability curves have negative second derivatives. At the given stimulation conguration, the detection probability across subjects is almost certain for the gaussian case with an average innervation density of 1.5 bers per mm2 . The Gaussian probability curve is also higher than the curve for the uniform distribution. However, as can be expected from the preceding arguments, the gaussian curve would be lower if the stimulation location was close to the proximal end. 3.4 Comparison of the Predictions from Uniformly Distributed Population with Psychophysical Data. Equation 3.2 is integrated as a function of stimulus amplitude at a given contactor location assuming a randomly uniform distribution of receptive eld centers. The contactor location is the approximate midpoint of the terminal phalanx along the proximo-distal axis (17 mm). The probability of detection across subjects is found for multiple numbers (n) of bers using equation 3.3 in a population with d = 0.75 bers per mm2 . The thin lines in Figure 5a show the change of the theoretical probability of detection as more bers are required for the detection

50

B. Gu¸ ¨ clu¨ and S. Bolanowski

Figure 3: Probability of detection for a gaussian distributed RA population. At least one active ber is assumed to be required for detection. (a) Effects of stimulus amplitude at three different average innervation densities. The stimulus contactor location is 5 mm. (b) Effects of stimulus contactor location at three different average innervation densities. The stimulus amplitude is 2 ¹m.

Probability of Stimulus Detection in a Model Population

51

Figure 4: Effects of average innervation density. The probability of detection is plotted for both uniform and gaussian distributions. At least one active ber is assumed to be required for detection. The stimulus contactor is located at 5 mm; its amplitude is 3 ¹m.

process. As can be predicted, as the number of required bers increases, the probability of detection at a given stimulus amplitude decreases. This can be seen in Figure 5a such that the psychometric functions shift to higher amplitudes as more bers are contributing to the detection process. Note, however, that there is a slight decrease in the slope of the theoretical psychometric functions as the detection rule is made more stringent. The curve for the probability of at least one active ber (n = 1) can be considered to be the absolute limit for detection. The psychophysical data, shown by the circles, do not lie along this absolute limit. The thick line, which is the logistic curve through the psychophysical data points, is closest to the theoretical psychometric curve for n = 7. Considering the variation of psychophysical data points and assuming a randomly uniform distribution of RA bers, it can be concluded that 5 to 10 bers are required for detection. A similar analysis regarding the number of required bers can be made if the probabilities of detection across subjects are plotted as a function of contactor location (see Figure 5b). For this purpose, equation 3.2 is integrated as a function of contactor location at a given stimulus amplitude (7 ¹m) and assuming a randomly uniform distribution of receptive eld centers. The probability of detection across subjects is found for multiple numbers (n) of bers using equation 3.3 in a population with d = 0.75 bers per mm2 . The thin lines in Figure 5b show the change of the theoretical probability of detection as more bers are required for the detection process. The thin curves are about constant along the terminal phalanx, which shows the

52

B. Gu¸ ¨ clu¨ and S. Bolanowski

Figure 5: Probability of detection in a randomly uniform distributed RA population with average innervation density d = 0.75 bers per mm2 . Probabilities are plotted (thin lines) by assuming that at least n active bers are required for detection. (a) Effects of stimulus amplitude as n is sequentially varied. The stimulus contactor location is 17 mm. The circles are psychophysical probabilities of detection across subjects obtained at the same contactor location. The thick line is the logistic function tted through the psychophysical data points. (b) Effects of stimulus contactor location as n is sequentially varied. The stimulus amplitude is 7 ¹m. The circles are psychophysical probabilities of detection across subjects obtained at the same stimulus amplitude. The thick line is cubic interpolation through the psychophysical data points.

Probability of Stimulus Detection in a Model Population

53

uniform distribution of receptive eld centers and the probability of detection is decreased as more bers are required. However, the psychophysical data, shown with circles, are not constant as a function of the contactor location. This points to the possibility of a different anatomical organization in the terminal phalanx. The psychophysical data are within the range of probabilities for ve to nine bers required for detection. 3.5 Comparison of the Predictions from Gaussian Distributed Population with Psychophysical Data. The analysis done in section 3.4 is repeated for a Gaussian distribution of receptive eld centers. The psychometric functions for this distribution (the thin lines in Figure 6a) are more widely separated along the amplitude axis than the functions for the uniform distribution. Again, there is a decrease in the slope of the theoretical psychometric functions as the detection rule is made more stringent. The curve for the probability of at least one active ber (n = 1) can be considered to be the absolute limit for detection. The psychophysical data, shown by the circles, do not lie along this absolute limit. The thick line, which is the logistic curve through the psychophysical data points, is between the theoretical psychometric curves for n = 5 and n = 6. The spread of the data points suggests that three to nine bers are required for detection, given a gaussian distribution of receptive eld centers. The same parameters as in section 3.4 are used to nd the probabilities of detection across subjects as a function of contactor location for a gaussian distribution (see Figure 5b). The theoretical probabilities of detection (thin lines) have peaks at distal locations on the terminal phalanx, which shows the higher innervation density of the gaussian distribution distally. However, the psychophysical data, shown with circles, follow an almost opposite trend. The highest psychophysical probability is at a proximal location. This points to the possibility of a different anatomical organization in the terminal phalanx. The psychophysical data are within the range of probabilities for 3 to 13 bers required for detection. However, since the theoretical and the experimental curves are so different, this estimate for the required number of bers is somewhat wide. 4 Discussion In this article, the probability of detection in a model population of RA mechanoreceptive bers is presented by assuming that one or more bers contribute to the detection process. The analysis is a step toward linking the physiology of receptor and ber populations with psychophysics. The calculations are based on a previous computational model (Gu¸ ¨ clu¨ & Bolanowski, 2002). Two hypothetical anatomical distributions of receptive eld centers were considered: randomly uniform and gaussian. The gaussian distribution has increasing innervation density distally and is more realistic. However, for both distributions, changing the stimulus amplitude affects the

54

B. Gu¸ ¨ clu¨ and S. Bolanowski

Figure 6: Probability of detection in a gaussian distributed RA population with average innervation density d = 0.75 bers per mm2 . Probabilities are plotted (thin lines) by assuming that at least n active bers are required for detection. (a) Effects of stimulus amplitude as n is sequentially varied. The stimulus contactor location is 17 mm. The circles are psychophysical probabilities of detection across subjects obtained at the same contactor location. The thick line is the logistic function tted through the psychophysical data points. (b) Effects of stimulus contactor location as n is sequentially varied. The stimulus amplitude is 7 ¹m. The circles are psychophysical probabilities of detection across subjects obtained at the same stimulus amplitude. The thick line is cubic interpolation through the psychophysical data points.

Probability of Stimulus Detection in a Model Population

55

probability of detection in a sigmoidal fashion. The effects of different distributions can be seen by shifting the stimulus contactor location. The gaussian case predicts that the probability of detection is higher distally than proximally, because the local innervation density is higher toward the ngertip. In addition, the change of probability versus average innervation density was presented, which has a similar relationship. Recent experiments by Gu¸ ¨ clu¨ (2003) permit the comparison of the above predictions with the probabilities of detection obtained from the NP I channel. The psychophysical probability of detection follows a sigmoid trend as a function of stimulus amplitude, similar to the predicted results. However, the psychophysical results do not match with the absolute limit for detection: one active ber. The psychophysical probabilities are closer to the predicted probabilities for at least 5 to 10 active bers. A closer estimate can be obtained when more subjects are used in the experiments, because the probabilities reect the variance across subjects, as noted. However, at this point, it is safe to say that multiple bers are required for detection, and the previous results by Vallbo and colleagues should be approached with caution. It is most probable that multiple tactile bers were activated in their experiments (see section 1). Studying the probability of detection as a function of stimulus-contactor location can give information about the underlying innervation pattern of the RA bers. The psychophysical probabilities were found to be much different from the predicted probabilities for uniform and gaussian distribution of receptive eld centers. In contrast to the previous physiological data that showed higher innervation distally, the psychophysical results have probability peaking proximally. However, a rm conclusion about the innervation pattern cannot be reached, because the probabilities of detection also depend on the mechanics of the skin. The impedance of the skin surface was found to vary along the proximo-distal direction (G u¸ ¨ clu, ¨ 2003). It may be possible that the empirical function (see equation 2.4) by Johnson (1974) is not applicable at every location on the terminal phalanx. The two confounding variables, skin mechanics and innervation pattern, can be disentangled only by future physiological experiments. Additionally, psychophysical experiments using more subjects may give a more consistent picture of the probability of detection as a function of contactor location. In many previous psychophysical experiments, a contactor surround was typically used to limit the spread of activity. Here, the stimulus affects the whole terminal phalanx, albeit distant regions minimally (see equation 2.4). The mechanical properties of the skin are largely simplied by this equation, and the results are valid only for the given contactor size (radius = 2 mm). Previous psychophysical studies done with a contactor surround show that there is no spatial summation for the NP I channel (Verrillo, 1963; Craig, 1976). This implies that changing the contactor size has no psychophysical effects when a sinusoidal vibratory stimulus was used. This phenomenon must be reexamined with psychophysical experiments similar to those pre-

56

B. Gu¸ ¨ clu¨ and S. Bolanowski

sented here and in G u¸ ¨ clu¨ (2003), which did not use a contactor surround. The psychophysical implications of this study can be assessed by considering the variability of the population response that makes the term detection dened only in probabilistic terms. This is not uncommon in sensory science. Psychometric functions quantify the behavioral response of a subject in a probabilistic way. In fact, the pscyhometric function itself is sigmoidal, similar to results above, and gives the probability of accomplishing a certain task with respect to a stimulus parameter. However, psychophysics incorporates both peripheral and central processing. The computational model used here regards the periphery solely, and the probabilistic nature of the neural response is represented by the stochastic models of intensity characteristics and receptive eld location (Gu¸ ¨ clu¨ & Bolanowski, 2002). When the effective stimulus amplitude is higher than a ber’s threshold, the ber is assumed to be active. Any time-dependent property is neglected at this time. Once the thresholds are assigned to bers, the physiological conguration is xed. Therefore, it is not appropriate to correlate the probability of detection calculated here with a psychometric function obtained from one subject. On the other hand, since the stochastic model gives the distribution in ber populations, the results can be related to variation between subjects or different skin regions (phalanges). Therefore, the psychophysical probabilities of detection across subjects were presented for comparison. Characterizing peripheral response in a more elaborate model can furnish a more useful interpretation of the theoretical analysis (e.g., within subjects) in the future. Analysis of the peripheral population response can only suggest the limits of sensory input. According to the results presented here, the limit for detection in the NP I channel cannot be a single active ber. Multiple bers were found to be required for detection. It is important to note that central somatosensory neurons have complex response properties. Here, each ber was treated independently, and nothing was said about the connectivity of these neurons. However, cortical inhibition was shown long ago (Mountcastle, 1957). Intracellular recording reveals complex nonlinear interactions in cortical neurons (Kang, Herman, MacGillis, & Zarzecki, 1985). For linking peripheral response with psychophysics more reliably, these and many other issues need to be taken into account. Acknowledgments This work was supported by NIH DC 00098 and NS 36816. We thank Laurel H. Carney for critical reading of the manuscript and for her helpful comments. References Bolanowski, S. J., Gescheider, G. A., Verrillo, R. T., & Checkosky, C. M. (1988).

Probability of Stimulus Detection in a Model Population

57

Four channels mediate the mechanical aspects of touch. J. Acoust. Soc. Am., 84, 1680–1694. Bolanowski, S. J., & Pawson, L. (in press). Organization of Meissner corpuscles in the glabrous skin of monkey and cat. Somatosens. Mot. Res. Bolanowski, S. J., Pawson, L., & Dishman, J. D. (2000).The organization of Meissners and Merkels in cat and monkey glabrous skin. Soc. Neurosci. Abstr., 26, 426. Cauna, N. (1965). The effects of aging on the receptor organs of the human dermis. In W. Montagna (Ed.), Advances in biology of skin Vol. 6: Aging (pp. 63–96). Oxford: Pergamon. Craig, J. C. (1976). Attenuation of vibrotactile spatial summation. Sens. Proc., 1, 40–56. Gescheider, G. A., Bolanowski, S. J., & Hardick, K. R. (2001). The frequency selectivity of information-processing channels in the tactile sensory system. Somatosens. Mot. Res., 18, 191–201. Goodwin, A. W., & Pierce, M. E. (1981). Population of quickly adapting mechanoreceptive afferents innervating monkey glabrous skin: Representation of two vibrating probes. J. Neurophysiol., 45, 243–253. Goodwin, A. W., Youl, B. D., & Zimmerman, N. P. (1981). Single quickly adapting mechanoreceptive afferents innervating monkey glabrous skin: Response to two vibrating probes. J. Neurophysiol., 45, 227–242. Gu¸ ¨ clu, ¨ B. (2003). Computational studies on rapidly-adaptingmechanoreceptive bers. Unpublished doctoral dissertation, Syracuse University. Gu¸ ¨ clu, ¨ B., & Bolanowski, S. J. (1999). Population responses of RA mechanoreceptors with hypothetical spatial distributions. Soc. Neurosci. Abstr., 25, 407. Gu¸ ¨ clu, ¨ B., & Bolanowski, S. J. (2001). Statistical comparison of the intensity characteristics of monkey and cat mechanoreceptive RA bers. Soc. Neurosci. Abstr., 27, Program 50.4. Gu¸ ¨ clu, ¨ B., & Bolanowski, S. J. (2002). Modeling population responses of rapidlyadapting mechanoreceptive bers. J. Comput. Neurosci., 12, 201–218. Gu¸ ¨ clu, ¨ B., & Bolanowski, S. J. (2003). Distribution of the intensity-characteristic parameters of cat rapidly-adapting mechanoreceptive bers. Somatosens. Mot. Res., 20, 149–155. Hensel, H., & Boman, K. K. A. (1960). Afferent impulses in cutaneous sensory nerves in human subjects. J. Neurophysiol., 23, 564–578. J¨anig, W., Schmidt, R. F., & Zimmermann, M. (1968). Single unit responses and the total afferent outow from the cat’s foot pad upon mechanical stimulation. Exp. Brain Res., 6, 100–115. Johansson, R. S. (1976). Receptive eld sensitivity prole of mechanosensitive units innervating the glabrous skin of the human hand. Brain Res., 104, 330– 334. Johansson, R. S. (1978). Tactile sensibility in the human hand: Receptive eld characteristics of mechanoreceptive units in the glabrous skin area. J. Physiol. London, 281, 101–123. Johansson, R. S., Landstr om, ¨ U., & Lundstr om, ¨ R. (1982). Responses of mechanoreceptive afferent units in the glabrous skin of the human hand to sinusoidal skin displacements. Brain Res., 244, 17–25.

58

B. Gu¸ ¨ clu¨ and S. Bolanowski

Johansson, R. S., & Vallbo, ÊA. B. (1979). Tactile sensibility in the human hand: Relative and absolute densities of four types of mechanoreceptive units in glabrous skin. J. Physiol. London, 286, 283–300. Johnson, K. O. (1974). Reconstruction of population response to a vibratory stimulus in quickly adapting mechanoreceptive afferent ber population innervating glabrous skin of the monkey. J. Neurophysiol., 37, 48–72. Kang, R., Herman, D., MacGillis, M., & Zarzecki, P. (1985). Convergence of sensory inputs in somatosensory cortex: Interactions from separate afferent sources. Exp. Brain Res., 57, 271–278. Knibestol, ¨ M., & Vallbo, ÊA. B. (1970). Single unit analysis of mechanoreceptor activity from the human glabrous skin. Acta Physiol. Scand., 80, 178–195. Lindblom, U. (1965). Properties of touch receptors in distal glabrous skin of the monkey. J. Neurophysiol., 28, 966–985. Mountcastle, V. B. (1957).Modality and topographic properties of single neurons of cat’s somatic sensory cortex. J. Neurophysiol., 20, 408–434. Ochoa, J., & Torebjork, ¨ E. (1983). Sensations evoked by intraneural microstimulation of single mechanoreceptor units innervating the human hand. J. Physiol. London, 342, 633–654. Pubols, B. H., & Pubols, L. M. (1976). Coding of mechanical stimulus velocity and indentation depth by squirrel monkey and raccoon glabrous skin mechanoreceptors. J. Neurophysiol., 39, 773–787. Talbot, W. H., Darian-Smith, I., Kornhuber, H., H., & Mountcastle, V. B. (1968). The sense of utter-vibration: Comparison of the human capacity with response patterns of mechanoreceptive afferents from the monkey hand. J. Neurophysiol., 31, 301–334. Vallbo, ÊA. B. (1981). Sensations evoked from the glabrous skin of the human hand by electrical stimulation of unitary mechanosensitive afferents. Brain Res., 215, 359–363. Vallbo, ÊA. B., & Johansson, R. (1976). Skin mechanoreceptors in the human hand: Neural and psychophysical thresholds. In Y. Zotterman (Ed.), Sensory functions of the skin in primates, with special reference to man (pp. 185–199). Oxford: Pergamon. Vallbo, ÊA. B., Olsson, K. ÊA., Westberg, K.-G., & Clark, F. J. (1984). Microstimulation of single tactile afferents from the human hand: Sensory attributes related to unit type and properties of receptive elds. Brain, 107, 727–749. Verrillo, R. T. (1963). Effect of contactor area on the vibrotactile threshold. J. Acoust. Soc. Am., 35, 1962–1966. Zwislocki, J. J., & Relkin, E. M. (2001). On a psychophysical transformed-rule up and down method converging on a 75% level of correct responses. PNAS, 98, 4811–4814. Received January 17, 2003; accepted June 25, 2003.

LETTER

Communicated by C. Lee Giles

A Machine Learning Method for Extracting Symbolic Knowledge from Recurrent Neural Networks A. Vahed [email protected]

C.W. Omlin [email protected] Department of Computer Science, University of the Western Cape, 7535 Bellville, South Africa

Neural networks do not readily provide an explanation of the knowledge stored in their weights as part of their information processing. Until recently, neural networks were considered to be black boxes, with the knowledge stored in their weights not readily accessible. Since then, research has resulted in a number of algorithms for extracting knowledge in symbolic form from trained neural networks. This article addresses the extraction of knowledge in symbolic form from recurrent neural networks trained to behave like deterministic nite-state automata (DFAs). To date, methods used to extract knowledge from such networks have relied on the hypothesis that networks’ states tend to cluster and that clusters of network states correspond to DFA states. The computational complexity of such a cluster analysis has led to heuristics that either limit the number of clusters that may form during training or limit the exploration of the space of hidden recurrent state neurons. These limitations, while necessary, may lead to decreased delity, in which the extracted knowledge may not model the true behavior of a trained network, perhaps not even for the training set. The method proposed here uses a polynomial time, symbolic learning algorithm to infer DFAs solely from the observation of a trained network’s input-output behavior. Thus, this method has the potential to increase the delity of the extracted knowledge. 1 Introduction 1.1 Motivation. One of the reasons that symbolic reasoning systems such as expert systems have found acceptance is their ability to explain how they arrived at a solution for a given problem. Neural networks do not intrinsically have such capabilities. The knowledge that neural networks have gained through training is stored in their weights. Until recently, it was a widely accepted myth that neural networks were black boxes; the knowledge distributed among their weights was not readily accessible to inspection, analysis, and verication. Since then, research on this topic has Neural Computation 16, 59–71 (2004)

c 2003 Massachusetts Institute of Technology °

60

A. Vahed and C. Omlin

resulted in a number of algorithms for extracting knowledge in symbolic form from trained neural networks. The goal of knowledge extraction is to generate a concise symbolic description of the knowledge stored in a network’s weights. Of particular concern is the delity of the extraction process—how accurately the extracted knowledge corresponds to the knowledge stored in the network. Unfortunately, rule extraction is a computationally very hard problem. For feedforward networks, it has been shown that there do not exist polynomial-time algorithms for concise knowledge extraction (Golea, 1996). Although no corresponding results exist to our knowledge in the literature for recurrent networks, it is likely that a similar result applies. Thus, heuristics have been developed for overcoming the combinatorial complexity of the problem. The merits of rule extraction include discovery of unknown salient features and nonlinear relationships in data sets, explanation capability leading to increased user acceptance, improved generalization performance, and possibly transfer of knowledge to new yet similar learning problems. Some applications (e.g., application of neural networks to credit rating, lending policy, and critical applications such as aircraft control) may even require that neural networks undergo validation prior to being deployed. Knowledge extraction could be an important stage in that process. As we will see, improved generalization performance applies particularly to recurrent networks whose nonlinear dynamical characteristics can easily lead to deteriorating generalization performance. 1.2 Classes of Extraction Algorithms. Extraction algorithms can be divided into three classes (Andrews, Diederich, & Tickle, 1995): decompositional, pedagogical, and eclectic. Decompositional methods infer rules from the internal network structure (individual nodes and weights). Previous work on nite state induction by decompositional methods analyzed the internal states of the neural network to identify the states of the extracted automaton. These include encodings based on a study of xed points of the logistic sigmoid function (Omlin, Thornber, & Giles, 1998; Omlin & Giles, 1996a, 1996c) and algorithms that construct clusters that eventually denote the states in the network (Cleeremans, Servan-Schreiber, & McClelland, 1989; Das & Das, 1992; Omlin & Giles, 1996a). Pedagogical methods view neural networks as black boxes and use some machine learning algorithm for deriving rules that explain the network behavior. The method presented here relies solely on input-output behavior of the network and hence can be classied as pedagogical. The merits of this approach are that it is not conned to particular neural net architectures such as the one described in this article, and it does not require a priori declaration of extraction parameter values such as the quantization levels needed to dene cluster grain in certain decompositional algorithms (Omlin & Giles, 1996a, 1996c).

Symbolic Knowledge from Recurrent Neural Networks

61

Algorithms that do not clearly t into either class are referred to as eclectic; they may combine aspects of decompositional methods and pedagogical methods. 1.3 Overview. This letter describes a hybrid method to extract a deterministic nite state automata from a recurrent neural network trained on a set of labeled training strings. In section 2, we briey introduce the secondorder network architecture we used to learn DFAs. This is followed by a summary of DFA extraction methods, which all belong to the class of decompositional knowledge extraction algorithm in section 3. In section 4, we discuss a symbolic learning algorithm for inferring DFAs from training sets and its application to the extraction of DFAs from trained networks. A discussion of simulation results will demonstrate the feasibility of extracting DFAs from observations of the input-output behavior of a trained recurrent neural network alone (i.e., without the need for an analysis of the output space of recurrent state neurons). Thus, our method belongs to the class of pedagogical knowledge extraction algorithms. A summary of the results and directions for future research in section 5 conclude this letter. 2 Recurrent Neural Networks and DFA extraction The relationship between nite state machines and recurrent neural networks has been explored in various ways (Elman, 1990; Watrous & Kuhn, 1992a; Pollack, 1991; Giles, Miller, Chen, Chen, et al., 1992; Zeng, Goodman, & Smyth, 1993). The outcome of these works shows that recurrent nets may indeed learn nite-state-like behavior (Cleeremans, et al., 1989) and that second-order recurrent networks can be constructed such that the internal DFA state representations remain stable (Omlin & Giles, 1996a, 1996c). However, problems still exist: ² A trained net may exhibit the correct nite state behavior for short input strings (with lengths in the range of those used for training), but for longer input strings, the clusters interpreted as the learned states of the FSM begin to blur and eventually merge. This leads to degraded state representations and, hence, incorrect output. This instability makes it difcult for the net to generalize its learned behavior (Das & Das, 1992; Omlin, Giles, & Miller, 1992). ² Where long-term dependencies have to be learned, that is, output that depends on inputs that have been presented early on during the processing of a string, gradient-descent algorithms suffer from the problem of vanishing gradients (Bengio, Simard, & Frasconi, 1994). We use second-order recurrent neural networks with weights Wijk to learn grammars. The network accepts a time-ordered sequence of inputs

62

A. Vahed and C. Omlin

and evolves with dynamics dened by the following equations, X D g.4i C bi /; S.tC1/ 4i ´ Wijk Sj.t/ Ik.t/ ; i j;k

where g is a sigmoid discriminant function and bi is the bias associated with hidden recurrent state neurons Si . The weights W ijk are updated according to a second-order form of the real-time recurrent learning algorithm based on gradient descent (Williams & Zipser, 1989). For more details, see Giles, Miller, Chen, Chen et al., 1992). 3 Knowledge Extraction 3.1 Preliminaries. Once a network is trained (or even during training), we want to extract meaningful internal representations of the network, such as rules. (For works on rule extraction from recurrent neural networks see Hanson & Burr, 1991; Giles, Miller, Chen, Chen et al., 1992; Watrous & Kuhn, 1992a; Zeng et al., 1993). The conclusion of Cleeremans et al. (1989) was that the hidden unit activations represented past histories and that clusters of these activations can represent the states of the generating automaton. Giles, Miller, Chen, Chen et al. (1992) showed that complete deterministic nitestate automata and their equivalence classes can be extracted from recurrent networks both during and after training. This was extended in Giles, Miller, Chen, Sun, et al., (1992) to include a method for extracting bounded “unknown” grammars from a trained recurrent network. Watrous and Kuhn (1992b) also proposed an alternative approach to state machine extraction. An algorithm for randomly chosen strings that is able to infer large DFAs has been proposed (Lang, 1992). It is based on Trakhtenbrot’s tree contraction algorithm (Trakhenbrot & Barzdin, 1973) but allows for the absence of labels of some sufxes and conict resolution. The inferred DFA is not necessarily the smallest model of the unknown source grammar. That method has been the most successful as DFAs with nearly 800 states have been inferred from a sparse uniform distribution of example strings. For the purpose of studying issues of knowledge representation in recurrent neural networks, we replace the inference engine that usually belongs to the class of symbolic learning algorithms with a recurrent neural network. A recurrent network is trained to classify positive and negative example strings correctly. Either the trained network can be used to classify new strings, or a regular grammar can be extracted from a trained network in the form of its corresponding DFA. We do not claim that our neural approach to regular language inference is competitive with the symbolic approach of algorithms that are tailored to the problem. 3.2 Clustering Through Partitioning. The theoretical foundation for DFA extraction based on partitioning of the state-space has been established in Casey (1996). An algorithm based on state-space partitioning was

Symbolic Knowledge from Recurrent Neural Networks

63

proposed in Giles, Miller, Chen, Chen et al. (1992). The extraction algorithm divides the output of each of the N state neurons into q intervals or quantization levels of equal size, producing qN partitions in the space of the hidden state neurons. Obviously, the extracted DFA depends on the quantization level q chosen; in general, different DFAs will be extracted for different values of q. We choose the DFA Ms , which is the smallest consistent DFA. Ms can be found by extracting DFAs M2 ; : : : ; Ms¡1 ; Ms in that order, and the best model is the rst consistent DFA M s . We rely on the hypothesis that there always exists an s such that Ms is the smallest consistent DFA. The quality of the extracted rules can be improved through continuous network pruning and retraining (Giles & Omlin, 1994). 3.3 Explicit Cluster Modeling. The method proposed in Das and Mozer (1994) solves the problem of network instability, and thus simpler DFA extraction, by allowing discrete DFA representations to evolve during training. The premises of the method are that (1) an upper bound T on the number of states of a DFA to be learned is known a priori, (2) C D fc1 ; c2 ; : : : ; cT g are the activations of hidden neurons for the DFA states, (3) the formation of discrete internal DFA states is hindered by gaussian noise in the weights, and (4) this noise decreases as training progresses. Empirical results for small toy problems demonstrated that the DFAs extracted using the self-clustering method appear to generalize better than either approaches that extract DFAs after learning and the self-stabilizing method, which is discussed in the next section. However, it is not clear whether these results are statistically signicant and whether the addition of the entropy of the prior distribution of gaussians in state-space helps or impedes learning. Hanson and Negishi (2002) showed that abstract rules (i.e., grammars) can be extracted from sequences of arbitrary input symbols by inventing abstract representations that can also be generalized to new symbol sets or grammars. The network learns abstract structures of the input that are context sensitive, hierarchical, and based on the state variable of the nite state machines that the neural network has learned. 3.4 Discrete Representations. The training method proposed in Zeng et al. (1993) learns stable internal representations by using a discretized combination of network structures and a pseudo–gradient weight update algorithm. The states of networks trained with this methods are isolated points in state-space instead of clusters of hidden neuron activations. The discretization biases network training toward learning DFAs that are not equivalent to DFAs used to generate training sets (Das & Mozer, 1994) Even with maximal stability of the internal DFA representation, the discrete networks’ generalization performance can be inferior to the performance of networks trained without clustering and with self-clustering.

64

A. Vahed and C. Omlin

3.5 Self-Clustering and Other Approaches. Alternatives to the dynamic state-space exploration discussed here are possible. For instance, hierarchical clustering algorithms can be used instead of state-space partitioning to nd clusters of hidden neuron activations. Clustering can also be cast into a learning problem. A DFA extraction procedure based on self-organizing maps (SOMs) (Kohonen, 1995) has been discussed in Tino and Sajda (1995). 4 Machine Learning Method The goal of any successful inference algorithm is the generation of a DFA that generalizes to unseen strings—that correctly classies strings for which it is a priori unknown whether they are members of some regular language. An inference algorithm is guaranteed to nd a DFA that explains the given example strings. However, that solution is in general not unique; there are many DFAs that could generate a given set of strings along with their classication. This ambiguity implies that any solution DFA M¤ may be different from the unknown source DFA M. However, there exists a condition under which a data set uniquely identies a given unknown source DFA M. The following denition will be useful: Denition 1. The distinguishability d.qi ; qj / of two DFA states qi and qj is the shortest string x such that the two states do not agree on the classication for x. The distinguishability dmax .M/ of a DFA M is the maximum over all pairs of states qi and qj in M. Under certain conditions, DFAs are uniquely identiable: Theorem 1. Let depth.M/ denote the maximum depth of the DFA M that accepts the language L.M/. If M generates a set S of example strings that consists of all strings of length up to Lunique where Lunique D dmax .M/ C depth.M/ C 1; then M is uniquely identiable. It is this property of DFA inference along with the capability of recurrent networks to generalize to previously unseen string that makes it possible to extract DFAs from trained networks without the need for a cluster analysis in the output space of recurrent state neurons. We simply rely on a trained network’s generalization ability to provide the labels of strings that were missing from a training set in order to obtain “complete” sets—sets of all strings up to length Lunique . Next, we briey discuss a polynomial-time algorithm for inferring unique DFAs from such complete training sets.

Symbolic Knowledge from Recurrent Neural Networks

65

4.1 Trakhenbrot-Barzdin Algorithm. The Trakhenbrot-Barzdin (TB) algorithm (Trakhenbrot and Barzdin, 1973) extracts minimal DFAs from a data set of input strings and their respective output states. We distinguish between the strings that are members and those that do not belong to a regular language: Denition 2. For the regular language L, let W C D fwj 8w 2 Lg be the set of all positive strings and W ¡ D L ¡ W C the set of all negative strings in L. A restricting premise of the TB algorithm is that the sets W ¡ and W C up to the string length Lunique should be known if the desired DFA is to be extracted uniquely. These sets for the purpose of the algorithm are shown by means of a prex tree as in Figure 1. The prex tree should thus be complete at least up to the depth required for the TB algorithm to extract the target DFA unambiguously. In the next section, we discuss how the the TB algorithm can be used to extract DFAs from trained recurrent neural networks. 4.2 Symbolic Knowledge Extraction. Similar to the DFA extraction algorithm based on state-space partitioning, we observe the output of trained networks for all strings up to a given string length. Algorithm 1 broadly illustrates the extraction procedure. We apply the TB algorithm to the prex tree corresponding to the resulting DFA extraction data set until a DFA is extracted that is consistent with the original training set used to train the recurrent network—that is, until the outputs of the

1

2

3

4

8

5

9

10

6

11

12

7

13

14

15

Figure 1: Prex Tree. A prex tree of depth (i.e., string length) 3. Accepting states are shown as double circles.

66

A. Vahed and C. Omlin

recurrent network and the inferred DFA agree for all strings in the original training set. It is our hypothesis that the rst consistent DFA is also the best model of the unknown source DFA that generated the training set. Algorithm 1. Let M be the universal DFA that rejects (or accepts) all strings, and let the string length L D 0. Repeat LÃ LC1 Generate all string up to L Extract M using TB.L/ Until M is consistent with test set The method described here holds several advantages over the clustering methods mentioned earlier. Being a black box approach, it obviates the need for user selection of quantization levels required for clustering by partition but involves a network training regime with identical computational complexity. Moreover, it avoids the need to declare a priori the number of states for explicit cluster modeling. Nonlinear dynamical systems such as recurrent networks may correctly classify all strings of a training set, but the internal DFA representation could become unstable, leading to misclassication of long strings that were not part of the training set (Omlin & Giles, 1996b). Since this algorithm converges more rapidly to a solution in terms of string length, this deterioration of generalization performance poses a much less signicant disadvantage. Results of performance comparisons in the next section show that the quality of this extraction technique measured in terms of accuracy and delity compares very favorably with another method based on state-space clustering. 4.3 Empirical Results. We considered 10-state DFAs for a regular language over the alphabet f0, 1g as the case study. This DFA size was selected principally because of considerations for the computational cost of training larger neural networks. Twenty-ve networks of the type described in section 2 were trained on data sets consisting of the rst 50, 100, 300, and 700 strings generated in lexicographic order. The network contained nine hidden (“state”) neurons and a single output neuron indicating the nal state of the network. All weights were initialized to random values in the interval [¡0:1; 0:1]. The learning rate ® and momentum ´ were both set to 0.5. Student’s t condence intervals for the mean percentage error are reported in Table 1. The table compares the performance of the TB algorithm with that of state-space clustering for three test sets: all strings of length up to 15 denoted by A (65,000 strings), 1000 randomly generated strings of length 100 (B), and 1000 randomly generated strings of length 1000 (C).

Symbolic Knowledge from Recurrent Neural Networks

67

Table 1: Extracted DFAs for Networks Trained on Complete Sets. Test Set

T

A

700 300 100 50 700 300 100 50 700 300 100 50

B

C

TB Extraction

Clustering

Neural Net

Mean ¹(%)

t .24;0:05/ for ¹

Mean ¹(%)

t.24;0:05/ for ¹

Mean ¹(%)

t.24;0:05/ for ¹

0.0 0.0 33.6 38.6 0.0 0.0 39.4 41.5 0.0 0.0 40.4 41.8

0.0–0.0 0.0–0.0 22.1–45.1 25.4–51.8 0.0–0.0 0.0–0.0 25.9–52.9 27.3–55.7 0.0–0.0 0.0–0.0 26.5–54.2 27.5–56.1

0.4 2.0 34.2 37.4 1.2 4.7 36.0 38.6 1.3 5.1 36.8 38.7

0.1–0.7 1.2–2.8 29.3–39.2 31.8–43.0 0.3–2.1 2.5–6.9 29.3–42.7 31.1–46.2 0.2–2.4 2.8–7.4 29.8–43.8 31.1–46.4

1.1 7.7 33.5 38.5 9.5 26.7 39.0 41.5 10.9 26.9 40.7 41.5

0.7–1.5 5.1–10.4 22.1–45.0 25.3–51.7 6.2–12.7 17.5–35.8 25.7–52.4 27.3–55.7 7.2–14.7 17.7–36.1 26.8–54.6 27.3–55.7

Notes: Student’s t.n¡1;®/ condence intervals of the mean error for n D 25 at ® D 5% level of signicance for DFAs by TB extraction and clustering-based partition. A, B, and C denote test strings of length 1 to 15, 1000 strings of length 1000, and 1000 strings of length 1000, respectively. The training sets T (column 2) are lexically contiguous.

The results show that DFA extraction using the TB algorithm is at least as good as clustering by partitioning for similar conditions—that is, up to the point where the TB technique extracts the DFA that perfectly matches the one from which the training data set was generated. We surmise that the point at which this occurs is when the algorithm has sufcient information of the target DFA in terms of Lunique, as described in section 4. All three methods exhibited roughly similar performance degradation below this point. Unless the DFA can be uniquely characterized by strings up to the length Lunique, many DFAs are consistent with the partial information implicit in the training strings. Since the desired DFA is not yet uniquely dened, it is not surprising that the TB algorithm is prone to large variations in performance up to this point. In this case, the nal DFA reported was merely the one inferred with the training strings up to that particular length. This is supported by theorem 1, which sets a minimum required string length Lunique for unique automaton identication. Where TB succeeded in extracting the correct target DFA, it always inferred the smallest one. When the training sets contain sufcient information, only the TB method consistently extracts the desired DFA. This phenomenon is also borne out by results for sparse training sets, as shown in Table 2. The results show that our algorithm, unlike other extraction methods, extracts DFAs of higher delity than the other methods, yields a DFA of minimal size, and consistently produces the desired DFA whenever suf-

68

A. Vahed and C. Omlin

Table 2: Extracted DFAs for Networks Trained on Sparse Sets. Test Set

T

A

700 300 100 50 700 300 100 50 700 300 100 50

B

C

(a) TB Extraction

(b) TB Extraction

Clustering

Mean ¹ (%)

t.24;0:05/ for ¹

Mean ¹ (%)

t.24;0:05/ for ¹

Mean ¹ (%)

t.24;0:05/ for ¹

0.0 4.1 38.0 41.7 0.0 10.2 41.6 42.7 0.0 11.1 42.0 43.3

0.0–0.0 2.8–5.5 25.4–51.6 27.6–55.8 0.0–0.0 7.6–12.8 28.2–55.0 28.1–57.4 0.0–0.0 7.6–14.7 27.4–56.6 28.4–58.2

0.0 4.3 38.2 41.4 0.0 10.1 41.6 42.5 0.0 11.2 42.2 43.1

0.0–0.0 2.9–5.7 25.4–51.0 27.1–55.7 0.0–0.0 7.5–12.7 28.0–55.2 27.9–57.1 0.0–0.0 7.6–14.7 27.7–56.7 28.2–58.0

0.9 4.8 39.6 42.7 1.2 9.3 42.1 44.1 1.3 9.7 43.6 45.3

0.3–1.5 3.6–6.1 33.2–46.0 35.3–50.2 0.4–2.0 7.7–10.9 34.6–49.7 35.8–52.4 0.4–2.2 7.6–11.7 35.2–51.5 37.0–53.6

Notes: Student’s t.n¡1;®/ condence intervals of the mean error for n D 25 at ® D 5% level of signicance for DFAs by TB extraction for (a) 25 networks trained on random training sets, (b) a single network trained on 25 random training sets, and clusteringbased partition. The training sets in column 2 are noncontiguous.

cient information is contained in the trained network. Given sufcient computational resources, the TB algorithm would always infer the DFA that matches the training data perfectly and it is pedagogic. This method can be applied to any mechanism that models DFA behavior. Even with an increased search depth, the inferred DFA will reect the degraded performance of the network on long strings. The behavior of the extracted DFAs as the threshold for perfect DFA extraction is approached certainly warrants further interest. The last two columns of Table 1 show that the performance of the extracted DFAs closely matches that of the neural network without an extraction algorithm—that is, up to the point where the TB algorithm obtained the correct target DFA. These raw results also exhibited the predicted deterioration in performance as the input string length increased. Simulations were also conducted for 25 networks trained on 300 and 700 strings randomly selected from the rst 1000. Results (shown in Table 2) were similar to that for networks trained on contiguous sample sets, except that more strings (between 300 and 700) were now needed to extract the target DFA successfully. Networks initialized with different randomized weights (but trained on the same sample sets) yielded virtually identical results. 5 Conclusions We have presented a novel method for extracting deterministic nite-state automata (DFAs) from recurrent neural networks based on the TB inference algorithm.

Symbolic Knowledge from Recurrent Neural Networks

69

The merits of this algorithm are as follows: ² Unlike other so-called decompositional DFA extraction methods proposed in the literature, this method does not require a cluster analysis of the output space of the recurrent state neurons (such as dening quantization levels) and thus avoids any bias that may be introduced in this way. ² The algorithm is pedagogical and thus able to extract a DFA solely from observations of input-output behavior. Hence, this method is independent of the mechanism used to to model DFA behavior. ² This method increases the delity of the extracted DFAs. That is, the extracted DFA will model more closely the knowledge stored in the network. ² Given sufcient information, this algorithm is guaranteed to extract the minimal representation of the desired DFA. ² The algorithm runs in polynomial time, and its main computational limitation is the search depth imposed on the prex tree required for TB extraction. In practice, this is not a serious restriction. To our knowledge, this is the rst time that a pedagogical algorithm has been proposed for the extraction of symbolic knowledge from recurrent neural networks. Although this extraction algorithm runs in polynomial time, training neural networks still remains computationally costly. This constitutes the main limitation of this algorithm. This algorithm has been applied to a neural network training regime that dynamically adapts the architecture of the network during training. It has been successfully used in an exchange rate prediction application based on that in Giles, Lawrence, and Tsoi (2001) and in seismic event modeling for data reported in Van Zyl and Omlin (2001). Preliminary results compare favorably with other approaches in both cases. It remains an open problem to nd a theoretical basis for the circumstances under which this extraction method outperforms methods based on cluster analysis of the output space of recurrent state neurons. References Andrews, R., Diederich, J., & Tickle, A. (1995). A survey and critique of techniques for extracting rules from neural networks. Knowledge-based systems (pp. 373–389). Amsterdam: Elsevier. Bengio, Y., Simard, P., & Frasconi, P. (1994). Learning long-term dependencies with gradient descent is difcult. IEEE Transactions on Neural Networks, 5, 157–166. Casey, M. (1996). The dynamics of discrete-time computation, with application to nite state machine extraction. Neural Computation, 8(6), 1135–1178.

70

A. Vahed and C. Omlin

Cleeremans, A., Servan-Schreiber, D., & McClelland, J. (1989). Finite state automata and simple recurrent recurrent networks. Neural Computation, 1(3), 372–381. Das, S., & Das, R. (1992). Induction of discrete state-machine by stabilizing a continuous recurrent network using clustering. Journal of Computer Science and Informatics, 21(2), 35–40. Das, S., & Mozer, M. (1994). A unied gradient-descent/clustering architecture for nite state machine induction. In J. Cowan, G. Tesauro, & J. Alspector (Eds.), Advances in neural information processing systems, 6 (pp. 19–26). San Mateo, CA: Morgan Kaufmann. Elman, J. (1990). Finding structure in time. Cognitive Science, 14, 179–211. Giles, C. L., Lawrence, S., & Tsoi, A. (2001). Noisy time series prediction using a recurrent neural network and grammatical inference. Machine Learning, 44(1), 161–183. Giles, C., Miller, C., Chen, D., Chen, H., Sun, G., & Lee, Y. (1992). Learning and extracting nite state automata with second-order recurrent neural networks. Neural Computation, 4(3), 380–393. Giles, C., Miller, C., Chen, D., Sun., G., Chen, H. & Lee, Y. (1992). Extracting and learning an unknown grammar with recurrent neural networks. In J. Moody, S. Hanson, and R. Lippmann (Eds.), Advances in neural information processing systems, 4 (pp. 317–324). San Mateo, CA: Morgan Kaufmann. Giles, C., & Omlin C. (1994). Pruning recurrent neural networks for improved generalization performance. IEEE Transactions on Neural Networks, 5(5), 848– 851. Golea, M. (1996). On the complexity of rule-extraction from neural networks and network-querying (Tech. Rep.). Canberra, Australia: Department of Systems Engineering, Australian National University. Hanson, S., & Burr, D. (1991). What connectionist models learn: Learning and representation in connectionist networks. In R. Mammone & Y. Zeevi (Eds.), Neural networks: Theory and applications. Orlando, FL: Academic Press. Hanson, S., & Negishi, M. (2002). On the emergence of rules in neural networks. Neural Computation, 14, 1–24. Kohonen, T. (1995). Self-organizing maps. Berlin: Springer-Verlag. Lang, K. (1992). Random DFA’s can be approximately learned from sparse uniform examples. In Proceedings of the Fifth ACM Workshop on Computational Learning Theory. New York: ACM Press. Omlin, C., & Giles, C. (1996a). Constructing deterministic nite-state automata in recurrent neural networks. Journal of the ACM, 43(6), 937–972. Omlin, C., & Giles, C. (1996b). Extraction of rules from discrete-time recurrent neural networks. Neural Networks, 9(1), 41–52. Omlin, C., & Giles, C. (1996c). Stable encoding of large nite-state automata in recurrent neural networks. Neural Computation, 8(7), 675–696. Omlin, C., Giles, C., & Miller, C. (1992). Heuristics for the extraction of rules from discrete-time recurrent neural networks. In Proceedings International Joint Conference on Neural Networks 1992 (Volume 1, pp. 33–38). New York: ACM Press.

Symbolic Knowledge from Recurrent Neural Networks

71

Omlin, C., Thornber, K., & Giles, C. (1998). Fuzzy nite-state automata can be deterministically encoded into recurrent neural networks. IEEE Transactions on Fuzzy Systems, 6(1), 76–89. Pollack, J. (1991). The induction of dynamical recognizers. Machine Learning, 7, 227–252. Tino, P., & Sajda, J. (1995). Learning and extracting initial mealy machines with a modular neural network model. Neural Computation, 7(4), 822–844. Trakhenbrot, B., & Barzdin, Y. (1973). Finite automata: Behavior and synthesis. Amsterdam: North-Holland. Van Zyl, J., & Omlin, C. (2001).Prediction of seismic events in mines using neural networks. In International Joint Conference on Neural Networks 2001 (Vol. 2, pp. 1410–1414). New York: IEEE Press. Watrous, R., & Kuhn, G. (1992a). Induction of nite-state languages using second-order recurrent networks. Neural Computation, 4(3), 406. Watrous, R., & Kuhn, G. (1992b). Induction of nite state languages using second-order recurrent networks. In J. Moody, S. Hanson, & R. Lippmann (Eds.), Advances in neural information processing systems, 4 (pp. 309–316). San Mateo, CA: Morgan Kaufmann. Williams, R., & Zipser, D. (1989). A learning algorithm for continually running fully recurrent neural networks. Neural Computation, 1, 270–280. Zeng, Z., Goodman, R., & Smyth, P. (1993). Learning nite state machines with self-clustering recurrent networks. Neural Computation, 5(6), 976–990. Received July 29, 2002; accepted June 4, 2003.

LETTER

Communicated by Danilo Mandic

Orthogonality of Decision Boundaries in Complex-Valued Neural Networks Tohru Nitta [email protected] National Institute of Advanced Industrial Science and Technology, Tsukuba-shi, Ibaraki, 305-8568 Japan

This letter presents some results of an analysis on the decision boundaries of complex-valued neural networks whose weights, threshold values, input and output signals are all complex numbers. The main results may be summarized as follows. (1) A decision boundary of a single complexvalued neuron consists of two hypersurfaces that intersect orthogonally, and divides a decision region into four equal sections. The XOR problem and the detection of symmetry problem that cannot be solved with two-layered real-valued neural networks, can be solved by two-layered complex-valued neural networks with the orthogonal decision boundaries, which reveals a potent computational power of complex-valued neural nets. Furthermore, the fading equalization problem can be successfully solved by the two-layered complex-valued neural network with the highest generalization ability. (2) A decision boundary of a threelayered complex-valued neural network has the orthogonal property as a basic structure, and its two hypersurfaces approach orthogonality as all the net inputs to each hidden neuron grow. In particular, most of the decision boundaries in the three-layered complex-valued neural network inetersect orthogonally when the network is trained using Complex-BP algorithm. As a result, the orthogonality of the decision boundaries improves its generalization ability. (3) The average of the learning speed of the Complex-BP is several times faster than that of the Real-BP. The standard deviation of the learning speed of the Complex-BP is smaller than that of the Real-BP. It seems that the complex-valued neural network and the related algorithm are natural for learning complex-valued patterns for the above reasons. 1 Introduction It is expected that complex-valued neural networks, whose parameters (weights and threshold values) are all complex numbers, will have applications in elds dealing with complex numbers such as telecommunications, speech recognition, and image processing with the Fourier transformation. When using the existing method for real numbers, we must apply the Neural Computation 16, 73–97 (2004)

c 2003 Massachusetts Institute of Technology °

74

T. Nitta

method individually to their real and imaginary parts. On the other hand, complex-valued neural networks allow us to process data directly. Moreover complex-valued neural networks enable us to capture good rotational behavior of complex numbers automatically. Actually, we can nd some applications of the complex-valued neural networks to various elds such as telecommunications and image processing in the literature (Miyauchi, Seki, Watanabe, & Miyauchi, 1992, 1993; Watanabe, Yazawa, Miyauchi, & Miyauchi, 1994; KES, 2001, 2002; ICONIP, 2002). The fading equalization technology is an application domain suitable for the complex-valued neural network, which will be described in section 4.3. The backpropagation algorithm, referred to here as Real-BP, (Rumelhart, Hinton, & Williams, 1986a, 1986b) is an adaptive procedure that is widely used in training a multilayer perceptron for a number of classication applications in areas such as speech and image recognition. The Complex-BP algorithm is a complex-valued version of the Real-BP, which was proposed by several researchers independently in the early 1990s (Kim & Guest, 1990; Nitta & Furuya, 1991; Benvenuto & Piazza, 1992; Georgiou & Koutsougeras, 1992; Nitta, 1993, 1997). The Complex-BP algorithm can be applied to multilayered neural networks whose weights, threshold values, and input and output signals are all complex numbers. This algorithm enables the network to learn complex-valued patterns naturally and has an ability to transform geometric gures as its inherent property, which may be related to the identity theorem in complex analysis (Nitta & Furuya, 1991; Nitta, 1993, 1997). It is important to clarify the characteristics of the complex-valued neural networks in order to promote applications. This article makes clear the differences between the real-valued neural network and the complexvalued neural network by analyzing their fundamental properties from the viewpoint of network architectures and claries the utility for the complexvalued neural network that the properties discovered in this article bring about. The main results may be summarized as follows. (1) A decision boundary of a single complex-valued neuron consists of two hypersurfaces that intersect orthogonally, and divides a decision region into four equal sections. The XOR problem and the detection-of-symmetry problem, which cannot be solved with two-layered real-valued neural networks, can be solved by two-layered complex-valued neural networks with the orthogonal decision boundaries, which reveals the potent computational power of complex-valued neural nets. Furthermore, the fading equalization problem can be successfully solved by the two-layered complex-valued neural network with the highest generalization ability. (2) A decision boundary of a three-layered complex-valued neural network has the orthogonal property as a basic structure, and its two hypersurfaces approach orthogonality as all the net inputs to each hidden neuron grow. In particular, most of the decision boundaries in the three-layered complex-valued neural network intersect orthogonally when the network is trained using the Complex-BP algorithm.

Orthogonality of Decision Boundaries

75

As a result, the orthogonality of the decision boundaries improves its generalization ability. (3) Furthermore, the average of the learning speed of the Complex-BP is several times faster than that of the Real-BP. The standard deviation of the learning speed of the Complex-BP is smaller than that of the Real-BP. This article is organized as follows. Section 2 describes the complexvalued neural network and the related Complex-BP algorithm. Section 3 deals with the theoretical analyses of decision boundaries of the complexvalued neural network model. The utility for the complex-valued neural network that the properties discovered in section 3 bring about is given in section 4. Section 5 is devoted to a discussion of the results obtained. Finally, we give some conclusions. 2 The Complex-Valued Neural Network This section describes the complex-valued neural network used in the analysis. 2.1 The Model. First, we consider the following complex-valued neuron. The input signals, weights, thresholds, and output signals are all complex numbers. The net input Un to a complex-valued neuron n is dened as Un D

X m

Wnm Xm C Vn ;

(2.1)

where W nm is the (complex-valued) weight connecting complex-valued neurons n and m, Xm is the (complex-valued) input signal from complex-valued neuron m, and Vn is the (complex-valued) threshold value of neuron n. To obtain the (complex-valued) output signal, convert the net input Un into p its real and imaginary parts as follows: Un D x C iy D z, where i denotes ¡1. The (complex-valued) output signal is dened as fC .z/ D fR .x/ C ifR .y/;

(2.2)

where fR .u/ D 1=.1 C exp.¡u//; u 2 R (R denotes the set of real numbers), that is, the real and imaginary parts of an output of a neuron mean the sigmoid functions of the real part x and imaginary part y of the net input z to the neuron, respectively. Note that the activation function fC is not a regular complex-valued function because the Cauchy-Riemann equations do not hold. A complex-valued neural network consists of such complex-valued neurons described above. The Complex-BP learning rule (Nitta & Furuya, 1991; Nitta, 1993, 1997) has been obtained by using a steepest-descent method for such (multilayered) complex-valued neural networks.

76

T. Nitta

2.2 The Activation Function. In this section, the validity of the activation function dened in expression 2.2 is described. In the case of real-valued neural networks, an activation function of real-valued neurons is usually chosen to be a smooth (continuously differentiable) and bounded function such as a sigmoidal function. As some researchers have pointed out (Georgiou & Koutsougeras, 1992; Nitta, 1997; Kim & Adali, 2000; Kuroe, Hashimoto, & Mori, 2002), in the complex region, however, we should recall the Liouville’s theorem, which says that if a function G is regular at all z 2 C and bounded, then G is a constant function where C denotes the set of complex numbers. That is, we need to choose either the regularity or the boundedness for an activation function of complex-valued neurons. In this article, expression 2.2 is adopted as an activation function of complex-valued neurons for the analysis, which is bounded but nonregular (that is, the boundedness is chosen). There are some reasons for adopting the complex function dened in expression 2.2 as an activation function of complex-valued neurons, which we now state. First, although expression 2.2 is not regular (i.e., the CauchyRiemann equations do not hold), there is a strong relationship between its real and imaginary parts. Actually, consider a complex-valued neuron with n-inputs, weights wk D wrk C iwik 2 C .1 · k · n/, and a threshold value µ D µ r C iµ i 2 C . Then for n input signals xk C iyk 2 C .1 · k · n/, the complex-valued neuron generates Á X C iY D fC Á D fR

n X .wrk C iwik /.xk C iyk / C .µ r C iµ i / kD1

n X .wrk xk ¡ wik yk / C µ r

!

kD1

C ifR

Á n X kD1

!

! .wik xk

C wrk yk /

Cµ

i

(2.3)

as an output. Both the real and imaginary parts of the right-hand side of expression 2.3 contain the common variables fxk ; yk ; wrk ; wik gnkD1 and inuence each other via those 4n variables. Moreover, the activation function 2.2 can process complex-valued signals properly through its amplitudephase relationship. Actually, as described in Nitta (1997), the 1 ¡ n ¡ 1 Complex-BP network with activation function 2.2 can transform geometric gures, such as rotation, similarity transformation and parallel displacement of straight lines, and circles. This cannot be done without the ability to process signals properly through the amplitude-phase relationship in the activation function. Second, it has been proved that the complex-valued neural network with the activation function 2.2 can approximate any con-

Orthogonality of Decision Boundaries

77

tinuous complex-valued function, whereas the one with a regular activation function (e.g., fC .z/ D 1=.1 C exp.¡z// proposed by Kim & Guest, 1990, and fC .z/ D tanh.z/ by Kim & Adali, 2000), cannot approximate any nonregular complex-valued function (Arena, Fortuna, Re, & Xibilia, 1993; Arena, Fortuna, Muscato, & Xibilia, 1998). That is, the complex-valued neural network with the activation function 2.2 is a universal approximator; the one with a regular activation function is not a universal approximator. Thus, the ability of complex-valued neural networks to approximate complex-valued functions depends heavily on the regularity of activation functions used. Third, the stability of the learning of the complex-valued neural network with the activation function 2.2 has been conrmed through some computer simulations (Nitta & Furuya, 1991; Nitta, 1993, 1997; Arena et al., 1993). Finally, the activation function 2.2 clearly satises the following ve properties that an activation function H.z/ D u.x; y/ C iv.x; y/; z D x C iy should possess, which Georgiou and Koutsougeras (1992) pointed out: 1. H is nonlinear in x and y. 2. H is bounded. 3. The partial derivatives ux , uy , vx , and vy exist and are bounded. 4. H.z/ is not entire. That is, there exists some zo 2 C such that H.z0 / is not regular. 5. ux vy is not identically equal to vx u y . We adopted the activation function 2.2 for the above reasons. 3 Orthogonality of Decision Boundaries in the Complex-Valued Neural Network The decision boundary is a boundary by which the pattern classier such as the Real-BP classies patterns and generally consists of hypersurfaces. Decision boundaries of real-valued neural networks have been examined empirically by Lippmann (1987). This section mathematically analyzes the decision boundaries of the complex-valued neural network. 3.1 A Case of a Single Neuron. We rst analyze the decision boundary of a single complex-valued neuron (i.e., the number of hidden layers is zero). Let the weights denote w D t [w1 : : : wn ] D w r C iw i ; wr D t [wr1 : : : wrn ]; i D t [wi i r i w 1 : : : wn ], and let the threshold denote µ D µ C iµ . Then, for n t input signals (complex numbers) z D [z1 : : : zn ] D x C iy ; x D t [x1 : : : xn ],

78

T. Nitta

y D t [y1 : : : yn ], the complex-valued neuron generates ³ µ ¶ ´ x C µr X C iY D fR [t wr ¡ t wi ]

y

³ C ifR [t w i

µ t

r

w]

x y

¶

´ Cµ

i

(3.1)

as an output. Here, for any two constants CR ; CI 2 .0; 1/, let ³ µ ¶ ´ x C µ r D CR ; X.x; y / D fR [t w r ¡ t w i ]

(3.2)

³ Y.x; y / D fR [t w i

(3.3)

y µ ¶ ´ x t r C µ i D CI : w] y

Note here that expression 3.2 is the decision boundary for the real part of an output of the complex-valued neuron with n inputs. That is, input signals .x; y / 2 R 2n are classied into two decision regions, f.x; y / 2 R2n jX.x; y / ¸ CR g, and f.x; y / 2 R2n jX.x; y / < CR g by the hypersurface given by expression 3.2. Similarly, expression 3.3 is the decision boundary for the imaginary part. The normal vectors QR .x; y / and QI .x; y / of the decision boundaries 3.2 and 3.3 are given by ³

´ @X @X @X @X ¢¢¢ ¢¢¢ @x1 @xn @y1 @yn ³ µ ¶ ´ x 0 t r t i r D fR [ w ¡ w ] C µ ¢ [t w r ¡t w i ];

QR .x; y / D

y

´ @Y @ Y @Y @Y ¢¢¢ ¢¢¢ @x1 @xn @y1 @yn ³ µ ¶ ´ x 0 t i t r i D fR [ w C µ ¢ [t w i t wr ]: w]

(3.4)

³

QI .x; y / D

y

(3.5)

Noting that the inner product of expressions 3.4 and 3.5 is zero, we can nd that the decision boundary for the real part of an output of a complex-valued neuron and that for the imaginary part intersect orthogonally. It can be easily shown that this orthogonal property also holds true for the other types of the complex-valued neurons proposed in Kim and Guest (1990), Benvenuto and Piazza (1992) and Georgiou and Koutsougeras (1992). It should be noted here that there seems to be a problem on learning convergence in the formulation in Kim and Guest (1990); the complexvalued backpropagation algorithm with the activation function fC .z/ D 1=.1 C exp.¡z//; z D x C iy never converged in our experiments.

Orthogonality of Decision Boundaries

79

Generally, a real-valued neuron classies an input real-valued signal into two classes (0, 1), and a complex-valued neuron classies an input complexvalued signal into four classes (0; 1; i; 1 C i). As described above, the decision boundary of a complex-valued neuron consists of two hypersurfaces that intersect orthogonally, and divides a decision region into four equal sections. Thus, it can be considered that a complex-valued neuron has a natural decision boundary for complex-valued patterns. 3.2 A Case of a Three-Layered Network. Next, we examine the decision boundary of a three-layered complex-valued neural network (i.e., it has one hidden layer). Consider a three-layered complex-valued neural network with L input neurons, M hidden neurons, and N output neurons. We use wji D wjir C iwjii for the weight between the input neuron i and the hidden neuron j, vkj D vrkj C ivikj for the weight between the hidden neuron j and the output neuron k, µj D µjr C iµji for the threshold of the hidden neuron j, and °k D °kr C i°ki for the threshold of the output neuron k. Then, for L input (complex-valued) signals z D t [z1 ¢ ¢ ¢ zL ] D x C iy ; x D t [x1 ¢ ¢ ¢ xL ]; y D t [y ¢ ¢ ¢ y ], the net input U to the hidden neuron j is given by 1 L j Uj D Ujr C iUji " # " # L L X X r i r i r i D .wji xi ¡ wji yi / C µj C i .wji xi C wji yi / C µj : iD1

(3.6)

iD1

Hence, the output Hj of the hidden neuron j is given by Hj D Hjr C iHji D fR .Ujr / C ifR .Uji /:

(3.7)

Also, the net input Sk to the output neuron k is given by Sk D Srk C iSik 2 3 2 3 M M X X D 4 .vrkj Hjr ¡ vikj Hji / C °kr 5 C i 4 .vikj Hjr C vrkj Hji / C °ki 5 : jD1

(3.8)

jD1

Hence, the output Ok of the output neuron k is given by Ok D Ork C iOik D fR .Srk / C ifR .Sik /:

(3.9)

Here, for any two constants CR ; CI 2 .0; 1/, let Ork .x; y / D CR ; Oik .x; y /

I

DC:

(3.10) (3.11)

80

T. Nitta

Expressions 3.10 and 3.11 are the decision boundaries for the real and imaginary parts of the output neuron k in the three-layered complex-valued neural network, respectively. The normal vectors QR .x; y /, QI .x; y / of these hypersurfaces, equations 3.10 and 3.11, are given by ³ QR .x; y / D ³ I

Q .x; y / D

@Ork

¢¢¢

@x1

@Ork @Ork @xL @y1

¢¢¢

@Ork @yL

@Oik @Oik @Oik @Oik ¢¢¢ ¢¢¢ @x1 @xL @y1 @yL

´ ;

(3.12)

;

(3.13)

´

and their inner product is given by QR .x; y / ¢ t QI .x; y / D

@ Ork @Oik @Ork @Oik @Ork @ Oik ¢ C ¢¢¢ C ¢ C ¢ (3.14) @x1 @ x1 @xL @xL @y1 @y1 @Ork @Oik C ¢¢¢ C ¢ : @yL @yL

Note here that for any 1 · i · L, @ Ork @ xi

¢

D

@Oik @xi

C

@Ork @yi

¢

@Oik @yi

@ fR .Srk / @ fR .Sik / ¢ @Srk @Sik 2 Á !3 M X @ fR .Ujr / @ fR .Uji / 5 ¢4 ¡ vikj wjii ¢ vrkj wjir ¢ @Ujr @Uji jD1 2 Á !3 M X @ fR .Ujr / @ fR .Uji / 5 ¢4 C vrkj wjii ¢ vikj wjir ¢ r i @U @U j j jD1 2 Á !3 M X @ fR .Uji / @ fR .Ujr / @ fR .Srk / @ fR .Sik / 5 ¡ ¢ ¢4 ¡ vikj wjii ¢ vrkj wjir ¢ i @Srk @Sik @U @Ujr j jD1 2 Á !3 M X @ fR .Uji / @ fR .Ujr / i r r i 5: ¢4 C vkj wji ¢ vkj wji ¢ @Uji @ Ujr jD1

(3.15)

Hence, the inner product of the normal vectors is not always zero. Therefore, we cannot conclude that the decision boundaries (hypersurfaces) for the real and imaginary parts of the output neuron k in the three-layered complexvalued neural network intersect orthogonally. However, paying enough

Orthogonality of Decision Boundaries

81

attention to expression 3.15, we can nd that if @ fR .Ujr / @Ujr

D

@ fR .Uji / @ Uji

(3.16)

for any 1 · j · M, then the inner product is zero. In general, if both ju1 j and ju2 j are sufciently large, we can consider that fR0 .u1 / is nearly equal to fR0 .u2 /. Hence, if, for any 1 · j · M, there exist sufciently large, positive, real numbers K1 ; K2 such that jUjr j > K1 and jUji j > K2 , then the two decision boundaries, equations 3.10 and 3.11, nearly intersect orthogonally. That is, if, for any 1 · j · M, both the absolute values of the real and imaginary parts of the net input (complex number) to the hidden neuron j are sufciently large, then the decision boundaries nearly intersect orthogonally. Therefore, the following theorem can be obtained: Theorem. The decision boundaries for the real and imaginary parts of an output neuron in the three-layered complex-valued neural network approach orthogonality as both the absolute values of the real and imaginary parts of the net inputs to all hidden neurons grow. 4 Utility of the Orthogonal Decision Boundaries In this section, we will show the utility that the properties on the decision boundary discovered in the previous section bring about. Minsky and Papert (1969) claried the limitations of two-layered realvalued neural networks (i.e., no hidden layers): in a large number of interesting cases, the two-layered real-valued neural network is incapable of solving the problems. A classic example of this case is the exclusive-or (XOR) problem, which has a long history in the study of neural networks, and many other difcult problems involve the XOR as subproblem. Another example is the detection-of-symmetry problem. Rumelhart et al. (1986a, 1986b) showed that the three-layered real-valued neural network (i.e., with one hidden layer) can solve such problems, including the XOR problem and the detection-of-symmetry problem, and the interesting internal representations can be constructed in the weight-space. As described above, the XOR problem and the detection-of-symmetry problem cannot be solved with the two-layered real-valued neural network. First, contrary to expectation, it will be proved that such problems can be solved by the two-layered complex-valued neural network (i.e., no hidden layers) with the orthogonal decision boundaries, which reveals a potent computational power of complex-valued neural nets. In addition, it will be shown as an application of the computational power that the fading equalization problem can be successfully solved by the two-layered complexvalued neural network with the highest generalization ability. Rumelhart

82

T. Nitta

Table 1: The XOR Problem. Input

Output

x1

x2

y

0 0 1 1

0 1 0 1

0 1 1 0

et al. (1986a, 1986b) showed that increasing the number of layers made the computational power of neural networks high. In this section, we will show that extending the dimensionality of neural networks to complex numbers has a similar effect on neural networks. This may be a new direction for enhancing the ability of neural networks. Second, we will present the simulation results on the generalization ability of the three-layered complex-valued neural networks trained using the Complex-BP (called the Complex-BP network) (Nitta & Furuya, 1991; Nitta, 1993, 1997) and will compare them with those of the three-layered realvalued neural networks trained using the Real-BP (called the Real-BP network) (Rumelhart et al., 1986a, 1986b). 4.1 The XOR Problem. In this section, it is proved that the XOR problem can be solved by a two-layered complex-valued neural network (i.e., no hidden layers) with the orthogonal decision boundaries. The input-output mapping in the XOR problem is shown in Table 1. In order to solve the XOR problem with complex-valued neural networks, the input-output mapping is encoded as shown in Table 2, where the outputs 1 and i are interpreted to be 0 and 0 and 1 C i are interpreted to be 1 of the original XOR problem (Table 1), respectively. We use a 1-1 complex-valued neural network (i.e., no hidden layers) with a weight w D u C iv 2 C between the input neuron and the output neuron (we assume that it has no threshold parameters). The activation function is

Table 2: An Encoded XOR Problem for Complex-Valued Neural Networks. Input

Output

z D x C iy

Z D X C iY

¡1 ¡ i ¡1 C i 1¡i 1Ci

1 0 1Ci i

Orthogonality of Decision Boundaries

83

dened as 1C .z/ D 1R .x/ C i1R .y/; z D x C iy;

(4.1)

where 1R is a real-valued step function dened on R, that is, 1R .u/ D 1 if u ¸ 0, 0 otherwise for any u 2 R . The decision boundary of the 1-1 complex-valued neural network described above consists of the following two straight lines, which intersect orthogonally: [u ¡ v] ¢ t [x [v

y] D 0;

u] ¢ t [x

(4.2)

y] D 0

(4.3)

for any input signal z D x Ciy 2 C , where u and v are the real and imaginary parts of the weight parameter w D u C iv, respectively. Expressions 4.2 and 4.3 are the decision boundaries for the real and imaginary parts of the 1-1 complex-valued neural network, respectively. Letting u D 0 and v D 1 (i.e., w D i), we have the decision boundary shown in Figure 1, which divides the input space (the decision region) into four equal sections and has the highest generalization ability for the XOR problem. On the other hand, the decision boundary of the three-layered real-valued neural network for the XOR problem does not always have the highest generalization ability (Lippmann, 1987). In addition, the required number of learnable parameters

Decision boundary for the imaginary part (x=0)

Im

Decision boundary for the real part (y=0)

1+i

-1+i 0

Re

1 -1-i

1-i 0

1

Figure 1: The decision boundary in the input space of the 1-1 complex-valued neural network that solves the XOR problem. The black circle means that the output in the XOR problem is 1, and the white one 0.

84

T. Nitta

is only 2 (i.e., only w D u C iv), whereas at least nine parameters are needed for the three-layered real-valued neural network to solve the XOR problem (Rumelhart et al.,p1986a, 1986b), where a complex-valued parameter z D x C iy (where i D ¡1) is counted as two because it consists of a real part x and an imaginary part y. 4.2 The Detection of Symmetry. Another interesting task that cannot be done by two-layered real-valued neural networks is the detection-ofsymmetry problem (Minsky & Papert, 1969). In this section, a solution to this problem using a two-layered complex-valued neural network (i.e., no hidden layers) with the orthogonal decision boundaries is given. The problem is to detect whether the binary activity levels of a onedimensional array of input neurons are symmetrical about the center point. For example, the input-output mapping in the case of three inputs is shown in Table 3. We used patterns of various lengths (from two to six) and could solve all the cases with two-layered complex-valued neural networks. Only a solution to the case with six inputs is presented here because the other cases can be done in a similar way. We use a 6-1 complex-valued neural network (i.e., no hidden layers) with weights wk D uk C ivk 2 C between an input neuron k and the output neuron (1 · k · 6) (we assume that it has no threshold parameters). In order to solve the problem with the complex-valued neural network, the input-output mapping is encoded as follows: an input xk 2 R is encoded as an input xk C iyk 2 C to the input neuron k, where yk D 0 (1 · k · 6), the output 1 2 R is encoded as 1 C i 2 C , and the output 0 2 R is encoded as 1 or i 2 C which is determined according to inputs (e.g., the output corresponding to the input t [0 0 0 0 1 0] is i). The activation function is the same as in expression 4.1.

Table 3: The Detection-of-Symmetry Problem with Three Inputs. Input

Output

x1

x2

x3

y

0 0 0 1 0 1 1 1

0 0 1 0 1 0 1 1

0 1 0 0 1 1 0 1

1 0 1 0 0 1 0 1

Note: Output 1 means that the corresponding input is symmetric, and 0 asymmetric.

Orthogonality of Decision Boundaries

85

The decision boundary of the 6-1 complex-valued neural network described above consists of the following two straight lines, which intersect orthogonally: [u1 ¢ ¢ ¢ u6 ¡v1 ¢ ¢ ¢ ¡v6 ] ¢ t [x1 ¢ ¢ ¢ x6 y1 ¢ ¢ ¢ y6 ] D 0; [v1 ¢ ¢ ¢ v6

u1 ¢ ¢ ¢

t

u6 ] ¢ [x1 ¢ ¢ ¢ x6 y1 ¢ ¢ ¢ y6 ] D 0;

(4.4) (4.5)

for any input signal zk D xk C iyk 2 C where uk and vk are the real and imaginary parts of the weight parameter wk D uk C ivk , respectively (1 · k · 6/. The expressions 4.4 and 4.5 are the decision boundaries for the real and imaginary parts of the 6-1 complex-valued neural network, respectively. Letting t [u1 ¢ ¢ ¢ u6 ] D t [¡1 2 ¡4 4 ¡2 1] and t [v1 ¢ ¢ ¢ v6 ] D t [1 ¡2 4 ¡4 2 ¡1] (i.e., w1 D ¡1 C i; w2 D 2 ¡ 2i; w3 D ¡4 C 4i; w4 D 4 ¡ 4i; w5 D ¡2 C 2i and w6 D 1 ¡ i), we have the orthogonal decision boundaries shown in Figure 2, which successfully detect the symmetry of the 26 .D 64/ input patterns. In addition, the required number of learnable parameters is 12 (i.e., 6 complex-valued weights), whereas at least 17 parameters are needed for the three-layered real-valued neural network to solve the detection of symmetry (Rumelhartp et al., 1986a, 1986b) where a complex-valued parameter z D xCiy (where i D ¡1) is counted as two as in section 4.1.

Figure 2: The decision boundary in the net input space of the 6-1 complexvalued neural network that solves the detection-of-symmetry problem. Note that the plane is not the input space but the net input space because the dimension of the input space is 6 and the input space cannot be written in a twodimensional plane. The black circle means a net input for a symmetric input and the white one asymmetric. There is only one black circle at the origin. The four circled complex numbers are the output values of the 6-1 complex-valued neural network in their regions, respectively.

86

T. Nitta

Table 4: Input-Output Mapping in the Fading Equalization Problem. Input

Output

¡1 ¡ i ¡1 C i 1¡i 1Ci

¡1 ¡ i ¡1 C i 1¡i 1Ci

4.3 The Fading Equalization Technology. In this section, it is shown that two-layered complex-valued neural networks with orthogonal decision boundaries can be successfully applied to the fading equalization technology (Lathi, 1998). Channel equalization in a digital communication system can be viewed as a pattern classication problem. The digital communication system receives a transmitted signal sequence with additive noise and tries to estimate the true transmitted sequence. A transmitted signal can take one of the p following four possible complex values: ¡1¡i, ¡1Ci, 1¡i, and 1Ci .i D ¡1/. Thus, the received signal will take value around ¡1 ¡ i, ¡1 C i, 1 ¡ i, and 1 C i, for example, ¡0:9 ¡ 1:2i, 1:1 C 0:84i, and the like, because some noises are added to them. We need to estimate the true complex values from such complex values with noise. Thus, a method with excellent generalization ability is needed for the estimate. The input-output mapping in the problem is shown in Table 4. We use the same 1-1 complex-valued neural network as in section 4.1. In order to solve the problem with the complex-valued neural network, the input-output mapping in Table 4 is encoded as shown in Table 5. Letting u D 1 and v D 0 (i.e., w D 1), we have the orthogonal decision boundary shown in Figure 3, which has the highest generalization ability for the fading equalization problem and can estimate true signals without errors. In addition, the required number of learnable parameters is only two (i.e., only w D u C iv). 4.4 Generalization Ability of Three-Layered Complex-Valued Neural Networks. We present the simulation results on the generalization abilTable 5: An Encoded Fading Equalization Problem for Complex-Valued Neural Networks. Input

Output

¡1 ¡ i ¡1 C i 1¡i 1Ci

0 i 1 1Ci

Orthogonality of Decision Boundaries

87

Figure 3: The decision boundary in the input space of the 1-1 complex-valued neural network that solves the fading equalization problem. The black circle means an input in the fading equalization problem. The four circled complex numbers refer to the output values of the 1-1 complex-valued neural network in their regions, respectively.

ity of the three-layered complex-valued neural networks trained using the Complex-BP (Nitta & Furuya, 1991; Nitta, 1993, 1997) and compare them with those of the three-layered real-valued neural networks trained using the Real-BP (Rumelhart et al., 1986a, 1986b). In the experiments, the three sets of (complex-valued) learning patterns shown in Tables 6 to 8 were used, and the learning constant " was 0.5. The initial components of the weights and the thresholds were chosen to be random real between ¡0:3 and 0:3. We judged that learning nished qPnumbers PN .p/ .p/ 2 .p/ .p/ when p kD1 jTk ¡ Ok j D 0:05 held, where Tk ; Ok 2 C denoted the desired output value, the actual output value of the output neuron k for the pattern p; N denoted the number of neurons in the output layer. We regarded presenting a set of learning patterns to the neural network as one learning cycle. We used the four kinds of three-layered Complex-BP networks: 1-3-1, 1-6-1, 1-9-1, and 1-12-1 networks. After training, by presenting the 1,681(D 41£41) points in the complex plane [¡1; 1] £ [¡1; 1] (x C iy, where x D ¡1:0; ¡0:95; : : : ; 0:95; 1:0I y D ¡1:0; ¡0:95; : : : ; 0:95; 1:0), the actual output points formed the decision boundaries. Figure 4 shows an example of the decision boundary of the Complex-BP network. In Figure 4, the number 1 denotes the region in which the real part of the output value of the neural

88

T. Nitta

Table 6: Learning Pattern 1. Input Pattern

Output Pattern

¡0:03 ¡ 0:03i 0:03 ¡ 0:03i 0:03 C 0:03i ¡0:03 C 0:03i

1Ci i 0 1

Table 7: Learning Pattern 2. Input Pattern

Output Pattern

¡0:03 ¡ 0:03i 0:03 ¡ 0:03i 0:03 C 0:03i ¡0:03 C 0:03i

i 0 1 1Ci

network is OFF (0.0-0.5), and the imaginary part is OFF; region 2 the real part ON (0.5–1.0) and the imaginary part OFF; region 3 the real part OFF and the imaginary part ON; region 4 the real part ON and the imaginary part ON. The decision boundary for the real part (i.e., the boundary that the region 1+3 and the region 2+4 form) and that for imaginary part (i.e., the boundary that the region 1+2 and the region 3+4 form) intersect orthogonally. We also conducted corresponding experiments for the Real-BP networks. We chose the 2-4-2 Real-BP network for the 1-3-1 Complex-BP network as a comparison object because the numbers of the parameters (weights and thresholds) were almost the same: the number of parameters for the 1-31 Complex-BP network was 20, and that for the 2-4-2 Real-BP network 22, p where a complex-valued parameter z D xCiy (where i D ¡1) was counted as two because it consisted of a real part x and an imaginary part y. Similarly, the 2-7-2, 2-11-2, and 2-14-2 Real-BP networks were chosen for the 1-6-1, 1-9-1, and 1-12-1 Complex-BP networks as their comparison objects, respectively. The numbers of parameters of them are shown in Table 9. In the Real-BP networks, the real component of a complex number was input into the rst input neuron, and the imaginary component was input into Table 8: Learning Pattern 3. Input pattern

Output pattern

¡0:03 ¡ 0:03i 0:03 ¡ 0:03i 0:03 C 0:03i ¡0:03 C 0:03i

0 1 1Ci i

Orthogonality of Decision Boundaries

89

Figure 4: An example of the decision boundary of the 1-12-1 Complex-BP network learned with learning pattern 1. The meanings of the numerals are as follows. 1: real part OFF (0.0–0.5), imaginary part OFF, 2: real part ON(0.5–1.0), imaginary part OFF, 3: real part OFF, imaginary part ON, and 4: real part ON, imaginary part ON. The decision boundary for the real part (i.e., the boundary that the region 1+3 and the region 2+4 form) and that for imaginary part (i.e., the boundary that the region 1+2 and the region 3+4 form) intersect orthogonally. Table 9: Number of Parameters in the Real-BP and Complex-BP Networks. Complex-BP Network Number of parameters Real-BP network Number of parameters

1-3-1 20 2-4-2 22

1-6-1 38 2-7-2 37

1-9-1 56 2-11-2 57

1-12-1 74 2-14-2 72

the second input neuron; the output from the rst output neuron was interpreted to be the real component of a complex number, and the output from the second output neuron was interpreted to be the imaginary component. Figure 5 shows an example of the decision boundary of the Real-BP network where the numbers 1-4 have the same meanings as those of Figure 4. We can nd from Figure 5 that the decision boundary for the real part (i.e., the boundary that the region 1+3 and the region 2+4 form) and that for imaginary part (i.e., the boundary that the region 1+2 and the region 3+4 form) do not intersect orthogonally.

90

T. Nitta

Figure 5: An example of the decision boundary of the 2-14-2 Real-BP network learned with learning pattern 1. The numbers 1–4 have the same meanings as those of Figure 4. The decision boundary for the real part (i.e., the boundary that the region 1+3 and the region 2+4 form) and for imaginary part (i.e., the boundary that the region 1+2 and the region 3+4 form) do not intersect orthogonally.

First, we measured the angles between the decision boundary for the real part (i.e., the boundary that the region 1+3 and the region 2+4 formed) and that for imaginary part (i.e., the boundary that the region 1+2 and the region 3+4 formed), which were the components of the decision boundary of the output neuron in the Complex-BP networks in the visual observation under the experimental conditions described above. In the visual observation, the angles of the decision boundaries observed were roughly classied into three classes: 30, 60, and 90 degrees. The result of the observation is shown in Table 10, where most of trials were 90 degrees, a few trials 60 degrees, and none of trials 30 degrees. Then the average and the standard deviation of the angles of 100 trials for each of the three learning patterns and each of the four kinds of network structures were used as the evaluation criteria. Although we stopped learning at the 200,000th iteration, all trials succeeded in converging. Also, we measured the same quantities of the Real-BP networks for the comparison. The results of the experiments are shown in Table 11. We see in Table 11 that all the average angles for the Complex-BP networks are almost 90 degrees, which are independent of the learning patterns and the network structures, whereas those of the Real-BP networks are around 70 to 80 degrees. In addition, the standard deviations of the an-

Orthogonality of Decision Boundaries

91

Table 10: Result of the Visual Observation of the Angles of the Decision Boundaries: Complex-BP Network.

Learning pattern 2 (1-3-1 network) Learning pattern 3 (1-6-1 network) Learning pattern 3 (1-9-1 network) Other 9 cases

30 degrees

60 degrees

90 degrees

0 0 0 0

4 3 1 0

96 97 99 100

Notes: Angles were roughly classied into the three classes: 30, 60 and 90 degrees. A numeral means the number of trials in which the angle observed was classied into a specic class.

Table 11: Comparison of the Angles of the Decision Boundaries (in degrees). Pattern 1

Pattern 2

Pattern 3

Complex-BP network Average Standard deviation Real-BP network Average Standard deviation Complex-BP network Average Standard deviation Real-BP network Average Standard deviation Complex-BP network Average Standard deviation Real-BP network Average Standard deviation

1-3-1 90 0 2-4-2 78 19 1-3-1 89 6 2-4-2 85 16 1-3-1 90 0 2-4-2 86 11

1-6-1 90 0 2-7-2 72 22 1-6-1 90 0 2-7-2 77 20 1-6-1 89 5 2-7-2 76 20

1-9-1 90 0 2-11-2 76 17 1-9-1 90 0 2-11-2 77 18 1-9-1 90 3 2-11-2 72 22

1-12-1 90 0 2-14-2 80 17 1-12-1 90 0 2-14-2 75 21 1-12-1 90 0 2-14-2 73 22

gles for the Complex-BP networks are around 0 to 5 degrees and those for the Real-BP networks around 20 degrees. Thus, we can conclude from the experimental results that the decision boundary for the real part and that for imaginary part, which are the components of the decision boundary of the output neuron in the three-layered Complex-BP networks, almost intersect orthogonally, whereas those for the Real-BP networks do not. There is a possibility of improving the generalization ability of neural networks caused by the orthogonality of the decision boundaries of the network. Next, we measured the discrimination rate of the Complex-BP network for unlearned patterns in order to clarify how the orthogonality of the decision boundary of the three-layered Complex-BP network changed its generalization ability. Specically, we counted the number of the test patterns for which the Complex-BP network could give the correct output in the same experiments described above on the angles of decision boundaries of

92

T. Nitta

100 trials for each of the three learning patterns and each of the four kinds of network structures. We dened the correctness as follows: the output value X C iY.0 · X; Y · 1/ of the Complex-BP network for an unlearned pattern xC iy .¡1 · x; y · 1/ was correct if jX ¡ Aj < 0:5 and jY ¡Bj < 0:5, provided that the closest input learning pattern to the unlearned pattern xCiy was aCib whose corresponding output learning pattern was ACiB .A; B D 0 or 1/. For example, the output value X C iY.0 · X; Y · 1/ of the Complex-BP network for an unlearned pattern x C iy .0 · x; y · 1/ was correct if both the real and imaginary parts of the output value of the network took a value less than 0.5, provided that the corresponding output learning pattern for the input learning pattern 0:03 C 0:03i was 0. Then the average and the standard deviation of the discrimination rate of 100 trials for each of the three learning patterns and each of the four kinds of network structures were used as the evaluation criterion. The results of the experiments including the Real-BP network case appear in Table 12. The simulation results clearly suggest that the Complex-BP network has better generalization performance than that of the Real-BP network. Furthermore, we investigated the causality between the high generalization ability of the Complex-BP network and the orthogonal decision boundary. Table 13 shows the average of the discrimination rate for the three upper cases shown in Table 10. It is clearly suggested from Table 13 that the average of the discrimination rate for the three-layered complex-valued neural network with the orthogonal decision boundary is superior to that for the one with the nonorthogonal decision boundary. Thus, we believe that the orthogonality of the decision boundaries causes the high generalization ability of the Complex-BP network.

Table 12: Comparison of the Discrimination Rate (%). Pattern 1

Pattern 2

Pattern 3

Complex-BP network Average Standard deviation Real-BP network Average Standard deviation Complex-BP network Average Standard deviation Real-BP network Average Standard deviation Complex-BP network Average Standard deviation Real-BP network Average Standard deviation

1-3-1 92 6 2-4-2 88 8 1-3-1 93 6 2-4-2 88 8 1-3-1 92 7 2-4-2 87 9

1-6-1 95 5 2-7-2 90 7 1-6-1 95 4 2-7-2 91 7 1-6-1 94 4 2-7-2 90 7

1-9-1 97 3 2-11-2 93 4 1-9-1 96 4 2-11-2 92 5 1-9-1 97 3 2-11-2 90 7

1-12-1 98 2 2-14-2 93 4 1-12-1 97 3 2-14-2 93 6 1-12-1 97 3 2-14-2 92 6

Orthogonality of Decision Boundaries

93

Table 13: Average of the Discrimination Rate for the Three Upper Cases Shown in Table 10 (%).

Learning pattern 2 (1-3-1 network) Learning pattern 3 (1-6-1 network) Learning pattern 3 (1-9-1 network)

30 degrees

60 degrees

90 degrees

— — —

88 88 84

93 93 96

Table 14: Comparison of the Learning Speed, by Learning Cycle. Pattern 1

Pattern 2

Pattern 3

Complex-BP Network Average Standard deviation Real-BP Network Average Standard deviation Complex-BP Network Average Standard deviation Real-BP Network Average Standard deviation Complex-BP Network Average Standard deviation Real-BP Network Average Standard deviation

1-3-1 10,770 438 2-4-2 31,647 1268 1-3-1 10,608 418 2-4-2 31,781 2126 1-3-1 10746 740 2-4-2 34,038 5182

1-6-1 10,178 472 2-7-2 29,945 944 1-6-1 9932 167 2-7-2 29,842 809 1-6-1 10055 412 2-7-2 29,620 806

1-9-1 9766 210 2-11-2 28,947 697 1-9-1 9713 148 2-11-2 28,902 721 1-9-1 9713 228 2-11-2 28,980 603

1-12-1 9529 144 2-14-2 28,230 566 1-12-1 9539 110 2-14-2 28,267 576 1-12-1 9502 188 2-14-2 28,471 584

Finally, we investigated the average and the standard deviation of the learning speed (i.e., learning cycles needed to converge) of 100 trials for each of the three learning patterns and each of the four kinds of network structures in the experiments described above. The results are shown in Table 14. We nd from these experiments that the learning speed of the Complex-BP is several times faster than that of the Real-BP, and the standard deviation of the learning speed of the Complex-BP is smaller than that of the Real-BP. 5 Discussion We have proved that the decision boundary for the real part of an output of a single complex-valued neuron and that for the imaginary part intersect orthogonally in section 3.1. Since this property is completely different from a usual real-valued neuron, one needs to design the complex-valued neural network for real applications and its learning algorithm taking into account the orthogonal property of the complex-valued neuron whatever the type of the network is (multilayered type or mutually connected type). More-

94

T. Nitta

over, it has constructively been proved using the orthogonal property of the decision boundary that the XOR problem and the detection-of-symmetry problem can be solved by two-layered complex-valued neural networks (i.e., complex-valued neurons), which cannot be solved by two-layered realvalued neural networks (i.e., real-valued neurons). These results reveal a potent computational power of complex-valued neural nets. Making the dimensionality of neural networks high (e.g., complex numbers) may be a new direction for enhancing the ability of neural networks. In addition, it has been shown as an application of the above computational power that the fading equalization problem can be successfully solved by the two-layered complex-valued neural network with the highest generalization ability. One should use two-layered complex-valued neural networks rather than threelayered ones when solving the fading equalization problem with complexvalued neural networks. It is not always guaranteed that the decision boundary of the three-layered complex-valued neural network has the orthogonality, as we made clear in section 3.2. Then we derived in section 3.2 a sufcient condition for the decision boundaries in the three-layered complex-valued neural network to almost intersect orthogonally (Theorem). The sufcient condition was as follows: both the absolute values of the real and imaginary parts of the net inputs to all hidden neurons are sufciently large. This is a characterization for the structure of the decision boundaries in the three-layered complex-valued neural network. The theorem will be useful if a learning algorithm such that both the absolute values of the real and imaginary parts of the net inputs to all hidden neurons become sufciently large is devised because there is a possibility that the orthogonality of the decision boundaries of the network can improve the generalization ability of three-layered complex-valued neural networks, as we have seen in section 4.4. That is, there is a possibility that we can use the theorem in order to improve the generalization ability of the three-layered complex-valued neural network. However, the situation in which the theorem is directly useful for the Complex-BP network cannot be considered for now because the control of the net input is difcult as long as the Complex-BP algorithm (that is, steepest-descent method) is used. The Complex-BP is one of the learning algorithms for complex-valued neural networks. Thus, it should be noted that the usefulness of the theorem depends on the learning algorithm used. Although the orthogonality of the decision boundaries in the three-layered complex-valued neural network can be guaranteed conditionally as described above, we can nd from the experiments in section 4.4 that most of the decision boundaries in the threelayered Complex-BP network intersect orthogonally. Moreover, it is learned from the experiments that there is a possibility that the orthogonality of the decision boundaries in the three-layered Complex-BP network improves its generalization ability. The decision boundary of the complex-valued neural network, which consists of two orthogonal hypersurfaces, divides a decision region into four equal sections. So it is intuitively considered that the

Orthogonality of Decision Boundaries

95

orthogonality of the decision boundaries improves its generalization ability. Then we showed the possibility experimentally. The theoretical evaluation of the generalization ability of the complex-valued neural network using the criteria for the evaluation such as the Vapnik-Chervonenkis dimension (VC-dimension) (Vapnik, 1998) could clarify how the orthogonal property of the decision boundaries of the complex-valued neural network inuences its generalization ability. It has already been reported that the average of the learning speed of the Complex-BP is several times faster than that of the Real-BP (Nitta, 1997). In this connection, we could conrm this in the experiments on the orthogonality of the decision boundary and the generalization ability of the threelayered Complex-BP network (section 4.4). It was learned that the standard deviation of the learning speed of the Complex-BP was smaller than that of the Real-BP, which had not been reported in Nitta (1997). 6 Conclusions We have claried the differences between the real-valued neural network and the complex-valued neural network through theoretical and experimental analyses of their fundamental properties and claried the utility for the complex-valued neural network that the properties discovered in this work bring about. It turned out that the complex-valued neural network has basically the orthogonal decision boundary as a result of the extension to complex numbers. A decision boundary of a single complex-valued neuron consists of two hypersurfaces that intersect orthogonally, and divides a decision region into four equal sections. The XOR problem and the detection-of-symmetry problem, which cannot be solved with two-layered real-valued neural networks, can be solved by two-layered complex-valued neural networks with the orthogonal decision boundaries, which reveals a potent computational power of complex-valued neural nets. Furthermore, the fading equalization problem can be successfully solved by the two-layered complex-valued neural network with the highest generalization ability. A decision boundary of a three-layered complex-valued neural network has the orthogonal property as a basic structure, and its two hypersurfaces approach orthogonality as all the net inputs to each hidden neuron grow. In particular, most of the decision boundaries in the three-layered complex-valued neural network intersect orthogonally when the network is trained using the Complex-BP algorithm. As a result, the orthogonality of the decision boundaries improves its generalization ability. Its theoretical proof is a future topic. Furthermore, the average learning speed of the Complex-BP is several times faster than that of the Real-BP. The standard deviation of the learning speed of the Complex-BP is smaller than that of the Real-BP. The complex-valued neural network and the related Complex-BP algorithm are natural methods for learning complex-valued patterns for the above rea-

96

T. Nitta

sons and are expected to be used effectively in elds dealing with complex numbers. Acknowledgments I give special thanks to S. Akaho, Y. Akiyama, M. Asogawa, T. Furuya, and the anonymous reviewers for valuable comments. References Arena, P., Fortuna, L., Muscato, G., & Xibilia, M. G. (1998). Neural networks in multidimensional domains. Lecture Notes in Control and Information Science 234, Berlin: Springer-Verlag. Arena, P., Fortuna, L., Re, R., & Xibilia, M. G. (1993). On the capability of neural networks with complex neurons in complex valued functions approximation. In Proc. IEEE Int. Conf. on Circuits and Systems (pp. 2168–2171). Washington, DC: IEEE. Benvenuto, N., & Piazza, F. (1992). On the complex backpropagation algorithm. IEEE Trans. Signal Processing, 40(4), 967–969. Georgiou, G. M., & Koutsougeras, C. (1992). Complex domain backpropagation. IEEE Trans. Circuits and Systems–II: Analog and Digital Signal Processing, 39(5), 330–334. ICONIP. (2002). Complex-valued neural networks. In Proc. International Conference on Neural Information Processing (Vol. 3, pp. 1074–1103). Washington, DC: IEEE. KES. (2001).Complex-valued neural networks and their applications. In N. Baba, L. C. Jain, & R. J. Howlett (Eds.), Knowledge-based intelligent information engineering systems and allied technologies (Part I, pp. 550–580). Tokyo: IOS Press. KES. (2002). Complex-valued neural networks. In E. Damiani, R. J. Howlett, L. C. Jain, & N. Ichalkaranje (Eds.), Knowledge-based intelligent information engineering systems and allied technologies (Part I, pp. 623–647). Amsterdam: IOS Press. Kim, M. S., & Guest, C. C. (1990). Modication of backpropagation networks for complex-valued signal processing in frequency domain. In Proc. Int. Joint Conf. on Neural Networks (Vol. 3, pp. 27–31). Washington, DC: IEEE. Kim, T., & Adali, T. (2000). Fully complex backpropagation for constant envelope signal processing. In Proc. IEEE Workshop on Neural Networks for Signal Processing (pp. 231–240). Washington, DC: IEEE. Kuroe, Y., Hashimoto, N., & Mori, T. (2002). On energy function for complexvalued neural networks and its applications. In Proc. Int. Conf. on Neural Information Processing (Vol. 3, pp. 1079–1083). Washington, DC: IEEE. Lathi, B. P. (1998). Modern digital and analog communication systems (3rd ed.). New York: Oxford University Press. Lippmann, R. P. (1987, April). An introduction to computing with neural nets. IEEE Acoustic, Speech and Signal Processing Magazine, 4–22.

Orthogonality of Decision Boundaries

97

Minsky, M. L., & Papert, S. A. (1969). Perceptrons. Cambridge, MA: MIT Press. Miyauchi, M., Seki, M., Watanabe, A., & Miyauchi, A. (1992). Interpretation of optical ow through neural network learning. In Proc. IAPR Workshop on Machine Vision Applications (pp. 523–528). Miyauchi, M., Seki, M., Watanabe, A., & Miyauchi, A. (1993). Interpretation of optical ow through complex neural network. In Proc. Int. Workshop on Articial Neural Networks: Lecture Notes in Computer Science 686, (pp. 645–650). Berlin: Springer-Verlag. Nitta, T. (1993). A complex numbered version of the back-propagation algorithm. In Proc. World Congress on Neural Networks (Vol. 3, pp. 576–579). Mahwah, NJ: Erlbaum. Nitta, T. (1997). An extension of the back-propagation algorithm to complex numbers. Neural Networks, 10(8), 1392–1415. Nitta, T., & Furuya, T. (1991). A complex back-propagation learning. Transactions of Information Processing Society of Japan, 32(10), 1319–1329 (in Japanese). Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1986a). Parallel distributed processing (Vol. 1). Cambridge, MA: MIT Press. Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1986b). Learning representations by back-propagating errors. Nature, 323, 533–536. Vapnik, V. N. (1998). Statistical learning theory. New York: Wiley. Watanabe, A., Yazawa, N., Miyauchi, A., & Miyauchi, M. (1994). A method to interpret 3D motions using neural networks. IEICE Trans. Fundamentals of Electronics, Communications and Computer Sciences, E77-A(8), 1363–1370. Received July 9, 2002; accepted May 5, 2003.

LETTER

Communicated by Sumio Watanabe and Katsuyuki Hagiwara

On the Asymptotic Distribution of the Least-Squares Estimators in Unidentifiable Models Taichi Hayasaka [email protected] Department of Information and Computer Engineering, Toyota National College of Technology, Toyota, Aichi 471-8525, Japan

Masashi Kitahara [email protected] Department of Information and Computer Sciences, Toyohashi University of Technology, Toyohashi, Aichi 441–8580, Japan

Shiro Usui [email protected] Laboratory for Neuroinformatics, RIKEN Brain Science Institute, Wako, Saitama 351–0198, Japan

In order to analyze the stochastic property of multilayered perceptrons or other learning machines, we deal with simpler models and derive the asymptotic distribution of the least-squares estimators of their parameters. In the case where a model is unidentified, we show different results from traditional linear models: the well-known property of asymptotic normality never holds for the estimates of redundant parameters. 1 Introduction Studying the fundamental property of multilayered perceptrons (MLP) treated as a parametric stochastic model in mathematical statistics is a challenging problem. White (1989), who did pioneering work in this area, identified one big difference between MLP and such traditional models as linear regression and ARMA: the unidentifiablity of optimal parameters due to the nonuniqueness of a combination of parameter values, which gives an equivalent input-output relation. This result suggests that it is fruitless to apply some traditional methods to a number of problems in applications of MLP. For example, consider the problem of selecting a suitable network structure or number of units that has good generalization ability. This is formulated in terms of a statistical model selection problem. Various criteria to evaluate the generalization ability of models have been proposed. It is known that the probability distribution of the maximum likelihood estiNeural Computation 16, 99–114 (2004)

c 2003 Massachusetts Institute of Technology

100

T. Hayasaka, M. Kitahara, and S. Usui

mators (MLE) of parameters converges to the gaussian distribution as the number of given samples goes to infinity. According to this property, called asymptotic normality, the well-known criterion AIC (Akaike information criterion; Akaike, 1974) is derived. In the case of MLP, however, we cannot show the asymptotic normality of estimators because of the unidentifiability of optimal parameters, so that the effectiveness of AIC is not ensured (Hagiwara, Toda, & Usui, 1993; Anders & Korn, 1999). Although some other criteria, such as NIC (network information criterion; Murata, Yoshizawa, & Amari, 1994) and GPE (generalized prediction error; Moody, 1992), have been proposed for MLP, they are effective only when the asymptotic normality holds. This implies that the determination of network structure is an open problem (Hagiwara, 2002). Therefore, several researchers have been working on MLP’s mathematical property related to its generalization ability ( e.g., Fukumizu, 1996; Hagiwara, Hayasaka, Toda, Usui, & Kuno, 2001; Watanabe, 2001). White (1989) showed the asymptotic normality of the estimates by supposing the uniqueness of optimal parameters and compactness of the parameter space. Murata et al. (1994) obtained a similar result. Furthermore, White asserted that the asymptotic distribution of parameter estimators in MLP becomes limiting mixed gaussian (LMG) introduced by Phillips (1989) in case the optimal values of parameters are not unique but are locallyidentifiable. However, White has not proved whether the local identifiability of optimal parameters holds for MLP. Kitahara, Hayasaka, Toda, and Usui (2001) suggested by numerical experiments that the compactness was not satisfied for MLP parameters. They also obtained the empirical distribution of their estimators in which it cannot be represented by a family of gaussian or LMG. We could say from those results that White’s idea does not express the exact stochastic property of MLP. The goal of this study is to derive such a distribution function theoretically. We first deal with a simple model with the unidentifiable condition and derive the asymptotic distribution of its parameter estimators in a context of regression estimation. This is equivalent to uniting the asymptotic normality for identifiable parameters (Murata et al., 1994; White, 1989) and the asymptotic distribution of the squares of unidentifiable parameters for a gaussian noise sequence (Hagiwara et al., 2001, Hayasaka, Toda, Usui, & Hagiwara, 1996). Then we discuss whether a similar result holds for the case of the models provided by the subset of a parameter space of MLP by applying the theory of extremes in a non–independent and identically distributed (i.i.d.) gaussian sequence. 2 Problem Formulation Let us consider the regression estimation from a given set of N pairs of independent input-output samples (x, y)N = {(xi , yi ); i = 1, . . . , N}, xi ∈ R, yi ∈ R. Outputs of a target system yi for predetermined inputs xi are

Least-Squares Estimators in Unidentifiable Modes

101

generated by adding noise to a function, yi = h(xi ) + εi∗ , i = 1, . . . , N,

(2.1)

where εi∗ is an independent random variable with a common gaussian distribution N (0, σ∗2 ). The true function h determines an invariant input-output relation of the system. Applying MLP with one output unit, s hidden units, and one input unit for estimating h, we describe a generating rule of samples (x, y)N as follows: yi =

s j=1

cj + εi , i = 1, . . . , N, 1 + exp(−bj1 x − bj0 )

(2.2)

where cj , bj0 , bj1 ∈ R, and εi is an independent random variable with a common gaussian distribution N (0, σs2 ). cj corresponds to the connection weight between the output layer and the hidden layer, and bj1 is the weight between the hidden layer and the input layer. bj0 is regarded as the threshold. bj0 and bj1 are nonlinear parameters in terms of the input-output relationship. We define the context of regression estimation in this study. A combination of basis functions is usually used for estimating the unknown function h, namely, the regression model, yi =

s

cj φ(xi ; bj ) + εi , i = 1, . . . , N,

(2.3)

j=1

where cj ∈ R and bj ∈ Rt . Equation 2.3 is represented by the matrix notation, yN = Φs(bs ) cs + εN ,

(2.4)

where yN = (y1 , . . . , yN )T , cs = (c1 , . . . , cs )T , εN = (ε1 , . . . , εN )T , and Φs(bs ) is the N × s design matrix in which the component in the ith row and the jth column is φ(xi ; bj ) provided by bs = (b1 , . . . , bs ). We call cj and bj a coefficient and a basis parameter, respectively. In case of MLP, bj = (bj0 , bj1 ) for all j. Throughout this article, we deal with the case where the squared error is employed as a loss function for parameter estimation: 2  N  s  1 remp (cs , bs ; (x, y)N ) = cj φ(xi ; bj ) . yi −  N i=1  j=1

(2.5)

The estimators based on (x, y)N are assumed to be the parameters that minimize the empirical loss, that is, the least squares estimators (LSE): bs ≡ argmin remp (cs , bs ; (x, y)N ). cs , cs ,bs

(2.6)

102

T. Hayasaka, M. Kitahara, and S. Usui

LSE cs = ( c1 , . . . , cs ) and bs = ( b1 , . . . , bs ) are equivalent to MLE because of the assumption of gaussian noise. Suppose that the true function h is represented by the linear combination of s∗ basis functions with optimal parameter values cj∗ and bj∗ , j = 1, . . . , s∗ , h(xi ) ≡

s∗

cj∗ φ(xi ; bj∗ ).

(2.7)

j=1

When we apply the regression model with s basis functions (see equation 2.3) and s > s∗ , the optimal coefficient values for redundant basis functions cj∗ , j > s∗ are regarded as zero. In this case, bj∗ can take any value; we say that the optimal basis parameter values are unidentifiable, because all bj gives the same function representation (or the same value of squared error, equation 2.5) when cj = 0. If the basis parameters are constant, equation 2.3 is a traditional linear regression model. Therefore, cs has a joint gaussian distribution asymptotically, and the inverse of Hessian for the loss function (it is equivalent to the information matrix in this case) corresponds to the covariance matrix of gaussian. However, the optimal basis parameter values are unidentifiable, so the Hessian becomes singular. Since it is impossible to calculate its inverse, the asymptotic normality cannot be shown for the regression models with variable basis parameters like MLP (White, 1989; Hagiwara et al., 1993; Anders & Korn, 1999). In the following sections, we consider the probability distribution of cs in such a case. 3 Asymptotic Distribution for the Regression Model with Orthogonal Basis Functions The parameters of regression models usually take real values or positive real values. In order to simplify the analysis, we set the basis parameter space to a finite set B ≡ {b1 , . . . , bN } ⊂ Rt so that the basis functions φ(x; b1 ), . . . , φ(x; bN ) are linearly independent. Since this model also includes unidentifiable optimal parameters, its property may be reflected in that of MLP or other learning machines represented by a superposition of adaptive basis functions. Let cs(bs ) be LSE of cs given by solving the normal equation, cs(bs ) = (Φs(bs )T Φs(bs ) )−1 Φs(bs )T yN ,

(3.1)

provided by a basis parameter vector bs . Then the least-squared error can be obtained by min min bs

cs

1 (yN − Φs(bs ) cs )T (yN − Φs(bs ) cs ) N

Least-Squares Estimators in Unidentifiable Modes

1 s) (yN − Φs(bs ) cs(bs ) )T (yN − Φs(bs ) c(b s ) N 1 = min (yTN yN − yN Ps(bs ) yN ), bs N

103

= min bs

(3.2)

(bs )T s ) −1 . LSE are obtained by where Ps(bs ) = Φs(bs ) (Φs(bs )T Φ(b s ) Φs

bs = argmax yN Ps(bs ) yN ,

(3.3)

cs = (Φs(bs )T Φs(bs ) )−1 Φs(bs )T yN .

(3.4)

bs

From equations 3.3 and 3.4, we can say that the probability distribution of cs is related to the stochastic property of the maximum in the random sequence {yTN Ps(bs ) yN } for all combinations of the elements in bs . If φ(x; b1 ), . . . , φ(x; bN ) satisfy the orthogonal condition under the appropriate choice of an ordered set of input values, it is possible to analyze the property of LSE in detail. We assume that N

φ(xi ; bj )φ(xi ; bk ) =

i=1

λ 0

j=k j = k,

(3.5)

where the squared norm of basis function λ usually diverges as N goes to infinity so that the true value of each coefficient cj should be a constant. Suppose that λ is a constant function of N. For example, let us consider trigonometric basis functions for xi ≡ 2π i/N, i = 1, . . . , N, φ(xi ; bj ) = bj0 cos(bj1 xi − bj2 ),

(3.6)

where each basis parameter bj = (bj0 , bj1 , bj2 ) is in the basis parameter space √ √ √ √ √ Bj = {(1, 0, 0), ( 2, 1, 0), ( 2, 1, π/2), ( 2, 2, 0), ( 2, 2, π/2), . . . , ( 2, N/2 − 1, π/2), (1, N/2, 0)}. In this case, equation 3.5 is satisfied, and λ = N. We have s )T s) max yN Ps(bs ) yN = max Φs(bs )T Φs(bs ) c(b c(b s s bs bs = λ max ( c(bj ) )2 ,

bs

(3.7)

j

where c(bj ) =

N 1 yi · φ(xi ; bj ). λ i=1

(3.8)

104

T. Hayasaka, M. Kitahara, and S. Usui

(b ) LSE cj and bj , j = 1, . . . , s, are regarded as bj = blj and cj = c lj , where l1 , . . . , lN correspond to the sequence rearranged in ascending order of (b ) c(bl1 ) )2 ≥ · · · ≥ ( c(blN ) )2 . Then the squared coefficients ( c lj )2 , that is, ( (bj ) 2 2 ( c ) λ/σ∗ , j = 1, . . . , N, are independent random variables with the noncentral χ 2 distribution with 1 degree of freedom and the noncentrality parameter (cj∗ )2 λ/σ∗2 . The independence of random sequence makes easier to study the property of maxima, because the classical extreme value theory ( e.g., Leadbetter & Rootz´en, 1988) can be applied. We show the following theorem for the asymptotic distribution of LSE of the coefficients.

Theorem 1. Assume that (c∗1 )2 > (c∗2 )2 > · · · > (c∗s∗ )2 and λ > O(log N). Then the probability distribution of cj , j ≤ s∗ converges to the gaussian distribution as N → ∞, σ2 P{ cj ≤ x} → N cj∗ , ∗ , λ

(3.9)

as N → ∞. We have for j > s∗ , ∗

P{ cj ≤ x} →

j−s −1 e−n·uN (x ) 1 sgn(x) + , exp(−e−uN (x ) ) 2 2σ∗ n! n=0

(3.10)

as N → ∞, where x = x2 λ/σ∗2 , un (x) = αN (x−βN ), and αN = 2 log(N − s∗ ), βN = αN − (log log(N − s∗ ) + log π )/2αN . Let ξj ≡ ( c(bj ) )2 λ/σ∗2 and µj ≡ (cj∗ )2 λ/σ∗2 , j = 1, . . . , N. Consider

Proof.

ξj = (mj + j )2 , where mj is a constant such that µj = mj2 , and j is a random variable with N (0, 1). Then ξj = µj + 2mj j + j2 ≤ µj + 2|mj ||j | + j2 ≤ µj + 2|mj | max |i | + max i2 , i

i

(3.11)

for j ≤ s∗ . Deo (1972) showed that max |i | →

i

2 log N, a.s.

(3.12)

This is turned into max i2 → 2 log N, a.s. i

(3.13)

Least-Squares Estimators in Unidentifiable Modes

105 (j)

Since µj > O(log N), we have ξj → O(µj ), a.s. Let MN be the jth largest maximum in the sequence {ξs∗ +1 , . . . , ξN }. From equation 3.13, M(1) N → 2 log(N − s∗ ), a.s., because µj = 0, j = s∗ + 1, . . . , N. Then ξ1 > ξ2 > · · · > ξs∗ > M(1) N , a.s.

(3.14)

bj corresponds to Equation 3.14 is followed by bj → bj , a.s., j ≤ s∗ , because ˆ

the jth largest maximum in {ξ1 , . . . , ξN }, j ≤ s∗ . Therefore, cj (≡ c(bj ) ) con(b ) j verges to c almost surely (it implies convergence in probability). Since each c(bj ) is a random variable with the distribution N (cj∗ , σ∗2 /λ), we can prove by applying the following lemma (Rao, 1973, p. 122) that the asymptotic distribution of cj is given by the gaussian distribution, equation 3.9, for j ≤ s∗ . Lemma 1.

Let {WN , ZN }, N = 1, 2, . . . be a sequence of pairs of variables. Then

|WN − ZN | → 0 in probability, ZN → Z in distribution, ⇒ WN → Z in distribution,

(3.15)

as N → ∞, that is, the asymptotic distribution of WN exists and is the same as that of Z.

cj | → In the case where j > s∗ , it is obvious that |

(j−s∗ )

σ∗2 MN

/λ, a.s., and

(j) MN

converges almost surely to the (j−s∗ )th largest maximum of absolute values of independent random variables with N (0, 1). From lemma 3 in the appendix, the asymptotic distribution of extremes,

(j)

MN is the type 1 distribution of

∗

j−s −1 e−nx (j) −x P αN MN − βN ≤ x → exp(−e ) , n! n=0

as N → ∞,

(3.16)

where αN = 2 log(N − s∗ ) and βN = αN − (log log(N − s∗ ) + log π )/2αN . Since the distribution of each c(bj ) is N (0, σ∗2 /λ) and this density function is symmetric with regard to the origin, equation 3.10 holds by using lemma 1. The asymptotic normality of cj holds only when j ≤ s∗ , which is equivalent to the limit theorems for identifiable parameters (Murata et al., 1994; White, 1989). Equation 3.10 is the asymptotic distribution of LSE of coef-

106

T. Hayasaka, M. Kitahara, and S. Usui

-5.0

0.6 0.4

pdf(x)

0.0

0.2

0.4 0.2 0.0

pdf(x)

0.6

ficients for redundant basis functions, which represent noise components included in the given samples, and it can also be derived from lemma 7 in Hagiwara et al. (2001). They dealt with the case where the regression model with n-point-fitting basis functions was applied to samples provided by a gaussian noise sequence only (i.e., h(x) = 0), and analyzed its asymptotic property by adopting more general solutions for the regression model with orthonormal basis functions by Hayasaka et al. (1996), in which they also used the classical extreme value theory. Theorem 1 shows that the same result holds with probability one even if h(x) = 0. As shown in the example of trigonometric basis functions, the assumption λ > O(log N) in theorem 1 is satisfied in general. If not, equation 3.10 holds for all cj , j ≤ s∗ for the case where the true coefficient values should be constant in terms of N. Figure 1 illustrates the probability density function of the asymptotic distribution of cj derived from theorem 1. Symmetry with regard to the origin and a double-peaked shape of its density function represented in Figure 1b is similar to the histogram of estimated parameters in MLP obtained by Kitahara et al. (2001). Although the regression model with orthogonal basis functions gives a very different function representation from MLP, both density shape suggest that theorem 1 may be extended to the other unidentifiable models with nonorthogonal basis functions.

-2.5

0.0

2.5

5.0

-5.0

-2.5

0.0

x

x

(a)

(b)

2.5

5.0

Figure 1: Probability density function of asymptotic distribution of cj for the regression model with orthogonal basis functions. (a) Equation 3.9 with σ∗2 = 1 and λ = 1. The asymptotic normality of cj holds for j ≤ s∗ . (b) Equation 3.10 with σ∗2 = 1, λ = 1, j = s∗ + 1, and N = 16. Density with double-peaked shape is obtained for cj , j > s∗ .

Least-Squares Estimators in Unidentifiable Modes

107

4 Asymptotic Distribution for the Regression Model with Step-Type Basis Functions In this section, we deal with the simplest case: the number of basis functions s = 1 and cj∗ = 0 for all j, that is, h(x) = 0 and s∗ = 0. According to these assumptions, the optimal parameter values are not identifiable, independent of types of models. Again, we set the basis parameter space to a finite set with N elements and N 2 i=1 φ (xi ; bj ) = λ for all j. In the case of nonorthogonal basis functions, LSE c1 is obtained by the maximization of squared coefficient values provided by all basis parameters, similar to the case of orthogonal basis functions: ˆ

c1 = c(bj ) ,

(4.1)

1 c(bj ) )2 , c(bj ) = bj = argmax( λ

N

yi · φ(xi ; bj ).

(4.2)

i=1

LSE of the coefficient is equivalent to the maximum in the sequence of N random variables with the χ 2 distribution with 1 degree of freedom, though the sequence is dependent. Since the classic extreme value theory cannot be applied directly to analyze the asymptotic distribution of LSE, the problem becomes more difficult. First, we show the sufficient condition for c1 to have the same asymptotic distribution as an orthogonal case. Theorem 2. Assume that yi ∼ N (0, σ∗2 ) for the appropriate choice of an ordered set of input values xi ∈ R, i = 1, . . . , N. If the regression model, equation 2.3, s = 1 satisfies the following condition, lim

N→∞

N−1

sup

l=1 |j1 −j2 |≥l

2 N 1 φ(xi ; bj1 )φ(xi ; bj2 ) < ∞, λ i=1

(4.3)

the LSE of the coefficient parameter has the asymptotic distribution 3.10. Proof. From the proof of theorem 1, we need to derive the sufficient condition that the asymptotic distribution of the maximum among absolute values of dependent random variables converges to that for independent ones with N (0, 1). This can be shown under certain natural conditions about the dependency of the sequence. Deo (1972, theorem 1) shows the following: Lemma 2. Let ξi , i = 1, . . . , N be random variables with N (0, 1) and M N be the maximum among {|ξ1 |, . . . , |ξN |}. Let δl = sup|j1 −j2 |≥l |E[ξj1 ξj2 ]|. If lim δl2 < −x ∞,

then αN (MN − βN ) converges in distribution to exp(−e ) where αN = 2 log N and βN = αN − (log log N + log π )/2αN .

108

T. Hayasaka, M. Kitahara, and S. Usui

We can show {E[ξj1 ξj2 ]}2 = Corr[ξj21 ξj22 ]. Then δl2

= sup

|j1 −j2 |≥l

2 N 1 φ(xi ; bj1 )φ(xi ; bj2 ) , λ i=1

so that equation () holds for the maximum of | c(bj ) | if N → ∞. This ends the proof.

(4.4) N−1 l=1

δl2 < ∞ as

For example, let us consider the regression model with the following basis functions, φ (r) (xi ; b) =

1 (b − r)+ < i ≤ b 0 otherwise,

(4.5)

where the basis parameter b ∈ {1, . . . , N} and the hyperparameter r = 1, 2, . . ., r N. (b − r)+ stands for max(0, b − r). If r = 1, equation 4.4 is equivalent to the n-point-fitting function (Hagiwara et al., 2001). When (r) 2 normalizing φ (r) (xi ; b) so as to make N i=1 {φ (xi ; b)} to a constant λ for all b, it is easy to show that the assumption of theorem 2 is satisfied even if r > 1. In fact, the basis function of such a model can be constructed by two step-type basis functions, φ (r) (xi ; b) =

√

b · φstep (xi ; b) −

(b − r)+ · φstep (xi ; b − r),

(4.6)

where

φstep (xi ; b) =

√ 1/ b i ≤ b 0 otherwise.

(4.7)

Equation 4.7 is regarded as a limit of a sigmoidal basis function with respect to the basis parameter value, so that we may say that the above example is provided by the subset of the function representation realized by MLP. It can be expected that the property of the regression model with the basis function, equation 4.7, alone is more similar to that of MLP. Unfortunately, theorem 2 does not hold in this case. However, equation 3.10 can represent the asymptotic distribution of the coefficient parameters, shown by the following theorem: Theorem 3. Assume that yi ∼ N (0, σ∗2 ) for the appropriate choice of an ordered set of input values xi ∈ R, i = 1, . . . , N. For the regression model with one basis function represented by equation 4.7, the LSE of the coefficient parameter has the asymptotic distribution, equation 3.10, with the constants λ = 1, αN =

2 log log N, and βN = αN + (log log log N − log π )/2αN .

Least-Squares Estimators in Unidentifiable Modes

Proof.

109

Let

c(j) =

N i=1

j 1 yi · φ(xi ; j) = yi . j i=1

(4.8)

Then LSE of the coefficient c1 is given by 1 c1 = c , j = argmax( c(j) )2 = argmax ˆ ( j)

1≤j≤N

1≤j≤N

j yi . j i=1

(4.9)

j Then we have | c1 | = maxj | i=1 yi |/ j. Applying the Darling-Erdos ¨ theorem for sums of independent gaussian variables (Darling & Erdos, ¨ 1956), the following equation holds, P αN

1 max

1≤j≤N

j yi − βN ≤ x → exp(−e−x ), j i=1

as N → ∞,

(4.10)

where αN = 2 log log N and βN = αN +(log log log N −log π )/2αN . Hence, we conclude theorem 3. Compared to the result for orthogonal basis functions, the convergence rate of tail probability is different, but the type of probability is same. From the Darling-Erdos ¨ theorem (Darling & Erdos, ¨ 1956), it is possible to generalize theorem 3 to the case regardless of the distributions of input and output samples, respectively. Notice again that the function representation of this regression model is regarded as the limit of MLP with respect to the basis parameters. 5 Discussion and Conclusion We showed that the different asymptotic distributions from gaussian are obtained for LSE of parameters in a context of regression estimation. By applying the theory of extremes in a gaussian sequence, we derived the asymptotic distributions for models with orthogonal basis functions (Hayasaka et al., 1996; Hagiwara et al., 2001). In this article, we treated such a result as a template of the stochastic property of other unidentified models (Figure 1 shows it graphically) and extended them to the nonorthogonal case. Although it cannot be applied directly to the case of the models provided by the subset of a parameter space of MLP, a similar interesting property was shown. The shape of the density of the asymptotic distribution we derived here has two peaks and takes the minimum value at their optimal value.

110

T. Hayasaka, M. Kitahara, and S. Usui

Such a function cannot be constructed by the family of LMG, which White (1989) regards as the asymptotic distribution in case of unidentified models. We assumed that the basis parameter space is a finite set in order to study the asymptotic property of MLP. The function representation defined by a linear combination of basis functions with basis parameters is called a dictionary representation, and the stochastic models with superior ability to linear models (e.g., projection pursuit regression, support vector machines) can be represented by its form (Cherkassky, Shao, Mulier, & Vapnik, 1999). We cannot conclude that there is no relationship between MLP and the models derived from assuming a finite parameter space because they provide some measure of dictionary representations for MLP. We analyzed the asymptotic distribution of LSE for the regression model with step-type basis functions as an example that gives a function representation close to MLP. In the typical case of unidentified optimal parameters, that is, MLP is applied to the samples that contain gaussian noise only, Kitahara et al. (2001) numerically showed that the optimal values of basis parameters in a sigmoidal function tended to infinity and its shape became discontinuous, namely, steptype. Our numerical and theoretical results suggest that theorem 3 explains the property of LSE for the coefficients of redundant basis functions in MLP. We will have to reduce a number of restrictions on models (e.g., the number of basis functions, input dimensionality) in order to probe its actual state. Appendix: Asymptotic Distributions of Extremes A main issue of the extreme value theory is the asymptotic property of extremes (i.e., maxima and minima) of a stochastic sequence. The basic importance is given by the extremal types theorem, which deals with the distribution of maxima in i.i.d. random sequences, as noted in the proof of the following lemma. Lemma 3. Let ξi , i = 1, . . . , N be random variables with N (0, 1) and M(k) N be the kth largest maximum among {|ξ1 |, |ξ2 |, . . . , |ξN |}. Then k−1 −nx e −x P αN M(k) − β ) ≤ x → exp(−e , as N → ∞, N N n! n=0

where αN =

(A.1)

2 log N and βN = αN − (log log N + log π )/2αN .

Proof. Let F(x) be a common distribution function of |ξ1 |, . . . , |ξN |, and SN be the number of |ξ1 |, . . . , |ξN | so that |ξi | > uN for a level uN . Then SN is a

Least-Squares Estimators in Unidentifiable Modes

111

random variable in which its distribution is binomial: P{SN ≤ k} =

k

N−n (uN )(1 N Cn F

− F(uN ))n , k = 0, 1, 2, . . . .

(A.2)

n=0

We have P{M(1) N ≤ uN } = P{|ξ1 | ≤ uN , |ξ2 | ≤ uN , . . . , |ξN | ≤ uN }.

(A.3)

Equation A.3 shows that the event {M(1) N ≤ uN } is equivalent to none of |ξ1 |, . . . , |ξN | exceeding the level uN : namely, P{SN = 0}. In the same way, the following holds: P{M(k) N ≤ uN } = P{SN < k} =

k−1

N Cn F

N−n

(uN )(1 − F(uN ))n , k = 1, 2, . . . .

(A.4)

n=0

If there exists uN so that N(1 − F(uN )) → τ (τ is a constant in terms of N) as N → ∞, the distribution of SN converges to Poisson, P{SN < k} → e−τ

k−1 n τ n=0

n!

k = 1, 2, . . . ,

(A.5)

as N → ∞. N(1 − F(uN )) → τ as N → ∞ implies FN (uN ) → e−τ as N → ∞. If there exists a distribution function G such that FN (uN ) → G(x) as N → ∞, −1 where uN = αN x + βN for some constants αN > 0 and βN , we have P{αN (M(k) N − βN ) ≤ x} = P{SN < k} → G(x)

k−1 − log G(x) n=0

n!

, k = 1, 2, . . . ,

(A.6)

as N → ∞. It is known that G(x) has one of the following three forms (extremal types theorem, theorem 1.2.1 in Leadbetter & Rootz´en, 1988): I : G(x) = exp(−e−x ),

(A.7)

II : G(x) = exp(−x−a ) x > 0,

(A.8)

III : G(x) = exp(−(−x)a ) x ≤ 0,

(A.9)

where a is a shape parameter. Now we seek to determine which of the three types of limit distributions applies for the random variables |ξi | with the distribution F(x). A useful condition has been obtained as follows:

112

T. Hayasaka, M. Kitahara, and S. Usui

Lemma 4. (de Haan 1976, theorem 5). Suppose that the distribution of an i.i.d. random sequence F(x) is absolutely continuous with a density function f ≡ F and f has a derivative f ≡ F such that f (x) < 0 for all x in (x0 , ∞). If lim

t→∞

f (t)(1 − F(t)) = −1, f 2 (t)

(A.10)

the asymptotic distribution of the maximum of the sequence is type 1 in the extremal types theorem. Let (x) and φ(x) be the probability distribution function and its density function of ξi , respectively. Then F(x) is given by the following: F(x) =

x −x

φ(x)dx =

= (x) − (−x)

x

−∞

φ(x)dx −

−x −∞

= 2(x) − 1.

φ(x)dx

(A.11)

F(x)’s density function f (x) and its differential function f (x) are as follows: f (x) = 2φ(x),

f (x) = −2xφ(x).

(A.12) (A.13)

Since 1 − (u) ∼ φ(u)/u as u → ∞, f (x)(1 − F(x)) −2xφ(x)(2 − 2(x)) = f 2 (x) 4φ 2 (x) x =− (1 − (x)) φ(x) x φ(x) ∼− = −1, φ(x) x

(A.14)

as x → ∞. From lemma 4, G(x) is given by equation A.7. Normalized constants αN and βN can be obtained by the process similar to example 2 in Resnick (1987). Equation A.1 is valid for k ≥ 2 by substituting equation A.7 for equation A.6. Acknowledgments We wish to thank Katsuyuki Hagiwara at Nagoya Institute of Technology and Naohiro Toda at Aichi Prefectural University for their useful comments.

Least-Squares Estimators in Unidentifiable Modes

113

References Akaike, H. (1974). A new look at the statistical model identification. IEEE Trans. Automat. Contr., AC–19, 716–723. Anders, U., & Korn, O. (1999). Model selection in neural networks. Neural Networks, 12, 309–323. Cherkassky, V., Shao, X., Mulier, F. M., & Vapnik, V. (1999). Model complexity control for regression using VC generalization bounds. IEEE Trans. Neural Networks,10, 1075–1089. Darling, D. A., & Erdos, ¨ P. (1956). A limit theorem for the maximum of normalized sums of independent random variables. Duke Math. J., 23, 143–155. de Haan, L. (1976). Sample extremes: An elementary introduction. Statist. Neelandica, 30, 161–172. Deo, C. M. (1972). Some limit theorems for maxima of absolute values of gaussian sequences. Sankhy¯a, A, 34, 289–292. Fukumizu, K. (1996). A regularity condition of information matrix of a multilayer perceptron network. Neural Networks, 9, 871–879. Hagiwara, K. (2002). On the problem in model selection of neural network regression in an overrealizable scenario. Neural Computation, 14, 1979–2002. Hagiwara, K., Hayasaka, T., Toda, N., Usui, S., & Kuno, K. (2001). Upper bound of the expected training error of neural network regression for a gaussian noise sequence. Neural Networks, 14, 1419–1429. Hagiwara, K., Toda, N., & Usui, S. (1993). On the problem of applying AIC to determine the structure of a layered feed-forward neural network. In Proceedings of International Joint Conference on Neural Networks III (pp. 2263–2266). New York: IEEE. Hayasaka, T., Toda, N., Usui, S., & Hagiwara, K. (1996). On the least square error and prediction square error of function representation with discrete variable basis. In Proceedings of the 1996 IEEE Signal Processing Society Workshop (pp. 72–81). New York: IEEE. Kitahara, M., Hayasaka, T., Toda, N., & Usui, S. (2001). On the probability distribution of estimators of regression model using 3-layered neural networks. In Proceedings of 2001 International Symposium on Nonlinear Theory and its Applications (Vol. 2, pp. 533–536). Zao, Miyagi, Japan. Leadbetter, M. R., & Rootz´en, H. (1988). Extremal theory for stochastic processes. Ann. Prob., 16, 431–478. Moody, J. E. (1992). The effective number of parameters: An analysis of generalization and regularization in nonlinear learning systems. In J. E. Moody, S. J. Hanson, & R. P. Lippmann (eds.), Advances in neural information processing systems, 4 (pp. 847–854). San Mateo, CA: Morgan Kaufmann. Murata, N., Yoshizawa, S., & Amari, S. (1994). Network information criterion— determining the number of hidden units for an artificial neural network model. IEEE Trans. Neural Networks, 5, 865–872. Phillips, P. C. B. (1989). Partially identified econometric models. Econometric Theory, 5, 181–240. Rao, C. R. (1973). Linear statistical inference and its applications (2nd ed.). New York: Wiley.

114

T. Hayasaka, M. Kitahara, and S. Usui

Resnick, S. I. (1987). Extreme values, regular variation, and point processes. Berlin: Springer-Verlag. Watanabe, S. (2001). Algebraic analysis for nonidentifiable learning machines. Neural Computation, 13, 899–933. White, H. (1989). Learning in artificial neural networks: A statistical perspective. Neural Computation, 1, 425–464. Received November 6, 2002; accepted June 3, 2003.

Communicated by John Platt

LETTER

Asymptotic Properties of the Fisher Kernel Koji Tsuda [email protected] Max Planck Institute for Biological Cybernetics, 72076 Tubingen, ¨ Germany, and AIST Computational Biology Research Center, Koto-ku, Tokyo, 135-0064, Japan

Shotaro Akaho [email protected] AIST Neuroscience Research Institute, Tsukuba, 305-8568, Japan

Motoaki Kawanabe nabe@rst.fhg.de Fraunhofer FIRST, 12489 Berlin, Germany

Klaus-Robert Muller ¨ klaus@rst.fhg.de Fraunhofer FIRST, 12489 Berlin, Germany, and University of Potsdam, 14482 Potsdam, Germany

This letter analyzes the Fisher kernel from a statistical point of view. The Fisher kernel is a particularly interesting method for constructing a model of the posterior probability that makes intelligent use of unlabeled data (i.e., of the underlying data density). It is important to analyze and ultimately understand the statistical properties of the Fisher kernel. To this end, we rst establish sufcient conditions that the constructed posterior model is realizable (i.e., it contains the true distribution). Realizability immediately leads to consistency results. Subsequently, we focus on an asymptotic analysis of the generalization error, which elucidates the learning curves of the Fisher kernel and how unlabeled data contribute to learning. We also point out that the squared or log loss is theoretically preferable—because both yield consistent estimators—to other losses such as the exponential loss, when a linear classier is used together with the Fisher kernel. Therefore, this letter underlines that the Fisher kernel should be viewed not as a heuristics but as a powerful statistical tool with well-controlled statistical properties. 1 Introduction Recently, the Fisher kernel (Jaakkola & Haussler, 1999) has been successfully applied as a feature extractor in supervised classication (Jaakkola & Haussler, 1999; Tsuda, Kawanabe, R¨atsch, Sonnenburg, & Muller, ¨ 2002; SonNeural Computation 16, 115–137 (2004)

c 2003 Massachusetts Institute of Technology °

116

K. Tsuda, S. Akaho, M. Kawanabe, and K. Muller ¨

nenburg, Ratsch, ¨ Jagota, & Muller, ¨ 2002; Smith & Gales, 2002; Vinokourov & Girolami, 2002). The original intuition (Jaakkola & Haussler, 1999) for the Fisher kernel was to construct a probabilistic model of the data in order to induce a metric for a subsequent discriminative training. Two problems could be addressed simultaneously. First it became possible to compare “apples and oranges”—as the Fisher kernel approach measures distances in the space of the respective probabilistic model parameters. So, for example, a DNA sequence of length, say, 100 and another one of length 1000 can be easily compared by using a representation in the respective hidden Markov models (HMM) parameter space. Thus, the Fisher kernel is very much in contrast to alignment methods that compare directly by essentially using dynamic programming techniques (e.g., Gotoh, 1982). A second feature of the Fisher kernel is that it allows incorporating prior knowledge about the data distribution into the classication process in a highly principled manner. In the practical use of support vector machines (SVM) (e.g., Vapnik, 1998; Cristianini & Shawe-Taylor, 2000; Muller, ¨ Mika, Ratsch, ¨ Tsuda, & Sch¨olkopf, 2001; Sch¨olkopf & Smola, 2002), where the choice of the kernel is of crucial importance, either the kernel can be engineered using all available prior knowledge (e.g., Zien et al., 2000), or it can be derived as the Fisher kernel from a probabilistic model (e.g., Jaakkola & Haussler, 1999; Tsuda et al., 2002; Sonnenburg et al., 2002; Smith & Gales, 2002). 1 In spite of its practical success, a theoretical analysis of the Fisher kernel has not been sufciently explored so far, with the exceptions being Jaakkola, Meila, and Jebara (1999), Tsuda and Kawanabe (2002), Seeger (2002), and Tsuda, Kawanabe, and Muller, ¨ in press). For example, Jaakkola et al. (1999) showed how to determine the prior distribution of parameters to recover the Fisher kernel in the framework of maximum entropy discrimination. And Seeger (2002) pointed out that the Fisher kernel can be perceived as an approximation of the mutual information kernel. This article presents theoretical results from a statistical point of view. In particular, we perceive the Fisher kernel as a method of constructing a model of the posterior probability of the class labels. The Fisher kernel can be derived as follows: Let X denote the domain of objects, which can be discrete or continuous. Let us assume that a probabilistic model q.x j µ /; x 2 X ; µ 2
@ log q.x j µO / @ log q.x j µO / ;:::; @µ1 @µd

!> :

(1.1)

1 Of course, a brute force search over all possible kernels can be pursued using crossvalidation procedures or bounds from learning theory to select the “best” kernel (cf. Muller ¨ et al., 2001).

Asymptotic Properties of the Fisher Kernel

117

The Fisher kernel refers to the inner product in this space. When used in supervised classication, the Fisher kernel is commonly combined with a linear classier such as SVMs (Vapnik, 1998), where a linear function is trained to discriminate two classes. Since the Fisher kernel can efciently make use of prior knowledge about the marginal distribution p.x/ (which can be estimated rather well using unlabeled samples), it is especially attractive in vision, text classication, and bioinformatics, where we can expect a lot of unlabeled samples (Zhang & Oles, 2000; Seeger, 2001). First, we will show the sufcient conditions that the obtained posterior model is realizable—that it contains the true posterior distribution, which then immediately leads to consistency. Once realizability is ensured, we can evaluate the expected generalization error in large sample situations by means of asymptotic statistics (Barndorff-Nielsen & Cox, 1989). This enables us to elucidate learning curves and how unlabeled samples contribute in reducing the generalization error. In addition, we point out that when a linear classier is combined with the Fisher kernel, then the log loss and the squared loss are theoretically preferable to other loss functions. This result recommends using a classier based on the log loss or the squared loss. 2 Realizablity Conditions Let y 2 fC1; ¡1g be the set of class labels. Denote by p.x/, P.y j x/, and p.x; y/ the true underlying marginal, posterior, and joint distributions, respectively. Let @® f D @ f=@®, rµ f D .@µ1 f; : : : ; @µd f /> , and rµ2 f denote the d £ d matrix, the Hessian, whose .i; j/th element is @ 2 f=.@ µi @µj /. For statistical learning, we construct a model of posterior probability P.y j x/ out of the Fisher score, equation 1.1. The posterior probability is described by a linear function followed by an activation function h, Q.y j x; ´ / D h.y[w> fµ .x/ C b]/;

(2.1) > > , and

where w 2 ; b; µ / h is a linear activation function:2 h.t/ D

1 1 tC : 2 2

(2.2)

In the following, we will investigate the conditions that Q.y j x; ´ / is realizable—that there is a parameter value ´ ¤ such that Q.y j x; ´ ¤ / D P.y j x/. 2.1 Core Model. First, a trivial example is shown to give a realizable model. Denote by q0 .x j µ / a mixture model of the true class distributions, q0 .x j ®/ D ®p.x j y D C1/ C .1 ¡ ®/p.x j y D ¡1/; ® 2 [0; 1];

(2.3)

2 Compared with sigmoid functions, the linear activation function is not so common in literature. However, it allows us to perform statistical analysis, as we show later.

118

K. Tsuda, S. Akaho, M. Kawanabe, and K. Muller ¨

which we call the core model. This model realizes the true marginal distribution p.x/, when ® D p.y D C1/ :D ® ¤ . Lemma 1. When the Fisher score is determined as f®¤ .x/ D @® log q0 .x j ® ¤ /, the posterior model, 2.1, is realizable. Proof. The posterior model Q.y j x; ´ / is realizable if there is a parameter value ´ ¤ such that Q.y j x; ´ ¤ / D P.y j x/; 8x 2 X ; y 2 fC1; ¡1g:

(2.4)

Substituting equations 2.1 and 2.2, equation 2.4 holds if and only if .w¤ /> fµ¤ .x/ C b¤ D P.y D C1 j x/ ¡ P.y D ¡1 j x/:

(2.5)

To prove the lemma for q0 , it is sufcient to show the existence of w; b 2 < such that w@® log q0 .x j ® ¤ / C b D P.y D C1 j x/ ¡ P.y D ¡1 j x/:

(2.6)

The Fisher score for q0 .x j ® ¤ / can be written as @® log q0 .x j ® ¤ / D

P.y D C1 j x/ P.y D ¡1 j x/ ¡ : 1 ¡ ®¤ ®¤

When w D 2® ¤ .1 ¡ ® ¤ / and b D 2® ¤ ¡ 1, equation 2.6 holds. 2.2 Deriving Realizability Conditions. Since we do not know the true class distributions p.x j y/, the core model q0 .x j ®/ in lemma 1 is never available. In the following, the result of lemma 1 is therefore relaxed to a more general class of probability models. Denote by M a set of probability distributions M D fq0 j q0 .x j ®/; ® 2 [0; 1]g. According to information geometry (Amari & Nagaoka, 2001), M is regarded as a manifold in a Riemannian space. Let Q denote the manifold of q.x j µ/: Q D fq j q.x j µ/; µ 2
Assume that the true distribution p.x/ is contained in Q,

p.x/ D q.x j µ ¤ / D q0 .x j ® ¤ /;

x2X ;

where µ ¤ is the true parameter. If the tangent space of Q at p.x/ contains the tangent space of M at the same point (see Figure 1), then the Fisher score f derived from q.x j µ¤ / gives a realizable posterior model, equation 2.1.

Asymptotic Properties of the Fisher Kernel

119

M

x

p

Q

Figure 1: Information-geometric picture of a probabilistic model whose Fisher kernel leads to a realizable posterior model. The important point is that the tangent space of manifold M is contained in that of manifold Q. Details are explained in the text.

Proof. To prove the theorem, it is sufcient to show the existence of w 2 rµ log q.x j µ ¤ / C b D P.y D 1 j x/ ¡ P.y D ¡1 j x/:

(2.7)

When the tangent space of M is contained in that of Q around p.x/, we have the following by the chain rule: d @ log q0 .x j ® ¤ / X @ log q.x j µ¤ / @µj : D @® @µj @® ®D® ¤ jD1

(2.8)

Let u be the d-dimensional vector where the ith element is @ µj : ui D @® ®D® ¤ Then equation 2.7 holds when w D 2® ¤ .1 ¡ ® ¤ /u;

b D 2® ¤ ¡ 1:

This theorem indicates that realizability depends on local geometry of manifold Q around the true distribution. In order to have a good posterior model, we have to ensure realizability while keeping the number of parameters small. To this end, we should make the manifold Q as low dimensional

120

K. Tsuda, S. Akaho, M. Kawanabe, and K. Muller ¨

as possible, while capturing the tangent space of M. If Q completely contains M, the realizability condition is satised. One example of this case was shown by Tsuda et al. (in press), where each class distribution is the mixture of shared gaussian components. Remark. A classier is called Bayes optimal if it achieves the Bayes error in the limit that the number of samples goes to innity (Devroye, Gyor, ¨ & Lugosi, 1996). Realizability is only a sufcient condition for Bayes optimality. It would be an interesting research topic to derive the conditions for Bayes optimality as well. 3 Consistency Results Denote by zn D fxi ; yi gniD1 the set of n independently and identically distributed (i.i.d.) labeled samples derived from p.x; y/. Denote by xum D fxju gm iD1 the set of m unlabeled i.i.d. samples derived from p.x/. In learning with the Fisher kernel from these samples, the learning procedure is typically separated into two steps (e.g., Jaakkola & Haussler, 1999). First, µ is obtained as

µO D argmaxµ

n X iD1

log q.xi j µ/ C

m X jD1

log q.xju j µ /:

(3.1)

Then, in the second step, w and b are obtained as O D argmax O b] [w; w;b

n X iD1

log h.yi [w> fµO .xi / C b]/:

(3.2)

The second step maximizes the conditional likelihood Q.y j x; ´ /. Let `.y; y0 / denote a loss function. Then equation 3.2 is generalized as O D argmin [w; O b] w;b

n X iD1

`.w> fµO .xi / C b; yi /;

(3.3)

where the loss function in equation 3.2 corresponds to `.y; y0 / D ¡ log h.yy0 /:

(3.4)

First, we prove that the consistency is ensured for the log loss (see equation 3.4), that is, in the limit that n goes to innity, the estimator ´O converges to the true one ´ ¤ . Further, it will be shown that the consistency can be proved for the squared loss, which has practical advantages. Lemma 2. Assume that q.x j µ/ satises the realizablity conditions in theorem 1. The two-step estimator with the log loss (see equation 3.4) is consistent.

Asymptotic Properties of the Fisher Kernel

121

Proof. In the two-step scheme, µ is estimated separately as maximum likelihood 3.1, so obviously µO converges to µ¤ . When we have innite samples, equation 3.3 is written as [wC ; bC ] D argminw;b

Z

X j2f1;¡1g

P.y D j/

`.w> fµ ¤ .x/ C b; j/p.x j y D j/dx:

Therefore, we should prove wC D w¤ and bC D b¤ . In other words, this problem is rewritten as a constrained variation problem (Gelfand & Fomin, 1963), where we nd a function g: X ! < that minimizes the following functional, Z X (3.5) L.g/ D P.y D j/ `.g.x/; j/p.x j y D j/dx; j2f1;¡1g

subject to the constraint g 2 G where

G D fg j g.x/ D w> fµ¤ .x/ C b; w 2
(3.6)

If the optimal solution of the variation problem without the above constraint is eventually contained in G , it is the solution of the constrained problem 3.5 as well. So let us consider the unconstrained problem rst. When the log loss is substituted into equation 3.5, we have L.g/ D ¡

»

Z

X j2f1;¡1g

P.y D j/

log

jg.x/ C 1 2

¼ p.x j y D j/dx:

The variation of L with respect to the small increment of g is written as ±L D ¡

X j2f1;¡1g

Z P.y D j/

±g

j p.x j y D j/dx: jg.x/ C 1

In order that g is an minimum, it is necessary that ±L D 0 holds for any ±g; thus, we have 1 1 p.x; y D 1/ ¡ p.x; y D ¡1/ D 0; 1 C g.x/ 1 ¡ g.x/

O , g.x/ D P.y D 1 j x/ ¡ P.y D ¡1 j x/:

(3.7) (3.8)

Since realizability is ensured by assumption, Q.y j x; ´ ¤ / D P.y j x/. According to equations 2.1 and 2.2, it holds that .w¤ /> fµ¤ .x/ C b¤ D P.y D C1 j x/ ¡ P.y D ¡1 j x/:

122

K. Tsuda, S. Akaho, M. Kawanabe, and K. Muller ¨

Thus, gO is contained in G , and the true parameters are obtained by solving the constrained problem, equation 3.5. Note that lemma 2 holds even if m D 0. Here, w O and bO cannot be obtained in closed form for the log loss. This turns out to be possible for the squared loss, `.y; y0 / D .y ¡ y0 /2 ;

(3.9)

and as we will see in the following, consistency is ensured as well in this case. Lemma 3.

The two-step estimator with the squared loss 3.9 is consistent.

When the squared loss is substituted into equation 3.5, we have Z X (3.10) L.g/ D P.y D j/ .g.x/ ¡ j/2 p.x j y D j/dx:

Proof.

j2f1;¡1g

In this case, the variational equation, 3.7, turns out that g.x/p.x/ ¡

X j2f1;¡1g

jp.x; y D j/ D 0;

which is solved as O D g.x/

p.x; y D C1/ ¡ p.x; y D ¡1/ p.x/

D P.y D C1 j x/ ¡ P.y D ¡1 j x/:

(3.11)

Since this solution is the same as equation 3.8, this lemma is proved by following the same procedure as lemma 2. Interestingly, one cannot ensure consistency for general loss functions. For example, when we use the exponential loss ³ ´ 1 `.y; y0 / D exp ¡ yy0 2 or the logistic loss `.y; y0 / D log.1 C exp.¡yy0 //;

the unconstrained variational solution is obtained as follows (Eguchi & Copas, 2001): O D log P.y D C1 j x/ ¡ log P.y D ¡1 j x/: g.x/

Asymptotic Properties of the Fisher Kernel

123

Since this solution may not be included in G , such losses do not necessarily achieve consistency.3 The squared loss is the appropriate choice for the Fisher kernel because from the theoretical viewpoint, it achieves consistency and because from the practical point of view, the solution can be obtained analytically in closed form. 4 Generalization Errors In this section, the generalization error of Fisher kernel classiers is investigated. Specically, we will study the behavior of the generalization error when the number of training samples is sufciently large. Such an analysis is often called learning curve analysis, where a learning curve describes the relation of the generalization error against the number of training samples (Baum & Haussler, 1989; Amari & Murata, 1993; Muller, ¨ Finke, Schulten, Murata, & Amari, 1996; Haussler, Kearns, Seung, & Tishby, 1996; Malzahn & Opper, 2002). Studies about learning curves have been playing an important role in elucidating the behavior of learning machines. Research in asymptotic statistics (e.g., Cox & Hinkley, 1974; Barndorff-Nielsen & Cox, 1989; Amari & Murata, 1993; Muller ¨ et al., 1996; van der Vaart, 1998; Amari & Nagaoka, 2001) and statistical mechanics approaches (e.g., Seung, Sompolinsky, & Tishby, 1992; Watkin, Rau, & Biehl, 1993; Haussler et al., 1996; Malzahn & Opper, 2002) has contributed to the study of generalization errors apart from bounds derived in a statistical learning theory framework (e.g., Devroye et al., 1996; Vapnik, 1998). In this section, we will adopt asymptotic statistical techniques following Barndorff-Nielsen and Cox (1989). The generalization error is dened as R.´ / :D Ex;y [r.x; y; ´ /] with a risk function r.x; y; ´ /. Here, Ex;y [¢] denotes the expectation with respect to p.x; y/. In the following, the risk function is determined as the KullbackLeibler divergence: µ D E R.´ / x;y log D

X Z y2f¡1;1g

p.x; y/ q.x; y j ´ /

¶

p.x; y/ log

p.x; y/ dx: q.x; y j ´ /

(4.1)

We will study the asymptotic generalization error of the two-step estimator typically used in the context of the Fisher kernel: (1) estimating the parameter of the marginal model using equation 3.1 and (2) xing these parameters and estimating the parameters of the linear model in equation 3.2. The Crame´ r-Rao bound (Barndorff-Nielsen & Cox, 1989) effectively determines the theoretical limit of learning. We will especially elucidate how 3 As discussed in the remark in section 2, the lack of consistency does not necessarily mean that they are not Bayes optimal. Further analyses are needed to clarify this point.

124

K. Tsuda, S. Akaho, M. Kawanabe, and K. Muller ¨

the generalization error is reduced to the theoretical limit as the number of unlabeled samples increases. 4.1 Asymptotics of M-Estimators. Before getting into the details, we briey review how to derive the generalization error of a general M-estimator (Barndorff-Nielsen & Cox, 1989). An M-estimator is calculated from an equation like v.zn ; xum ; ´O / D 0;

(4.2)

where zn is the labeled data of size n, xum is the unlabeled data of size m, ´ is an s-dimensional parameter, and v is an s-dimensional vector valued function (i.e., an estimating function). The function v is assumed to satisfy the unbiasedness condition,

E[v.zn ; xum ; ´ ¤ /] D 0;

(4.3)

and other regularity conditions that guarantee the consistency of the estimator ´O . Here E[¢] denotes the expectation with respect to training samples (both zn and xum ). When n goes to innity (r D m=n is xed), we have 1 r´ v.zn ; xum ; ´ ¤ / ! 0; in probability; n 1 p v.zn ; xum ; ´ ¤ / ! N .0; 3/; in distribution; n

(4.4) (4.5)

where 1 E[r´ v.zn ; xum ; ´ ¤ /]; n!1 n 1 3 D lim E[v.zn ; xum ; ´ ¤ /fv.zn ; xum ; ´ ¤ /g> ]: n!1 n 0 D lim

(4.6) (4.7)

We calculate the asymptotic distribution of the M-estimator ´O . From the estimating equation and equation 4.4, we have 1 1 v.zn ; xum ; ´ ¤ / C r´ v.zn ; xum ; ´ ¤ /.´O ¡ ´ ¤ / C Op .k´O ¡ ´ ¤ k2 / n n 1 D p ³ C 0.´O ¡ ´ ¤ / C Op .n¡1 /; n p where ³ D v.zn ; xum ; ´ ¤ /= n. Therefore, the estimator ´O can be approximated as p (4.8) n.´O ¡ ´ ¤ / D ¡0 ¡1 ³ C Op .n¡1=2 /; 0D

and it is asymptotically gaussian distributed, p n.´O ¡ ´ ¤ / » N .0; 0 ¡1 30 ¡> /; where 0 ¡> D .0 > /¡1 .

(4.9)

Asymptotic Properties of the Fisher Kernel

125

4.2 Asymptotic Expansion of the Generalization Errors. Next, let us consider the asymptotic expansion of generalization error 4.1. By Taylor expansion, we can calculate the expectation of R.´O / over labeled and unlabeled samples zn and xum ,

E[R.´O /] D R.´ ¤ / C r´> R.´ ¤ /E[´O ¡ ´ ¤ ]

1 trfr´2 R.´ ¤ / E[.´O ¡ ´ ¤ / .´O ¡ ´ ¤ /> ]g C O.n¡3=2 / 2 1 D R.´ ¤ / C trfr´2 R.´ ¤ / 0 ¡1 30 ¡> g C O.n¡3=2 /: (4.10) 2n C

When we adopt the KL divergence, equation 4.1, it turns out that R.´ ¤ / D 0, and the Hessian is equal to the Fisher information matrix (Barndorff-Nielsen & Cox, 1989), r´2 R.´ ¤ / D ¡ Ex;y [r´2 log q.x; y j ´ ¤ /] D G;

where

G D Ex;y [r´ log q.x; y j ´ ¤ /r´> log q.x; y j ´ ¤ /]:

Therefore, the generalization error is described as

E[R.´O /] D

1 trfG0 ¡1 30 ¡> g C O.n¡3=2 /: 2n

(4.11)

Notice that the derivation of the generalization error 4.11 relies heavily on the regularity conditions governing the limiting properties of M-estimators (Barndorff-Nielsen & Cox, 1989). When the regularity conditions do not hold, we need a different mathematical machinery to analyze the generalization error (Watanabe, 2001). 4.3 Generalization Error of the Two-Step Estimator. The two-step estimator 3.1 and 3.2 is regarded as a special case of M-estimators, where the estimating function is vµ .zn ; xum ; µO / D v» .zn ; xum ; »O ; µO / D

n X iD1 n X iD1

rµ log q.xi j µO / C

m X jD1

rµ log q.xju j µO / D 0;

O D 0; r» log hfyi .w O > fµO .xi / C b/g

(4.12)

(4.13)

where we will write » D .w> ; b/> for convenience. Following the general approach presented in section 4.1, the generalization error can be derived. To this end, let us dene important notations rst. Let us decompose the Fisher information matrix G as ³ ´ G»» G»µ GD ; Gµ» Gµµ

126

K. Tsuda, S. Akaho, M. Kawanabe, and K. Muller ¨

where G»» , G»µ ; Gµµ are the matrices of size .d C 1/ £ .d C 1/, .d C 1/ £ d, and d £ d, respectively and Gµ» D G> »µ . Then its inverse is written as ³ ´ S»» S»µ G¡1 D ; Sµ» Sµµ ¡1 where Sµµ D .Gµµ ¡ Gµ» G¡1 (others not shown for brevity). From »» G »µ / these submatrices, we dene the effective Fisher information (Kawanabe & Amari, 1994) as ¡1 GEµµ :D S¡1 µµ D Gµµ ¡ Gµ» G »» G»µ ;

(4.14)

which is the net information of µ after subtracting the amount shared with the other parameter » . We also dene Uµµ D Ex [rµ log q.x j µ¤ /rµ log q.x j µ ¤ /> ]: Then the generalization error is derived as follows: Theorem 2.

The generalization error of the two-step estimator is

E[R.´O /] D

» ± ²¼ 1 1 tr GEµµ U¡1 C O.n¡3=2 /: dC1C µµ 2n 1Cr

(4.15)

The proof is described in appendix A. 4.4 Cram´er-Rao Bound. Now that we have derived the generalization error (see equation 4.15), the next question is how it compares to other estimators. In order to answer this question, we will consider the lower bound of the generalization errors among a reasonable set of estimators. It is well known that the parameter variance of any asymptotically unbiased estimator 4 is lower-bounded by means of the Fisher information (e.g., Barndorff-Nielsen & Cox, 1989). Theorem 3 (Asymptotic Cram´er-Rao bound). Assume that there are n samples x1 ; : : : ; xn derived i.i.d. from p.x j ´ ¤ /. Also assume that an estimator ´O .x1 ; : : : ; xn / is asymptotically unbiased, that is,

E[´ .x1 ; : : : ; xn /] D ´ ¤ C o.n¡1=2 /: The covariance matrix of the estimator is asymptotically lower-bounded as lim nV[´O ¡ ´ ¤ ] ¸ J¡1 ;

n!1

(4.16)

4 One could consider asymptotically biased estimators, but typically such estimators are too tricky to be used in practice.

Asymptotic Properties of the Fisher Kernel

127

where J is the Fisher information matrix, J D Ex [r´ log p.x j ´ ¤ /r´ log p.x j ´ ¤ /> ]; and A ¸ B means that A ¡ B is positive semidenite. Our problem is slightly more complicated than stated in this theorem, because we have both n labeled and m unlabeled samples. In this case, the total Fisher information is simply the sum of Fisher information of labeled and unlabeled data (e.g., Zhang & Oles, 2000; Seeger, 2001). Therefore, xing the ratio r D m=n, the bound 4.16 is rewritten as lim nV[´ ¡ ´ ¤ ] ¸ .G C rU/¡1 ;

n!1

where U is the Fisher information of the marginal model q.x j µ /: U D Ex [r´ log q.x j µ¤ /r´> log q.x j µ ¤ /]:

Once the parameter variance is bounded, we can bound the generalization error asymptotically as follows: Theorem 4. The generalization error of any asymptotically unbiased estimator is lower-bounded as lim nE[R.´O /] ¸

n!1

1 tr.I C rG¡1 U/¡1 : 2

(4.17)

Proof. Let us abbreviate V[´O ¡ ´ ¤ ] as V. As seen in equation 4.11, the generalization error is asymptotically expanded as

E[R.´O /] D

1 trfGVg C O.n¡3=2 /: 2n

(4.18)

Since limn!1 nV ¸ .G C rU/¡1 , we derive equation 4.17 as 1 tr G.G C rU/¡1 2 1 D tr.I C rG¡1 U/¡1 : 2

lim nE[R.´O /] ¸

n!1

4.5 Effect of Unlabeled Data. As we compare the generalization error 4.15 with the lower bound 4.17, it is obvious that the generalization error does not achieve the lower bound by equality. This means that the twostep estimator fails to exploit all the Fisher information provided by the samples. Intuitively, it is because we only use x’s in estimating µ at the rst step, discarding the information of y.

128

K. Tsuda, S. Akaho, M. Kawanabe, and K. Muller ¨

However, we will show that the difference to the lower bound gets smaller as the number of unlabeled samples increases. In order to compare the generalization error and the lower bound, the lower bound is expanded as follows: Lemma 4. When expanded with respect to r, the lower bound in equation 4.17 is described as 1 tr.I C rG¡1 U/¡1 2 » ¼ 1 1 ¡1 D d C 1 C tr.GEµµ Uµµ / C O.r¡2 /: 2 r

lim nE[R.´O /] ¸

n!1

(4.19)

The proof is in appendix B. The n¡1 coefcient of the generalization error 4.15 is described as » ¼ 1 1 ¡1 E (4.20) d C 1 C tr.Gµµ Uµµ / C O.r¡2 /: 2 r Thus, the difference to the lower bound is within the order of r¡2 , which becomes very small when r is large. In order to illustrate this result, we calculate the learning curves for a simple model. The Fisher score is derived from the core model 2.3, where class distributions are one-dimensional unit gaussians centered on ¡1 and 1, respectively: x j y D C1 » N

.¡1; 1/;

x j y D ¡1 » N

.1; 1/:

The learning curves at r D 0; 1, and 3 are shown in Figure 2. When there are no unlabeled samples (r D 0), the difference between the learning curve and the lower bound is substantially large. However, the difference gets smaller quickly as r increases, and the two curves become almost identical at r D 3. This illustrative result underlines our theorerical analysis and suggests the importance of unlabeled samples in learning with the Fisher kernel. 5 Conclusion In this article, we have investigated several theoretical aspects of the Fisher kernel. One contribution is that we have put the Fisher kernel into the framework of statistical inference by showing the realizability conditions. This allows for subsequent analysis of discriminative classiers, consistency, and learning curves of the generalization error (including unlabeled data). Thus, our study has put the Fisher kernel approach on a more solid statistical basis, from which new algorithmic directions can be explored (e.g., the Bayes inference). We examined in this article only one option for feature extraction from marginal models: the combination of the Fisher kernel, a linear classi-

Asymptotic Properties of the Fisher Kernel

129

0.1

r=0 0.05

0 0 0.1

100

200

r=1 0.05

0

0 0.1

100

200

r=3 0.05

0

0

100

200

Figure 2: Theoretical learning curves of the Fisher kernel classier. The horizontal axis shows the number of labeled samples n, and the vertical axis shows O The solid and broken curves correspond to the generalization error E[R.´/]. the generalization error of the two-step estimator and the lower bound determined by the Cram´er-Rao bound, respectively. As the unlabeled/labeled ratio r increases, the two curves get closer.

er, and a linear activation function. In practice, it makes sense to consider alternative combinations. Ultimately, our goal is to construct a universal statistical theory of feature extraction from marginal models, which allows even wider practical use and a better inclusion of prior knowledge (e.g., hidden in unlabeled data or in industrial domain knowledge) into kernel-based learning methods.

130

K. Tsuda, S. Akaho, M. Kawanabe, and K. Muller ¨

Appendix A: Proof of Theorem 2 Let us decompose the matrices 0, 3 as ³ ´ ³ ´ 0»» 0»µ 3»» 3»µ 0D ; 3D : 0µ» 0µµ 3µ» 3µµ The submatrices are computed as follows: 0»»

0»µ

" # n X 1 ¤ D lim E r» r» log Q.yi j xi ; ´ / n!1 n iD1 D E[r»2 log q.x; y j ´ ¤ / ¡ r»2 log q.x j µ ¤ /] D ¡G»» ; " # n X 1 ¤ D lim E rµ r» log Q.yi j xi ; ´ / n!1 n iD1 D ¡G»µ ; 8 <X n

2 0µ» D lim

n!1

D 0; 0µµ D lim

n!1

1 4 rµ log q.xi j µ¤ / C E r» : n iD1

m X

8 <X n

2

1 4 rµ log q.xi j µ ¤ / C E rµ : n iD1

jD1

m X jD1

93 = rµ log q.xju j µ ¤ / 5 ; 93 = rµ log q.xju j µ ¤ / 5 ;

D ¡.1 C r/Uµµ ;

3»»

" # n n X X 1 ¤ ¤ > D lim E r» log Q.yi j xi ; ´ / r» log Q.yi j xi ; ´ / n!1 n iD1 iD1 D G»» ;

3»µ

" n X 1 D lim E r» log Q.yi j xi ; ´ ¤ / n!1 n iD1 £ D 0;

3µ» D 0;

8 <X n :

iD1

rµ log q.xi j µ ¤ / C

m X jD1

9> 3 = 7 rµ log q.xju j µ¤ / 5 ;

Asymptotic Properties of the Fisher Kernel

3µµ

131

9 28 n m = X 1 4<X D lim E rµ log q.xi j µ ¤ / C rµ log q.xju j µ ¤ / n!1 n : ; iD1 jD1 £

8 <X n :

iD1

rµ log q.xi j µ¤ / C

m X jD1

9> 3 = 7 rµ log q.xju j µ ¤ / 5 ;

D .1 C r/Uµµ : In summary, we have the following: ³ ´ ³ G»» G»» G»µ 0D¡ ; 3D 0 .1 C r/Uµµ 0 The inverse matrix of 0 becomes Á ! Á ¡1 ¡1 ¡1 ¡0»» G¡1 0»» 0»µ 0µµ ¡1 »» D¡ 0 D ¡1 0 0 0µµ

´ 0 : .1 C r/Uµµ ! ¡1 1 ¡ 1Cr G¡1 »» G »µ Uµµ : ¡1 1 1Cr Uµµ

The asymptotic covariance of ´O is ³ ´ A B ¡1 ¡> D 0 30 ; B> D ¡1 ¡1 ¡1 ¡1 ¡1 > ¡1 C 0»» A D 0»» 3»» 0»» 0»µ 0µµ 3µµ 0µµ 0»µ 0»»

D G¡1 »» C

1 ¡1 G¡1 G»µ Uµµ Gµ» G¡1 »» ; 1 C r »»

¡1 ¡1 ¡1 D¡ B D ¡0»» 0»µ 0µµ 3µµ 0µµ ¡1 ¡1 D D D 0µµ 3µµ 0µµ

Therefore,

1 U ¡1 : 1 C r µµ

1 ¡1 G¡1 G»µ Uµµ ; 1 C r »»

G0 ¡1 30 ¡> ³ ´ Á ¡1 ¡1 ¡1 1 G»» C 1Cr G¡1 G»» G»µ »» G»µ Uµµ Gµ» G »» D ¡1 1 Gµ» Gµµ ¡ 1Cr Uµµ Gµ» G¡1 »» ³

D

I ¡1 ¡1 1 E Gµ» G¡1 »» ¡ 1Cr Gµµ Uµµ G µ» G»»

¡1 1 ¡ 1Cr G¡1 »» G »µ Uµµ ¡1 1 1Cr Uµµ

!

´ 0 ¡1 : 1 E 1Cr Gµµ Uµµ

By substituting it to equation 4.11, we get the asymptotic expansion of the generalization error as » ¼ 1 1 ¡1 ¡3=2 E O [R. D C 1 C tr.G (A.1) E ´ /] d /: µµ Uµµ / C O.n 2n 1Cr

132

K. Tsuda, S. Akaho, M. Kawanabe, and K. Muller ¨

We remark that GEµµ > Uµµ in this case. This can be shown as follows. The conditional information matrix, J.Y j X/ D Ex;y [r´ log Q.y j x; ´ ¤ /r´ log Q.y j x; ´ ¤ /] D ¡Ex;y [r´ r´ log Q.y j x; ´ ¤ /] Á ! G»» G»µ D ; Gµ» Gµµ ¡ Uµµ

is positive denite if the probabilistic model is regular. Let us transform the information matrix as ³ ´ 0 G»» > FJ.Y j X/F D ; 0 GEµµ ¡ Uµµ where

³

FD

I

¡Gµ» G¡1 »»

´ 0 : I

Since the matrix FJ.Y j X/F> is positive denite, GEµµ ¡ Uµµ must be positive denite too. Although we showed the result only in the case of log-likelihood loss, it is possible to calculate the generalization error for general loss functions. The fomula becomes » ¼ 1 1 ¡1 ¡1 tr.G»» L¡1 C tr.HU E[R.´O /] D 3 L / / »» »» »» µµ 2n 1Cr C O.n¡3=2 /;

(A.2)

where ¡1 ¡1 ¡1 H D Gµµ ¡ Gµ» L¡1 »» L»µ ¡ Lµ» L»» G»µ C Lµ» L»» G»» L»» L»µ

L´´ D ¡Ex;y [r´ r´ `.w> fµ .x/ C b; y/]

3»» D Ex;y [r» `.w> fµ .x/ C b; y/ r» `.w> fµ .x/ C b; y/]: Appendix B:

Proof of Lemma 4

In order to prove equation 4.19, we will use the following expansion (Sugiyama, 2001): Lemma 5. follows:

For any symmetric matrix Z and r 6D 0, .I C rZ/¡1 is expanded as

.I C rZ/¡1 D .I ¡ ZZ† / ¡

´j k ³ X 1 ¡ Z† r jD1

Asymptotic Properties of the Fisher Kernel

³ ´kC1 ³ ´¡1 1 1 ¡ ¡ Z† I C Z† ; r r

133

(A.1)

where † indicates the Moore-Penrose pseudo inverse (Campbell & Meyer, 1979) and k is an arbitrary positive integer. The proof is described in appendix C. The lower bound in equation 4.17 is rewritten as 1 1 tr.I C rG¡1 U/¡1 D tr[.G1=2 C rG¡1=2 U/¡1 G1=2 ] 2 2 1 D tr[G1=2 .G1=2 C rG¡1=2 U/¡1 ] 2 1 D tr.I C rG¡1=2 UG¡1=2 /¡1 : 2 Setting Z D G¡1=2 UG¡1=2 , it is expanded as » ¼ 1 1 »1 tr.I C rG¡1 U/¡1 D C O.r¡2 /; »0 C 2 2 r

(A.2)

where the coefcients are described as »0 D tr.I ¡ ZZ† /;

»1 D tr.Z† /:

(A.3)

Equation 4.19 is proved because the coefcients are derived as follows: Lemma 6.

The coefcients »0 and »1 are described as

»0 D d C 1; E

(A.4) ¡1

»1 D tr.Gµµ Uµµ /; respectively. Proof.

Since q.x j µ / does not depend on w and b, U is described as

UD

³ 0 0

´ 0 : Uµµ

Then Z is rewritten as Z D G¡1=2 UG¡1=2 D BB> ;

(A.5)

134

K. Tsuda, S. Akaho, M. Kawanabe, and K. Muller ¨

where B is a .2d C 1/ £ d matrix: µ B D G¡1=2

¶ 0 1=2 : Uµµ

In terms of B, the pseudo inverse of Z is written as Z† D B.B> B/¡2 B> : The coefcient »0 is rewritten as »0 D tr.I ¡ ZZ† / D .2d C 1/ ¡ tr.ZZ† /; where tr.ZZ† / D tr.BB> B.B> B/¡2 B> / D tr.B.B> B/¡1 B> / D tr.B > B.B> B/¡1 / D d:

We thus have »0 D d C 1. »1 D tr.Z† / is rewritten as tr.Z† / D tr[B.B> B/¡2 B> ]

D tr.B> B/¡1 Á " #!¡1 h i 0 1=2 ¡1 D tr 0 Uµµ G 1=2 Uµµ ¡1=2 ¡1

D tr[Uµµ

¡1=2

Sµµ Uµµ ]

¡1 ¡1 E D tr.S¡1 µµ Uµµ / D tr.Gµµ Uµµ /:

Appendix C:

Proof of Lemma 5

This expansion was originally derived in lemma 4.8 of Sugiyama (2001). In the following, we quote his proof for readers’ convenience. Let us dene ® D 1=r. According to theorem 4.8 in Albert (1972), the following holds for a symmetric matrix Z: .I C ® ¡1 Z/¡1 D .I ¡ ZZ† / C ®Z† .I C ®Z† /¡1 : Also, for any matrix B, we have the following equation: .I C B/¡1 D I.I C B/¡1 D .I C B ¡ B/.I C B/¡1 D I ¡ B.I C B/¡1 :

(A.1)

Asymptotic Properties of the Fisher Kernel

135

Dening B D ®Z† , we have .I C ®Z† /¡1 D I ¡ ®Z† .I C ®Z† /¡1 :

(A.2)

By repeatedly applying equation A.2 to equation A.1, we have .I C ® ¡1 Z/¡1 D .I ¡ ZZ† / C ®Z† [I ¡ ®Z† .I C ®Z† /¡1 ]

D .I ¡ ZZ† / C ®Z† ¡ .®Z† /2 .I C ®Z† /¡1

D .I ¡ ZZ† / C ®Z† ¡ .®Z† /2 C .®Z† /3 .I C ®Z† /¡1 :: :

D .I ¡ ZZ† / ¡

k X jD1

.¡®Z† / j ¡ .¡®Z† /kC1 .I C ®Z† /¡1 :

Acknowledgments K.R.M. acknowledges partial nancial support by DFG (MU 987/1-1) and and B.M.B.F. under contract FKZ 01IBB02A. References Albert, A. (1972). Regression and the Moore-Penrose pseudoinverse. Orlando, FL: Academic Press. Amari, S., & Murata, N. (1993). Statistical theory of learning curves under entropic loss criterion. Neural Computation, 5, 140–153. Amari, S. & Nagaoka, H. (2001). Methods of information geometry. Providence, RI: American Mathematical Society. Barndorff-Nielsen, O., & Cox, D. (1989). Asymptotic techniques for use in statistics. London: Chapman and Hall. Baum, E., & Haussler, D. (1989). What size net gives valid generalization? Neural Computation, 1, 151–160. Campbell, S., & Meyer, C. (1979). Generalized inverse of linear transformations. New York: Pitman Publishing. Cox, D., & Hinkley, D. (1974). Theoretical statistics. London: Chapman & Hall. Cristianini, N., & Shawe-Taylor, J. (2000). An introduction to support vector machines. Cambridge: Cambridge University Press. Devroye, L., Gy¨or, L., & Lugosi, G. (1996). A probabilistic theory of pattern recognition. New York: Springer-Verlag. Eguchi, S., & Copas, J. (2001). Information geometry on discriminant analysis and recent development. Journal of the Korean Statistical Society, 27, 101–117.

136

K. Tsuda, S. Akaho, M. Kawanabe, and K. Muller ¨

Gelfand, I., & Fomin, S. (1963). Calculus of variations. Englewood Cliffs, NJ: Prentice-Hall. Gotoh, O. (1982). An improved algorithm for matching biological sequences. Journal of Molecular Biology, 162, 705–708. Haussler, D., Kearns, M., Seung, H., & Tishby, N. (1996). Rigorous learning curve bounds from statistical mechanics. Machine Learning, 25, 195–236. Jaakkola, T., & Haussler, D. (1999). Exploiting generative models in discriminative classiers. In M. Kearns, S. Solla, & D. Cohn (Eds.), Advances in neural information processing systems, 11 (pp. 487–493). Cambridge, MA: MIT Press. Jaakkola, T., Meila, M., & Jebara, T. (1999). Maximum entropy discrimination (Tech. Rep. No. AITR-1668). Cambridge, MA: MIT Press. Kawanabe, M., & Amari, S. (1994). Estimation of network parameters in semiparametric stochastic perceptron. Neural Computation, 6, 1244–1261. Malzahn, D., & Opper, M. (2002). A variational approach to learning curves. In T. G. Dietterich, S. Becker, & Z. Ghahramani (Eds.), Advances in neural information processing systems, 14 (pp. 463–469). Cambridge, MA: MIT Press. Muller, ¨ K.-R., Finke, M., Schulten, K., Murata, N., & Amari, S. (1996). A numerical study on learning curves in stochastic multi-layer feed-forward networks. Neural Computation, 8, 1085–1106. Muller, ¨ K.-R., Mika, S., R¨atsch, G., Tsuda, K., & Sch¨olkopf, B. (2001). An introduction to kernel-based learning algorithms. IEEE Trans. Neural Networks, 12, 181–201. Sch¨olkopf, B., & Smola, A. (2002). Learning with kernels. Cambridge, MA: MIT Press. Seeger, M. (2001). Learning with labeled and unlabeled data (Tech. Rep.). Edinburgh: Institute for Adaptive and Neural Computation, University of Edinburgh. Available on-line: http://www.dai.ed.ac.uk/homes/seeger/ papers/review.ps.gz. Seeger, M. (2002). Covariance kernels from Bayesian generative models. In T. G. Dietterich, S. Becker, & Z. Ghahramani (Eds.), Advances in neural information processing systems, 14, (pp. 905–912). Cambridge, MA: MIT Press. Seung, S., Sompolinsky, H., & Tishby, N. (1992). Statistical mechanics of learning from examples. Physical Review A, 45, 6056. Smith, N. & Gales, M. (2002). Speech recognition using SVMs. In T. G. Dietterich, S. Becker, & Z. Ghahramani (Eds.), Advances in neural information processing systems, 14, (pp. 1197–1204). Cambridge, MA: MIT Press. Sonnenburg, S., R¨atsch, G., Jagota, A., & Muller, ¨ K.-R. (2002). New methods for splice site recognition. In J. Dorronsoro (Ed.), Articial neural networks— ICANN 2002 (pp. 329–336). New York: Springer-Verlag. Sugiyama, M. (2001). A theory of model selection and active learning for supervised learning. Unpublished doctoral dissertation, Tokyo Institute of Technology. Tsuda, K., & Kawanabe, M. (2002). The leave-one-out kernel. In J. Dorronsoro (Ed.), Articial neural networks—ICANN 2002 (pp. 727–732). New York: Springer-Verlag. Tsuda, K., Kawanabe, M., & Muller, ¨ K.-R. (in press). Clustering with the Fisher score. In S. Becker, S. Thrun, & K. Obermayer (Eds.), Advances in neural information processing systems, 15. Cambridge, MA: MIT Press.

Asymptotic Properties of the Fisher Kernel

137

Tsuda, K., Kawanabe, M., R¨atsch, G., Sonnenburg, S., & Muller, ¨ K.-R. (2002). A new discriminative kernel from probabilistic models. Neural Computation, 14, 2397–2414. van der Vaart, A. (1998). Asymptotic statistics. Cambridge: Cambridge University Press. Vapnik, V. (1998). Statistical learning theory. New York: Wiley. Vinokourov, A., & Girolami, M. (2002). A probabilistic framework for the hierarchic organization and classication of document collections. Journal of Intelligent Information Systems, 18, 153–172. Watanabe, S. (2001). Algebraic analysis for non-identiable learning machines. Neural Computation, 13, 899–933. Watkin, T., Rau, A., & Biehl, M. (1993). The statistical mechanics of learning a rule. Reviews of Modern Physics, 65, 499. Zhang, T., & Oles, F. (2000). The value of unlabeled data for classication problems. In P. Langley (Ed.), Proceedings of the Seventeenth International Conference on Machine Learning (pp. 1191–1198). San Mateo, CA: Morgan Kaufmann. Zien, A., R¨atsch, G., Mika, S., Sch¨olkopf, B., Lengauer, T., & Muller, ¨ K.-R. (2000). Engineering support vector machine kernels that recognize translation initiation sites. Bioinformatics, 16, 799–807. Received February 3, 2003; accepted June 3, 2003.

LETTER

Communicated by David Stork

Adaptive Hybrid Learning for Neural Networks Rob Smithies [email protected]

Said Salhi

[email protected]

Nat Queen [email protected] School of Mathematics and Statistics, University of Birmingham, Birmingham B15 2TT, U.K.

A robust locally adaptive learning algorithm is developed via two enhancements of the Resilient Propagation (RPROP) method. Remaining drawbacks of the gradient-based approach are addressed by hybridization with gradient-independent Local Search. Finally, a global optimization method based on recursion of the hybrid is constructed, making use of tabu neighborhoods to accelerate the search for minima through diversication. Enhanced RPROP is shown to be faster and more accurate than the standard RPROP in solving classication tasks based on natural data sets taken from the UCI repository of machine learning databases. Furthermore, the use of Local Search is shown to improve Enhanced RPROP by solving the same classication tasks as part of the global optimization method. 1 Introduction Neural networks solve problems by learning the functional relationship between paired input-output data characterizing the problem. Therefore, they see use in a diverse range of applications, including pattern recognition, classication, and optimization. How this learning is performed is key to the success of the neural network. The most popular technique is supervised learning: the neural network is exposed to training examples of the functional relationship sampled from the paired input-output data, so that connection weights affecting signals propagated through the neural network are able to be adapted. The standard architecture for supervised learning is the multilayer feedforward network (MLFF) (Rumelhart & McClelland, 1986), consisting of a number of units linked by weighted connections and organized into input, hidden, and output layers. Therefore, this study focuses on the training of MLFFs. Each training example consists of an input pattern xp and a target output pattern tp , such that the functional relationship between paired input-output data is represented by a Neural Computation 16, 139–157 (2004)

c 2003 Massachusetts Institute of Technology °

140

R. Smithies, S. Salhi, and N. Queen

set P of these training examples. The goal of supervised learning is to tune the connection weights so that when xp is presented to the input layer and propagated through the neural network, the output pattern yp computed in the output layer will match tp . The training is guided by a cost function ED

X

Ep ;

(1.1)

p2P

where Ep is a measure of error between yp and tp . The aim of supervised learning is now equivalent to searching for a global minimum of E over the vector space spanned by the weights of the neural network. Typically, gradient descent is used for this type of optimization, in which each connection weight wij is adapted on iteration t by a step taken in the opposite direction to the gradient, scaled by the learning rate ²: 1wij .t/ D ¡²

@E .t/ @ wij

(1.2)

wij .t C 1/ D wij .t/ C 1wij .t/: The standard method for computing the rst-order derivative of equation 1.2 and thus training MLFFs is the backpropagation algorithm, devised by Rumelhart and McClelland (1986). It works by recursively propagating the training examples through the neural network in order to backpropagate the errors Ep of equation 1.1. If on-line learning is used, weights are @Ep updated for each training example p using the gradient @w to minimize Ep , ij which works well if the data set contains a lot of redundant information. If off-line learning is used, weights are updated for all training examples @E using the gradient @w to minimize E. Off-line learning is more reliable, as ij the shape of the entire cost function is considered, and so it is the strategy of choice in this study. This letter tackles the conundrum of choosing an appropriate learning rate ² for training, which is often a difcult task and is problem specic, being dependent on the shape of the error function. For instance, a small learning rate leads to slow convergence in shallow regions of the error function, whereas a large learning rate can lead to oscillations in the search. A number of proposed approaches deal with this problem of gradient descent and can be broadly classied as either globally or locally adaptive techniques. Globally adaptive techniques such as the conjugate gradient method (Moller, 1993) use knowledge about the state of the entire neural network, whereas locally adaptive techniques use weight-specic information only. Although local strategies require less information, Schiffman and Joost (1993) argue that they outperform global strategies for many applications, and so this letter focuses on locally adaptive techniques.

Adaptive Hybrid Learning for Neural Networks

141

The article is organized as follows. A brief review of the locally adaptive techniques is given in section 2. These form the basis of the new local and global optimization methods presented in section 3. The new methods are tested using natural data sets, and computational results are presented in section 4. A concluding summary follows in section 5. 2 A Review of Locally Adaptive Techniques QuickProp is a locally adaptive technique due to Fahlman (1988), in which the local error function for each weight is assumed to be a parabolic basin whose gradient is unaffected by the updating of other weights. Estimates of the position of the minimum for each weight are given by the following update, @E 1wij .t/ D ¡² .t/ C @wij

@E @wij .t/ @E @wij .t

¡ 1/ ¡

@E @wij .t/

1wij .t ¡ 1/;

(2.1)

where the rst term on the right-hand side is the standard gradient-descent step, and the second term is equivalent to a local application of Newton’s approximation method (Riedmiller, 1994). Furthermore, in order to avoid excessively large update steps due to a small denominator in the second term of equation 2.1, the update step is restricted to be at most º times as large as the previous update step. QuickProp consistently outperforms gradient descent and is one of the most popular adaptive learning techniques (Fahlman, 1988). Delta-Bar-Delta, proposed by Jacobs (1988), uses weight-specic learning rates based on the reasoning that the error function may well differ in shape from weight to weight. Each learning rate is adapted using a local estimate of the shape of the error function given by the gradients at consecutive weight updates. If the gradients have the same sign, the learning rate is linearly increased by a small constant · to accelerate learning in shallow regions of the error function. Otherwise, a local minimum has been traversed, as the previous weight update was too large, and so the learning rate is exponentially decreased by multiplying it with a constant ´¡ :

²ij .t/ D

8 > >· C ²ij .t ¡ 1/ < > > :

´¡ ²ij .t ¡ 1/ ²ij .t ¡ 1/

;

if

;

if

;

@E ¡ @wij .t @E ¡ @wij .t

@E 1/ @w .t/ > 0 ij @E 1/ @w .t/ < 0 ij

(2.2)

otherwise;

where 0 < ´¡ < 1. The weight update is given by 1wij .t/ D ¡²ij .t/

@E .t/ C ¹1wij .t ¡ 1/; @ wij

(2.3)

142

R. Smithies, S. Salhi, and N. Queen

which is similar to the standard weight update equation 1.2 of gradient descent except that it includes an extra momentum term with a momentum coefcient ¹ scaling the inuence of the previous update on the current update. The Delta-Bar-Delta method also outperforms gradient descent and is more robust with respect to the choice of parameters (Jacobs, 1988). Super SAB, described by Tollenaere (1990), is based on weight-specic learning rates like Delta-Bar-Delta, but differs slightly in that the learning rate is increased exponentially in shallow regions of the error function instead of linearly, thus providing a more reactive algorithm: 8 @E @E > ´ C ²ij .t ¡ 1/; if @w .t ¡ 1/ @w .t/ > 0 > ij ij < @E @E ¡ ²ij .t/ D ´ ²ij .t ¡ 1/; if @w .t ¡ 1/ @w .t/ < 0 ij ij > > : otherwise; ²ij .t ¡ 1/; where 0 < ´¡ < 1 < ´C . The weight update is once again performed by equation 2.3. The adaptive methods seen thus far suffer from a counterintuitive problem. When large update steps are required to accelerate learning along shallow regions of the error function, the associated small partial derivatives give rise to small update steps. Alternatively, when small update steps are needed to converge to a solution carefully, as in steep regions of the error function in the vicinity of local minima, the associated large partial derivatives give rise to large update steps. The best of the gradient-based methods in terms of convergence speed, accuracy, and robustness with respect to its parameters is Resilient Propagation (RPROP), developed by Riedmiller and @E Braun (1993). Only the sign of the partial derivative @w is used by RPROP ij to determine each weight update. Therefore, the size of each weight update is independent of the size of the partial derivative, overcoming the problem mentioned above. Indeed, RPROP updates each weight wij based on a step size 1ij controlled by the following heuristic:

1ij .t/ D

8 > ´C 1ij .t ¡ 1/; > < > > :

´¡ 1ij .t ¡ 1/; 1ij .t ¡ 1/;

if if

@E @wij .t @E @wij .t

@E ¡ 1/ @w .t/ > 0 ij @E ¡ 1/ @w .t/ < 0 ij

(2.4)

otherwise;

@E where 0 < ´¡ < 1 < ´C . If the gradient @w does not change sign for conij secutive weight steps, the step size is increased exponentially to accelerate learning; otherwise, the step size is decreased in an attempt to converge on the local minimum that has been traversed. The rst two heuristics presented in section 3 are enhancements of equation 2.4. It should be noted that step sizes are bounded above by 1max to prevent excessively large weights. Moreover, the choice of initial step size 1ij .0/ has some impact on performance, though RPROP is less sensitive to this than its rivals are to

Adaptive Hybrid Learning for Neural Networks

143

the choice of their initial learning rate, as has been shown experimentally by Riedmiller (1994). A common attempt at improving learning algorithms such as RPROP is to use weight backtracking (Silva & Almeida, 1990; Riedmiller & Braun, 1993), where a weight update is reversed if it impinges on the learning achieved for that weight of the neural network. If weight backtracking is used, each weight update 1wij is given by the RPROP+ heuristic: » ¼ @E @E @E D ¡sign 1wij .t/ .t/ 1ij .t/; if .t ¡ 1/ .t/ ¸ 0 (2.5) @wij @wij @ wij 1wij .t/ D ¡1ij .t/ and

@E .t/ D 0; @wij

if

@E @E .t ¡ 1/ .t/ < 0; @wij @wij

(2.6)

whereas if weight backtracking is omitted, it is given by the RPROP heuristic: » ¼ @E (2.7) 1wij .t/ D ¡sign .t/ 1ij .t/: @wij Enhancements of the RPROP+ and RPROP¡ heuristics have been proposed in the literature. An improved version of RPROP+ due to Igel and Husken ¨ (2000), known as iRPROP+, is based on the idea that a change in sign of the partial derivative implies a traversal of a local minimum but does not reveal whether the weight update leads to an increase or decrease in error. Therefore, reversing a weight step when the error decreases is counterproductive, and so weight backtracking should be performed only if the overall error increases. This is done by enhancing equation 2.6 to give the following heuristic: 1wij .t/ D ¡1ij .t/; if

@E @E .t ¡ 1/ .t/ < 0 and E.t/ > E.t ¡ 1/; @ wij @wij

(2.8)

where iRPROP+ is given by combining equation 2.5 with 2.8. Close to a local minimum, the error function can be regarded as quadratic, and so if step sizes are adapted to be small enough to maintain the search in the associated basin of attraction, then each weight update traversing the minimum will always take a step closer to that minimum. Compared to RPROP+, only one additional variable needs to be stored: the previous overall error E.t ¡ 1/. An improved version of RPROP¡ due to Igel and Husken ¨ (2000), known as iRPROP¡, incorporates the idea of decreasing the step size if a local minimum has been traversed, while avoiding unnecessary weight modication in the next iteration. This is done via the following heuristic: @E .t/ D 0; @ wij

if

@E @E .t ¡ 1/ .t/ < 0; @ wij @wij

(2.9)

144

R. Smithies, S. Salhi, and N. Queen

where iRPROP¡ is given by combining equations 2.7 and 2.9. According to Igel and Husken ¨ (2000), iRPROP+ and iRPROP¡ outperform RPROP+ and RPROP¡ over a variety of problems, including optimization of articial test functions, classication tasks taken from the PROBEN1 benchmark collection of natural data sets (Prechelt, 1994), and approximation of differential equation solutions. There appears to be little difference in performance between iRPROP+ and iRPROP¡, but iRPROP+ requires additional computation of the previous error E.t ¡ 1/, and so the enhancements of RPROP presented in this article will be tested in conjunction with the iRPROP¡ heuristic. It should be noted that gradient-based methods require the existence of a gradient in order to descend toward a local minimum and thus fail in at regions of the error function. However, they can also falter if the gradient is shallow enough to be deemed effectively at by round-off due to a lack of computational precision. Moreover, local optimization strategies cannot guarantee a globally optimal solution to any given problem, as they are prone to becoming trapped in the basin of attraction of some local minimum. Such issues will be addressed later in the article, by hybridizing Enhanced iRPROP¡ with gradient-independent techniques and then constructing a global optimization method based on the hybrid. 3 Methodology Two further enhancements of RPROP will be proposed with the aim of producing a new local optimization method, before being hybridized with a proposed form of Local Search to create a new global optimization method. 3.1 Enhancing the RPROP Algorithm. 3.1.1 The POWER Heuristic. Although RPROP is less sensitive to the choice of initial step sizes 1ij .0/ than other learning algorithms, they must necessarily be small to avoid excessively large weights. Hence the exponential increases of RPROP due to equation 2.4 require a warm-up period before optimum step sizes are reached. This deciency is overcome through a modication of equation 2.4, giving the following power-based heuristic, known as POWER: 8 C @E @E > ± .t ¡ 1/1=´ ; if @w .t ¡ 1/ @w .t/ > 0 > ij ij < ij @E @E ¡ (3.1) ±ij .t/ D ´ ±ij .t ¡ 1/; if @wij .t ¡ 1/ @wij .t/ < 0 > > : otherwise ±ij .t ¡ 1/; 1ij .t/ D ±ij .t/1max ;

where 0 < ´¡ < 1 < ´C and 0 < ±ij .0/ < 1. Any acceleration of the learning done by increasing the step size using equation 3.1 is more immediate than

Adaptive Hybrid Learning for Neural Networks

145

Figure 1: The step size increase due to POWER is initially more rapid than the exponential increase of RPROP or the linear increase of Delta-Bar-Delta, resulting in a more responsive algorithm.

in the standard RPROP heuristic, equation 2.4, or in the Delta-Bar-Delta heuristic, equation 2.2, due to the power-based modication, resulting in a more responsive algorithm, since the adaptation of the step sizes is much more rapid than in Delta-Bar-Delta and RPROP, as illustrated in Figure 1. Therefore, RPROP enhanced by POWER is more robust than the standard RPROP, being even less sensitive to the choice of initial step sizes. Note that ±ij .0/ < 1 in order for the exponent of equation 3.1 to increase the step size. In fact, making use of the rapid adaptation one can effectively eliminate ±ij .0/ as a heuristic parameter, since any positive value close to zero will sufce. A further benet of using equation 3.1 is that over sustained periods, the step size will increase smoothly to a maximum given by 1max , unlike the sharp cut-offs that characterize the standard RPROP heuristic equation 2.4. 3.1.2 The TRIG Heuristic. Using the fact that close to a local minimum, the error function is assumed to be quadratic, the heuristic parameter ´¡ of equation 2.4 or 3.1 can be replaced by a parameter ´ij¡ .t/ adapted at each iteration of RPROP by the following trigonometric-based heuristic, referred to as TRIG: ( ´ij¡ .t/

D

cos.µ .t¡1// cos.µ .t¡1//Ccos.µ.t// ; cos.µ .t// cos.µ .t¡1//Ccos.µ.t// ;

» µij .t/ D tan¡1

¼ @E .t/ : @ wij

if weight backtracking is used otherwise

(3.2)

(3.3)

146

R. Smithies, S. Salhi, and N. Queen

If weight update 1wij traverses a local minimum, then the assumption of a quadratic shape of the error function close to the local minimum can be used to detect any increase or decrease in the error for weight wij , provided the step size is small enough to maintain the search in the associated basin of attraction. This idea is depicted in Figure 2 for the case when weight backtracking @E @E is used. An increase in the error occurs if @w .t ¡ 1/ < @w .t/ or equivalently ij ij if µ.t/ > µ.t ¡ 1/ by equation 3.3, as illustrated in the top diagram of Figure @E @E 2. A decrease in the error occurs if @w .t ¡ 1/ > @w .t/ or, equivalently, ij ij if µ.t/ < µ.t ¡ 1/ by equation 3.3, as indicated in the bottom diagram of Figure 2. This angular difference allows step-size decrements to be tailored to the local state of the error function by equation 3.2, depending on whether weight backtracking is used, and is a more exible way of achieving effective convergence to the local minimum than using the xed parameter ´ ¡ . 3.2 Hybridizing Enhanced RPROP with Local Search. If Enhanced RPROP (that is RPROP with TRIG & POWER) fails to converge to a local minimum for a given problem, it is likely to be either oscillating about the minimum along several of the weight-specic gradients or stagnating in an effectively at region of the error function. Provided such situations can be detected, hybridizing Enhanced RPROP with a gradient-independent Local Search (Aarts & Lenstra, 1997), can reinvigorate exploration of the weight space. A neural network can be trained by Local Search using the following weight update for each iteration t: ( W.t C 1/ D

W.t/ C 1W.t/; W.t/;

if E.W.t/ C 1W.t// < E.W.t// otherwise

1W.t/ 2 N.W.t//;

(3.4)

(3.5)

where N.W.t// is a region of weight space centered on W.t/, known as the neighborhood of W.t/. Note that since weight space is continuous, N.W.t// must also be continuous. Key to the use of Local Search after a run of Enhanced RPROP is the choice of neighborhood. If Enhanced RPROP is oscillating, the neighborhood should be small enough in the relevant dimensions of weight space in order to guide the search toward a minimum in those dimensions. Alternatively, if Enhanced RPROP is stagnating, the neighborhood should be large enough in the relevant dimensions of weight space to give sufcient scope to the search. Both ideals can be achieved by using the step sizes 1Fij given by the nal iteration of Enhanced RPROP to construct a neighborhood, N.W.t// D fW.t/ C 1W.t/ : j1wij j < 1Fij for all i and jg;

Adaptive Hybrid Learning for Neural Networks

147

Figure 2: The assumption of a quadratic shape to the error function in the vicinity of a local minimum is used by TRIG to detect increases (top) or decreases (bottom) in error, resulting in faster convergence.

148

R. Smithies, S. Salhi, and N. Queen

where 1wij is the .i; j/th entry of the weight update matrix 1W.t/. This provides an estimated basin of attraction of the local minimum. The switch from Enhanced RPROP to the new version of Local Search, and the subsequent termination of the local optimization method, are controlled by criteria dened by Prechelt (1994). First, there is the relative increase of the validation error over the current minimum validation error, known as the generalization loss at iteration t: GL.t/ D

Eva .t/ Eva .t/ ¡1D ¡ 1; mint0 ·t Eva .t0 / Eopt .t/

where Etr .t/ is the training error, Eva .t/ is the validation error, and Eopt .t/ is the lowest validation error up to iteration t. A high generalization loss is one reason to switch to Local Search or to terminate training, which occurs if (3.6)

GLj .t/ > .GLj /threshold

for j successive iterations t ¡ j C 1; : : : ; t. Second, there is a measure of how much larger the average training error over k successive epochs is compared to the minimum training error for that period, known as the training progress at iteration t: P TP k .t/ D

t0 2t¡kC1:::t Etr .t

kmin

t0 2t¡kC1:::t

0

/ ¡ 1: Etr.t0 /

This measure is high for unstable periods of training but is guaranteed to approach zero unless the whole training phase oscillates. Therefore, the switch to Local Search or termination of training occurs if TP k .t/ < .TPk /threshold:

(3.7)

Combining these criteria with Enhanced RPROP & Local Search gives a hybrid method for local optimization that draws on the relative strengths of both gradient-based and gradient-independent techniques. 3.3 Tabu Neighborhoods for Global Optimization. Existing global optimization strategies like the branch-and-bound algorithm are mostly conned to problems associated with discrete solution spaces, such as combinatorial optimization problems. Exceptions include hill climbing algorithms such as simulated annealing (Kirkpatrick & Gelatt, 1983). The global optimization strategy presented here involves the recursive application of the Enhanced RPROP & Local Search hybrid method, where the nal solution of each recursion can be used to improve the search of subsequent recursions. This is achieved via an application of Tabu Search (Salhi & Queen, in

Adaptive Hybrid Learning for Neural Networks

149

press), an extension of Local Search in which previous candidate updates are prevented from reselection by surrounding tabu regions that collectively form tabu space ( T [ N.1W.t//; if E.W.t/ C 1W.t// ¸ E.W.t// TD otherwise; T; and hence equation 3.5 is replaced by the following rule: 1W.t/ 2 N.W.t//nT:

The weight update is the same as that of Local Search, namely equation 3.4. The rst recursion of the hybrid method begins with the uniformly random selection of an initial solution from initial weight space. Each time a nal solution W F generated by a recursion of the hybrid method has some chosen F , error measure E.W F / worse than that of the current best nal solution Wbest I the corresponding initial solution W is surrounded by a tabu neighborhood N.W I / subsequently incorporated into tabu space: ( F / T [ N.W I /; if E.W F / ¸ E.Wbest TD otherwise: T; Otherwise, W F becomes the new best nal solution: ( F if E.W F / < E.Wbest WF ; / F Wbest D F Wbest ; otherwise: F Initially, T D ; and E.Wbest / À 1, and subsequent initial solutions are chosen to satisfy

WI 2 = T:

(3.8)

The idea of using Tabu Search in this context is to encourage exploration of new regions of weight space. Therefore, the tabu neighborhood N.W I / is dened in such a way that a minimum of m neighborhoods provides a covering of the initial weight space. Suppose the initial weight space has volume V, and each tabu neighborhood N.W I / is an n-dimensional hypercube subspace of the initial weight space, of side 2r and hence volume .2r/n : N.W I / D fW I C 1W I : max j1wIij j < rg; i;j

(3.9)

where 1WijI is the .i; j/th entry of the matrix 1W I . Then it is evident that V ¸ m.2r/n , which can be rearranged to give a recommended minimum for the parameter r in equation 3.9, r 1n V (3.10) rD : 2 m

150

R. Smithies, S. Salhi, and N. Queen

Figure 3: Main steps of the global optimization method.

Finally, if the initial weight space itself is dened as an n-dimensional hypercube of side 2R, then V D .2R/n and substituting this into equation 3.10 gives R rD p : n m

(3.11)

The methodology of the proposed global optimization method is summarized in Figure 3. 4 Computational Results All algorithms were coded in the Java programming language and tested on two documented classication tasks involving the cancer1 and diabetes1 natural data sets taken from the UCI repository of machine learning databases (Prechelt, 1994). The results that are specically commented on in this section have been found to be statistically signicant by the nonparametric Mann-Whitney U test for unpaired data taken at the 5% signicance level. 4.1 Local Optimization. The impact of the new heuristics POWER and TRIG on the behavior of RPROP is analyzed by using the best-performing

Adaptive Hybrid Learning for Neural Networks

151

version of RPROP taken from Igel and Husken ¨ (2000): iRPROP¡. Performances of QuickProp, Delta-Bar-Delta, Super SAB, iRPROP¡, iRPROP¡ with POWER, and iRPROP¡ with TRIG & POWER are compared when solving the cancer1 and diabetes1 classication tasks. The parameter settings of the local optimization methods are: ² QuickProp: ² D 0:001; º D 1:75 ² Delta-Bar-Delta: ²ij .0/ D 0:01; ¹ D 0:5; · D 0:01; ´¡ D 0:5; 1max D 50 ² Super SAB: ²ij .0/ D 0:01; ¹ D 0:5; ´ C D 1:2; ´¡ D 0:5; 1max D 50 ² iRPROP- : 1ij .0/ D 0:0125; ´ C D 1:2; ´¡ D 0:5; 1max D 50

² iRPROP- with POWER: ±ij .0/ D 10¡10 ; ´ C D 1:05; ´¡ D 0:5; 1max D 1 ² iRPROP- with TRIG & POWER: ±ij .0/ D 10¡10 ; ´C D 1:05; 1max D 1

where those for iRPROP¡ are taken from Igel & Husken ¨ (2000), while the remainder have been found empirically. The experimental set-up follows that of Prechelt (1994). The architectures used for each task incorporate linear output units and hidden units with sigmoidal outputs of the following form: yi D

neti ; 1 C jneti j

(4.1)

whose output is in the range [¡0:5; 0:5]. The cost function equation 1.1 is a normalized form of sum-of-squares error where Ep is dened as Ep D

.ymax ¡ ymin / X p p 2 .tn ¡ yn / ; PN n2N

(4.2) p

p

where N is£the set of output units and the targets tn of outputs yn are from ¤ the range ymin ; ymax . Each local optimization method is subject to 100 trials, attempting to correctly classify as many of the examples as possible over the 3000 epochs of each trial. The speed of convergence in each trial is measured by recording the training epoch resulting in the lowest validation error. This is a standard early stopping criterion to ensure generalization of the learning and, hence, good neural network performance for examples other than those used for training. The accuracy of each trial is gauged by the percentage of examples correctly classied according to a 40%-20%-40% criterion due to Fahlman (1988). That is, an example is correctly classied if each target output of ¡1 corresponds to an actual output in the lower 40% of the output range and each target output of 1 corresponds to an actual output in the upper 40% of the output range. This criterion is used rather than some reciprocal of the normalized sum-of-squares error, equation 4.2, as it provides a better measure of how close a learning algorithm is to achieving perfect classication.

152

R. Smithies, S. Salhi, and N. Queen

4.1.1 Breast Cancer Classication Task. This task, labeled as cancer1 in Prechelt (1994) and used by Igel and Husken ¨ (2000) to compare iRPROP+ and iRPROP¡ with RPROP+ and RPROP¡, is to decide whether a tumor is benign or malignant on the basis of cell data collated by microscopic examination (e.g., clump thickness, uniformity of cell size, cell shape, amount of marginal adhesion, frequency of bare nuclei). The associated data set, courtesy of William H. Wolberg at the University of Wisconsin Hospitals, consists of nine continuous input attributes, two output attributes (with a target of f1; ¡1g for each malignant tumor and f¡1; 1g for each benign tumor), and 699 patients (65.5% of whom have benign tumors). As suggested by Prechelt (1994), the data set is sequentially split in a 50%-25%-25% ratio, giving 349 training, 175 validation, and 175 testing examples, respectively. Moreover, the 9-4-2-2 neural network topology (including all shortcut connections) recommended by Prechelt (1994) is used. Finally, the initial weights wij .0/ are chosen from a weight space that is a 109-dimensional hypercube, the range of each dimension being [¡0:5; 0:5]. The results are summarized in Figures 4 and 5. Figure 4 shows the mean percentage of correctly classied examples accompanied by error bars indicating standard deviation. It is evident that both QuickProp and Delta-Bar-Delta are less accurate than the other learning algorithms, while Super SAB compares favorably with iRPROP¡. However, the addition of the POWER heuristic improves iRPROP¡, enabling perfect training (since all training examples are correctly classied) without compromising the ability of the neural network to generalize (since there is no signicant drop in the percentage of validation examples correctly classied). Furthermore, the inclusion of the TRIG heuristic does not severely impinge on the accuracy achieved by iRPROP¡ with POWER. Figure 5 shows the mean number of training epochs required to minimize the validation error. It is clear that QuickProp, Delta-Bar-Delta, and Super SAB have slower convergence than the other learning algorithms. A more crucial point is that there is no signicant effect on the convergence speed of iRPROP¡ when POWER and TRIG are added. This task, labeled as diabetes1 in 4.1.2 Diabetes Classication Task. Prechelt (1994) and again used by Igel and Husken ¨ (2000), is to decide whether a Pima Indian individual is diabetic given personal data (e.g., age, number of pregnancies) and medical data (e.g., blood pressure, body mass index, glucose tolerance). The data set consists of eight continuous input attributes, two output attributes (with a target of f1; ¡1g for each diabetic individual and f¡1; 1g otherwise), and 768 individuals (34.9% of whom are diabetic). Once more, the data set is split in a 50%-25%-25% ratio to give 384 training, 192 validation, and 192 testing examples, respectively. Also, the 8-2-2-2 neural network topology (including all shortcut connections) proposed by Prechelt (1994) is used. Finally, the initial weights are chosen from a weight

Adaptive Hybrid Learning for Neural Networks

153

Figure 4: Accuracy of local optimization algorithms solving the cancer1 task.

Figure 5: Speed of local optimization algorithms solving the cancer1 task.

154

R. Smithies, S. Salhi, and N. Queen

Figure 6: Accuracy of local optimization algorithms solving the diabetes1 task.

space that is a 74-dimensional hypercube, the range of each dimension once again being [¡0:5; 0:5]. The results are collated in Figures 6 and 7. Figure 6 shows once again that both QuickProp and Delta-Bar-Delta are less accurate than the other learning algorithms and that Super SAB compares well with iRPROP¡. Again, the training accuracy of iRPROP¡ is improved by the inclusion of POWER and TRIG. Figure 7 indicates that the inclusion of POWER and TRIG substantially improves the convergence speed of iRPROP¡. 4.2 Global Optimization. The impact of using Local Search in the global optimization method is analyzed by comparing the performances of Enhanced iRPROP¡ (that is, iRPROP¡ with TRIG & POWER) with Enhanced iRPROP¡ & Local Search when solving problems based on the cancer1 and diabetes1 data sets. For both tasks, the parameter settings for Enhanced iRPROP¡ are ±ij .0/ D 10¡10, ´C D 1:05, and 1max D 1. The data sets are again divided using a 50%-25%-25% ratio, and the same architectures are used as in the local optimization experiments: a 9-4-2-2 architecture for the cancer1 task and an 8-2-2-2 architecture for the diabetes1 task. Both Enhanced iRPROP¡ and Enhanced iRPROP- & Local Search are subject to 1000 trials. For each trial of the hybrid, control switches to Local Search when the stagnation criteria are satised. Furthermore, the current trial nishes if the termi-

Adaptive Hybrid Learning for Neural Networks

155

Figure 7: Speed of local optimization algorithms solving the diabetes1 task.

nation criteria are satisied. For both tasks, the stagnation and termination criteria are given by Prechelt (1994), and both are satised if either the generalization loss (see equation 3.6) exceeds .GLj /threshold D 0:05 where j D 8, or the training progress (see equation 3.7) falls below .TPk /threshold D 10¡5 where k D 5. For each trial, the percentages of correctly classied examples are recorded, along with the epoch at which termination occurs. In addition, a log is kept of the percentages of correctly classied examples for the best of the 1000 trials (the trial with lowest validation error). Considering the number of weights and biases in the respective architectures, the initial weight space for the cancer1 task is a 109-dimensional hypercube and for the diabetes1 task a 74-dimensional hypercube. Since each dimension is in the range [¡0:5; 0:5], it is clear that R D 0:5, and letting m D 1000 (the number of trials) in equation 3.11 allows the size of tabu neighborhood hypercubes to be dened by r D 109p0:5 for the cancer1 task and r D 74p0:5 for the 1000 1000 diabetes1 task. The results are summarized in Table 1. Table 1 indicates that for both the cancer1 and diabetes1 tasks, the mean percentages of correctly classied examples are noticeably higher when Local Search is used. Therefore, the addition of Local Search is seen to improve the accuracy of each trial, without causing a signicant increase in the mean number of epochs per trial. The relatively low percentages for the diabetes1 task are due to initial solutions in at regions of the error function, causing

156

R. Smithies, S. Salhi, and N. Queen

Table 1: Accuracy and Speed of Global Optimization Algorithms. Performance Measures

Breast Cancer Task Enhanced iRPROP¡

Enhanced iRPROP¡ & Local Search

Correctly classied examples (% to 4 signicant gures) Training Mean 72.85 76.10 SD 34.49 29.83 Best 91.98 97.71 Validation Mean 72.32 80.52 SD 36.69 27.60 Best 97.14 97.71 Testing Mean 73.88 82.99 SD 38.54 28.12 Best 98.86 98.29 78.93 All Mean 72.97 SD 35.94 28.57 Best 94.99 97.85 Number of Epochs to Convergence (to nearest epoch) Per trial Mean 93 104 SD 27 31

Diabetes Task Enhanced iRPROP¡

Enhanced iRPROP¡ & Local Search

34.84 19.92 60.68 35.91 21.18 54.17 34.01 19.54 54.69 34.90 19.97 57.56

45.08 19.84 67.71 50.26 22.69 71.35 44.10 39.34 64.06 46.13 20.38 69.52

175 360

180 395

Note: The gures in boldface type are the best results found by the hybrids.

early stagnation of Enhanced iRPROP¡, though this is tempered by use of Local Search. Finally, the best percentages of correctly classied examples for both tasks are mostly higher when Local Search is used.

5 Conclusion Two proposed enhancements of the standard RPROP algorithm have been incorporated into iRPROP¡ and tested using classication tasks based on the cancer1 and diabetes1 natural data sets taken from the UCI repository of machine learning databases. The POWER heuristic improves the accuracy and the convergence speed of iRPROP¡ in solving both classication tasks without much increase in algorithm complexity, and the TRIG heuristic eliminates a xed parameter of iRPROP¡ without impinging on performance. Moreover, hybridizing Enhanced iRPROP¡ with Local Search leads to further advancements in terms of accuracy when incorporated as part of the proposed global optimization method. Together, the new local and global optimization methods provide an efcient supervised learning approach for training neural networks to solve real-world problems.

Adaptive Hybrid Learning for Neural Networks

157

Acknowledgments We thank both referees for their constructive suggestions, which improved the content and the presentation of this article, and also EPSRC for sponsoring R.S. References Aarts, E., & Lenstra, J. (1997). Local search in combinatorial optimization. New York: Wiley. Fahlman, S. (1988). An empirical study of learning speed in back-propagation networks (Tech. Rep. No. CMU-CS-88-162). Pittsburgh, PA: Carnegie-Mellon University. Igel, C., & Husken, ¨ M. (2000). Improving the Rprop learning algorithm. In Proceedings of the Second International Symposium on Neural Computation, NC2000, (pp. 115–121). Berlin: ICSC Academic Press. Jacobs, R. (1988). Increased rates of convergence through learning rate adaption. Neural Networks, 1, 295–307. Kirkpatrick, S., Gelatt, C., & Vecchi, M. (1983). Optimisation by simulated annealing. Science, 220, 671–680. Moller, M. (1993). A scaled conjugate gradient algorithm for fast supervised learning. Neural Networks, 6, 525–533. Prechelt, L. (1994). PROBEN1—A set of benchmarks and benchmarking rules for neural network training algorithms (Tech. Rep. No. 21/94). Karlsruhe: Fakult¨at fur ¨ Informatik, Universit at ¨ Karlsruhe. Riedmiller, M. (1994). Advanced supervised learning in multi-layer perceptrons—from backpropagation to adaptive learning algorithms. Computer Standards and Interfaces, 16, 265–278. Riedmiller, M., & Braun, H. (1993). A direct adaptive method for faster backpropagation learning: The RPROP algorithm. In Proceedings of the IEEE International Conference on Neural Networks (ICNN) (Vol. 16, pp. 586–591). Piscataway, NJ: IEEE. Rumelhart, D., & McClelland, J. (1986). Parallel distributed processing. Cambridge, MA: MIT Press. Salhi, S., & Queen, N. (in press). Determining local and global minima for functions with multiple minima. European Journal of Operational Research. Schiffman, W., Joost, M., & Werner, R. (1993). Optimization of the backpropagation algorithm for training multilayer perceptrons (Tech. Rep.) Koblenz: University of Koblenz, Institute of Physics. Silva, F., & Almeida, L. (1990). Advanced neural computers. Amsterdam: NorthHolland. Tollenaere, T. (1990). Supersab: Fast adaptive backpropagationwith good scaling properties. Neural Networks, 3, 561–573. Received January 2, 2003; accepted June 5, 2003.

Communicated by Kenji Fukumizu

LETTER

Divergence Function, Duality, and Convex Analysis Jun Zhang [email protected] Department of Psychology, University of Michigan, Ann Arbor, MI 48109, U. S. A

From a smooth, strictly convex function 8: Rn ! R, a parametric family .®/ of divergence function D8 may be introduced: .®/ D8 .x; y/ D

4 1 ¡ ®2

³

1¡® 1C® 8.x/ C 8.y/ 2 2 ³ ´´ 1¡® 1C® ¡8 xC y 2 2

.§1/ for x; y 2 int dom.8/ ½ Rn , and for ® 2 R, with D8 dened through taking the limit of ®. Each member is shown to induce an ®-independent Riemannian metric, as well as a pair of dual ®-connections, which are .§1/ generally nonat, except for ® D §1. In the latter case, D8 reduces to the (nonparametric) Bregman divergence, which is representable using 8 and its convex conjugate 8¤ and becomes the canonical divergence for dually at spaces (Amari, 1982, 1985; Amari & Nagaoka, 2000). This formulation based on convex analysis naturally extends the informationgeometric interpretation of divergence functions (Eguchi, 1983) to allow the distinction between two different kinds of duality: referential duality (® $ ¡®) and representational duality (8 $ 8¤ ). When applied to (not necessarily normalized) probability densities, the concept of conjugated representations of densities is introduced, so that §®-connections dened on probability densities embody both referential and representational duality and are hence themselves bidual. When restricted to a nite-dimensional afne submanifold, the natural parameters of a certain representation of densities and the expectation parameters under its conjugate representation form biorthogonal coordinates. The alpha representation (indexed by ¯ now, ¯ 2 [¡1; 1]) is shown to be the only measure-invariant representation. The resulting two-parameter family of divergence functionals D .®;¯ / , .®; ¯/ 2 [¡1; 1] £ [¡1; 1] induces identical Fisher information but bidual alpha-connection pairs; it reduces in form to Amari’s alpha-divergence family when ® D §1 or when ¯ D 1, but to the family of Jensen difference (Rao, 1987) when ¯ D ¡1.

Neural Computation 16, 159–195 (2004)

c 2003 Massachusetts Institute of Technology °

160

J. Zhang

1 Introduction Divergence functions play an important role in many areas of neural computation like learning, optimization, estimation, and inference. Roughly, they measure the directed (asymmetric) difference of two probability density functions in the innite-dimensional functional space, or two points in a nite-dimensional vector space that denes parameters of a statistical model. An example is the Kullback-Leibler (KL) divergence (cross-entropy) between two probability densities p and q, here expressed in its extended form (i.e., without requiring p; q to be normalized), K.p; q/ D

´ Z ³ q q ¡ p ¡ p log d¹ D K¤ .q; p/; p

(1.1)

which reaches the unique, global minimum value of zero on the diagonal of the product manifold (i.e., p D q). Many learning algorithms and/or proof for their convergence rely on properties of the KL divergence; the common ones are Boltzmann machine (Ackley, Hinton, & Sejnowski, 1985; Amari, 1991; Amari, Kurata, & Nagaoka 1992), the em algorithm and its comparison with EM algorithm (Amari, 1995), and the wake-sleep algorithm of the Helmholtz machine (Ikeda, Amari, & Nakahara, 1999). Another class of divergence functions widely used in optimization and convex programming literature is the so-called Bregman divergence (see below). It plays an essential role in unifying the class of projection and alternating minimization algorithms (Lafferty, Della Pietra, & Della Pietra, 1997; Della Pietra, Della Pietra, & Lafferty, 2002; Bauschke & Combettes, 2002; Bauschke, Borwein, & Combettes, 2002). Parametric families of Bregman divergence were used in blind source separation (Mihoko & Eguchi, 2002) and for boosting machines (Lebanon & Lafferty, 2002; Eguchi, 2002). Divergence function or functional1 is an essential subject in information geometry, the differential geometric study of the manifold of (parametric or nonparametric) probability distributions (Amari, 1982, 1985; Eguchi, 1983, 1992; Amari & Nagaoka, 2000). As rst demonstrated by Eguchi (1983), a well-dened divergence function (also called a contrast function) with vanishing rst order (in the vicinity of p D q) term will induce a Riemannian metric g by its second-order properties and a pair of dual (also called conjugate) connections .0; 0 ¤ / by its third-order properties, where the dual connections jointly preserve the metric under parallel transport. A manifold 1 Strictly speaking, when the underlying space is a nite-dimensional vector space, for example, that of parameters for a statistical or neural network model, then the term function is appropriate. However, if the underlying space is an innite-dimensional function space, for example, that of nonparametric probability densities, then the term functional ought to be used. The latter implicitly denes a divergence function (through pullback) if the probability densities are embedded into a nite-dimensional space as a statistical model.

Divergence Function, Duality, and Convex Analysis

161

endowed with fg; 0; 0 ¤ g is known as a statistical manifold; conditions for its afne realization through its embedding into a higher-dimension space have been claried (Kurose, 1994; Matsuzoe, 1998, 1999; Uohashi, Ohara, & Fujii, 2000). 1.1 Alpha-, Bregman, and Csiszar’s f-Divergence and Their relations. Amari (1982, 1985) introduced and investigated an important parametric family of divergence functionals, called ®-divergence 2

A

.®/

4 .p; q/ D 1 ¡ ®2

Z ³

1¡® 1C® 1¡® 1C® pC q¡p 2 q 2 2 2

´ d¹; ® 2 R: (1.2)

The ®-divergence, which specializes to K.p; q/ for ® D ¡1 and K¤ .p; q/ for ® D 1 (by taking the limit of ®), induces on the statistical manifold the family of ®-connections (Chentsov, 1982; Amari, 1982). The duality between ® $ ¡® is reected in that §®-connections form a pair of dual connections that jointly preserve the metric and that an ®-connection is at if and only .¡®/-connection is at (Amari, 1985; Lauritzen, 1987). As a special case, the exponential family (® D 1) and the mixture family (® D ¡1) of densities generate dually at statistical manifolds. Alpha divergence is a special case of the measure-invariant f -divergence introduced by Csisz a´ r (1967), which is associated with any convex function fc : RC ! RC satisfying fc .1/ D 0; fc0 .1/ D 0: Z

F fc .p; q/ D

p fc

³ ´ q d¹; p

(1.3)

where RC ´ RC [ f0g. This can be seen by fc taking the following family of convex functions3 indexed by a parameter ®, f

2

.®/

4 .t/ D 1 ¡ ®2

³

1¡® 1C® 1C® C t¡t 2 2 2

´ ;

® 2 R:

(1.4)

This form of ®-divergence rst appeared in Zhu and Rohwer (1995, 1997), where 1C® it was called the ±-divergence, ± D .1 ¡ ®/=2. The term 1¡® 2 p C 2 q in equation 1.2 after integration, is simply 1 for normalized densities; this was how Amari (1982, 1985) introduced ®-connection as a specic family of Csisza´ r’s f -divergence. See note 3. 3 Note that this form differs slightly with the function given in Amari (1985) and 1C® Amari and Nagaoka (2000) by the additional term 1¡® 2 C 2 t. This term is needed for the form of ®-divergence expressed in equation 1.2, which is “extended” from the original denition given in Amari (1982, 1985) to allow denormalized densities, in much the same way that extended Kullback-Leibler divergence (see equation 1.1) differs from its original form (without the p ¡ q or q ¡ p term). This additional term in f .®/ allows the condition f .®/ .1/ D 0 to be satised. It also provides a unied treatment for the ® D §1 case, since lim ®!1 f .®/ .t/ D 1 ¡ t C t log t; lim ®!¡1 f .®/ .t/ D t ¡ 1 ¡ log t.

162

J. Zhang

Eguchi (1983) showed that any f -divergence induced a statistical manifold with a metric proportional to Fisher information with the constant of proportionality fc00 .1/ and an equivalent ®-connection, ® D 3 C 2 fc000 .1/=fc00 .1/:

(1.5)

We note in passing that for a general, smooth, and strictly convex function f : R ! R, we can always induce a measure-invariant divergence by using fc .t/ D g.t/ in equation 1.3, where g.t/ ´ f .t/ ¡ f .1/ ¡ f 0 .1/.t ¡ 1/:

(1.6)

That the right-hand side of the above is nonnegative can be easily proved by showing that t D 1 is a global minimum with g.1/ D g0 .1/ D 0. Another kind of divergence function, called Bregman divergence, is dened for any two points x D [x1 ; : : : ; xn ], y D [y1 ; : : : ; yn ] in an n-dimensional vector space Rn (Bregman, 1967). It is induced by a smooth and strictly convex function 8: Rn ! R: B 8 .x; y/ D 8.y/ ¡ 8.x/ ¡ hy ¡ x; r8.x/i;

(1.7)

where r is the gradient (or, more strictly, subdifferential @8 if differentiability condition is removed) operator and h¢; ¢i denotes the standard inner product of two vectors. It is also called (actually proportional to) geometric divergence (Kurose, 1994), proposed in the context of afne realization of a statistical manifold. Bregman divergence B8 .x; y/ specializes to P i the KL divergence upon setting 8.x/ RD i ex , P introducing new variables xi D log pi ; yi D log qi , and changing d¹ into i . More generally, as observed by Eguchi (2002), Csisz a´ r ’s f -divergence (see equation 1.3) is naturally related (but not P identical) to Bregman divergence (see equation 1.7), having taken 8.x/ D i f .xi / with yi D qi =pi and xi D 1. In this case (with a slight abuse of notation), ³ i ´ X q F f .p; q/ D pi B f ;1 : pi i It is now known (Kass & Vos, 1997) that Bregman divergence is essentially the canonical divergence (Amari & Nagaoka, 2000) on a dually at manifold equipped with a pair of biorthogonal coordinates induced from a pair of “potential functions” under the Legendre transform (Amari, 1982, 1985). It is a distance-like measure on a nite-dimensional Riemannian manifold that is essentially at and is very different from the ®-divergence (see equation 1.2) that is dened over the space of positively valued, innite-dimensional functions on sample space (i.e., positive measures) and is generally nonat. However, if the positive measures p and q can be afnely embedded

Divergence Function, Duality, and Convex Analysis

163

into some nite-dimensional submanifold, the Legendre potentials for ®divergence could exist. Technically, this corresponds to the so-called ®-afne manifold, where the embedded ®-representation of the densities (® 2 R), » log p ®D1 (1.8) l.®/ .p/ D 2 .1¡®/=2 else, 1¡® p can be expressed as a linear combination of a countable set of basis functions of the innite-dimensional functional space (the denition of ®-afnity). If and only if such embedding is possible for a certain value of ®, a potential function (and its dual) can be found so that equation 1.2 becomes 1.7. In general, Bregman divergence and ®-divergence are very different in terms of both the dimensionality and the atness of the underlying manifold that they are dened on, though both may induce dual connections. Given the fundamental importance of ®-connections in information geometry, it is natural to ask whether the parameter ® may arise other than from the context of ®-embedding of density functions. How is the ® $ ¡® duality related to the pair of biorthogonal coordinates and the canonical divergence they dene? Does there exist an even more general expression of divergence functions that would include the ®-divergence, the Bregman divergence, and the f -divergence as special cases yet would still induce the dual §®-connections? The existence of a divergence function on a statistical manifold given the Riemannian metric and a pair of dual, torsion-free connections was answered positively by Matumoto (1993). Here, the interest is to nd explicit forms for such general divergence functions, in particular, measure-invariant ones. The goal of this article is to introduce a unifying perspective for the ®-, Bregman, and Csiszar ’s f -divergence as arising from certain fundamental convex inequalities and to clarify two different senses of duality embodied by divergence function and functionals and the statistical manifolds they dene: referential duality and representational duality. Starting from the denition of a smooth, strictly convex function 8: Rn ! .®/ R, a parametric family of divergence D8 .x; y/, ® 2 R; over points x; y 2 S D int dom.8/ can be introduced that will be shown to induce a single Riemannian metric with a parametric family of afne connections indexed by ®, the convex mixture parameter. These ®-connections are nonat unless .§1/ ® D §1, when D8 turns out to be the Bregman divergence, which can be cast into the canonical form using a pair of convex conjugate functions 8 and 8¤ (Amari’s potential functions) that obey Legendre-Fenchel duality. The biorthogonal coordinates x and u, originally introduced for a dually at manifold, can now be extended to dene the divergence function on any nonat manifold (® 6D §1) as well. A distinction is drawn between two kinds of duality (“biduality”) of statistical manifolds, in the sense of mutual references x $ y, as reected by ® $ ¡® duality, and in the sense of conjugate representations u D r8.x/ $ x D .r8¤ /.u/ D .r8/¡1 .u/,

164

J. Zhang

as reected by 8 $ 8¤ duality. In the innite-dimensional case, representational duality is achieved through conjugate representations of any (not necessarily normalized) density function; here, conjugacy is with respect to a strictly convex function dened on the real line f : R ! R. Our notion of conjugate representations of density functions, which generalizes the notion of alpha representation (see equation 1.8), proves to be useful in characterizing the afne embedding of a density function into a nitedimensional submanifold; the natural and expectation parameters become the pair of biorthogonal coordinates, and this case completely reduces to the one discussed earlier. Of particular practical importance is that our analysis provides a twoparameter family of measure-invariant divergence function(al) D .®;¯ / .p; q/ under the alpha representation l.¯/ .p/ of densities (indexed by ¯ now, ¯ 2 [¡1; 1], with ® reserved to index convex mixture), which induce identical Fisher information metric and a family of alpha connections where the product ®¯ serves as the “alpha” parameter. The two indices themselves, .®; ¯/ 2 [¡1; 1] £ [¡1; 1], precisely express referential duality and representational duality, respectively. Interestingly, at the level of divergence functional, D .®;¯/ turns out to be the family of alpha divergence for ® D §1 or for ¯ D 1, and the family of “Jensen difference” (Rao, 1987) for ¯ D ¡1. Thus, Kullback-Leibler divergence, the one-parameter family of ®-divergence and of Jensen difference, and the two-parameter family of .®; ¯/-divergence form nested families of measure-invariant divergence function(al) associated with the same statistical manifold studied in classical information geometry. 2 Divergence on Finite-Dimensional Parameter Space In this section, we consider the nite-dimensional vector space Rn , or a convex subset thereof, that denes the parameter of a probabilistic (e.g., neural network) model. The goal is to introduce, with the help of an arbitrary strictly convex function 8: Rn ! R, a family of asymmetric, nonnegative measures between two points in such space, called divergence functions (see proposition 1) and, through which to induce a well-dened statistical manifold with a Riemannian metric and a pair of dual connections (see proposition 2). The procedure used for linking a divergence function(al) to the underlying statistical manifold is due to Eguchi (1983); our notion of referential duality is reected in the construction of dual connections. The notion of representational duality is introduced through equation 2.15 and proposition 5, based on the convex conjugacy operation (see remark 2.3.1). Biduality is thus shown to be the fundamental property of a statistical manifold induced by the family of divergence functions based on a convex function 8. 2.1 Convexity and Divergence Functions. Consider the n-dimensional vector space (e.g., the parameter space in the case of parametric probability

Divergence Function, Duality, and Convex Analysis

165

density functions or neural network models). A set S µ Rn is called convex if for any two points x D [x1 ; : : : ; xn ] 2 S, y D [y1 ; : : : ; yn ] 2 S and any real number ® 2 [¡1; 1], the convex mixture 1¡® 1C® xC y 2 S; 2 2 that is, the line segment connecting any two points x and y, belongs to the set S. A strictly convex function of several variables 8.x/ is a function dened on a nonempty convex set S D int dom.8/ µ Rn such that for any two points x 2 S, y 2 S and any real number ® 2 .¡1; 1/, the following, ³

´ 1¡® 1C® 1¡® 1C® 8 xC y · 8.x/ C 8.y/; 2 2 2 2

(2.1)

is valid, with equality holding only when x D y. Equation 2.1 will sometimes be referred to as the fundamental convex inequality below. Intuitively, the difference between the left-hand side and the right-hand side of (equation 2.1) depends on some kind of proximity between the two points x and y in question, as well as on the degree of convexity (loosely speaking) of the function 8. For convenience, 8 is assumed to be differentiable up to third order, though this condition can be slightly relaxed to the class of so-called essentially smooth and essentially strictly convex functions or the convex function of the Legendre type (Rockafellar, 1970) without affecting much of the subsequent analysis. Note that for ® D §1, the equality in equation 2.1 holds uniformly for all x; y; for ® 6D §1, the equality will not hold uniformly unless 8.x/ is the linear function ha; xi C b with a a constant vector and b a scalar. Proposition 1. For any smooth, strictly convex function 8: Rn ! R; x 7! 8.x/ and ® 2 R, the function .®/ D8 .x; y/ D

³ 4 1¡® 1C® 8.x/ C 8.y/ 1 ¡ ®2 2 2 ³ ´´ 1¡® 1C® ¡8 xC y 2 2

(2.2)

with .§1/ .®/ D8 .x; y/ D lim D8 .x; y/ ®!§1

(2.3)

is a parametric family of nonnegative functions of x; y that equal zero if and only 1C® if x D y. Here, the points x; y, and z D 1¡® 2 x C 2 y are all assumed to belong to S D int dom.8/.

166

J. Zhang

Proof. Clearly, for any ® 2 .¡1; 1/, 1 ¡ ® 2 > 0, so from equation 2.1, the .®/ functions D8 .x; y/ ¸ 0 for all x; y 2 S, with equality holding if and only if 2 as a convex mixture of z x D y. When ® > 1, we rewrite y D ®C1 z C ®¡1 ®C1 x 0 0 2 1¡® ®¡1 1C® 0 and x (i.e., ®C1 D 2 , ®C1 D 2 with ® 2 .¡1; 1/). Strict convexity of 8 guarantees 2 ®¡1 8.z/ C 8.x/ ¸ 8.y/ ®C1 ®C1 or explicitly 2 1C®

³

³ ´´ 1¡® 1C® 1¡® 1C® · 0: 8.x/ C 8.y/ ¡ 8 xC y 2 2 2 2

.®/ This, along with 1 ¡ ® 2 < 0, proves the nonnegativity of D8 .x; y/ ¸ 0 for ® > 1, with equality holding if and only if z D x, that is, x D y. The case of ® < ¡1 is similarly proved by applying equation 2.1 to the three points y; z, .®/ 2 and their convex mixture x D 1¡® zC ¡1¡® 1¡® x. Finally, continuity of D8 .x; y/ with respect to ® guarantees that the above claim is also valid in the case of ® D §1. ¦ .®/ .®/ Remark 2.1.1. These functions are asymmetric, D8 .x; y/ 6D D8 .y; x/ in general, but satisfy the dual relation .®/ .¡®/ D8 .x; y/ D D8 .y; x/:

(2.4)

.®/ Therefore, D8 as parameterized by ® 2 R, for a xed 8, properly form a family of divergence functions (also known as deviations or contrast functions) in the sense of Eguchi (1983, 1992), Amari (1982, 1985), Kaas & Vos (1997), and Amari & Nagaoka (2000).

Example 2.1.2. Take the negative of the entropy function 8.x/ D which is easily seen to be convex. Then equation 2.2 becomes Á 4 X 1¡® i x log 1 ¡ ®2 i 2

P i

xi log xi ,

xi 1¡® i 2 x

1C® i C y log 2

C

1¡® i 2 x

1C® i 2 y

yi C

1C® i 2 y

! ´ E.®/ .x; y/;

(2.5)

a family of divergence measure called Jensen difference (Rao, 1987), apart 4 from the 1¡® 2 factor. The Kullback-Leibler divergence, equation 1.1, is recovered by letting ® ! §1 in E.®/ .x; y/.

Divergence Function, Duality, and Convex Analysis

167

P i Example 2.1.3. Take 8.x/ D i ex while denoting pi D log xi ; qi D log yi . .®/ Then D.®/ 8 .x; y/ becomes the ®-divergence A .p; q/ in its discrete version. .®/ 2.2 Statistical Manifold Induced by D8 . The divergence function measure of the asymmetric (directed) distance between a comparison point y as measured from a xed reference point x. Although this function is globally dened for x; y at large, information geometry provides a standard technique, due to Eguchi (1983), to investigate the differential geometric structure induced on S from any divergence .®/ function, through taking limx!x0 ; y!x0 D8 .x; y/. The most important geometric objects on a differential manifold are the Riemannian metric g and the afne connection 0. The metric tensor xes the inner product operation on the manifold, whereas the afne connection establishes the afne correspondence among neighboring tangent spaces and denes covariant differentiation. .®/ D8 .x; y/ provides a quantitative

.®/ Proposition 2. The statistical manifold fS; g; 0 .®/ ; 0 ¤.®/ g associated with D8 is given (in component forms) by

(2.6)

gij D 8ij and .®/ 0ij;k D

1¡® 8ijk ; 2

¤.®/

0ij;k D

1C® 8ijk : 2

(2.7)

Here, 8ij , 8ijk denote, respectively, second and third partial derivatives of 8.x/: 8ij D

@ 2 8.x/ ; @xi @x j

8ijk D

@ 3 8.x/ : @xi @x j @xk

Proof. Assuming Fre´ chet differentiability of 8, we calculate the Taylor .®/ expansion of D8 .x; y/ around the reference point x0 in the direction » for the rst variable (i.e., x D x0 C » ) and in the direction of ´ for the second variable (i.e., y D x0 C ´), while renaming x0 as x for clarity:4 .®/ D8 .x C »; x C ´/ ’

1X 8ij .» i ¡ ´i / .» j ¡ ´ j / 2 i;j

4 We try to follow the conventions of tensor algebra for upper and lower indices, but do not invoke Einstein summation convention since many of the equalities are not in coordinate-invariant or tensorial form.

168

J. Zhang

³ 1X 3¡® i j k 3C® i j k C 8ijk »» » C ´´´ 6 i;j;k 2 2

´ 3 C 3® i j k 3 ¡ 3® i j k ¡ ´´» ¡ » » ´ C o.» m ´l /; 2 2

where higher orders in the expansion (i.e., mCl ¸ 4) have been collected into o.¢/. Following Eguchi (1983), the metric tensor of the Riemannian geometry .®/ induced by D8 is gij .x/ D ¡

@2 .®/ D C C .x »; x ´/ ; ´D»D0 @ » i @´ j 8

(2.8)

whereas the pair of dual afne connections 0 and 0 ¤ is 0ij;k .x/ D ¡ ¤ 0ij;k .x/ D ¡

@3

.®/ D8 .x C »; x C ´/

;

(2.9)

@3 .®/ D C C .x » ; x ´/ : ´D» D0 @´i @ ´ j @» k 8

(2.10)

@» i @» j @´k

´D» D0

Carrying out differentiation yields equations 2.6 and 2.7. ¦ Remark 2.2.1. Clearly, the metric tensor gij , which is symmetric and positive semidenite due to the strict convexity of 8, is actually independent of ®, whereas the ®-dependent afne connections are torsion free (since .®/ .®/ 0ij;k D 0ji;k ) and satisfy the duality ¤.®/ .¡®/ 0ij;k D 0ij;k ;

in accordance with equation 2.4, the duality in the selection of reference ver.®/ sus comparison point in D8 . Dual ®-connections in the form of equation 2.7 were formally introduced and investigated in Lauritzen (1987). Here, the .®/ family of D8 -divergence is shown to induce these ®-connections. Clearly, when ® D 0, the connection 0 .0/ D 0 ¤.0/ ´ 0 LC is the self-dual, metric (Levi-Civita) connection, as through direct verication it satises LC 0ij;k

1 D 2

³

´ @ gik .x/ @ gkj .x/ @ gij .x/ C ¡ : @xi @xk @x j

The Levi-Civita connection and other members in the ®-connection family are related through LC D 0ij;k

² 1 ± .®/ ¤.®/ 0ij;k C 0ij;k : 2

Divergence Function, Duality, and Convex Analysis

169

Note the covariant form of the afne connection, 0ij;k , is related to its contravariant form 0ijk through gij : X

gkl 0ijk D 0ij;l

(actually, 0ijk is the more primitive quantity since it does not involve the metric). The curvature or atness of a connection is described by the RiemannChristoffel curvature tensor, i Rj¹º .x/ D

i @0ºj .x/

@x¹

¡

i @0¹j .x/

@xº

C

X k

i k 0¹k .x/0ºj .x/ ¡

X

i k 0ºk .x/0¹j .x/;

k

or equivalently by Rij¹º D

X

l gil Rj¹º :

l

It is antisymmetric when i $ j or when ¹ $ º and symmetric when .i; j/ $ .¹; º/. Since the curvature R¤ij¹º of the dual connection 0 ¤ equals Rij¹º (Lauritzen, 1987), ¤.®/

.®/ .¡®/ Rij¹º D Rij¹º D Rij¹º :

Proposition 3. The Riemann-Christoffel curvature tensor for the ®-connection .®/ .®/ 0ij;k induced by D8 is D R.®/ ij¹º

1 ¡ ®2 X .8ilº 8jk¹ ¡ 8il¹ 8jkº /8lk ; 4 l;k

(2.11)

where 8ij D gij is the matrix inverse of 8ij . Proof.

First, from its denition, Rij¹º can be recast into

Rij¹º D

³ ³ ´ ³ ´´ @0ºj;i @0¹j;i X k @gik @gik k ¡ C ¡ ¡0 ¡ 0 0 0 : (2.12) ¹k;i ºk;i ºj ¹j @ x¹ @ xº @x¹ @xº k

Substituting in the expression of ®-connections (equation 2.7), the rst P two terms cancel and the terms under k give rise to equation 2.11. ¦

170

J. Zhang

Remark 2.2.2. The metric (see equation 2.6), dual ®-connections (see equation 2.7), and the curvature (see equation 2.11) in such forms rst appeared in Amari (1985) where 8.x/ is the cumulant generating function © ª of an exponential family. Here, the statistical manifold S; g; 0 .®/ ; 0 ¤.®/ is induced by a divergence function via the Eguchi relation, and 8.x/ can be any (smooth and strictly) convex function. The purpose of this proposition is to remind readers that for any convex 8 in general, the curvature is determined by 4 both an ®-dependent factor 1¡® 2 and a 8-related component, the latter depending on 8ij plus its derivatives and inverse. This leads to the following conformal property: O Corollary 1. If two smooth, strictly convex functions 8.x/ and 8.x/ are conformally related, that is, if there exists some positive function ¾ .x/ > 0 such that O ij D ¾ 8ij , then the curvatures of their respective ®-connection satisfy 8 .®/ .®/ RO ij¹º D ¾ Rij¹º :

Proof.

(2.13)

Observe that

O ilº D ¾ 8ilº C ¾i 8lº ; 8 where ¾i denotes @ ¾=@xi . Permutating the index set .i; l; º/ to .i; l; ¹/, to O ij D .j; k; ¹/, and to .j; k; º/ yield three other similar identities. Noting 8 .®/ ¡1 ij .¾ / 8 , direct substitution of these relations into the expression of Rij¹º in equation 2.11 yields equation 2.13. ¦ 2.3 Dually Flat Statistical Manifold (® D §1). When ® D §1, all components of the curvature tensor vanish, that is, R.§1/ ij¹º D 0. In this case, there ¤.¡1/ .1/ exists a coordinate system under which either 0ij;k D 0 or 0ij;k D 0. This is the well-studied dually at statistical manifold (Amari, 1982, 1985; Amari & Nagaoka, 2000), under which a pair of global biorthogonal coordinates, related to each other through the Legendre transform with respect to 8, can be found to cast the divergence function into its canonical form. The Riemannian manifold with metric tensor given by equation 2.6, along with the dually at 0 .1/ and 0¤.¡1/ , is known as the Hessian manifold (Shima, 1978; Shima & Yagi, 1997).

Proposition 4. equation 1.7)

.®/ When ® ! §1, D8 reduces to the Bregman divergence (see

.¡1/ .1/ D8 .x; y/ D D8 .y; x/ D B8 .x; y/; .1/ .¡1/ D8 .x; y/ D D8 .y; x/ D B8 .y; x/:

Divergence Function, Duality, and Convex Analysis

Proof.

171

Assuming that the Gˆateaux derivative of 8,

8.x C ¸» / ¡ 8.x/ ; ¸!0 ¸ lim

exists and equals h»; r8.x/i, where r is the gradient (subdifferential) operator and h¢; ¢i denotes the standard inner product. Similarly, lim

¸!1

8.y C .1 ¡ ¸/´/ ¡ 8.y/ D h´; r8.y/i: 1¡¸

Taking » D y ¡ x, ´ D x ¡ y, and ¸ D 1§® 2 , and substituting these into equation 2.3 immediately yields the results. ¦ Remark 2.3.1. Introducing the convex conjugate of 8 through the LegendreFenchel transform (see, e.g., Rockafellar, 1970), 8¤ .u/ D hu; .r8/¡1 .u/i ¡ 8..r8/¡1 .u//;

(2.14)

it can be shown that the function 8¤ is also convex (on a convex region S0 3 u where u D r8.x/) and has 8 as its conjugate, .8 ¤ /¤ D 8: Since r8 and r8¤ are inverse functions of each other, as veried by differentiating equation 2.14, it is convenient to denote this one-to-one correspondence between x 2 S and u 2 S0 by x D r8¤ .u/ D .r8/¡1 .u/;

u D r8.x/ D .r8¤ /¡1 .x/:

(2.15)

.§1/ With these, it can be shown that the Bregman divergence D8 is actually the canonical divergence (Amari & Nagaoka, 2000) in disguise. .§1/ Corollary 2. The D8 -divergence is the canonical divergence of a dually at statistical manifold: .1/ D8 .x; .r8/¡1 .v// D A8 .x; v/ ´ 8.x/ C 8¤ .v/ ¡ hx; vi;

.¡1/ D8 ..r8/¡1 .u/; y/ D A8¤ .u; y/ ´ 8.y/ C 8¤ .u/ ¡ hu; yi:

Proof.

Using the convex conjugate 8¤ , we have .1/ D8 .x; y/ D 8.x/ ¡ hx; r8.y/i C 8¤ .r8.y//;

.¡1/ D8 .x; y/ D 8.y/ ¡ hy; r8.x/i C 8¤ .r8.x//:

(2.16)

172

J. Zhang

.1/ Substituting u D r8.x/; v D r8.y/ yields equation 2.16. So D8 .x; ¡1 .r8/ .v//, when viewed as a function of x; v, is the canonical divergence. .¡1/ So is D8 .¦

Remark 2.3.2. The canonical divergence A8 .x; v/ based on the LegendreFenchel inequality is introduced by Amari (1982, 1985), where the functions 8; 8¤ , the cumulant generating functions of an exponential family, were referred to as dual potential functions. This form, equation 2.16, is “canonical” because it is uniquely specied in a dually at space where the pair of canonical coordinates (see equation 2.15) induced by the dual potential functions is biorthogonal, @ ui D gij .x/; @ xj

@ xi D gQ ij .u/; @u j

(2.17)

where gQ ij .u.x// D gij .x/ is the matrix inverse of gij .x/ given by equation 2.6. Remark 2.3.3. We point out that there are two kinds of duality associated with the divergence (directed distance) dened on a dually at statistical .¡1/ .1/ .¡1/ .1/ manifold: one between D8 $ D8 and D8 $ D8 ¤ ¤ , the other between .¡1/ .¡1/ .1/ .1/ D8 $ D8¤ and D 8 $ D8¤ . The rst kind is related to the duality in the choice of the reference and the comparison status for the two points (x versus y) for computing the value of the divergence, and hence is called referential duality. The second kind is related to the duality in the choice of the representation of the point as a vector in the parameter versus gradient space (x versus u) in the expression of the divergence function, and hence is called representational duality. More concretely, .¡1/ .¡1/ D8 .x; y/ D D8 ¤ .r8.y/; r8.x// .1/ .1/ D D8 ¤ .r8.x/; r8.y// D D8 .y; x/:

The biduality is compactly reected in the canonical divergence as A8 .x; v/ D A8¤ .v; x/: 2.4 Biduality of Statistical Manifold for General ®. A natural question .®/ to ask is whether biduality is a general property of the divergence D8 and hence a property of any statistical manifold admitting a metric and a pair of dual (but not necessarily at) connections. Proposition 5 provides a positive answer to this question after considering the geometry generated by the pair .®/ .®/ of conjugate divergence functions, D8 and D8 ¤ , for each ® 2 R.

Divergence Function, Duality, and Convex Analysis

173

Q 0Q .®/ ; 0Q ¤.®/ g induced by Proposition 5. For the statistical manifold fS0 ; g; .®/ 0 D8¤ .u; v/ dened on u; v 2 S , denote the Riemannian metric as gQ mn .u/, the pair of dual connections as 0Q .®/mn;l .u/; 0Q ¤.®/mn;l .u/, and the Riemann-Christoffel curvature tensor as RQ .®/klmn .u/. They are related to those (in lower scripts and .®/ without the tilde) induced by D8 .x; y/ via X l

gil .x/ gQ ln .u/ D ±in ;

and 0Q .®/mn;l .u/ D ¡ Q ¤.®/mn;l

0

Q .®/klmn

R

.u/ D ¡ .u/ D

X i;j;k

X i;j;k

X

i;j;¹;º

.®/ gQ im .u/ gQ jn .u/ gQ kl .u/0 ij;k .x/; .¡®/ gQ im .u/ gQ jn .u/ gQ kl .u/0 ij;k .x/;

.®/ gQ ik .u/ gQ jl .u/ gQ ¹m .u/ gQ ºn .u/Rij¹º .x/

where the dual correspondence (see equation 2.15) is invoked. Proof. Following the proof of proposition 2 (i.e., using the Eguchi relation), the metric and dual connections induced on S0 are simply gQ mn D .8¤ /mn and 0Q .®/mn;l D

1¡® .8¤ /mnl ; 2

0Q ¤.®/mn;l D

1C® .8¤ /mnl ; 2

with the corresponding Riemann-Christoffel curvature of 0Q .®/mn;l given as 1 ¡ ®2 X RQ .®/klmn D ..8¤ /k½n .8¤ /l¿ m ¡ .8¤ /k½m .8¤ /l¿ n /.8¤ /½¿ ; 4 ½ ;¿ where .8 ¤ /mn D

@ 2 8¤ .u/ ; @ um @un

.8¤ /mnl D

@ 3 8¤ .u/ ; @um @ un @ul

P P and .8 ¤ /½¿ is the matrix inverse of .8 ¤ /mn . That l gil .x/ gQ ln .u.x// D l gil .x.u// gQ ln .u/ D ±in is due to equations 2.15 and 2.17. Differentiating this

174

J. Zhang

identity with respect to xk yields X @gim .x/ X @ gQ mn .u/ Q mn .u/ D ¡ g g .x/ im @ xk @xk m m Á ! X X @ .8 ¤ /mn .u/ @ul D¡ gim .x/ @ul @xk m l or X m

8imk .x/ gQ mn .u/ D ¡

X

gim .x/ gkl .x/ .8¤ /mnl .u/;

m;l

which immediately gives rise to the desired relations between the ®connections. Simple substitution yields the relation between RQ .®/klmn and .¦ R.®/ ij¹º Remark 2.4.1. The biorthogonal coordinates x and u were originally dened on the manifold S and its dual S0 , respectively. Because of the bijectivity of the mapping between x and u, we may identify points in S with points in S0 and simply view x $ u as coordinate transformations on the same underlying manifold. The relations between the metric g, dual connections 0 .§®/ , or the curvature R.®/ written in superscripts with tilde and those written in subscripts without tilde are merely expressions of the same geometric entities using different coordinate systems. The dualistic geometric structure 0 .®/ $ 0 ¤.®/ under g, which reects referential duality, is preserved under the mapping x $ u, which reects representational duality. When the manifold is dually at (® D §1), x and u enjoy the additional property of being geodesic coordinates. Remark 2.4.2. Matumoto (1993) proved that a divergence function always exists for a statistical manifold equipped with an arbitrary metric tensor and a pair of dual connections. Given a convex function 8, along with its unique conjugate 8¤ , are there other families of divergence functions .®/ .®/ D8 .x; y/ and D8¤ .u; v/ that induce the same bidualistic statistical mani.®/ folds fS; g; 0 ; 0¤.®/ g? The answer is positive. Consider the family of divergence functions, D.®/ 8 .x; y/ D

1 ¡ ® .¡1/ 1 C ® .1/ D8 .x; y/ C D8 .x; y/; 2 2

and their conjugate (replacing 8 with 8¤ ). Recall from proposition 2 that .¡1/ .1/ the metric tensor induced by D8 .x; y/ and D8 .x; y/ is the same gij , while ¤.1/ ¤.¡1/ .¡1/ .1/ the induced connections satisfy 0ij;k D 0ij;k D 8ijk ; 0ij;k D 0ij;k D 0.

Divergence Function, Duality, and Convex Analysis

175

Since the Eguchi relations, equations 2.8 to 2.10, are linear with respect to inducing functions, the family of divergence functions D.®/ 8 .x; y/, as convex .¡1/ .1/ mixture of D 8 .x; y/ and D8 .x; y/, will necessarily induce the metric 1¡® 1C® gij C gij D gij ; 2 2 and dual connections 1 ¡ ® .¡1/ 1 C ® .1/ .®/ 0ij;k C 0ij;k D 0ij;k ; 2 2 1 ¡ ® ¤.¡1/ 1 C ® ¤.1/ ¤.®/ 0ij;k C 0ij;k D 0ij;k : 2 2 .®/ .®/ Similar arguments apply to D.®/ 8¤ .u; v/. This is, D8 .x; y/ and D8¤ .u; v/ form another pair of families of divergence functions that induce the same statis.®/ .®/ tical manifold fS; g; 0 .®/ ; 0 ¤.®/ g. The two pairs, .D8 .x; y/; D8¤ .u; v// pair, .®/ and .D.®/ 8 .x; y/; D8¤ .u; v// pair, agree on ® D §1, the dually at case when there is a single form of canonical divergence. They differ for any other val.®/ .®/ ues of ®, including the self-dual element (® D 0). The two pairs .D8 ; D8 ¤ / .®/ .®/ versus .D8 ; D8¤ / coincide with each other up to the third order when Taylor expanding .1 § ®/.y ¡ x/. That is why they induce an identical statistical manifold. They differ on the fourth-order terms in their expansions.

3 Divergence on Probability and Positive Measures The previous sections have discussed divergence functions dened between points in Rn or in its dual space, or both. In particular, they apply to probability measures of nite support, or the nite-dimensional parameter space, which denes parametric probability models. Traditionally, divergence functionals are also dened with respect to innite-dimensional probability densities (or positive measures in general if normalization constraint is removed). To the extent that a probability density function can be embedded into a nite-dimensional parameter space, a divergence measure on density functions will implicitly induce a divergence on the parameter space (technically, via pullback). In fact, this is the original setup in Amari (1985), where each ®-divergence (® 2 R) is seen as the canonical divergence arising from the ®-embedding of the probability density function into an afne submanifold (the condition of ®-afnity). The approach outlined below avoids such a atness assumption while still achieving conjugate representations (embeddings) of probability densities, and therefore extends the notion of biduality to the innite-dimensional case. It will be proved (in proposition 9) that if the embedding manifold is at, then the induced

176

J. Zhang

®-connections reduce to the ones introduced in the previous section, with the natural and expectation parameters arising out of these conjugate representations forming biorthogonal coordinates just like the ones induced by dual potential functions in the nite-dimensional case. To follow the procedures of section 2.1 and construct divergence functionals, a smooth and strictly convex function dened on the real line f : R ! R is introduced. Recall that any such f can be written as an integral of a strictly R monotone-increasing function g and vice versa: f .° / D c° g.t/dt C f .c/, with g0 .t/ > 0. The convex conjugate f ¤ : R ! R is given by f ¤ .±/ D R± ¡1 ¤ ¡1 also strictly monotonic and ° ; ± 2 R. g.c/ g .t/dt C f .g.c//, with g Here, the monotonicity condition replaces the requirement of a positive semidenite Hessian in the case of a convex function of several variables. The Legendre-Fenchel inequality f .° / C f ¤ .±/ ¸ ° ± can be rewritten as Young’s inequality, Z ° Z ± g.t/ dt C g¡1 .t/ dt C cg.c/ ¸ ° ±; c

g.c/

with equality holding if and only if ± D g.° /. The use of a pair of strictly monotonic functions f 0 D g and . f ¤ /0 D g¡1 , which we call ½- and ¿ representations below, provides a means to dene conjugate embeddings (representations) of density functions and therefore a method to extend the analysis in the previous sections to the innite-dimensional manifold of positive measures (after integrating over the sample space). 3.1 Divergence Functional Based on Convex Function on the Real Line. Recall that the fundamental inequality (see equation 2.1) of a strictly convex function 8, now for f : R ! R, can be used to dene a nonnegative quantity (for any ® 2 R), ³ ³ ´´ 4 1¡® 1C® 1¡® 1C® C ¡ C f .° / f .±/ f ° ± : 1 ¡ ®2 2 2 2 2 Note that here, ° and ± are numbers instead of nite-dimensional vectors. In particular, they can be the values of two probability density functions ° D p.³ / and ± D q.³ / where ³ 2 X is a point in the sample space X . The nonnegativity of the above expression for each value of ³ allows us to dene a global divergence measure, called a divergence functional, over the space of a (denormalized) probability density function after integrating over the sample space (with appropriate measure ¹.d³ / D d¹), Z .®/ .®/ d f .p; q/ D d f .p.³ /; q.³ // d¹ » ³Z ´ ³Z ´ 4 1¡® 1C® D f .p/ d¹ C f .q/ d¹ 1 ¡ ®2 2 2 ´ ¼ Z ³ 1¡® 1C® ¡ f pC q d¹ ; 2 2

Divergence Function, Duality, and Convex Analysis

177

with .¡1/

df

Z .1/ .p; q/ D d f .q; p/ D . f .q/ ¡ f .p/ ¡ .q ¡ p/ f 0.p// d¹ Z D . f .q/ C f ¤ . f 0 .p// ¡ qf 0 .p//d¹ ´ A f .q; f 0 .p//

(3.1) (3.2)

where f ¤ : R ! R, dened by f ¤ .t/ D t . f 0 /¡1 .t/ ¡ f .. f 0 /¡1 .t//; is the convex conjugate to f , with . f ¤ /¤ D f and . f ¤ /0 D . f 0 /¡1 . In this way, d.®/ over the innite-dimensional functional space is def .p; q/ .®/ ned in much the same way as on the nite-dimensional R D8 .x; y/ dened R vector space. The integration f .p/d¹ D f .p.³ //d¹, which is a nonlinear functional of p, assumes the role of 8 of the nite-dimensional case discussed in section 2; this is most transparent if we consider, latter, Pn for the i /; x 2 Rn the special class of “separable” convex functions 8.x/ D f .x iD1 R P such that r8.x/ is simply [ f 0 .x1 /; : : : ; f 0 .xn /]. With i $ d¹, the expressions of the divergence function and the divergence functional look similar. However, one should not conclude that divergence functions are special cases of divergence functionals or vice versa. There is a subtle but important difference: in the former, the inducing function 8.x/ is strictly convex in x; in the latter, f .p/ is strictly convex in p, but its pullback on X , f .p.³ // is not assumed to be convex at all. So even when the sample space may be nite, . f ± p/.³ / is generally not a convex function of ³ .

Example 3.1.1. Take f .t/ D t log t ¡ t .t > 0/, with conjugate function f ¤ .u/ D eu . The divergence Z A f .p; u/ D D

¡ ¢ .p log p ¡ p/ C eu ¡ p u d¹

Z ± ² p p log u ¡ p C eu d¹ e

is the Kullback-Leibler divergence K.p; eu / between p.³ / and q.³ / D eu.³ / . Example 3.1.2. 0 tr

r0

Take f .t/ D

tr r

.t > 0/ with the conjugate function f ¤ .t/ D

, where the pair of conjugated real exponents r > 1; r0 > 1 satises 1 1 C 0 D 1: r r

178

J. Zhang

The divergence A f is a nonnegative expression based on Holder’s ¨ inequality for two functions u.³ /; v.³ /, Z Á A f .u; v/ D

! 0 ur vr C 0 ¡ u v d¹ ¸ 0; r r 0

with equality holding if and only if ur .³ / D vr .³ / for all ³ 2 X . Denote 2 2 and r0 D 1C® , with ® 2 .¡1; 1/. The above divergence is just r D 1¡® 2

2

A® .p; q/ between p.³ / D .u.³ // 1¡® and q.³ / D .v.³ // 1C® , apart from a factor 4 . 1¡® 2

3.2 Conjugate Representations and Induced Statistical Manifold. We introduce the notion of ½-representation of a (not necessarily normalized) probability density by dening a mapping ½: RC ! R; p 7! ½.p/ where ½ is a strictly monotone increasing function. This is a generalization of the ®representation (Amari, 1985; Amari & Nagaoka, 2000) where ½.p/ D l.®/ .p/, as given by equation 1.8. For a smooth and strictly convex function f : R ! R, the ¿ -representation of the density function p 7! ¿ .p/ is said to be conjugate to the ½-representation with respect to f if ¿ .p/ D f 0 .½.p// D .. f ¤ /0 /¡1 .½.p// Ã! ½.p/ D . f 0/¡1 .¿ .p// D . f ¤ /0 .¿ .p//:

(3.3)

Just like the construction in section 3.1 of divergence functionals for two densities p; q, one may construct divergence functionals for two densities under ½-representations D .®/ ´ d.®/ or under ¿ f;½ .p; q/ f .½ .p/; ½.q// representations D .®/ ´ d.®/ f ¤ ;¿ .p; q/ f ¤ .¿ .p/; ¿ .q//.

Proposition 6. For ® 2 R, D .®/ D .®/ D .®/ D .®/ f;½ .p; q/, f;¿ .p; q/, f ¤ ;½ .p; q/, and f ¤ ;¿ .p; q/, each forms a one-parameter family of divergence functionals, with the .§1/-divergence functional,

D .1/ .p; q/ D D .¡1/ .q; p/ D D .1/ .q; p/ D D .¡1/ .p; q/ f;½ f;½ f ¤ ;¿ f ¤ ;¿ D .1/ .p; q/ f ¤ ;½

D A f .½ .p/; ¿ .q//;

D D .¡1/ .q; p/ D D .1/ .q; p/ D D .¡1/ .p; q/ f ¤ ;½ f;¿ f;¿ D A f ¤ .½ .p/; ¿ .q// ;

taking the following canonical form, Z ¡ ¢ A f .½ .p/; ¿ .q// ´ f .½ .p// C f ¤ .¿ .q// ¡ ½.p/ ¿ .q/ d¹ D A f ¤ .¿ .q/; ½.p// :

(3.4)

Divergence Function, Duality, and Convex Analysis

179

Proof. The proof for nonnegativity of these functionals for all ® 2 R follows that in proposition 2. Taking lim®!§1 D .®/ and noting equation 3.3 f;½ .p; q/ immediately leads to the expressions of .§1/-divergence functional. ¦ Example 3.2.1. Amari’s ®-embedding where ½.p/ D l.®/ .p/; ¿ .p/ D l.¡®/ .p/ corresponds to (assuming ® 6D §1) f .t/ D

2 1C®

³

1¡® t 2

´

2 1¡®

;

¤

f .t/ D

2 1¡®

³

1C® t 2

´

2 1C®

:

Writing out A f .½ .p/; ¿ .q// explicitly yields the ®-divergence in the form of equation 1.2. For ® D §1, see example 3.1.1. Now we restrict attention to a nite-dimensional submanifold of probability densities whose ½-representations are parameterized using µ D [µ 1 ; : : : ; µ n ] 2 Mµ . Under such a statistical model, the divergence functional of any two densities p and q, assumed to be specied by µp and µq , respectively, becomes an implicit function of µp ; µq 2 Rn . In other words, through introducing parametric models (i.e., a nite-dimensional submanifold) of the innite-dimensional manifold of probability densities, we again arrive at divergence functions over the vector space. We denote the ½-representation of a parameterized probability density as ½.p.³ I µ//, or sometimes simply ½.µp /, while suppressing the sample space variable ³ , and denote the corresponding divergence function by » 4 1¡® 1C® .®/ D f;½ .µp ; µq / D E¹ f .½ .µp // C f .½ .µq // 1 ¡ ®2 2 2 ³ ´¼ 1¡® 1C® ¡f (3.5) ½.µp / C ½.µq / ; 2 2 R R where E¹ f¢g denotes f¢g d¹. We will also use Ep f¢g to denote f¢g p d¹ later. Similarly, the parametrically embedded probability density under ¿ representation is denoted ¿ .p.³ I µ // or simply ¿ .µp /. Proposition 7.

The family of divergence functions D .®/ f;½ .µp ; µq / induces a dually

afne Riemannian manifold fMµ ; g; 0 .®/ ; 0 ¤.®/ g for each ® 2 R, with the metric tensor » ¼ @½ @½ (3.6) gij .µ / D E¹ f 00 .½.µ // i @ µ @µ j and the dual afne connections » ¼ 1 ¡ ® 000 .®/ 00 D C 0ij;k .µ / E ¹ f .½/ Aijk f .½/ B ijk 2

;

(3.7)

180

J. Zhang

» ¤.®/ 0ij;k .µ /

D E¹

¼ 1 C ® 000 00 f .½/ Aijk C f .½/ Bijk : 2

(3.8)

Here, ½ and all its partial derivatives (with respect to µ ) are functions of µ and ³ , while Aijk , B ijk denote Aijk .³ I µ / D

@½ @½ @½ ; @µ i @ µ j @µ k

Bijk .³ I µ / D

@ 2 ½ @½ : @µ i @µ j @µ k

Proof. We follow the same technique of section 2.2 to expand the value of divergence measure D .®/ around µp D µ C »; µq D µ C ´ for small f;½ .µp ; µq /

»; ´ 2 Rn . Considering the order of expansion o.» m ´l / with nonnegative integers m; l, the terms with m C l · 1 vanish uniformly. The terms with m C l D 2 are 8 9 <1 X = @½ @½ i i j j E¹ ¡ ¡ f 00 .½/ i .» ´ / .» ´ / ; :2 ; @µ @ µ j i;j which is ®-independent. The terms with m C l D 3 are (after lengthy calculation) E¹

8 <1 X :6

i;j;k

³

3¡® i j k 3C® i j k »» » C ´´´ 2 2 ´ 3 ¡ 3® i j k 3 C 3® i j k ¡ »» ´ ¡ ´´» 2 2 9 = 1 X 00 i j i j k k C f .½/ Bijk .» » ¡ ´ ´ / .» ¡ ´ / : ; 2 f 000 .½/ Aijk

i;j;k

Applying Eguchi relations 2.8 to 2.10 and carrying out differentiation yields the desired results. ¦ Remark 3.2.2. Strict convexity of f requires that f 00 > 0; thereby, the positive semideniteness of gij is guaranteed. Clearly, the ®-connections form ¤.®/ .¡®/ dual pairs 0ijk D 0ijk . The induced statistical manifold will be shown to demonstrate biduality just as in section 2. It is important to realize that while f is strictly convex in p, f .p.µ // is not at all convex in µ . Therefore, propositions 2 and 7 do not imply one another necessarily! Example 3.2.3. For f .t/ D et , and ½.p/ D log p, that is, ¿ .p/ D p, the identity function, the above expressions reduce to the Fisher information and

Divergence Function, Duality, and Convex Analysis

181

.¯/ is the paramet®-connections of the exponential family. When ½.p/ R D.¯/l .p/ ric alpha representation, then gij .µ / reduces to .@½ =@µ i / .@½ .¡¯/ =@µ j /d¹ as given in Amari (1985) and Amari and Nagaoka (2000). The formula for ®-connections, however, differs from theirs, since with ½ D l.¯/ parametrically, we will get a two-parameter family of connections with ® and ¯ as parameters. (See section 3.5.)

3.3 Biduality of Statistical Manifold Proposition 8. Under conjugate ½- and ¿ -representations, the metric tensor in.®/ duced by D f;½ .µp ; µq / can be expressed as » gij .µ / D E¹

@½ @¿ @µ i @µ j

¼

» D E¹

¼ @¿ @½ ; @µ i @µ j

(3.9)

while the induced dual connections are » ¼ 1 ¡ ® @ 2 ¿ @½ 1 C ® @ 2½ @ ¿ .®/ C 0ij;k .µ / D E ¹ ; 2 @µ i @µ j @µ k 2 @µ i @µ j @µ k » ¤.®/

0ij;k .µ / D E¹ Proof.

¼ 1 C ® @ 2 ¿ @½ 1 ¡ ® @ 2 ½ @¿ C : 2 @µ i @ µ j @µ k 2 @µ i @µ j @ µ k

(3.10)

(3.11)

First, »

@½ @½ f .½/ i j @ µ @µ 00

E¹

¼

» D E¹

¼ @ f 0 .½/ @½ : @µ i @ µ j

Realizing f 0 .½ .p// D ¿ .p/ proves equation 3.9. Second, differentiating the above with respect to µ k yields » E¹

f 000 .½/

@½ @½ @½ @ 2 ½ @½ 00 C f .½/ i @ µ @µ j @µ k @µ i @µ k @ µ j

¼

» D E¹

¼ @ 2 f 0 .½/ @½ : @µ i @µ k @µ j

Rearranging, E¹ f f 000 .½/Aijk g D Tikj ; where »

Tijk

¼ @ 2 ¿ @½ @ 2 ½ @¿ ¡ @µ i @µ j @µ k @µ i @µ j @ µ k » ³ 2 ´¼ @ ½ @¿ @ 2 ¿ @½ D E¹ ¡ ¡ D Tikj @µ i @µ k @µ j @µ i @µ k @ µ j ´ E¹

(3.12)

182

J. Zhang

.®/ is a totally symmetric tensor. Substituting into the expression for 0ij;k ,

»

´ ¼ @ 2 f 0 .½/ @½ @ 2 ½ @ f 0 .½/ @ 2 f 0 .½/ @½ ¡ C @µ i @µ j @ µ k @ µ i@ j @ µ k @µ i @µ j @ µ k » ¼ 2 0 2 0 1 ¡ ® @ f .½/ @½ 1 C ® @ ½ @ f .½/ D E¹ C : 2 @µ i @ µ j @µ k 2 @µ i @µ j @ µ k

.®/ 0ij;k D E ¹

1¡® 2

³

.®/ This proves equation 3.10, the expression of 0ij;k under conjugate ½- and ¤.®/

¿ -representations. The proof concerning 0ij;k is similar. ¦ Remark 3.3.1. Amari’s alpha representation (index by ¯ here to avoid confusion) corresponds to ½-representation (½.p/ D l.¯/ .p/; ¿ .p/ D l.¡¯/ .p/) with ® D 1, or ¿ -representation (½.p/ D l.¡¯/ .p/; ¿ .p/ D l.¯/ .p/) with ® D ¡1. Note also that ® D 0 corresponds to the metric-compatible Levi-Civita con.0/ nection 0ij;k and that .®/ .0/ D 0ij;k ¡ 0ij;k

® Tijk 2

conforms to the formal denition of ®-connection (Lauritzen, 1987). Corollary 3.

¤.®/

.®/ Q The metric gQ ij and the dual connections 0Q ij;k ; 0ij;k of the statistical

manifold induced by the conjugate divergence function D .®/ f ¤ ;¿ .µp ; µq / are related to .®/

those induced by D f;½ .µp ; µq / via gQ ij .µ / D gij .µ /; with .®/ .¡®/ 0Q ij;k .µ / D 0ij;k .µ /;

Proof.

¤.®/ .®/ 0Q ij;k .µ / D 0ij;k .µ /:

Observe that ½ D . f ¤ /0 .¿ /, from proposition 8, »

gij .µ / D E¹

@¿ @. f ¤ /0 .¿ / @µi @µ j

¼

» ¼ @ ¿ @¿ ¤ 00 D E¹ . f / .¿ / i ; @µ @µ j

which is just gQ ij . To prove the second part, we follow the steps in the proof of proposition 8 to show » E¹

@ 2 ½ @ f 0 .½/ @ 2 f 0 .½/ @½ ¡ @ µ i@ j @ µ k @µ i @ j @µ k

¼

» ¼ @¿ @¿ @¿ D E¹ . f ¤ /000 .¿ / i i i ; @µ @ µ @µ

Divergence Function, Duality, and Convex Analysis

183

which is analogous to equation 3.12, and then show » ³ ´ ¼ 1C® @¿ @¿ @¿ @2¿ @ ¿ .®/ . f ¤ /000 .¿ / i j k C i j k . f ¤ /00 .¿ / 0ij;k D E¹ 2 @µ @µ @ µ @µ @ @µ .¡®/ Q ´0 ; ij;k

which is, by denition, the connection induced by D .¡®/ ¦ f ¤ ;¿ .µp ; µq /. Remark 3.3.2.

Note that the statistical manifold associated with D .®/ is the f ¤ ;¿

same nite-dimensional µ-manifold Mµ as induced by D .®/ . The difference f;½ between D .®/ and D .®/ is due to the difference in the ½- and ¿ f;½ .µp ; µq / f ¤ ;¿ .µp ; µq / representations of densities, which are conjugate to each other. Corollary 3 says that the duality 0 $ 0 ¤ , in addition to reecting µp $ µq , reects the conjugacy between ½.p/ and ¿ .p/, that is, the dual representations of probability density function, so that ¤.®/ .®/ 0ij;k D 0Q ij;k :

3.4 Natural and Expectation Parameters of Statistical Manifold. Assume now that µ , as appeared in ½.µ / and ¿ .µ /, is the natural parameter of a parametric statistical model. For an exponential family, it is well known that one might as well parameterize density functions by the expectation parameter ´, which is dual to the natural parameter. We generalize this duality (biorthogonality) between the natural parameter and the expectation parameter for arbitrary ½- and ¿ -embedding by considering the pullback of D .®/ and D .®/ in both Mµ and M´ . f;½ f ¤ ;¿ We now introduce the notion of ½-afnity (a generalization of ®-afnity). A family of the (denormalized) probability densities is said to be ½-afne if the ½-representation of these densities can be embedded into a nitedimensional afne space, that is, if there exists a set of linearly independent functions ¸i .³ / over the sample space X 3 ³ such that X (3.13) ½.p.³ // D µ i ¸i .³ / : i

1

Here, µ D [µ ; : : : ; µ n ] 2 Mµ is called the natural parameter of this ½-afne family. For any density p.³ /, the projection of its ¿ -representation onto the functions ¸i .³ / Z (3.14) ´i D ¿ .p.³ // ¸i .³ / d¹; forms a vector ´ D [´1 ; : : : ; ´n ] 2 M´ ; ´ is called the expectation parameter of p.³ /.

184

J. Zhang

Proposition 9.

For a ½-afne family of densities,

i. The functions Z 8.µ / D

Z f .½ .µ // d¹;

¤

f ¤ .¿ .´// d¹

8 .´/ D

are a pair of conjugated, strictly convex functions. ii. The natural parameter µ and the expectation parameter ´ form biorthogonal coordinates @8¤ D µ i; @´i

@8 D ´i ; @µ i with @´i D gij .µ /; @µ j

@µi D gQ ij .´/; @ ´j

where the metric gij .µ / is positive-denite with gQ ij .´/ as its matrix inverse.

iii. The dual afne connections become .®/

0ij;k .µ / D

1¡® @ 38 ; 2 @ µ i @µ j @µ k

¤.®/

0ij;k .µ / D

1C® @ 38 : 2 @µ i @µ j @µ k

iv. The divergence functional D .®/ f;½ .p; q/ becomes the divergence function D.®/ 8 .µp ; µq /. v. The canonical divergence functional A f .½ .p/; ¿ .q// D 8.µp / C 8¤ .´q / ¡

X i

µpi ´qi ´ A8 .µp ; ´q /; (3.15)

where A8 .µ ; ´/ is the canonical divergence function as given by corollary 2. The assumption 3.13 implies that

Proof.

@½ @µ i

D ¸i .³ /, so from equation 3.6,

Z gij D

f 00 .½/ ¸i .³ / ¸j .³ / d¹:

That gij is positive denite is seen by observing X ij

Á

Z i j

gij » » D

00

f .½/

X i

!2 ¸i .³ /»

i

d¹ > 0

Divergence Function, Duality, and Convex Analysis

185

for any » D [» 1 ; : : : ; » n ] 2 Rn , due to linear independence of the ¸i ’s and the strict convexity of f . Now @8 D @µ i

Z

@ f .½/ d¹ D @µ i

Z f 0 .½/

@½ d¹ D @µ i

Z ¿ .p.³ // ¸i .³ / d¹ D ´i

by denition 3.14. We can verify straightforwardly that @28 D @´i =@µ j D @µ i @ µ j

Z f 00 .½ .³ //

@½ ¸i .³ / d¹ D gij .µ / @µj

is positive denite, so 8.µ / must be a strictly convex function. Parts iv and then iii follow proposition 2 once strict convexity of 8.µ / is established. Differentiating both sides of equation 3.14 with respect to ´j yields j

Z

±i D

@¿ ¸i .³ / d¹: @´j

Thus, Z Z @ f ¤ .¿ / @¿ @¿ d¹ D . f ¤ /0 .¿ / d¹ D ½.p.³ // d¹ @´i @´i @´i 0 1 ´ Z X X ³Z @ ¿ @¿ j @ D µ ¸j .³ /A d¹ D µj ¸j .³ / d¹ @´i @´i j j X j D µ j ±i D µ i :

@8¤ D @´i

Z

j

Part ii, namely, biorthogonality of µ and ´, is thus established. Evaluating X i

@8 ¡ 8.µ / D µ @µ i

Z

i

D D

Z Z

¿ .p.³ I µ//

Á X

! i

µ ¸i .³ /

Z d¹¡

i

¿ .p.³ I µ// ½.p.³ I µ // d¹ ¡

f .½ .p.µ // d¹

Z f .½.p.³ I µ // d¹

f ¤ .¿ .p.³ I ´// d¹ D 8¤ .´/

establishes the conjugacy between 8 and 8¤ , and hence strict convexity of 8¤ , as claimed in part i. Finally, substituting these expressions into equation 3.4 establishes part v. Therefore, we have proved all the relations stated in this proposition. ¦

186

J. Zhang

Remark 3.4.1. This is a generalization of the results about ®-afne manifolds (Amari, 1985; Amari & Nagaoka, 2000), where ½- and ¿ -representations are just ®- and .¡®/-representations, respectively. Proposition 9 says that when ¸i .³ /’s are used as the basis functions of the sample space, µ is the natural (contravariant) coordinate to express ½.p/, while ´ is the expectation (covariant) coordinate to express ¿ .p/. They are biorthogonal Z

@½ @¿ j d¹ D ±i ; @µ i @´j

when the ½- (or ¿ )-representation of the density function is embeddable into the nite-dimensional afne space. The natural and expectation parameters are related to the ¿ - and to the ½-representation, respectively, via ¿ .p.³ // D f

0

Á X

! i

µ ¸i .³ / ;

i

Z ´i D

f 0 .½.p//¸ i .³ / d¹:

With the expectation parameter ´, one may express the divergence functional D .®/ and obtain the corresponding metric and dual connection f;½ .´p ; ´q / pair. The properties of the statistical manifold on M´ are shown by the next proposition. Proposition 10. The metric tensor gO ij and the dual connections 0O .®/ij;k , 0O ¤.®/ij;k induced by D .®/ f;½ .´p ; ´q / are related to those (expressed in lower induces) induced .®/

by D f;½ .µp ; µq / via X l

gil .µ / gO lm .´/ D ±im ;

(3.16)

and 0O .®/ij;k .´/ D ¡

X l;m;n

0O ¤.®/ij;k .´/ D ¡

.¡®/ gO im .´/ gO jn .´/ gO kl .´/0ml;n .µ /;

X l;m;n

.®/ gO im .´/ gO jn .´/ gO kl .´/0ml;n .µ /;

(3.17)

(3.18)

where ´ and µ are biorthogonal. Proof. The relation 3.16 follows proposition 9. To prove equation 3.17, we write out 0O .®/ij;k following proposition 8 (note that upper- and lower-case

Divergence Function, Duality, and Convex Analysis

187

here are pro forma): »

¼ 1 ¡ ® @ 2 ¿ @½ 1 C ® @ 2½ @ ¿ C 2 @´i @´j @´k 2 @´i @´j @ ´k ( Á !Á ! X @µm @ @¿ 1 ¡ ® X @½ @µ l E¹ m 2 @ µ l @´k m @´i @µ @´j l Á !Á !) X @µ m @ @½ 1 C ® X @¿ @µl C m 2 @µ l @´k m @´i @µ @´j l ( Á ! X @µ l @µ m X @ ¿ @µ n 1 ¡ ® @½ @ E¹ n 2 @ µ l @µ m @ ´k @´i n @ µ @´j l;m Á !) X @½ @µ n 1 C ® @¿ @ C n 2 @µ l @µ m n @ µ @´j ³ » X @µ l @µ m @µ n 1 ¡ ® @½ @ 2 ¿ E¹ 2 @ µ l @µ m @µ n @´k @´i @´j l;m;n ¼ 1 C ® @¿ @ 2 ½ C 2 @µ l @µ m @µ n ³ ´ » ¼ ´ n 1 ¡ ® @½ @¿ 1 C ® @¿ @½ @ @µ C E C ¹ 2 @µ l @ µ n 2 @ µ l @µ n @µ m @´j ³ ´ X @µ l @µ m @µ n .®/ @ @µ n 0mn;l C m gnl : @´k @´i @´j @µ @´j l;m;n

0O .®/ij;k D E¹ D

D

D

D Since

X ³ @ @µn ´ X @ g jn X @ gnl D g gnl D ¡ g jn nl m m m @µ @ ´j n n @µ n @µ ± ² X .®/ .¡®/ D¡ g jn 0mn;l C 0ml;n ; n

where the last step is from @gnl ¤.®/ .®/ .®/ .¡®/ D 0mn;l C 0ml;n D 0mn;l C 0ml;n ; @µ m assertion 3.17 is proved after direct substitution. Observing the duality 0O ¤.®/ij;k D 0O .¡®/ij;k leads to equation 3.18. ¦ Remark 3.4.2. The relation between g and 0 in their subscript and superscript forms is analogous to that stated by proposition 5. However, note the

188

J. Zhang

.¡®/ conjugacy of ® in 0O .®/ij;k $ 0ml;n correspondence, due to the change between µ- and ´-coordinates, both under the ½-representation. On the other hand, similar to corollary 3, the metric gN ij and the dual afne connection 0N .®/ij;k ; 0N ¤.®/ij;k of the statistical manifold (denoted using bar) induced by the conjugate divergence functions D .®/ are related to those (def ¤ ;¿ .´p ; ´q /

noted using hat) induced by D .®/ via f ¤ ;½ .´p ; ´q / gN ij .´/ D gO ij .´/; with 0N .®/ij;k .´/ D 0O .¡®/ij;k .´/;

0N ¤.®/ij;k .´/ D 0O .®/ij;k .´/:

3.5 Divergence Functional from Generalized Mean. When f is, in addition to being strictly convex, strictly monotone increasing, we may set ½ D f ¡1 , so that the divergence functional becomes

D½.®/ .p; q/ D

Z ³ 4 1¡® 1C® pC q 1 ¡ ®2 2 2 ³ ´´ 1¡® 1C® ¡½ ¡1 ½.p/ C ½.q/ d¹: 2 2

(3.19)

Note that for ® 2 [¡1; 1], ³ ´ 1C® ¡1 1 ¡ ® ´ C M.®/ .p; q/ ½ ½.p/ ½.q/ ½ 2 2 denes a generalized mean (“quasi-linear mean” by Hardy, Littlewood, & Polya, ´ 1952) associated with a concave and monotone function ½: RC ! R. Viewed in this way, the divergence is related to the departure of the linear (arithmetic) mean from a quasi-linear mean induced by a nonlinear function with nonzero concavity/convexity. 1¡®

1C®

.®/ Example 3.5.1. Take ½.p/ D log p, then M .®/ ½ .p; q/ D p 2 q 2 , and D½ .p; q/ is the ®-divergence (see equation 1.2). For a general concave ½,

Z

D½.1/ .p; q/ D

.p ¡ q ¡ .½ ¡1 /0 .½ .q// .½.p/ ¡ ½.q/// d¹ D D½.¡1/ .q; p/

is an immediate generalization of the extended Kullback-Leibler divergence in equation 1.1. To further explore the divergence functionals associated with the quasilinear mean operator, we impose a homogeneity requirement, such that the

Divergence Function, Duality, and Convex Analysis

189

divergence is invariant after scaling (· 2 RC ):

D½.®/ .·p; ·q/ D · D½.®/ .p; q/: Proposition 11. The only measure-invariant divergence functional associated .®/ with quasi-linear mean operator M½ is a two-parameter family, Z ³ 4 2 1¡® 1C® .®;¯/ D .p; q/ ´ pC q 1 ¡ ®2 1 C ¯ 2 2 ! ³ ´ 2 1 ¡ ® 1¡¯ 1 C ® 1¡¯ 1¡¯ ¡ (3.20) p 2 C q 2 d¹; 2 2 which results from the alpha-representation (indexed by ¯ here) ½.p/ D l.¯/ .p/ as given by equation 1.8. Here .®; ¯/ 2 [¡1; 1] £ [¡1; 1], and the factor 2=.1 C ¯/ is introduced to make D .®;¯/ .p; q/ well dened for ¯ D ¡1. Proof.

This homogeneity requirement implies that ³

½ ¡1

´ ³ ´ 1¡® 1C® 1¡® 1C® ½.·p/ C ½.·q/ D ·½ ¡1 ½.p/ C ½.q/ : 2 2 2 2

By a lemma in Hardy et al. (1952, p. 68), the general solution to the above functional equation is ½.t/ D

» s at C b a log t C b

s 6D 0 s D 0;

with corresponding ³ M .®/ s .p; q/

D

1¡® s 1C® s p C q 2 2

´1 s

;

.®/

M0 .p; q/ D p

1¡® 2

q

1C® 2

:

Here a; b; s are all constants. Strict concavity of ½ requires 0 · s · 1 and .®/ .®/ a > 0. Since it is easily veried D½ D Da½Cb , without loss of generality, we have ½.p/ D l.¯/ .p/; ¯ 2 [¡1; 1] where s D 3.20. ¦

1¡¯ 2 .

This gives rise to equation

Proposition 12 (corollary to Proposition 7). The two-parameter family of divergence functions D .®;¯/ .µp ; µq / induces a statistical manifold with Fisher information as its metric and generic alpha-connections as its dual connection pair, » ¼ @ log p @ log p D gij Ep @µi @µ j

190

J. Zhang

»

.®;¯ /

0ij;k

¤.®;¯ /

0ij;k

¼

@ 2 log p @ log p 1 ¡ ®¯ @ log p @ log p @ log p C 2 @µi @µ i @µ j @µ k @µ j @µ k » 2 ¼ @ log p @ log p 1 C ®¯ @ log p @ log p @ log p D Ep C 2 @µi @µ i @µ j @µ k @µ j @µ k D Ep

; :

Proof. Applying formulas 3.6 to 3.8 to the measure-invariant divergence functional D½.®/ .p; q/ with ½.p/ D log p and f D ½ ¡1 gives rise to the desired result. ¦ .®;¯/

Remark 3.5.2. This two-parameter family of afne connections 0ij;k , indexed now by the numerical product ®¯ 2 [¡1; 1], is actually in the generic form of an alpha-connection, .®;¯ /

0ij;k

.¡®;¡¯/

D 0ij;k

;

with biduality compactly expressed as ¤.®;¯ /

0ij;k

.¡®;¯ /

D 0ij;k

.®;¡¯/

D 0ij;k

(3.21)

:

The parameters ® 2 [¡1; 1] and ¯ 2 [¡1; 1] reect referential duality and representational duality, respectively. Among this two-parameter family, the Levi-Civita connection results when either ® or ¯ equals 0. When ® D §1 or ¯ D §1, each case reduces to the one-parameter version of the generic alpha-connection. The family D .®;¯/ is then a generalization of Amari’s alpha-divergence, equation 1.2, with lim D .®;¯ / .p; q/ D A .¡¯/ .p; q/;

®!¡1

lim D .®;¯ / .p; q/ D A .¯/ .p; q/;

®!1

lim D .®;¯ / .p; q/ D A .®/ .p; q/;

¯!1

1¡®

1C®

where the last equation is due to lim¯!1 M.®/ D M.®/ D p 2 q 2 . On the s 0 other hand, when ¯ ! ¡1, we have the interesting asymptotic relation, lim D .®;¯/ .p; q/ D E.®/ .p; q/;

¯!¡1

where E.®/ was the Jensen difference, equation 2.5, discussed by Rao (1987). 3.6 Parametric Family of Csisz´ar’s f -Divergence. The fact (see Proposition 12) that our two-parameter family of divergence functions D .®;¯/ actually induces a one-dimensional family of alpha-connection is by no means surprising. This is because D .®;¯/ obviously falls within Csisz a´ r’s

Divergence Function, Duality, and Convex Analysis

191

f -divergence (see equation 1.3), the generic form for measure-invariant divergence, where f

.®;¯ /

8 .t/ D .1 ¡ ® 2 /.1 C ¯/

Á

1¡® 1C® C t 2 2 ³ ´ 2 ! 1¡® 1 C ® 1¡¯ 1¡¯ ¡ C t 2 ; 2 2

(3.22)

is now a two-parameter family with f .®;¯/ .1/ D 0I . f .®;¯ / /0 .1/ D 0I . f .®;¯ / /00 .1/ D 1. That the alpha index is given by the product ®¯ in this case follows explicitly from calculating . f .®;¯ / /000 .1/ using equation 1.5. What is interesting in this regard is the distinct roles played by ® (for reference duality) and by ¯ (for representational duality). The parameters .®; ¯/ 2 [0; 1] £ [0; 1] form an interesting topological structure of a Moebius band in the space of divergence functions, all with identical Fisher information and the family of alpha-connections. We may generalize Csisz a´ r’s f -divergence to construct a family of measure-invariant divergence functional in the following way. Given a smooth, strictly convex function f .t/, construct the family (for ° 2 R) ³ ³ ´´ 4 1¡° 1C° 1¡° 1C° .° / C ¡ C G f .t/ D f .1/ f .t/ f t ; 1 ¡ °2 2 2 2 2 with G.¡1/ .t/ D g.t/ as given in equation 1.6. It is easy to verify that for an f .° /

arbitrary ° , G f 0, and that

.° /

.° /

is a proper Csiszar’s ´ function with G f .1/ D 0, .G f /0 .1/ D

.G f /00 .1/ D f 00 .1/; .° /

.G f /000 .1/ D .° /

° C 3 000 f .1/; 2

/ so the statistical manifold generated by G.° has the same metric as that f generated by f but a family of parameterized alpha-connections. If we take f .t/ D f .®/ .t/ as in equation 1.4, then G.° ;®/ will generate a two-parameter family of alpha-connections with the effective alpha value 3C.®¡3/.° C3/=2. We note in passing that repeating this process, by having G.° ;®/ now take the role of f , may lead to nested (e.g., two-, three- parameter) families of alpha-connections.

4 General Discussion This article introduced several families of divergence functions and functionals all based on the fundamental inequality of an arbitrary smooth and strictly convex function. In the nite-dimensional case, the convex mixture parameter, ®, which reects reference duality, turns out to correspond to

192

J. Zhang

the ® parameter in the one-parameter family of ®-connection in the sense of Lauritzen (1987), which includes the at connections .® D §1/ induced by Bregman divergence. The biorthogonal coordinates related to the inducing convex function and its conjugate (Amari’s dual potentials) reect representational duality. In the innite-dimensional cases, with the notion of conjugate (i.e., ½- and ¿ -) embeddings of density functions, the form of the constructed divergence functionals generalizes the familiar ones (®-divergence and f -divergence). The resulting ®-connections, equation 3.7, or equivalently, equation 3.10, have the most generalized yet explicit form found in the literature. When densities are ½-afne, they specialize to ®-connections in the nite-dimensional vector space mentioned above. When measureinvariance is imposed, they specialize to the family of alpha-connections proper, but now with two parameters—one reecting reference duality and the other representational duality. These ndings will enrich the theory of information geometry and make it applicable to nite-dimensional vector space (not necessarily of parameters of probability densities) as well as to innite-dimensional functional space (not necessarily of normalized density functions). In terms of neural computation, to the extent that alpha-divergence and alpha-connections generate deep analytic insights (e.g., Amari, Ikeda, & Shimokawa, 2001; Takeuchi & Amari, submitted), these theoretical results may help facilitate those analyses by clarifying the meaning of duality in projection-based algorithms. Previously, alpha-divergence, in its extended form (see equation 1.2), was shown (Amari & Nagaoka, 2000) to be the canonical divergence for the ®-afne family of densities (densities that, under the ®-representation l.®/ , are spanned by an afne subspace). Therefore, for a given ® value, there is only one such family that induces the at (®)connection with all components zero when expressed in suitable coordinates (as special cases, 0 .1/ D 0 for the exponential family and 0 .¡1/ D 0 for the mixture family). This is slightly different from the view of Zhu and Rohwer (1995, 1997) who, in their Bayesian inference framework, simply treated ® as a parameter in the entire class of (®-)divergence (between any two densities) which yields, through Eguchi relation, at connections only when ® D §1. These apparently disparate interpretations, despite being subtly so, have now been straightened out. The current framework points out two related but different senses of duality in information geometry: representational duality and reference duality. Further, it has been claried how the same one-parameter family of dual alpha-connections actually may embody both kinds of dualities. Future research will illuminate how this notion of biduality in characterizing the asymmetric difference of two density functions or two parameters may have captured the very essence of computational algorithms of inference, optimization, and adaptation.

Divergence Function, Duality, and Convex Analysis

193

Acknowledgments I thank Huaiyu Zhu for introducing the topic of divergence functions and information geometry. Part of this work was presented to the International Conference on Information Geometry and Its Applications, Pescara, 2002, where I beneted from direct feedback and extensive discussions with many conference participants, including S. Amari and S. Eguchi in particular. Discussions with Matt Jones at the University of Michigan have helped improve the presentation.

References Ackley, D. H., Hinton, G. E., & Sejnowski, T. J. (1985). A learning algorithm for Boltzmann machines. Cognitive Science, 9, 147–169. Amari, S. (1982). Differential geometry of curved exponential families— curvatures and information loss. Annals of Statistics, 10, 357–385. Amari, S. (1985). Differential geometric methods in statistics. New York: SpringerVerlag. Amari, S. (1991). Dualistic geometry of the manifold higher-order neurons. Neural Networks, 4, 443–451. Amari, S. (1995). Information geometry of EM and EM algorithms for neural networks. Neural Networks, 8, 1379–1408. Amari, S., Kurata, K., & Nagaoka, H. (1992).Information geometry of Boltzmann machines. IEEE Transactions on Neural Networks, 3, 260–271. Amari, S., Ikeda, S., & Shimokawa, H. (2001). Information geometry and mean eld approximation: The ®-projection approach. In M. Opper, & D. Saad (Eds), Advanced mean eld methods—Theory and practice, (pp. 241–257). Cambridge, MA: MIT Press. Amari, S., & Nagaoka, H. (2000). Method of information geometry. New York: Oxford University Press. Bauschke, H. H., Borwein, J. M., & Combettes, P. L. (2002). Bregman monotone optimization algorithms. CECM Preprint 02:184. Available on-line: http://www.cecm.sfu.ca/preprints/2002pp.html. Bauschke, H. H., & Combettes, P. L. (2002). Iterating Bregman retractions. CECM Preprint 02:186. Available on-line: http://www.cecm.sfu.ca/ preprints/2002pp.html. Bregman, L. M. (1967). The relaxation method of nding the common point of convex sets and its application to the solution of problems in convex programming. USSR Computational Mathematics and Physics, 7, 200–217. Chentsov, N. N. (1982). Statistical decision rules and optimal inference. Providence, RI: AMS, 1982. Csisz a´ r, I. (1967). On topical properties of f-divergence. Studia Mathematicarum Hungarica, 2, 329–339.

194

J. Zhang

Della Pietra, S., Della Pietra, V., & Lafferty, J. (2002).Duality and auxiliary functions for Bregman distances (Tech. Rep. No. CMU-CS-01-109).Pittsburgh, PA: School of Computer Science, Carnegie Mellon University. Eguchi, S. (1983). Second order efciency of minimum contrast estimators in a curved exponential family. Annals of Statistics, 11, 793–803. Eguchi, S. (1992). Geometry of minimum contrast. Hiroshima Mathematical Journal, 22, 631–647. Eguchi, S. (2002). U-boosting method for classication and information geometry. Paper presented at the SRCCS International Statistical Workshop, Seoul National University, June. Hardy, G., Littlewood, J. E., & Polya, ´ G. (1952). Inequalities. Cambridge: Cambridge University Press. Ikeda, S., Amari, S., & Nakahara, H. (1999) Convergence of the wake-sleep algorithm. In M. Kearns, S. Solla, & D. Cohn (Eds.), Advances in neural information processing systems, 11 (pp. 239–245). Cambridge, MA: MIT Press. Kaas, R. E., & Vos, P. W. (1997). Geometric foundation of asymptotic inference. New York: Wiley. Kurose, T. (1994). On the divergences of 1-conformally at statistical manifolds. T¨ohoko Mathematical Journal, 46, 427–433. Lafferty, J., Della Pietra, S., & Della Pietra, V. (1997). Statistical learning algorithms based on Bregman distances. In Proceedings of 1997 Canadian Workshop on Information Theory, pp. 77–80. Toronto, Canada: Fields Institute. Lauritzen, S. (1987). Statistical manifolds. In S. Amari, O. Barndorff-Nielsen, R. Kass, S. Lauritzen, and C. R. Rao (Eds.), Differential geometry in statistical inference (pp. 163–216). Hayward, CA: Institute of Mathematical Statistics. Lebanon, G., & Lafferty, J. (2002). Boosting and maximum likelihood for exponential models. In T. G. Dietterich, S. Becker, & Z. Ghahramani (eds.) Advances in neural information processingsystems, 14 (pp. 447–454). Cambridge, MA: MIT Press. Matsuzoe, H. (1998). On realization of conformally-projectively at statistical manifolds and the divergences. Hokkaido Mathematical Journal, 27, 409–421. Matsuzoe, H. (1999). Geometry of contrast functions and conformal geometry. Hiroshima Mathematical Journal, 29, 175–191. Matumoto, T. (1993). Any statistical manifold has a contrast function—On the C3 -functions taking the minimum at the diagonal of the product manifold. Hiroshima Mathematical Journal, 23, 327–332. Mihoko, M., & Eguchi, S. (2002). Robust blink source separation by betadivergence. Neural Computation, 14, 1859–1886. Rao, C. R. (1987). Differential metrics in probability spaces. In S. Amari, O. Barndorff-Nielsen, R. Kass, S. Lauritzen, & C. R. Rao (Eds.), Differential geometry in statistical inference. (pp. 217–240). Hayward, CA: Institute of Mathematical Statistics. Rockafellar, R. T. (1970). Convex analysis. Princeton, NJ: Princeton University Press. Shima, H. (1978). Compact locally Hessian manifolds. Osaka Journal of Mathematics, 15, 509–513.

Divergence Function, Duality, and Convex Analysis

195

Shima, H., & Yagi, K. (1997). Geometry of Hessian manifolds. Differential Geometry and Its Applications, 7, 277–290. Takeuchi, J., & Amari, S. (submitted). ®-Parallel prior and its properties. IEEE Transaction on Information Theory. Manuscript under review. Uohashi, K., Ohara, A., & Fujii, T. (2000). 1-Conformally at statistical submanifolds. Osaka Journal of Mathematics, 37, 501–507. Zhu, H. Y., & Rohwer, R. (1995). Bayesian invariant measurements of generalization. Neural Processing Letter, 2, 28–31. Zhu, H. Y., & Rohwer, R. (1997) Measurements of generalisation based on information geometry. In S. W. Ellacott, J. C. Mason, & I. J. Anderson (Eds.), Mathematics of neural networks: Model algorithms and applications (pp 394–398). Norwell, MA: Kluwer. Received October 29, 2002; accepted June 26, 2003.

LETTER

Communicated by Brendan Frey

Linear Response Algorithms for Approximate Inference in Graphical Models Max Welling [email protected] Department of Computer Science, University of Toronto, Toronto M5S 3G4 Canada

Yee Whye Teh [email protected] Computer Science Division, University of California at Berkeley, Berkeley, CA 94720, U.S.A.

Belief propagation (BP) on cyclic graphs is an efcient algorithm for computing approximate marginal probability distributions over single nodes and neighboring nodes in the graph. However, it does not prescribe a way to compute joint distributions over pairs of distant nodes in the graph. In this article, we propose two new algorithms for approximating these pairwise probabilities, based on the linear response theorem. The rst is a propagation algorithm that is shown to converge if BP converges to a stable xed point. The second algorithm is based on matrix inversion. Applying these ideas to gaussian random elds, we derive a propagation algorithm for computing the inverse of a matrix. 1 Introduction Like Markov chain Monte Carlo sampling and variational methods, belief propagation (BP) has become an important tool for approximate inference on graphs with cycles. Especially in the eld of error correction decoding, it has brought performance very close to the Shannon limit (Frey & MacKay, 1997). A number of studies of BP have gradually increased our understanding of the convergence properties and accuracy of the algorithm (Weiss & Freeman, 2001; Weiss, 2000). In particular, recent developments show that the stable xed points are local minima of the Bethe free energy (Yedidia, Freeman, & Weiss, 2000; Heskes, in press). This insight paved the way for more sophisticated generalized BP algorithms (Yedidia, Freeman, & Weiss, 2002) and convergent alternatives to BP (Yuille, 2002; Teh & Welling, 2001). Other developments include the expectation propagation algorithm designed to propagate sufcient statistics of members of the exponential family (Minka 2001). Despite its success, BP does not provide a prescription to compute joint probabilities over pairs of nonneighboring nodes in the graph. When the Neural Computation 16, 197–221 (2004)

c 2003 Massachusetts Institute of Technology °

198

M. Welling and Y. Weh

graph is a tree, there is a single chain connecting any two nodes, and dynamic programming can be used to integrate out the internal variables efciently. However, when cycles exist, it is not clear what approximate procedure is appropriate. It is precisely this problem that we address in this article. We show that the required estimates can be obtained by computing the sensitivity of the node marginals to small changes in the node potentials. Based on this idea, we present two algorithms to estimate the joint probabilities of arbitrary pairs of nodes. These results are interesting in the inference domain and may have future applications to learning graphical models from data. For instance, information about dependencies between random variables is relevant for learning the structure of a graph and the parameters encoding the interactions. Another possible application area is active learning. Since the node potentials encode the external evidence owing into the network and since we compute the sensitivity of the marginal distributions to changing this external evidence, this information can be used to search for good nodes to collect additional data. For instance, nodes that have a big impact on the system seem to be good candidates. The letter is organized as follows. Factor graphs are introduced in section 2. Section 3 reviews the Gibbs free energy and two popular approximations: the mean eld and Bethe approximations. In section 4, we explain the ideas behind the linear response estimates of pairwise probabilities and prove a number of useful properties that they satisfy. We derive an algorithm to compute the linear response estimates by propagating “supermessages” around the graph in section 5; section 6 describes an alternative method based on inverting a matrix. Section 7 describes an application of linear response theory to gaussian networks that gives a novel algorithm to invert matrices. In experiments (section 8), we compare the accuracy of the new estimates against other methods. We conclude with a discussion of our work in section 9. 2 Factor Graphs Let V index a collection of random variables fXi gi2V . Let xi denote values of Xi . For a subset of nodes ® ½ V, let X® D fXi gi2® be the variable associated with that subset and x® be values of X® . Let A be a family of such subsets : of V. The probability distribution over X D XV is assumed to have the following form, P X .X D x/ D

Y 1 Y Ã® .x® / Ãi .xi /; Z ®2A i2V

(2.1)

where Ã® ; Ãi are positive potential functions dened on subsets and single nodes, respectively. Z is the normalization constant (or partition function),

Linear Response Algorithms

199

1

2

3

4

a

g b P ~ ya (1,2) yb (2,3) yg (1,3,4) y1 (1)y2 (2) y3 (3) y4 (4) Figure 1: Example of a factor graph.

given by ZD

XY

Y Ã® .x® /

x ®2A

Ãi .xi /;

(2.2)

i2V

where the sum runs over all possible states x of X. In the following, we : will write P.x/ D P X .X D x/ for notational simplicity. The decomposition of equation 2.1 is consistent with a factor graph with function nodes over X® and variables nodes Xi . Figure 1 shows an example. Neighbors in a factor graph are dened as nodes that are connected by an edge (e.g., subset ® and variable 2 are neighbors in Figure 1). For each i 2 V, denote its neighbors by N i D f® 2 A : ® 3 ig, and for each subset ®, its neighbors are simply N ® D fi 2 V : i 2 ®g. Factor graphs are a convenient representation for structured probabilistic models and subsume undirected graphical models and acyclic directed graphical models (Kschischang, Frey, & Loeliger, 2001). Further, there is a simple message-passing algorithm for approximate inference that generalizes the BP algorithms on both undirected and acyclic directed graphical models. For that reason, we will state the results of this article in the language of factor graphs. 3 The Gibbs Free Energy Let B.x/ be a variational probability distribution, and let b® ; bi be its marginal distributions over ® 2 A and i 2 V, respectively. Consider minimizing the following objective, called the Gibbs free energy, G.B/ D ¡ ¡

XX ®

x®

i

xi

XX

b® .x® / log Ã® .x® / bi .xi / log Ãi .xi / ¡ H.B/;

(3.1)

200

M. Welling and Y. Weh

where H.B/ is the entropy of B.x/, X H.B/ D ¡ B.x/ log B.x/:

(3.2)

x

It is easy to show that the Gibbs free energy is precisely minimized at B.x/ D P.x/. In the following, we will use this variational formulation to describe two types of approximations: the mean eld and the Bethe approximations. 3.1 The Mean Field Approximation. The mean eld approximation uses a restricted set of variational distributions: those that assume inde: Q pendence between all variables xi : BMF .x/ D i bMF i .xi /. Plugging this into the Gibbs free energy, we get Á ! XX Y GMF .fbMF g/ D ¡ bMF .x / log Ã .x / i

¡

®

x®

i

xi

XX

i2®

i

i

®

®

MF .fbMF g/; bMF i .xi / log Ãi .xi / ¡ H i

where H MF is the mean eld entropy: XX MF HMF .fbMF bMF i g/ D ¡ i .xi / log bi .xi /: i

(3.3)

(3.4)

xi

Minimizing this with respect to bMF i .xi / (holding the remaining marginal distributions xed), we derive the following update equation, 0 1 XX Y 1 log Ã® .x® / (3.5) bMF Ãi .xi / exp @ bjMF .xj /A ; i .xi / Ã °i x®ni ®2N ni j2N i

®

where °i is a normalization constant. Sequential updates that replace each bMF i .xi / by the right-hand side of equation 3.5 are a form of coordinate descent on the mean eld–Gibbs free energy, which implies that they are guaranteed to converge to a local minimum. 3.2 The Bethe Approximation: Belief Propagation. The mean eld approximation ignores all dependencies between the random variables and therefore overestimates the entropy of the model. To obtain a more accurate approximation, we sum the entropies of the subsets ® 2 A and the nodes i 2 V. However, this overcounts the entropies on the overlaps of the subsets ® 2 A, which we therefore subtract off as follows, BP HBP.fbBP ® ; bi g/ XX X X BP BP D¡ bBP ci bBP ® .x® / log b® .x® / ¡ i .xi / log bi .xi /; (3.6) ®

x®

i

xi

Linear Response Algorithms

201

where the overcounting numbers are ci D 1 ¡ jN energy is thus given by (Yedidia et al., 2000) BP GBP.fbBP i ; b® g/ D ¡

¡

XX x®

®

XX i

xi

i j.

The resulting Gibbs free

bBP ® .x® / log Ã® .x® / bBP i .xi / log Ãi .xi /

BP ¡H BP .fbBP i ; b® g/;

(3.7)

where the following local constraints need to be imposed,1 X x®ni

BP bBP ® .x® / D bi .xi /

8® 2 A; i 2 ®; xi ;

(3.8)

in addition to the constraints that all marginal distributions should be normalized. Yedidia et al. (2000) showed that this constrained minimization problem may be solved by propagating messages over the links of the graph. Since the graph is bipartite, we need only to introduce messages from factor nodes to variable nodes m®i .xi / and messages from variable nodes to factor nodes ni® .xi /. The following xed-point equations can now be derived that solve for a local minimum of the BP-Gibbs free energy, ni® .xi / Ã Ãi .xi / m®i .xi / Ã

X

Y ¯2N

m¯i .xi / Y

Ã® .x® / x®ni

(3.9)

i n®

j2N

nj® .xj /:

(3.10)

® ni

Finally, marginal distributions over factor nodes and variable nodes are expressed in terms of the messages as follows, b® .x® / D bi .xi / D

Q 1 °® Ã® .x® / i2N Q 1 °i Ãi .xi / ®2N i

ni® .xi /

(3.11)

m®i .xi /;

(3.12)

®

where °i ; °® are normalization constants. On tree-structured factor graphs, there exists a scheduling such that each message needs to be updated only once in order to compute the exact marginal distributions on the factors and the nodes. On factor graphs with loops, iterating the messages does not always converge, but if they converge, they often give accurate approximations to the exact marginals (Murphy, 1 Note that although the beliefs fbi ; b® g satisfy local consistency constraints, they need not actually be globally consistent in that they do not necessarily correspond to the marginal distributions of a single probability distribution B.x/.

202

M. Welling and Y. Weh

Weiss, & Jordan, 1999). Further, the stable xed points of the iterations can only be local minima of the BP-Gibbs free energy (Heskes, in press). We note that theoretically, there is no need to normalize the messages themselves (as long as one normalizes the estimates of the marginals), but that it is desired computationally to avoid numerical overow or underow. 4 Linear Response The mean eld and BP algorithms described above provide estimates for single node marginals (both mean eld and BP) and factor node marginals (BP only), but not for joint marginal distributions of distant nodes. The linear response (LR) theory can be used to estimate joint marginal distributions over an arbitrary pair of nodes. For pairs of nodes inside a single factor, this procedure even improves on the estimates that can be obtained from BP by marginalization of factor node marginals. The idea here is to study changes in the system when we perturb the single node potentials, log Ãi .xi / D log Ãi0 .xi / C µi .xi /:

(4.1)

The superscript in equation 4.1 indicates unperturbed quantities in the following. Let µ D fµi g, and dene the free energy F.µ / D ¡ log

XY

Y

Ã® .x® / x ®2A

Ãi0 .xi /eµi .xi / :

(4.2)

i2V

¡F.µ / is the cumulant generating function for P.X/, up to irrelevant constants. Differentiating F.µ / with respect to µ gives @F.µ / D pj .xj / ¡ @µj .xj / µ D0 @ 2 F.µ / D @pj .xj / ¡ @µi .xi /@µj .xj / µ D0 @µi .xi / µD0 » pij .xi ; xj /¡pi .xi /pj .xj / D pi .xi /±xi ;xj ¡pi .xi /pj .xj /

(4.3)

if i 6D j if i D j;

(4.4)

where pi ; pij are single and pairwise marginals of P.x/. Hence, second-order perturbations in the system (4.4) give the covariances between any two nodes in the graph. The desired joint marginal distributions are then obtained by adding back the pi .xi /pj .xj / term. Expressions for higher-order cumulants can be derived by taking further derivatives of ¡F.µ /. 4.1 Approximate Linear Response. Notice from equation 4.4 that the covariance estimates are obtained by studying the perturbations in pj .xj / as

Linear Response Algorithms

203

we vary µi .xi /. This is not practical in general since calculating pj .xj / itself is intractable. Instead, we consider perturbations of approximate marginal distributions fbj g. In the following, we will assume that bj .xj I µ / are the beliefs at a local minimum of the approximate Gibbs free energy under consideration (possibly subject to constraints). @b .x Iµ / In analogy to equation 4.4, let Cij .xi ; xj / D @µj .xj / be the linear rei

i

µ D0

sponse estimated covariance, and dene the linear response estimated joint pairwise marginal as 0 0 bLR ij .xi ; xj / D C ij .xi ; xj / C bi .xi /bj .xj /;

(4.5)

: where b0i .xi / D bi .xi I µ D 0/. We will show that bLR and Cij satisfy a number ij of important properties of joint marginals and covariances. First, we show that Cij .xi ; xj / can be interpreted as the Hessian of a well behaved convex function. We focus here on the Bethe approximation (the mean eld case is simpler). First, let C be the set of beliefs that satisfy the constraints (3.8) and normalization constraints. The approximate marginals fb0i g along with the joint marginals fb0® g form a local minimum of the Bethe: Gibbs free energy (subject to b0 D fb0i ; b0® g 2 C ). Assume that b0 is a strict 2 BP local minimum of G . That is, there is an open domain D containing b0 such that GBP .b0 / < GBP .b/ for each b 2 D \ C nb0 . Now we can dene X (4.6) G¤ .µ / D inf GBP .b/ ¡ bi .xi /µ i .xi /: b2D \C

i;xi

G¤ .µ / is a concave function since it is the inmum of a set of linear functions in µ . Further, G¤ .0/ D G.b0 /, and since b0 is a strict local minimum when µ D 0, small perturbations in µ will result in small perturbations in b0 , so that G¤ is well-behaved on an open neighborhood around µ D 0. Differentiating ¤ .µ/ D ¡bj .xj I µ/, so that we now have G¤ .µ /, we get @G @µj .xj / @bj .xj I µ/ @ 2 G¤ .µ / : D¡ Cij .xi ; xj / D @µi .xi / @µi .xi /@µj .xj / µ D0 µ D0

(4.7)

In essence, we can interpret G¤ .µ / as a local convex dual of GBP .b/ (by restricting attention to D ). Since GBP is an approximation to the exact Gibbs free energy (Welling & Teh, 2003), which is in turn dual to F.µ / (Georges & Yedidia, 1991), G¤ .µ / can be seen as an approximation to F.µ / for small values of µ . For that reason, we can take its second derivatives Cij .xi ; xj / as approximations to the exact covariances (which are second derivatives of ¡F.µ /). These relationships are shown pictorially in Figure 2. 2

The strict local minimality is in fact attained if we use loopy BP (Heskes, in press).

204

M. Welling and Y. Weh

p C

dual

F(q )

approximate

approximate

BP

b

dual

G*(q )

LR

C

G(p )

BP

G (b )

Figure 2: Diagrammatic representation of the different objective functions discussed in the article. The free energy F is the cumulant generating function (up to a constant), G is the Gibbs free energy, GBP is the Bethe-Gibbs free energy, which is an approximation to the true Gibbs free energy, and G¤ is the approximate cumulant generating function. The actual approximation is performed in the dual space, and the dashed arrow indicates that the overall process gives G¤ as an approximation to F.

We now proceed to prove a number of important properties of the covariance C. Theorem 1.

The approximate covariance satises the following symmetry: (4.8)

Cij .xi ; xj / D Cji .xj ; xi /:

Proof. The covariances are second derivatives of ¡G¤ .µ / at µ D 0, and we can interchange the order of the derivatives since G¤ .µ / is well behaved on a neighborhood around µ D 0. Theorem 2. The approximate covariance satises the following “marginalization” conditions for each xi ; xj : X x0i

Cij .x0i ; xj / D

X xj0

Cij .xi ; xj0 / D 0:

(4.9)

As a result, the approximate joint marginals satisfy local marginalization constraints: X X 0 0 0 0 (4.10) bLR bLR ij .xi ; xj / D bj .xj / ij .xi ; xj / D b i .xi /: x0i

xj0

Linear Response Algorithms

205

Using the denition of Cij .xi ; xj / and marginalization constraints

Proof. for bj0 ,

X xj0

X @ bj .xj0 I µ/ X @ Cij .xi ; xj0 / D bj .xj0 I µ / D @µi .xi / @µi .xi / x0 xj0 µD0 j µ D0 @ D 1 D 0: @µi .xi / µ D0

(4.11)

P The constraint x0i Cij .x0i ; xj / D 0 follows from the symmetry 4.8, while the corresponding marginalization, equation 4.10, follows from equation 4.9 and the denition of bLR . ij

Since ¡F.µ / is convex, its Hessian matrix with entries given in equation 4.4 is positive semidenite. Similarly, since the approximate covariances Cij .xi ; xj / are second derivatives of a convex function ¡G¤ .µ /, we have: Theorem 3. The matrix formed from the approximate covariances Cij .xi ; xj / by varying i and xi over the rows and varying j; xj over the columns is positive semidefinite.

Using the above results, we can reinterpret the linear response correction as a “projection” of the (only locally consistent) beliefs fb0i ; b0® g onto a set of beliefs fb0i ; bLR ij g that is both locally consistent (see theorem 2) and satises the global constraint of being positive semidenite (see theorem 3). This is depicted in Figure 3. Indeed, the idea to include global constraints such as positive semideniteness in approximate inference algorithms was proposed in Wainwright and Jordan (2003). It is surprising that a simple post hoc projection can achieve the same result. 5 Propagation Algorithms for Linear Response Although we have derived an expression for the covariance in the linear response approximation (see equation 4.3), we have not yet explained how to compute it efciently. In this section, we derive a propagation algorithm to that end and prove some convergence results and in the next section we present an algorithm based on a matrix inverse.

206

M. Welling and Y. Weh

locally consistent marginals

postive semi definite covariance

globally consistent marginals

LR BP

Figure 3: Constraint sets discussed in the article. The inner set is the set of all marginal distributions that are consistent with some global distribution B.x/, the outer set is the constraint set of all locally consistent marginal distributions, and the middle set consists of locally consistent marginal distributions with positive semidenite covariance. The linear response algorithm performs a correction on the joint pairwise marginals such that the covariance matrix is symmetric and positive semidenite, while all local consistency relations are still respected.

Recall from equation 4.5 that we need the rst derivative of bi .xi I µ / with respect to µj .xj / at µ D 0. This does not automatically imply that we need an analytic expression for bi .xi I µ / in terms of µ. Instead, we need only to keep track of rst-order dependencies by expanding all quantities and equations up to rst order in µ . For the beliefs we write,3 ³ ´ X bi .xi I µ/ D b0i .xi / 1 C Rij .xi ; yj /µj .yj / :

(5.1)

j;yj

The “response matrix” Rij .xi ; yj / measures the sensitivity of log bi .xi I µ/ at node i to a change in the log node potentials log Ãj .yj / at node j. Combining equation 5.1 with equation 4.7, we nd that Cij .xi ; xj / D b0i .xi /Rij .xi ; xj /:

(5.2)

3 The unconventional form of this expansion will make subsequent derivations more transparent.

Linear Response Algorithms

207

The constraints, equation 4.9 (which follow from the normalization of bi .xi I µ / and b0i .xi /), translate into X xi

b0i .xi /Rij .xi ; yj / D 0;

(5.3)

and it is not hard to verify that the following shift can be applied to accomplish this,4 Rij .xi ; yj / Ã Rij .xi ; yj / ¡

X xi

b0i .xi /Rij .xi ; yj /:

(5.4)

5.1 The Mean Field Approximation. Let us assume that we have found a local minimum of the mean eld–Gibbs free energy by iterating equation 3.5 until convergence. By inserting the expansions, equations 4.1 and 5.1, into equation 3.5 and equating terms linear in µ, we derive the following update equations for the response matrix in the mean eld approximation, Rik .xi ; yk / Ã ±ik ±xi yk C Á £

X

XX ®2N

i

x®ni

Á log Ã® .x® / !

Rjk .xj ; yk / :

Y j2®ni

! 0; bj MF .xj /

(5.5)

j2®ni

This update is followed by the shift 5.4 in order to satisfy the constraint 5.3, and the process is initialized with Rik .xi ; yk / D 0. After convergence we compute the approximate covariance according to equation 5.2. Theorem 4. The propagation algorithm for computing the linear response estimates of pairwise probabilities in the mean eld approximation is guaranteed to converge to a unique xed point using any scheduling of the updates. For a full proof, we refer to the proof of theorem 6, which is very similar. However, it is easy to see that for sequential updates, convergence is guaranteed because equation 5.5 is the rst-order term of the mean eld equation, 3.5, which converges for arbitrary µ . 5.2 The Bethe Approximation. In the Bethe approximation, we follow a similar strategy as in the previous section for the mean eld approximation. First, we assume that belief propagation has converged to a stable xed 4 The shift can be derived by introducing a µ-dependent normalizing constant in equation 5.1, expanding it to rst order in µ and using the rst-order terms to satisfy constraint 5.3.

208

M. Welling and Y. Weh

point, which by Heskes (in press) is guaranteed to be a local minimum of the Bethe-Gibbs free energy. Next, we expand the messages ni® .xi / and m®i .xi / up to rst order in µ around the stable xed point, ³ ´ X 0 (5.6) ni® .xi / D ni® .xi / 1 C Ni®;k .xi ; yk /µ k .yk / k;yk

³ ´ X m®i .xi / D m0®i .xi / 1 C M®i;k .xi ; yk /µk .yk / :

(5.7)

k;yk

Inserting these expansions and the expansion 4.1 into the BP equations, 3.9 and 3.10, and matching rst-order terms, we arrive at the following update equations for the “supermessages” M ®i;k .xi ; yk / and Ni®;k .xi ; yk /, X (5.8) Ni®;k .xi ; yk / Ã ±ik ±xi yk C M¯i;k .xi ; yk / ¯2N

M®i;k .xi ; yk / Ã

i n®

X Ã® .x® / Y x®ni

m0®i .xi / l2N

® ni

n0l® .xl /

X j2N

Nj®;k .xj ; yk /:

(5.9)

® ni

The supermessages are initialized at M®i;k D Ni®;k D 0 and “normalized” as follows:5 X (5.10) Ni®;k .xi ; yk / Ã Ni®;k .xi ; yk / ¡ Ni®;k .xi ; yk / xi

M®i;k .xi ; yk / Ã M ®i;k .xi ; yk / ¡

X xi

M®i;k .xi ; yk /:

(5.11)

After the above xed-point equations have converged, we compute the response matrix Rij .xi ; xj / by inserting the expansions 5.1, 4.1, and 5.7 into equation 3.12 and matching rst-order terms: X (5.12) Rij .xi ; xj / D ±ij ±xi xj C M®i;j .xi ; xj /: ®2N

i

We then normalize the response matrix as in equation 5.4 and compute the approximate covariances as in equation 5.2. We now prove a number of useful results concerning the iterative algorithm proposed above. Theorem 5. If the factor graph has no loops, then the linear response estimates dened in equation 5.2 are exact. Moreover, there exists a scheduling of the supermessages such that the algorithm converges after just one iteration (i.e., every message is updated just once). 5 The derivation is along similar lines as explained in the previous section for the mean eld case. Note also that unlike the mean eld case, normalization is desirable only for reasons of numerical stability.

Linear Response Algorithms

209

Proof. Both results follow from the fact that BP on tree-structured factor graphs computes the exact single-node marginals for arbitrary µ. Since the supermessages are the rst-order terms of the BP updates with arbitrary µ, we can invoke the exact linear response theorem given by equations 4.3 and 4.4 to claim that the algorithm converges to the exact joint pairwise marginal distributions. Moreover, the number of iterations that BP needs to converge is independent of µ, and there exists a scheduling that updates each message exactly once (inward-outward scheduling). Since the supermessages are the rst-order terms of the BP updates, they inherit these properties. For graphs with cycles, BP is not guaranteed to converge. We can, however, still prove the following strong result: Theorem 6. If the messages fm0®i .xi /; n0i® .xi /g have converged to a stable xed point, then the update equations for the supermessages (see equations 5.8, 5.9, and 5.11) will also converge to a unique stable xed point, using any scheduling of the supermessages. Sketch of proof: As a rst step, we combine the BP message updates (see equations 3.9 and 3.10) into one set of xed-point equations by inserting equation 3.9 into equation 3.10. Next, we linearize the xed-point equations for the BP messages around the stable xed point. We introduce a small : Q Q ®i .xi / D perturbation in the logarithm of the messages: ± log m®i .xi / D M Ma , where we have collected the message index ®i and the state index xi into one “attened” index a. The linearized equation takes the general form, Q a Ã log ma C log ma C M

X b

Q b; Lab M

(5.13)

where the matrix L is given by the rst-order term of the Taylor expansion of the xed-point equation. Since we know that the xed point is stable, we infer that the absolute values of the eigenvalues of L are all smaller than 1, Q a ! 0 as we iterate the xed-point equations. so that M Similarly for the supermessages, we insert equation 5.8 into equation 5.9 and include the normalization, equation 5.11, explicitly so that equations 5.8, 5.9, and 5.11 collapse into one linear equation. We now observe that the collapsed update equations for the supermessages are linear and of the form M a¹ Ã Aa¹ C

X

Lab Mb¹ ;

(5.14)

b

where we introduced new attened indices ¹ D .k; xk / and where L is identical to the L in equation 5.13. The constant term Aa¹ comes from the

210

M. Welling and Y. Weh

fact that we also expanded the node potential Ã¹ as in equation 4.1. Next, we recall that for the linear dynamics 5.14, there can only be one xed point at

Ma¹ D

Xh b

.I ¡ L/¡1

i ab

(5.15)

Ab¹

which exists only if det.I ¡ L/ 6D 0. Finally, since the eigenvalues of L are less than 1, we conclude that det.I ¡ L/ 6D 0, so the xed point exists, the xed point is stable, and the (parallel) xed-point equations, 5.14, will converge to that xed point. The above proves the result for parallel updates of the supermessages. However, for linear systems, the Stein-Rosenberg theorem now guarantees that any scheduling will converge to the same xed point and, moreover, that sequential updates will do so faster. 6 Noniterative Algorithms for Linear Response In section 5, we described propagation algorithms to compute the approx@bi .xi / imate covariances @µ directly. In this section, we describe an alternative .x / k

k

method that rst computes @µi .xi / , @bk .xk /

@µi .xi / @bk .xk /

and then inverts the matrix formed by

where we have attened fi; xi g into a row index and fk; xk g into a column index. This method is a direct extension of Kappen and Rodriguez (1998). The intuition is that while perturbations in a single µi .xi / affect the whole system, perturbations in a single bi .xi / (while keeping the others xed) affect each subsystem ® 2 A independently (see also Welling & Teh, @µi .xi / @bi .xi / 2003). This makes it easier to compute @b than to compute @µ . k .xk / k .xk / First, we propose minimal representations for bi and µk . Notice that the P current representations of bi and µk are redundant: we always have xi bi .xi / D 1 for all i, while for each k, adding a constant to all µk .xk / does not change the beliefs. This means that the matrix is actually not invertible: it has eigenvalues of 0. To deal with this noninvertibility, we propose a minimal representation for bi and µi . In particular, we assume that for each i, there is a distinguished value xi D 0 and set µi .0/ D 0 while functionally P @µi .xi / dene bi .0/ D 1 ¡ xi 6D0 bi .xi /. Now the matrix formed by @b for each i; k, k .xk / and xi ; xk 6D 0 is invertible; its inverse gives us the desired covariances for xi ; xk 6D 0. Values for xi D 0 or xk D 0 can then be computed using equation 4.9. 6.1 The Mean Field Approximation. Taking the log of the mean eld xed-point equation, 3.5, and differentiating with respect to bk .xk /, we get,

Linear Response Algorithms

211

after some manipulation for each i; k, and xi ; xk 6D 0, ³

@µi .xi / @bk .xk /

´ 1 ±xi xk C bi .xi / bi .0/ X X Y ¡.1 ¡ ±ik / log Ã® .x® / bj .xj /:

D ±ik

®2N

i \N

k

x®ni;k

(6.1)

j2®ni;k

Inverting this matrix thus results in the desired estimates of the covariances (see also Kappen & Rodriguez, 1998, for the binary case). 6.2 The Bethe Approximation. In addition to using the minimal representations for bi and µi , we will also need minimal representations for the messages. This can be achieved by dening new quantities ¸i® .xi / D i/ log nni® .x for all i and xi . The ¸i® ’s can be interpreted as Lagrange multipliers i® .0/ to enforce the consistency constraints, equation 3.8 (Yedidia et al., 2000). We will use these multipliers instead of the messages in this section. Reexpressing the xed-point equations, 3.9 through 3.12, in terms of bi ’s and ¸i® ’s only, and introducing the perturbations µi , we get: ³

´ci

Ãi .xi / µi .xi / Y ¡¸i® .xi / e e Ãi .0/ ®2N i P Q ¸j® .xj / x®ni Ã® .x® / j2N ® e bi .xi / D P Q ¸j® .xj / x® Ã® .x® / j2N ® e

bi .xi / bi .0/

D

for all i; xi 6D 0 for all i; ® 2 N

(6.2) i ; xi

6D 0: (6.3)

The division by the values at 0 in equation 6.2 is to get rid of the proportionality constant. The above forms a minimal set of xed-point equations that the single node beliefs bi ’s and Lagrange multipliers ¸i® ’s need to satisfy at any local minimum of the Bethe free energy. Differentiating the logarithm of equation 6.2 with respect to bk .xk /, we get ³ @µi .xi / @bk .xk /

D ci ±ik

1 ±x i x k C bi .xi / bi .0/

´ C

X @¸i® .xi / ; @bk .xk / ®2N

(6.4)

i

remembering that bi .0/ is a function of bi .xi /, xi 6D 0. Notice that we need @µi .xi / i® .xi / values for @¸ in order to solve for @b . Since perturbations in bk .xk / @bk .xk / k .xk / (while keeping other bj ’s xed) do not affect nodes not directly connected to i® .xi / D 0 for k 62 ®. When k 2 ®, these can in turn be obtained by k, we have @¸ @bk .xk / solving, for each ®, a matrix inverse. Differentiating equation 6.3 by bk .xk /,

212

M. Welling and Y. Weh

we obtain ±ik ±xi xk D C®ij .xi ; xj / D

XX j2® xj 6D0

@¸ .x /

C®ij .xi ; xj / @bj®k .xkj/

» b® .xi ; xj / ¡ bi .xi /bj .xj / bi .xi /±xi xj ¡ bi .xi /bj .xj /

(6.5) if i 6D j if i D j

(6.6)

for each i; k 2 N ® and xi ; xk 6D 0. Flattening the indices in equation 6.5 (varying i; xi over rows and k; xk over columns), the left-hand side becomes the identity matrix, while the right-hand side is a product of two matrices. The rst is a covariance matrix C® where the ijth block is C®ij .xi ; xj /, while @¸ .x /

the second matrix consists of all the desired derivatives @bj®k .xkj/ . Hence, the derivatives are given as elements of the inverse covariance matrix C¡1 ® . @¸ .x / @µi .xi / Finally, plugging the values of @bj®.x j/ into equation 6.4 now gives @b , and k k k .xk / inverting that matrix will now give us the desired approximate covariances over the whole graph. Interestingly, the method requires access to the beliefs only at the local minimum, not to the potentials or Lagrange multipliers. 7 A Propagation Algorithm for Matrix Inversion Up to this point, all considerations have been in the discrete domain. A natural question is whether linear response can also be applied in the continuous domain. In this section, we use linear response to derive a propagation algorithm to compute the exact covariance matrix of a gaussian Markov random eld. A gaussian random eld is a real-valued Markov random eld with pairwise interactions. Its energy is ED

X 1X Wij xi xj C ®i xi ; 2 ij i

(7.1)

where Wij are the interactions and ®i are the biases. Since gaussian distributions are completely described by their rst- and second-order statistics, inference in this model reduces to the computation of the mean and covariance, : ¹ D hxi D ¡W¡1 ®

: 6 D hxxT i ¡ ¹¹T D W¡1 :

(7.2)

Weiss and Freeman (1999) showed that BP (when it converges) will compute the exact means ¹i but approximate variances 6ii and covariance 6ij between neighboring nodes. We will now show how to compute the exact covariance matrix using linear response, which through equation 7.2 translates into a perhaps unexpected algorithm to invert the matrix W.

Linear Response Algorithms

213

First, we introduce a small perturbation to the biases, ® ! ® C º and note that @ 2 F.º/ @¹i : D¡ 6ij D ¡ @ºi @ºj @ ºj ºD0 ºD0

(7.3)

Our strategy will thus be to compute ¹.º/ ¼ ¹0 ¡ 6º up to rst order in º. This can again be achieved by expanding the propagation updates to rst order in º. It will be convenient to collapse the two sets of message updates, equations 3.9 and 3.10, into one set of messages by inserting equation 3.9 into equation 3.10. Because the subsets ® correspond to pairs of variables in the gaussian random eld model, we change notation for the messages from ® ! j with ® D fi; jg to i ! j. Using the following denitions for the messages and potentials, ¡ 12 aij x2j ¡bij xj

(7.4)

mij .xj / / e

¡ 12 Wii x2i ¡®i xi

Ãij .xi ; xj / D e¡Wij xi xj Ãi .xi / D e we derive the update equations,

(7.5)

;

6

³ X ´ aij ®i C bki Wii C k2N i nj aki Wij k2N i nj P X ®i C k2N i bki ¿i D W ii C aki ¹i D ¡ ; ¿i k2N aij Ã

¡Wij2 P

bij Ã

(7.6) (7.7)

i

where the means ¹i are exact at convergence, but the precisions ¿i are approximate (Weiss & Freeman, 1999). We note that the aij messages do not depend on ®, so that the perturbation ®P ! ® C º will have no effect on it. Perturbing the bij messages as bij D b0ij C k Bij;k ºk ; we derive the following update equations for the “supermessages” Bij;l , Bij;l D

X aij .±il C Bki;l /: Wij k2N nj

(7.8)

i

Note that given a solution to equation 7.8, it is no longer necessary to Prun the updates for bij (see equation 7.6) since bij can be computed by bij D l Bij;l ®l . Theorem 7. If BP has converged to a stable xed point (i.e., message updates 7.6 have converged to a stable xed point), then the message updates 7.8 will converge 6

Here we used the following identity:

R

1

dxe¡ 2 ax

2 ¡bx

2 =2a

D eb

p

2¼=a.

214

M. Welling and Y. Weh

to a unique stable xed point. Moreover, the exact covariance matrix 6 D W¡1 is given by the following expression, 6il D

³ ´ X 1 ±il C Bki;l ; ¿i k2N

(7.9)

i

with ¿i given by equation 7.7. Sketch of proof. The convergence proof is similar to the proof of theorem 6 and is based on the observation that equation 7.8 is a linearization of the xed-point equation for bij , equation 7.6, so has the same convergence properties. The exactness proof is similar to the proof of theorem 5 and uses the fact that BP computes the means exactly so equation 7.3 computes the exact covariance, which is what we compute with equation 7.9. Weiss and Freeman (1999) P further showed that for diagonally dominant weight matrices (jWii j > j6Di jWij j 8i), convergence of BP (i.e., message updates, equation 7.6) is guaranteed. Combined with the above theorem, this ensures that the proposed iterative algorithm to invert W will converge for diagonally dominant W. Whether the class of problems that can be solved using this method can be enlarged, possibly at the expense of an approximation, is an open question. From equation 7.2, we observe that the exact covariance matrix may also be computed by running BP N times with ®i D ¡ei ; i D 1; : : : ; N, where ei is the unit vector in direction i. The exact means f¹i g computed using BP thus form the columns of the matrix W¡1 . This idea was exploited in the proof of claim 2 in Weiss and Freeman (1999). The complexity of the above algorithm is O .N £ E/ per iteration, where N is the number of nodes and E the number of edges in the graph. Consequently, it will improve on a straight matrix inversion only if the graph is sparse (i.e., the matrix to invert has many zeros). 8 Experiments In the following experiments, we compare ve methods for computing approximate estimates of the covariance matrix Cij .xi ; xj / D pij .xi ; xj / ¡ pi .xi /pj .xj /: MF: Since mean eld assumes independence, we have C D 0. This will act as a baseline. BP: Estimates computed directly from equation 3.11 by integrating out variables that are not considered (in fact, in the experiments below, the factors ® consist of pairs of nodes, so no integration is necessary). Note that nontrivial estimates exist only if there is a factor node that contains

Linear Response Algorithms

215

(a)

(b)

Figure 4: (a) Square grid used in the experiments. The rows are collected into supernodes which then form a chain. (b) Spanning tree of the nodes on the square grid used in the experiments.

both nodes. The BP messages were uniformly initialized at m®i .xi / D ni® .xi / D 1 and run until convergence. No damping was used.

MF+LR: Estimates computed from the linear response correction to the mean eld approximation (see section 5.1). The MF beliefs were rst uniformly initialized at bi .xi / D 1=D and run until convergence, while the response matrix Rij .xi ; xj / was initialized at 0. BP+LR: Estimates computed from the linear response correction to the Bethe approximation (see section 5.2). The supermessages fM ®i;j .xi ; xj /; Ni®;j .xi ; xj /g were all initialized at 0. COND: Estimates computed using the following conditioning procedure. Clamp a certain node j to a specic state xj D a. Run BP to compute conditional distributions bBP .xi jxj D a/. Do this for all nodes and all states to obtain all conditional distributions bBP .xi jxj /. The joint distribution is now computed as bCOND .xi ; xj / D bBP .xi jxj /bBP .xj /. Finally, the covariance ij is computed as X X CCOND .xi ; xj / D bCOND .xi ; xj / ¡ bCOND .xi ; xj / bCOND .xi ; xj /: (8.1) ij ij ij ij xj

Note that C is not symmetric and that the marginal consistent with bBP .x /.

xi

P xj

bCOND .xi ; xj / is not ij

i

The methods were halted if the maximum change in absolute value of all beliefs (MF) or messages (BP) was smaller than 10¡8 . The graphical model in the rst two experiments has nodes placed on a square 6 £ 6 grid (i.e., N D 36) with only nearest neighbors connected (see Figure 4a). Each node is associated with a random variable, which can be in one of three states (D D 3). The factors were chosen to be all pairs of neighboring nodes in the graph.

216

M. Welling and Y. Weh

Grid (neighbors, s

Grid (neighbors, s

=0)

2

C=0 BP

MF+LR

BP 5

10

error covariances

error covariances

10

3

10

BP+LR

4

10

MF+LR BP+LR 10

10

5

10

COND COND

(a)

s

0.5

edge

1

1.5

2

0.5

Grid (next to nearest neighbors, snode=0)

(d)

edge

1

1.5

2

=2)

node

C=0

C=0 MF+LR

5

10

MF+LR

error covariances

error covariances

s

Grid (next to nearest neighbors, s

2

10

BP+LR

4

10

BP+LR 10

10

COND 6

10

COND

(b)

s

0.5

edge

15

1

1.5

Grid (distant nodes, s

2

10

=0) 10

COND

MF+LR

(e)

edge

1

1.5

2

=2)

5

BP+LR

C=0

s

node

error covariances

4

10

0.5

Grid (remaining nodes, s

node

error covariances

=2)

node

node

C=0

C=0

MF+LR

10

BP+LR

10

6

10

COND

0.5

s

edge

(c) 1

15

1.5

2

10

0.5

s

edge

(f) 1

1.5

Figure 5: Absolute error for the estimated covariances for 6 £ 6 square grid.

2

Linear Response Algorithms

217

Tree ® Grid (neighbors, s

T re e ® G rid (ne xt to ne ares t ne ighb ors , s node= 0 )

=0)

node

0

10

0

10

C=0

C=0

5

10

MF+LR BP+LR

5

10

BP

erro r c ov arian ce s

error covariances

MF+LR BP+LR COND

COND

10

10

10

10

15

10

15

10

(a)

0.5 s

0

edge

1.5

10

2

0

0.5

s

edge

1

1.5

2

Tree ® 6x6 Grid (remaining nodes, snode=0)

0

10

MF+LR

C=0 5

BP+LR

10

error covariances

(b )

20

1

10

10

COND

15

10

20

10

0

0.5

(c)

s

edge

1

1.5

2

Figure 6: Absolute error for the spanning tree in Figure 4b that is smoothly changed into the 6 £ 6 square grid. Fully Connected Graph (snode=0)

10

Fully Connected Graph (snode=2)

2

C=0 4

10

6

10

8

C=0

10

BP

error covariances

error covariances

10

MF+LR

BP+LR

0.2

0.4

s

0.6

edge

0.8

1

BP

10

6

10

8

10

10

COND

(a)

4

MF+LR

BP+LR

COND

0.2

(b) 0.4

s

0.6

edge

0.8

1

Figure 7: Absolute error computed on a fully connected network with 10 nodes.

218

M. Welling and Y. Weh

By clustering the nodes in each row into supernodes, exact inference is still feasible by using the forward-backward algorithm. Pairwise probabilities between nodes in nonconsecutive layers were computed by integrating out the intermediate supernodes. The error in the estimated covariances was computed as the absolute difference between the estimated and the true values, averaged over pairs of nodes and their possible states, and averaged over 15 random draws of the network, as described below. An instantiation of a network was generated by randomly drawing the logarithm of the node and edge potentials from a gaussian with zero mean and standard deviation ¾node and ¾edge , respectively. In the rst experiment, we generated networks randomly with a scale ¾edge varying over the range [0; 2] and two settings of the scale ¾node , namely, f0; 2g. The results in Figure 5 were separately plotted for neighboring nodes, next-to-nearest neighboring nodes, and the remaining nodes in order to show the decay of dependencies with distance. Estimates for BP are absent in Figures 5b and 5e and in 5c and 5f because BP does not provide nontrivial estimates for nonneighbors. In the next experiment, we generated a single network with ¾edge D 1 and fÃi g D 1 on the 6 £ 6 square grid used in the previous experiment. The edge strengths of a subset of the edges forming a spanning tree of the graph were held xed (see Figure 4b), while the remaining edge strengths were multiplied by a factor increasing from 0 to 2 on the x-axis. The results are shown in Figure 6. Note that BP+LR and COND are exact on the tree. Finally, we generated fully connected graphs with 10 nodes and 3 states per node (i.e., all nodes are neighbors). We used varying edge strengths (¾edge ranging from [0; 1]) and two values of ¾node : f0; 2g. The results are shown in Figure 7. If we further increase the edge strengths in this fully connected network, we nd that BP often fails to converge. We could probably improve this situation a little bit by damping the BP updates, but because of the many tight loops, BP is doomed to fail for relatively large ¾edge . All experiments conrm that the LR estimates of the covariances in the Bethe approximation improve signicantly on the LR estimates in the MF approximation. It is well known that the MF approximation usually improves for large, densely connected networks. This is probably the reason MF+LR performed better on the fully connected graph but never as well as BP+LR or COND. The COND method performed surprisingly well at either the same level of accuracy as BP+LR or a little bit better. It was, however, checked numerically that the symmetrized estimate of the covariance matrix was not positive semidenite and that the various marginals computed from the joint distributions bCOND .xi ; xj / were inconsistent with each other. ij In the next section, we further discuss the differences between BP+LR and COND. Finally, as expected, the BP+LR and COND estimates are exact on

Linear Response Algorithms

219

a tree—the error is of the order of machine precision—but increases when the graph contains cycles with increasing edge strengths. 9 Discussion Loosely speaking, the “philosophy” of this article to compute estimates of covariances is as follows (see Figure 2). First, we observe that the log partition function is the cumulant generating function. Next, we dene its conjugate dual—the Gibbs free energy—and approximate it (e.g., the mean eld or the Bethe approximation). Finally, we transform back to obtain a convex approximation to the log partition function, from which we estimate the covariances. We have presented linear response algorithms on factor graphs. In the discrete case, we have discussed the mean eld and the Bethe approximations, while for gaussian random elds, we have shown how the proposed linear response algorithm translates into a surprising propagation algorithm to compute a matrix inverse. The computational complexity of the iterative linear response algorithm scales as O .N £ E £ D3 / per iteration, where N is the number of nodes, E the number of edges, and D the number of states per node. The noniterative algorithm scales slightly worse, O .N 3 £D3 /, but is based on a matrix inverse for which very efcient implementations exist. A question that remains open is whether we can improve the efciency of the iterative algorithm when we are interested only in the joint distributions of neighboring nodes. On tree-structured graphs, we know that belief propagation computes those estimates exactly in O .E £ D2 /, but the linear response algorithm still seems to scale as O .N £ E £ D3 /, which indicates that some useful information remains unused. Another hint pointing in that direction comes from the fact that in the gaussian case, an efcient algorithm was proposed in Wainwright, Sudderth, and Willsky (2000) for the computation of variances and neighboring covariances on a loopy graph. There are still a number of generalizations worth exploring. First, instead of MF or Bethe approximations, we can use the more accurate Kikuchi approximation dened over larger clusters of nodes and their intersections (see also Tanaka, 2003). Another candidate is the convexied Bethe free energy (Wainwright, Jaakkola, and Willsky, 2002. Second, in the case of the Bethe approximation, belief propagation is not guaranteed to converge. However, convergent alternatives have been developed in the literature (Teh & Welling, 2001; Yuille, 2002), and the noniterative linear response algorithm can still be applied to compute joint pairwise distributions. For reasons of computational efciency, it may be desirable to develop iterative algorithms for this case. Third, the presented method easily generalizes to the computation of higher-order cumulants. It is straightforward (but cumbersome) to develop iterative linear response algorithms for this as well. Finally, we

220

M. Welling and Y. Weh

are investigating whether linear response algorithms may also be applied to xed points of the expectation propagation algorithm. The most important distinguishing feature between the proposed LR algorithm and the conditioning procedure described in section 8 is that the covariance estimate is automatically positive semidenite. The idea to include global constraints such as positive semideniteness in approximate inference algorithms was proposed in Wainwright and Jordan (2003). LR may be considered as a post hoc projection on this constraint set (see section 4.1 and Figure 3). Another difference is the lack of a convergence proof for conditioned BP runs, given that BP has converged without conditioning (convergence for BP+LR was proven in section 5.2). Even if the various runs for conditioned BP do converge, different runs might converge to different local minima of the Bethe free energy, making the obtained estimates inconsistent and less accurate (although in the regime we worked with in the experiments, we did not observe this behavior). Finally, the noniterative algorithm is applicable to all local minima in the Bethe-Gibbs free energy— even those that correspond to unstable xed points of BP. These minima can, however, still be identied using convergent alternatives (Yuille, 2002; Teh & Welling, 2001). Acknowledgments We thank Martin Wainwright for discussion and the referees for valuable feedback. M.W. thanks Geoffrey Hinton for support. Y.W.T thanks Mike Jordan for support. References Frey, B., & MacKay, D. (1997). A revolution: Belief propagation in graphs with cycles. In M. Jordan, M. Kearns, & S. Solla (Eds.), Advances in neural information processing systems, 10 (pp. 479–485). Cambridge, MA: MIT Press. Georges, A., & Yedidia, J. (1991). How to expand around mean-eld theory using high-temperature expansions. J. Phys A: Math. Gen., 24, 2173–2192. Heskes, T. (in press). Stable xed points of loopy belief propagation are minima of the bethe free energy. In S. Becker, S. Thrun, & K. Obermayer (Eds.), Advances in neural information processing systems, 15. Cambridge, MA: MIT Press. Kappen, H., & Rodriguez, F. (1998). Efcient learning in Boltzmann machines using linear response theory. Neural Computation, 10, 1137–1156. Kschischang, F., Frey, B., & Loeliger, H. (2001). Factor graphs and the sumproduct algorithm. IEEE Transactions on Information Theory, 47(2), 498–519. Minka, T. (2001). Expectation propagation for approximate Bayesian inference. In Proceedings of the Conference on Uncertainty in Articial Intelligence (pp. 362– 369). San Francisco: Morgan Kaufmann.

Linear Response Algorithms

221

Murphy, K., Weiss, Y., & Jordan, M. (1999). Loopy belief propagation for approximate inference: An empirical study. In Proceedings of the Conference on Uncertainty in Articial Intelligence (pp. 467–475). San Francisco: Morgan Kaufmann. Tanaka, K. (2003). Probabilistic inference by means of cluster variation method and linear response theory. IEICE Transactions in Information and Systems, E 6-D (7), 1–15. Teh, Y., & Welling, M. (2001). The unied propagation and scaling algorithm. In T. G. Dietterich, S. Becker, & Z. Ghahramani (Eds.), Advances in neural information processing systems, 14 (pp. 953–960). Cambridge, MA: MIT Press. Wainwright, M., Jaakkola, T., & Willsky, A. (2002). A new class of upper bounds on the log partition function. In Proceedings of the Conference on Uncertainty in Articial Intelligence (pp. 536–543). San Francisco: Morgan Kaufmann. Wainwright, M., & Jordan, M. (2003). Semidenite relaxations for approximate inference on graphs with cycles (Tech. Rep. No. UCB/CSD-3-1226). Berkeley: CS Division, University of California, Berkeley. Wainwright, M., Sudderth, E., & Willsky, A. (2000). Tree-based modeling and estimation of gaussian processes on graphs with cycles. In T. Leen, T. Dietterich, & V. Tresp (Eds.), Advances in neural information processing systems, 13 (661–667). Cambridge, MA: MIT Press. Weiss, Y. (2000). Correctness of local probability propagation in graphical models with loops. Neural Computation, 12, 1–41. Weiss, Y, & Freeman, W. (1999). Correctness of belief propagation in gaussian graphical models of arbitrary topology. In S. A. Solla, T. K. Leen, & K. R. Muller (Eds.), Advances in neural information processing systems, 12 (pp. 673– 679). Cambridge, MA: MIT Press. Weiss, Y, & Freeman, W. (2001). On the optimality of solutions of the maxproduct belief-propagation algorithm in arbitrary graphs. IEEE Transactions on Information Theory, 47, 723–735. Welling, M, & Teh, Y. (2003). Approximate inference in Boltzmann machines. Articial Intelligence, 143, 19–50. Yedidia, J., Freeman, W., & Weiss, Y. (2000). Generalized belief propagation. In T. Leen, T. Dietterich, & V. Tresp (Eds.), Advances in neural information processing systems, 13 (pp. 689–695). Cambridge, MA: MIT Press. Yedidia, J., Freeman, W., & Weiss, Y. (2002). Constructing free energy approximations and generalized belief propagation algorithms (Tech. Rep. TR-2002-35). Cambridge, MA: Mitsubishi Electric Research Laboratories. Yuille, A. (2002). CCCP algorithms to minimize the Bethe and Kikuchi free energies: Convergent alternatives to belief propagation. Neural Computation, 14 (7), 1691–1722. Received October 11, 2002; accepted July 8, 2003.

ARTICLE

Communicated by Pamela Reinagel

Analyzing Neural Responses to Natural Signals: Maximally Informative Dimensions Tatyana Sharpee [email protected] Sloan–Swartz Center for Theoretical Neurobiology and Department of Physiology, University of California at San Francisco, San Francisco, CA 94143, U.S.A.

Nicole C. Rust [email protected] Center for Neural Science, New York University, New York, NY 10003, U.S.A.

William Bialek [email protected] Department of Physics, Princeton University, Princeton, NJ 08544, U.S.A., and Sloan–Swartz Center for Theoretical Neurobiology and Department of Physiology, University of California at San Francisco, San Francisco, CA 94143, U.S.A.

We propose a method that allows for a rigorous statistical analysis of neural responses to natural stimuli that are nongaussian and exhibit strong correlations. We have in mind a model in which neurons are selective for a small number of stimulus dimensions out of a high-dimensional stimulus space, but within this subspace the responses can be arbitrarily nonlinear. Existing analysis methods are based on correlation functions between stimuli and responses, but these methods are guaranteed to work only in the case of gaussian stimulus ensembles. As an alternative to correlation functions, we maximize the mutual information between the neural responses and projections of the stimulus onto low-dimensional subspaces. The procedure can be done iteratively by increasing the dimensionality of this subspace. Those dimensions that allow the recovery of all of the information between spikes and the full unprojected stimuli describe the relevant subspace. If the dimensionality of the relevant subspace indeed is small, it becomes feasible to map the neuron’s input-output function even under fully natural stimulus conditions. These ideas are illustrated in simulations on model visual and auditory neurons responding to natural scenes and sounds, respectively.

Neural Computation 16, 223–250 (2004)

c 2003 Massachusetts Institute of Technology °

224

T. Sharpee, N. Rust, and W. Bialek

1 Introduction From olfaction to vision and audition, a growing number of experiments are examining the responses of sensory neurons to natural stimuli (Creutzfeldt & Northdurft, 1978; Rieke, Bodnar, & Bialek, 1995; Baddeley et al., 1997; Stanley, Li, & Dan, 1999; Theunissen, Sen, & Doupe, 2000; Vinje & Gallant, 2000, 2002; Lewen, Bialek, & de Ruyter van Steveninck, 2001; Sen, Theunissen, & Doupe, 2001; Vickers, Christensen, Baker, & Hildebrand, 2001; Ringach, Hawken, & Shapley, 2002; Weliky, Fiser, Hunt, & Wagner, 2003; Rolls, Aggelopoulos, & Zheng, 2003; Smyth, Willmore, Baker, Thompson, & Tolhurst, 2003). Observing the full dynamic range of neural responses may require using stimulus ensembles that approximate those occurring in nature (Rieke, Warland, de Ruyter van Steveninck, & Bialek, 1997; Simoncelli & Olshausen, 2001), and it is an attractive hypothesis that the neural representation of these natural signals may be optimized in some way (Barlow, 1961, 2001; von der Twer & Macleod, 2001; Bialek, 2002). Many neurons exhibit strongly nonlinear and adaptive responses that are unlikely to be predicted from a combination of responses to simple stimuli; for example, neurons have been shown to adapt to the distribution of sensory inputs, so that any characterization of these responses will depend on context (Smirnakis, Berry, Warland, Bialek, & Meister, 1996; Brenner, Bialek, & de Ruyter van Steveninck, 2000; Fairhall, Lewen, Bialek, & de Ruyter van Steveninck, 2001). Finally, the variability of a neuron’s responses decreases substantially when complex dynamical, rather than static, stimuli are used (Mainen & Sejnowski, 1995; de Ruyter van Steveninck, Lewen, Strong, Koberle, & Bialek, 1997; Kara, Reinagel, & Reid, 2000; de Ruyter van Steveninck, Borst, & Bialek, 2000). All of these arguments point to the need for general tools to analyze the neural responses to complex, naturalistic inputs. The stimuli analyzed by sensory neurons are intrinsically high-dimensional, with dimensions D » 102 ¡ 103 . For example, in the case of visual neurons, the input is commonly specied as light intensity on a grid of at least 10£10 pixels. Each of the presented stimuli can be described as a vector s in this high-dimensional stimulus space (see Figure 1). The dimensionality becomes even larger if stimulus history has to be considered as well. For example, if we are interested in how the past N frames of the movie affect the probability of a spike, then the stimulus s, being a concatenation of the past N samples, will have dimensionality N times that of a single frame. We also assume that the probability distribution P.s/ is sampled during an experiment ergodically, so that we can exchange averages over time with averages over the true distribution as needed. Although direct exploration of a D » 102 ¡ 103 -dimensional stimulus space is beyond the constraints of experimental data collection, progress can be made provided we make certain assumptions about how the response has been generated. In the simplest model, the probability of response can be described by one receptive eld (RF) or linear lter (Rieke et al., 1997).

Analyzing Neural Responses to Natural Signals

225

Figure 1: Schematic illustration of a model with a one–dimensional relevant subspace: eO1 is the relevant dimension, and eO2 and eO3 are irrelevant ones. Shown are three example stimuli, s, s0 , and s00, the receptive eld of a model neuron—the relevant dimension eO1 , and our guess v for the relevant dimension. Probabilities of a spike P.spikejs¢ eO1 / and P.spikejs¢v/ are calculated by rst projecting all of the stimuli s onto each of the two vectors eO 1 and v, respectively, and then applying equations (2.3, 1.2, 1.1) sequentially. Our guess v for the relevant dimension is adjusted during the progress of the algorithm in such a way as to maximize I.v/ of equation (2.5), which makes vector v approach the true relevant dimension eO 1 .

The RF can be thought of as a template or special direction eO1 in the stimulus space1 such that the neuron’s response depends on only a projection of a given stimulus s onto eO1 , although the dependence of the response on this 1 The notation eO denotes a unit vector, since we are interested only in the direction the vector species and not in its length.

226

T. Sharpee, N. Rust, and W. Bialek

projection can be strongly nonlinear (cf. Figure 1). In this simple model, the reverse correlation method (de Boer & Kuyper, 1968; Rieke et al., 1997; Chichilnisky, 2001) can be used to recover the vector eO1 by analyzing the neuron’s responses to gaussian white noise. In a more general case, the probability of the response depends on projections si D eOi ¢ s of the stimulus s on a set of K vectors fOe1 ; eO2 ; : : : ; eOK g: P.spikejs/ D P.spike/ f .s1 ; s2 ; : : : ; sK /;

(1.1)

where P.spikejs/ is the probability of a spike given a stimulus s and P.spike/ is the average ring rate. In what follows, we will call the subspace spanned by the set of vectors fOe1 ; eO2 ; : : : ; eOK g the relevant subspace (RS).2 We reiterate that vectors fOei g, 1 · i · K may also describe how the time dependence of the stimulus s affects the probability of a spike. An example of such a relevant dimension would be a spatiotemporal RF of a visual neuron. Although the ideas developed below can be used to analyze input-output functions f with respect to different neural responses, such as patterns of spikes in time (de Ruyter van Steveninck & Bialek, 1988; Brenner, Strong, Koberle, Bialek, & de Ruyter van Steveninck, 2000; Reinagel & Reid, 2000), for illustration purposes we choose a single spike as the response of interest. 3 Equation 1.1 in itself is not yet a simplication if the dimensionality K of the RS is equal to the dimensionality D of the stimulus space. In this article, we will assume that the neuron’s ring is sensitive only to a small number of stimulus features (K ¿ D). While the general idea of searching for lowdimensional structure in high-dimensional data is very old, our motivation here comes from work on the y visual system, where it was shown explic2 Since the analysis does not depend on a particular choice of a basis within the full D-dimensional stimulus space, for clarity we choose the basis in which the rst K basis vectors span the relevant subspace and the remaining D ¡ K vectors span the irrelevant subspace. 3 We emphasize that our focus here on single spikes is not equivalent to assuming that the spike train is a Poisson process modulated by the stimulus. No matter what the statistical structure of the spike train is, we always can ask what features of the stimulus are relevant for setting the probability of generating a single spike at one moment in time. From an information-theoretic point of view, asking for stimulus features that capture the mutual information between the stimulus and the arrival times of single spikes is a well-posed question even if successive spikes do not carry independent information. Note also that spikes’ carrying independent information is not the same as spikes’ being generated as a Poisson process. On the other hand, if (for example) different temporal patterns of spikes carry information about different stimulus features, then analysis of single spikes will result in a relevant subspace of artifactually high dimensionality. Thus, it is important that the approach discussed here carries over without modication to the analysis of relevant dimensions for the generation of any discrete event, such as a pattern of spikes across time in one cell or synchronous spikes across a population of cells. For a related discussion of relevant dimensions and spike patterns using covariance matrix methods see de Ruyter van Steveninck and Bialek (1988) and Aguera ¨ y Arcas, Fairhall, and Bialek (2003).

Analyzing Neural Responses to Natural Signals

227

itly that patterns of action potentials in identied motion-sensitive neurons are correlated with low-dimensional projections of the high-dimensional visual input (de Ruyter van Steveninck & Bialek, 1988; Brenner, Bialek, et al., 2000; Bialek & de Ruyter van Steveninck, 2003). The input-output function f in equation 1.1 can be strongly nonlinear, but it is presumed to depend on only a small number of projections. This assumption appears to be less stringent than that of approximate linearity, which one makes when characterizing a neuron’s response in terms of Wiener kernels (see, e.g., the discussion in section 2.1.3 of Rieke et al., 1997). The most difcult part in reconstructing the input-output function is to nd the RS. Note that for K > 1, a description in terms of any linear combination of vectors fOe1 ; eO2 ; : : : ; eOK g is just as valid, since we did not make any assumptions as to a particular form of the nonlinear function f . Once the relevant subspace is known, the probability P.spikejs/ becomes a function of only a few parameters, and it becomes feasible to map this function experimentally, inverting the probability distributions according to Bayes’ rule: f .s1 ; s2 ; : : : ; sK / D

P.s1 ; s2 ; : : : ; sK jspike/ : P.s1 ; s2 ; : : : ; sK /

(1.2)

If stimuli are chosen from a correlated gaussian noise ensemble, then the neural response can be characterized by the spike-triggered covariance method (de Ruyter van Steveninck & Bialek, 1988; Brenner, Bialek, et al., 2000; Schwartz, Chichilnisky, & Simoncelli, 2002; Touryan, Lau, & Dan, 2002; Bialek & de Ruyter van Steveninck, 2003). It can be shown that the dimensionality of the RS is equal to the number of nonzero eigenvalues of a matrix given by a difference between covariance matrices of all presented stimuli and stimuli conditional on a spike. Moreover, the RS is spanned by the eigenvectors associated with the nonzero eigenvalues multiplied by the inverse of the a priori covariance matrix. Compared to the reverse correlation method, we are no longer limited to nding only one of the relevant dimensions fOei g, 1 · i · K. Both the reverse correlation and the spike–triggered covariance method, however, give rigorously interpretable results only for gaussian distributions of inputs. In this article, we investigate whether it is possible to lift the requirement for stimuli to be gaussian. When using natural stimuli, which certainly are nongaussian, the RS cannot be found by the spike-triggered covariance method. Similarly, the reverse correlation method does not give the correct RF, even in the simplest case where the input-output function in equation 1.1 depends on only one projection (see appendix A for a discussion of this point). However, vectors that span the RS are clearly special directions in the stimulus space independent of assumptions about P.s/. This notion can be quantied by the Shannon information. We note that the stimuli s do not have to lie on a low-dimensional manifold within the overall D-dimensional

228

T. Sharpee, N. Rust, and W. Bialek

space.4 However, since we assume that the neuron’s input-output function depends on a small number of relevant dimensions, the ensemble of stimuli conditional on a spike may exhibit clear clustering. This makes the proposed method of looking for the RS complementary to the clustering of stimuli conditional on a spike done in the information bottleneck method (Tishby, Pereira, & Bialek, 1999; see also Dimitrov & Miller, 2001). Noninformationbased measures of similarity between probability distributions P.s/ and P.sjspike/ have also been proposed to nd the RS (Paninski, 2003a). To summarize our assumptions: ² The sampling of the probability distribution of stimuli P.s/ is ergodic and stationary across repetitions. The probability distribution is not assumed to be gaussian. The ensemble of stimuli described by P.s/ does not have to lie on a low-dimensional manifold embedded in the overall D-dimensional space. ² We choose a single spike as the response of interest (for illustration purposes only). An identical scheme can be applied, for example, to particular interspike intervals or to synchronous spikes from a pair of neurons. ² The subspace relevant for generating a spike is low dimensional and Euclidean (cf. equation 1.1). ² The input-output function, which is dened within the low-dimensional RS, can be arbitrarily nonlinear. It is obtained experimentally by sampling the probability distributions P.s/ and P.sjspike/ within the RS. The article is organized as follows. In section 2 we discuss how an optimization problem can be formulated to nd the RS. A particular algorithm used to implement the optimization scheme is described in section 3. In section 4 we illustrate how the optimization scheme works with natural stimuli for model orientation-sensitive cells with one and two relevant dimensions, much like simple and complex cells found in primary visual cortex, as well as for a model auditory neuron responding to natural sounds. We also discuss the convergence of our estimates of the RS as a function of data set size. We emphasize that our optimization scheme does not rely on any specic statistical properties of the stimulus ensemble and thus can be used with natural stimuli. 4 If one suspects that neurons are sensitive to low-dimensional features of their input, one might be tempted to analyze neural responses to stimuli that explore only the (putative) relevant subspace, perhaps along the line of the subspace reverse correlation method (Ringach et al., 1997). Our approach (like the spike-triggered covariance approach) is different because it allows the analysis of responses to stimuli that live in the full space, and instead we let the neuron “tell us” which low-dimensional subspace is relevant.

Analyzing Neural Responses to Natural Signals

229

2 Information as an Objective Function When analyzing neural responses, we compare the a priori probability distribution of all presented stimuli with the probability distribution of stimuli that lead to a spike (de Ruyter van Steveninck & Bialek, 1988). For gaussian signals, the probability distribution can be characterized by its second moment, the covariance matrix. However, an ensemble of natural stimuli is not gaussian, so that in a general case, neither second nor any nite number of moments is sufcient to describe the probability distribution. In this situation, Shannon information provides a rigorous way of comparing two probability distributions. The average information carried by the arrival time of one spike is given by Brenner, Strong, et al. (2000), µ

Z dsP.sjspike/ log2

Ispike D

¶ P.sjspike/ ; P.s/

(2.1)

where ds denotes integration over full D–dimensional stimulus space. The information per spike as written in equation 2.1 is difcult to estimate experimentally, since it requires either sampling of the high-dimensional probability distribution P.sjspike/ or a model of how spikes were generated, that is, the knowledge of low-dimensional RS. However, it is possible to calculate Ispike in a model-independent way if stimuli are presented multiple times to estimate the probability distribution P.spikejs/. Then, ½ Ispike D

µ ¶¾ P.spikejs/ P.spikejs/ log2 ; P.spike/ P.spike/ s

(2.2)

where the average is taken over all presented stimuli. This can be useful in practice (Brenner, Strong, et al., 2000), because we can replace the ensemble average his with a time average and P.spikejs/ with the time-dependent spike rate r.t/. Note that for a nite data set of N repetitions, the obtained value Ispike.N/ will be on average larger than Ispike.1/. The true value Ispike can be found by either subtracting an expected bias value, which is of the order of » 1=.P.spike/N 2 ln 2/ (Treves & Panzeri, 1995; Panzeri & Treves, 1996; Pola, Schultz, Petersen, & Panzeri, 2002; Paninski, 2003b) or extrapolating to N ! 1 (Brenner, Strong, et al., 2000; Strong, Koberle, de Ruyter van Steveninck, & Bialek, 1998). Measurement of Ispike in this way provides a model independent benchmark against which we can compare any description of the neuron’s input-output relation. Our assumption is that spikes are generated according to a projection onto a low-dimensional subspace. Therefore, to characterize relevance of a particular direction v in the stimulus space, we project all of the presented stimuli onto v and form probability distributions Pv .x/ and Pv .xjspike/ of projection values x for the a priori stimulus ensemble and that conditional

230

T. Sharpee, N. Rust, and W. Bialek

on a spike, respectively: (2.3)

P v .x/ D h±.x ¡ s ¢ v/is ;

(2.4)

P v .xjspike/ D h±.x ¡ s ¢ v/jspikeis ;

R where ±.x/ is a delta function. In practice, both the average h¢ ¢ ¢is ´ ds ¢ ¢ ¢ P.s/ R over the a priori stimulus ensemble and the average h¢ ¢ ¢ jspikeis ´ ds ¢ ¢ ¢ P.sjspike/ over the ensemble conditional on a spike are calculated by binning the range of projection values x. The probability distributions are then obtained as histograms, normalized in a such a way that the sum over all bins gives 1. The mutual information between spike arrival times and the projection x, by analogy with equation 2.1, is µ

Z I.v/ D

dxPv .xjspike/ log2

¶ Pv .xjspike/ ; Pv .x/

(2.5)

which is also the Kullback-Leibler divergence D[Pv .xjspike/jjPv .x/]. Notice that this information is a function of the direction v. The information I.v/ provides an invariant measure of how much the occurrence of a spike is determined by projection on the direction v. It is a function only of direction in the stimulus space and does not change when vector v is multiplied by a constant. This can be seen by noting that for any probability distribution and any constant c, Pcv .x/ D c¡1 Pv .x=c/ (see also theorem 9.6.4 of Cover & Thomas, 1991). When evaluated along any vector v, I.v/ · Ispike. The total information Ispike can be recovered along one particular direction only if v D eO1 and only if the RS is one-dimensional. By analogy with equation 2.5, one could also calculate information I.v1 ; : : : ; vn / along a set of several directions fv1 ; : : : ; vn g based on the multipoint probability distributions of projection values x1 , x2 ; : : : ; xn along vectors v1 , v2 ; : : : ; vn of interest: * P v1 ;:::;vn .fxi gjspike/ D * P v1 ;:::;vn .fxi g/ D

n Y iD1 n Y iD1

+ ±.xi ¡ s ¢ vi /jspike ;

(2.6)

s

+ ±.xi ¡ s ¢ vi / :

(2.7)

s

If we are successful in nding all of the directions fOei g, 1 · i · K contributing to the input-output relation, equation 1.1, then the information evaluated in this subspace will be equal to the total information Ispike. When we calculate information along a set of K vectors that are slightly off from the RS, the answer, of course, is smaller than Ispike and is initially quadratic in small deviations ±vi . One can therefore hope to nd the RS by maximizing information with respect to K vectors simultaneously. The information does

Analyzing Neural Responses to Natural Signals

231

not increase if more vectors outside the RS are included. For uncorrelated stimuli, any vector or a set of vectors that maximizes I.v/ belongs to the RS. On the other hand, as discussed in appendix B, the result of optimization with respect to a number of vectors k < K may deviate from the RS if stimuli are correlated. To nd the RS, we rst maximize I.v/ and compare this maximum with Ispike, which is estimated according to equation 2.2. If the difference exceeds that expected from nite sampling corrections, we increment the number of directions with respect to which information is simultaneously maximized. 3 Optimization Algorithm In this section, we describe a particular algorithm we used to look for the most informative dimensions in order to nd the relevant subspace. We make no claim that our choice of the algorithm is most efcient. However, it does give reproducible results for different starting points and spike trains with differences taken to simulate neural noise. Overall, choices for an algorithm are broader because the information I.v/ as dened by equation 2.5 is a continuous function, whose gradient can be computed. We nd (see appendix C for a derivation) µ ¶ Z £ ¤ d Pv .xjspike/ rv I D dxPv .x/ hsjx; spikei ¡ hsjxi ¢ (3.1) ; dx Pv .x/ where hsjx; spikei D

1 P.xjspike/

Z ds s±.x ¡ s ¢ v/P.sjspike/;

(3.2)

and similarly for hsjxi. Since information does not change with the length of the vector, we have v¢rv I D 0, as also can be seen directly from equation 3.1. As an optimization algorithm, we have used a combination of gradient ascent and simulated annealing algorithms: successive line maximizations were done along the direction of the gradient (Press, Teukolsky, Vetterling, & Flannery, 1992). During line maximizations, a point with a smaller value of information was accepted according to Boltzmann statistics, with probability / exp[.I.viC1 / ¡ I.vi //=T]. The effective temperature T is reduced by factor of 1 ¡ ²T upon completion of each line maximization. Parameters of the simulated annealing algorithm to be adjusted are the starting temperature T0 and the cooling rate ²T , 1T D ¡²T T. When maximizing with respect to one vector, we used values T0 D 1 and ²T D 0:05. When maximizing with respect to two vectors, we either used the cooling schedule with ²T D 0:005 and repeated it several times (four times in our case) or allowed the effective temperature T to increase by a factor of 10 upon convergence to a local maximum (keeping T · T0 always), while limiting the total number of line maximizations.

232

T. Sharpee, N. Rust, and W. Bialek 600 w h ite n o is e s tim u li

n a tu ra lis tic s tim u li

P(I/I spike )

3 (a )

400

2

(b ) (c )

200

0 10

1

4

10

2

I/I

10

s p ik e

0

0 0

0 .5

I/I

1 s p ik e

Figure 2: The probability distribution of information values in units of the total information per spike in the case of (a) uncorrelated binary noise stimuli, (b) correlated gaussian noise with power spectrum of natural scenes, and (c) stimuli derived from natural scenes (patches of photos). The distribution was obtained by calculating information along 105 random vectors for a model cell with one relevant dimension. Note the different scales in the two panels.

The problem of maximizing a function often is related to the problem of making a good initial guess. It turns out, however, that the choice of a starting point is much less crucial in cases where the stimuli are correlated. To illustrate this point, we plot in Figure 2 the probability distribution of information along random directions v for both white noise and naturalistic stimuli in a model with one relevant dimension. For uncorrelated stimuli, not only is information equal to zero for a vector that is perpendicular to the relevant subspace, but in addition, the derivative is equal to zero. Since a randomly chosen vector has on average apsmall projection on the relevant subspace (compared to its length) vr =jvj » n=d, the corresponding information can be found by expanding in vr =jvj: v2r I¼ 2jvj2

Á

Z dxPeOir .x/

P0eO .x/ ir

PeOir .x/

!2 [hsOer jspikei ¡ hsOer i]2 ;

(3.3)

where vector v D vr eOr C vir eOir is decomposed in its components inside and outside the RS, respectively. The average information for a random vector is therefore » .hv2r i=jvj2 / D K=D. In cases where stimuli are drawn from a gaussian ensemble with correlations, an expression for the information values has a similar structure to equation 3.3. To see this, we transform to Fourier space and normalize each Fourier component by the square root of the power spectrum S.k/.

Analyzing Neural Responses to Natural Signals

233

In this new basis, both the vectors fei g, 1 · i · K, forming the RS and the randomly chosen vector p v along which information is being evaluated, are to be multiplied by S.k/. Thus, if we now substitute for the dot product P v2r the convolution weighted by the power spectrum, Ki .v ¤ eOi /2 , where P

v ¤ eOi D pP

ei .k/S.k/ k v.k/O pP ; 2 2 k v .k/S.k/ k eOi .k/S.k/

(3.4)

then equation 3.3 will describe information values along randomly chosen vectors v for correlated gaussian stimuli with the power spectrum S.k/. Although both vr and v.k/ are gaussian variables with variance » 1=D, the weighted convolution has not only a much larger variance but is also strongly P nongaussian (the nongaussian character is due to the normalizing factor k v2 .k/S.k/ in the denominator of equation 3.4). As for the variance, it can be estimated as < .v ¤ eO1 /2 >D 4¼= ln2 D, in cases where stimuli are taken as patches of correlated gaussian noise with the two-dimensional power spectrum S.k/ D A=k2 . The large values of the weighted dot product v ¤ eOi , 1 · i · K result not only in signicant information values along a randomly chosen vector, but also in large magnitudes of the derivative rI, which is no longer dominated by noise, contrary to the case of uncorrelated stimuli. In the end, we nd that randomly choosing one of the presented frames as a starting guess is sufcient. 4 Results We tested the scheme of looking for the most informative dimensions on model neurons that respond to stimuli derived from natural scenes and sounds. As visual stimuli, we used scans across natural scenes, which were taken as black and white photos digitized to 8 bits with no corrections made for the camera’s light-intensity transformation function. Some statistical properties of the stimulus set are shown in Figure 3. Qualitatively, they reproduce the known results on the statistics of natural scenes (Ruderman & Bialek, 1994; Ruderman, 1994; Dong & Atick, 1995; Simoncelli & Olshausen, 2001). Most important properties for this study are strong spatial correlations, as evident from the power spectrum S(k) plotted in Figure 3b, and deviations of the probability distribution from a gaussian one. The nongaussian character can be seen in Figure 3c, where the probability distribution of intensities is shown, and in Figure 3d, which shows the distribution of projections on a Gabor lter (in what follows, the units of projections, such as s1 , will be given in units of the corresponding standard deviations). Our goal is to demonstrate that although the correlations present in the ensemble are nongaussian, they can be removed successfully from the estimate of vectors dening the RS.

234

T. Sharpee, N. Rust, and W. Bialek 5

(b) 10

(a)

S(k) /s2(I)

100 0

200

10

300 5

400

(c) 0 10

200

400

600

10

1

10

10

0

ka p

5

(d)10 P(I)

P(s ) 1

0

10

2

10

2

5

0

2

10

(I ) /s(I)

4

2 0 2 4 (s1 <s1> ) /s(s1 )

Figure 3: Statistical properties of the visual stimulus ensemble. (a) One of the photos. Stimuli would be 30 £ 30 patches taken from the overall photograph. (b) The power spectrum, in units of light intensity variance ¾ 2 .I/, averaged over orientation as a function of dimensionless wave vector ka, where a is the pixel size. (c) The probability distribution of light intensity in units of ¾ .I/. (d) The probability distribution of projections between stimuli and a Gabor lter, also in units of the corresponding standard deviation ¾ .s1 /.

4.1 A Model Simple Cell. Our rst example is based on the properties of simple cells found in the primary visual cortex. A model phase- and orientation-sensitive cell has a single relevant dimension eO1 , shown in Figure 4a. A given stimulus s leads to a spike if the projection s1 D s ¢ eO1 reaches a threshold value µ in the presence of noise: P.spikejs/ ´ f .s1 / D hH.s1 ¡ µ C » /i; P.spike/

(4.1)

where a gaussian random variable » of variance ¾ 2 models additive noise, and the function H.x/ D 1 for x > 0, and zero otherwise. Together with the RF eO1 , the parameters µ for threshold and the noise variance ¾ 2 determine the input-output function.

Analyzing Neural Responses to Natural Signals

235

Figure 4: Analysis of a model simple cell with RF shown in (a). (b) The “exact” spike-triggered average vsta . (c) An attempt to remove correlations according O max to the reverse correlation method, C¡1 a priori vsta . (d) The normalized vector v found by maximizing information. (e) Convergence of the algorithm according to information I.v/ and projection vO ¢ eO1 between normalized vectors as a function of the inverse effective temperature T ¡1 . (f) The probability of a spike P.spikejs ¢ vO max / (crosses) is compared to P.spikejs1 / used in generating spikes (solid line). Parameters of the model are ¾ D 0:31 and µ D 1:84, both given in units of standard deviation of s1 , which is also the units for the x-axis in f .

When the spike-triggered average (STA), or reverse correlation function, is computed from the responses to correlated stimuli, the resulting vector will be broadened due to spatial correlations present in the stimuli (see Figure 4b). For stimuli that are drawn from a gaussian probability distribution, the effects of correlations could be removed by multiplying vsta by the inverse of the a priori covariance matrix, according to the reverse correlation method, vO Gaussian est / C¡1 a priori vsta , equation A.2. However, this procedure tends to amplify noise. To separate errors due to neural noise from those

236

T. Sharpee, N. Rust, and W. Bialek

due to the nongaussian character of correlations, note that in a model, the effect of neural noise on our estimate of the STA can be eliminated by averaging the presented stimuli weighted with the exact ring rate, as opposed to using a histogram of responses to estimate P.spikejs/ from a nite set of trials. We have used this “exact” STA, Z vsta D

ds sP.sjspike/ D

1 P.spike/

Z dsP.s/ sP.spikejs/;

(4.2)

in calculations presented in Figures 4b and 4c. Even with this noiseless STA (the equivalent of collecting an innite data set), the standard decorrelation procedure is not valid for nongaussian stimuli and nonlinear input-output functions, as discussed in detail in appendix A. The result of such a decorrelation in our example is shown in Figure 4c. It clearly is missing some of the structure in the model lter, with projection eO1 ¢ vO Gaussian est ¼ 0:14. The discrepancy is not due to neural noise or nite sampling, since the “exact” STA was decorrelated; the absence of noise in the exact STA also means that there would be no justication for smoothing the results of the decorrelation. The discrepancy between the true RF and the decorrelated STA increases with the strength of nonlinearity in the input-output function. In contrast, it is possible to obtain a good estimate of the relevant dimension eO1 by maximizing information directly (see Figure 4d). A typical progress of the simulated annealing algorithm with decreasing temperature T is shown in Figure 4e. There we plot both the information along the vector and its projection on eO1 . We note that while information I remains almost constant, the value of projection continues to improve. Qualitatively, this is because the probability distributions depend exponentially on information. The nal value of projection depends on the size of the data set, as discussed below. In the example shown in Figure 4, there were ¼ 50; 000 spikes with average probability of spike ¼ 0:05 per frame, and the reconstructed vector has a projection vO max ¢ eO1 D 0:920 § 0:006. Having estimated the RF, one can proceed to sample the nonlinear input-output function. This is done by constructing histograms for P.s ¢ vO max / and P.s ¢ vO max jspike/ of projections onto vector vO max found by maximizing information and taking their ratio, as in equation 1.2. In Figure 4f, we compare P.spikejs ¢ vO max / (crosses) with the probability P.spikejs1 / used in the model (solid line). 4.2 Estimated Deviation from the Optimal Dimension. When inforO which mation is calculated from a nite data set, the (normalized) vector v, O eO1 arises maximizes I, will deviate from the true RF eO1 . The deviation ±v D v¡ because the probability distributions are estimated from experimental histograms and differ from the distributions found in the limit of innite data size. For a simple cell, the quality of reconstruction can be characterized by the projection vO ¢ eO1 D 1 ¡ 12 ±v2 , where both vO and eO1 are normalized and ±v is by denition orthogonal to eO1 . The deviation ±v » A¡1 rI, where A is

Analyzing Neural Responses to Natural Signals

237

the Hessian of information. Its structure is similar to that of a covariance matrix: ³ ´ Z 1 d P.xjspike/ 2 ln Aij D dxP.xjspike/ ln 2 dx P.x/ £ .hsi sj jxi ¡ hsi jxihsj jxi/: (4.3) When averaged over possible outcomes of N trials, the gradient of information is zero for the optimal direction. Here, in order to evaluate h±v2 i D Tr[A¡1 hrIrIT iA¡1 ], we need to know the variance of the gradient of I. Assuming that the probability of generating a spike is independent for different bins, we can estimate hrIi rIj i » Aij =.Nspike ln 2/. Therefore, an expected error in the reconstruction of the optimal lter is inversely proportional to the number of spikes. The corresponding expected value of the projection between the reconstructed vector and the relevant direction eO1 is given by 1 Tr0 [A¡1 ] vO ¢ eO1 ¼ 1 ¡ h±v2 i D 1 ¡ ; 2 2Nspike ln 2

(4.4)

where Tr0 means that the trace is taken in the subspace orthogonal to the model lter.5 The estimate, equation 4.4, can be calculated without knowledge of the underlying model; it is » D=.2Nspike/. This behavior should also hold in cases where the stimulus dimensions are expanded to include time. The errors are expected to increase in proportion to the increased dimensionality. In the case of a complex cell with two relevant dimensions, the error is expected to be twice that for a cell with single relevant dimension, also discussed in section 4.3. We emphasize that the error estimate according to equation 4.4 is of the same order as errors of the reverse correlation method when it is applied for gaussian ensembles. The latter are given by .Tr[C¡1 ] ¡ C¡1 11 /=[2Nspike 02 h f .s1 /i]. Of course, if the reverse correlation method were to be applied to the nongaussian ensemble, the errors would be larger. In Figure 5, we show the result of simulations for various numbers of trials, and therefore Nspike. The average projection of the normalized reconstructed vector vO on the RF eO1 behaves initially as 1=Nspike (dashed line). For smaller data sets, ¡2 in this case, Nspikes < » 30; 000, corrections » Nspikes become important for estimating the expected errors of the algorithm. Happily, these corrections have a sign such that smaller data sets are more effective than one might have expected from the asymptotic calculation. This can be veried from the expansion vO ¢ eO1 D [1 ¡ ±v2 ]¡1=2 ¼ 1 ¡ 12 h±v2 i C 38 h±v4 i, were only the rst two terms were taken into account in equation 4.4. By denition, ±v1 D ±v ¢ eO1 D 0, and therefore h±v21 i / A¡1 is to be subtracted from 11 / Tr[A¡1 ]. Because eO1 is an eigenvector of A with zero eigenvalue, A¡1 is innite. 11 Therefore, the proper treatment is to take the trace in the subspace orthogonal to eO1 . 5

h±v2 i

238

T. Sharpee, N. Rust, and W. Bialek 1

0.9

1

e ×v

max

0.95

0.85

0.8

0.75 0

0.2

0.4

0.6

N 1

0.8

spike

4

x 10

Figure 5: Projection of vector vO max that maximizes information on RF eO 1 is plotted as a function of the number of spikes. The solid line is a quadratic t in 1=Nspike , and the dashed line is the leading linear term in 1=Nspike . This set of simulations was carried out for a model visual neuron with one relevant dimension from Figure 4a and the input-output function (see equation 4.1), with parameter values ¾ ¼ 0:61¾ .s1 /, µ ¼ 0:61¾ .s1 /. For this model neuron, the linear approximation for the expected error is applicable for Nspike > » 30; 000.

4.3 A Model Complex Cell. A sequence of spikes from a model cell with two relevant dimensions was simulated by projecting each of the stimuli on vectors that differ by ¼=2 in their spatial phase, taken to mimic properties of complex cells, as in Figure 6. A particular frame leads to a spike according to a logical OR, that is, if either js1 j or js2 j exceeds a threshold value µ in the presence of noise, where s1 D s ¢ eO1 , s2 D s ¢ eO2 . Similarly to equation 4.1, P.spikejs/ D f .s1 ; s2 / D hH.js1 j ¡ µ ¡ »1 / _ H.js2 j ¡ µ ¡ »2 /i; P.spike/

(4.5)

where »1 and »2 are independent gaussian variables. The sampling of this input-output function by our particular set of natural stimuli is shown in Figure 6c. As is well known, reverse correlation fails in this case because the spike–triggered average stimulus is zero, although with gaussian stimuli, the spike-triggered covariance method would recover the relevant dimensions (Touryan et al., 2002). Here we show that searching for maximally informative dimensions allows us to recover the relevant subspace even under more natural stimulus conditions.

Analyzing Neural Responses to Natural Signals

239

Figure 6: Analysis of a model complex cell with relevant dimensions eO1 and eO 2 shown in (a) and (b), respectively. Spikes are generated according to an “OR” input-output function f .s1 ; s2 / with the threshold µ ¼ 0:61¾ .s1 / and noise standard deviation ¾ D 0:31¾ .s1 /. (c,d) Vectors v1 and v2 found by maximizing information I.v1 ; v2 /.

We start by maximizing information with respect to one vector. Contrary to the result in Figure 4e for a simple cell, one optimal dimension recovers only about 60% of the total information per spike (see equation 2.2). Perhaps surprisingly, because of the strong correlations in natural scenes, even a projection onto a random vector in the D » 103 -dimensional stimulus space has a high probability of explaining 60% of total information per spike, as can be seen in Figure 2. We therefore go on to maximize information with respect to two vectors. As a result of maximization, we obtain two vectors v1 and v2 , shown in Figure 6. The information along them is I.v1 ; v2 / ¼ 0:90, which is within the range of information values obtained along different linear combinations of the two model vectors I.eO1 ; eO2 /=Ispike D 0:89 § 0:11. Therefore, the description of neuron’s ring in terms of vectors v1 and v2 is complete up to the noise level, and we do not have to look for extra relevant dimensions. Practically, the number of relevant dimensions can be determined by comparing I.v1 ; v2 / to either Ispike or I.v1 ; v2 ; v3 /, the latter being the result of maximization with respect to three vectors simultaneously. As mentioned in section 1, information along set a of vectors does not increase when extra dimensions are added to the relevant subspace. Therefore, if I.v1 ; v2 / D I.v1 ; v2 ; v3 / (again, up to the noise level), this means that

240

T. Sharpee, N. Rust, and W. Bialek

there are only two relevant dimensions. Using Ispike for comparison with I.v1 ; v2 / has the advantage of not having to look for an extra dimension, which can be computationally intensive. However, Ispike might be subject to larger systematic bias errors than I.v1 ; v2 ; v3 /. Vectors v1 and v2 obtained by maximizing I.v1 ; v2 / are not exactly orthogonal and are also rotated with respect to eO1 and eO2 . However, the quality of reconstruction, as well as the value of information I.v1 ; v2 /, is independent of a particular choice of basis with the RS. The appropriate measure of similarity between the two planes is the dot product of their normals. In the example of Figure 6, nO .Oe1 ;Oe2 / ¢ nO .v1 ;v2 / D 0:82 §0:07, where nO .Oe1 ;Oe2 / is a normal to the plane passing through vectors eO1 and eO2 . Maximizing information with respect to two dimensions requires a signicantly slower cooling rate and, consequently, longer computational times. However, the expected error in the reconstruction, 1 ¡ nO .Oe1 ;Oe2 / ¢ nO .v1 ;v2 / , scales as 1=Nspike behavior, similar to equation 4.4, and is roughly twice that for a simple cell given the same number of spikes. We make vectors v1 and v2 orthogonal to each others upon completion of the algorithm. 4.4 A Model Auditory Neuron with One Relevant Dimension. Because stimuli s are treated as vectors in an abstract space, the method of looking for the most informative dimensions can be applied equally well to auditory as well as to visual neurons. Here we illustrate the method by considering a model auditory neuron with one relevant dimension, which is shown in Figure 7c and is taken to mimic the properties of cochlear neurons. The model neuron is probed by two ensembles of naturalistic stimuli: one is a recording of a native Russian speaker reading a piece of Russian prose, and the other is a recording of a piece of English prose read by a native English speaker. Both ensembles are nongaussian and exhibit amplitude distributions with long, nearly exponential tails (see Figure 7a) which are qualitatively similar to those of light intensities in natural scenes (Voss & Clarke, 1975; Ruderman, 1994). However, the power spectrum is different in the two cases, as can be seen in Figure 7b. The differences in the correlation structure in particular lead to different STAs across the two ensembles (cf. Figure 7d). Both of the STAs also deviate from the model lter shown in Figure 7c. Despite differences in the probability distributions P.s/, it is possible to recover the relevant dimension of the model neuron by maximizing information. In Figure 7c we show the two most informative vectors found by running the algorithm for the two ensembles and replot the model lter from Figure 7c to show that the three vectors overlap almost perfectly. Thus, different nongaussian correlations can be successfully removed to obtain an estimate of the relevant dimension. If the most informative vector changes with the stimulus ensemble, this can be interpreted as caused by adaptation to the probability distribution.

P(Amplitude)

Analyzing Neural Responses to Natural Signals 10

0

10

2

10

4

(a)

1 2

0.1

241

(b)

S(w)

0.01

1 2

0.001

6

10 10 0 10 Amplitude (units of RMS) (c)

0.1

5 10 15 20 frequency (kHz) (d)

0.1

2 0

0

0.1

0.1

1 5

t(ms) 10

0

(e)

P(spike| s × v

0.1 0 0.1 0

5

t(ms)

10

5

t(ms) 10

1

max

)

0

(f)

0.5

0

5

0

5 s× v

max

Figure 7: A model auditory neuron is probed by two natural ensembles of stimuli: a piece of English prose (1) and a piece of of Russian prose (2) . The size of the stimulus ensemble was the same in both cases, and the sampling rate was 44.1 kHz. (a) The probability distribution of the sound pressure amplitude in units of standard deviation for both ensembles is strongly nongaussian. (b) The power spectra for the two ensembles. (c) The relevant vector of the model neuron, of dimensionality D D 500. (d) The STA is broadened in both cases, but differs between the two cases due to differences in the power spectra of the two ensembles. (e) Vectors that maximize information for either of the ensembles overlap almost perfectly with each other and with the model lter, which is also replotted here from c. (f) The probability of a spike P.spikejs ¢ vO max / (crosses) is compared to P.spikejs1 / used in generating spikes (solid line). The input-output function had parameter values ¾ ¼ 0:9¾ .s1 / and µ ¼ 1:8¾ .s1 /.

5 Summary Features of the stimulus that are most relevant for generating the response of a neuron can be found by maximizing information between the sequence of responses and the projection of stimuli on trial vectors within the stimulus space. Calculated in this manner, information becomes a function of

242

T. Sharpee, N. Rust, and W. Bialek

direction in stimulus space. Those vectors that maximize the information and account for the total information per response of interest span the relevant subspace. The method allows multiple dimensions to be found. The reconstruction of the relevant subspace is done without assuming a particular form of the input-output function. It can be strongly nonlinear within the relevant subspace and is estimated from experimental histograms for each trial direction independently. Most important, this method can be used with any stimulus ensemble, even those that are strongly nongaussian, as in the case of natural signals. We have illustrated the method on model neurons responding to natural scenes and sounds. We expect the current implementation of the method to be most useful when several most informative vectors (· 10, depending on their dimensionality) are to be analyzed for neurons probed by natural scenes. This technique could be particularly useful in describing sensory processing in poorly understood regions of higher-level sensory cortex (such as visual areas V2, V4, and IT and auditory cortex beyond A1) where white noise stimulation is known to be less effective than naturalistic stimuli. Appendix A: Limitations of the Reverse Correlation Method Here we examine what sort of deviations one can expect when applying the reverse correlation method to natural stimuli even in the model with just one relevant dimension. There are two factors that, when combined, invalidate the reverse correlation method: the nongaussian character of correlations and the nonlinearity of the input-output function (Ringach, Sapiro, & Shapley, 1997). In its original formulation (de Boer & Kuyper, 1968), the neuron is probed by white noise, and the relevant dimension eO1 is given by the STA eO1 / hsr.s/i. If the signals are not white, that is, the covariance matrix Cij D hsi sj i is not a unit matrix, then the STA is a broadened version of the original lter eO1 . This can be seen by noting that for any function F.s/ of gaussian variables fsi g, the identity holds: hsi F.s/i D hsi sj ih@sj F.s/i;

@j ´ @sj :

(A.1)

When property A.1 is applied to the vector components of the STA, hsi r.s/i D Cij h@j r.s/i. Since we work within the assumption that the ring rate is a (nonlinear) function of projection onto one lter eO1 , r.s/ D r.s1 /, the latter average is proportional to the model lter itself, h@j ri D eO1j hr0 .s1 /i. Therefore, we arrive at the prescription of the reverse correlation method, eO1i / [C¡1 ]ij hsj r.s/i:

(A.2)

The gaussian property is necessary in order to represent the STA as a convolution of the covariance matrix Cij of the stimulus ensemble and the model

Analyzing Neural Responses to Natural Signals

243

lter. To understand how the reconstructed vector obtained according to equation A.2 deviates from the relevant one, we consider weakly nongaussian stimuli, with the probability distribution PnG .s/ D

1 P0 .s/e²H1 .s/ ; Z

(A.3)

where P0 .s/ is the gaussian probability distribution with covariance matrix C and the normalization factor Z D he²H1 .s/ i. The function H1 describes deviations of the probability distribution from gaussian, and therefore we will set hsi H1 i D 0 and hsi sj H1 i D 0, since these averages can be accounted for in the gaussian ensemble. In what follows, we will keep only the rstorder terms in perturbation parameter ². Using property A.1, we nd the STA to be given by £ ¤ hsi rinG D hsi sj i h@j ri C ²hr@j .H1 /i ;

(A.4)

where averages are taken with respect to the gaussian distribution. Similarly, the covariance matrix Cij evaluated with respect to the nongaussian ensemble is given by Cij D

1 hsi sj e²H1 i D hsi sj i C ²hsi sk ihsj @k .H1 /i; Z

(A.5)

so that to the rst order in ², hsi sj i D Cij ¡ ²Cik hsj @k .H1 /i. Combining this with equation A.4, we get ¡ ¢ hsi rinG D const £ Cij eO1j C ²Cij h r ¡ s1 hr0 i @j .H1 /i:

(A.6)

The second term in equation A.6 prevents the application of the reverse correlation method for nongaussian signals. Indeed, if we multiply the STA, equation A.6, with the inverse of the a priori covariance matrix Cij according to the reverse correlation method, equation A.2, we no longer obtain the RF eO1 . The deviation of the obtained answer from the true RF increases with ², which measures the deviation of the probability distribution from gaussian. Since natural stimuli are known to be strongly nongaussian, this makes the use of the reverse correlation problematic when analyzing neural responses to natural stimuli. The difference in applying the reverse correlation to stimuli drawn from a correlated gaussian ensemble versus a nongaussian one is illustrated in Figures 8b and 8c. In the rst case, shown in Figure 8b, stimuli are drawn from a correlated gaussian ensemble with the covariance matrix equal to that of natural images. In the second case, shown in Figure 8c, the patches of photos are taken as stimuli. The STA is broadened in both cases. Although the two-point correlations are just as strong in the case of gaussian stimuli

244

T. Sharpee, N. Rust, and W. Bialek

Figure 8: The nongaussian character of correlations present in natural scenes invalidates the reverse correlation method for neurons with a nonlinear inputoutput function. (a) A model visual neuron has one relevant dimension eO1 and the nonlinear input-output function. The “exact” STA is used (see equation 4.2) to separate effects of neural noise from alterations introduced by the method. The decorrelated “exact” STA is obtained by multiplying the “exact” STA by the inverse of the covariance matrix, according to equation A.2. (b) Stimuli are taken from a correlated gaussian noise ensemble. The effect of correlations in STA can be removed according to equation A.2. (c) When patches of photos are taken as stimuli for the same model neuron as in b, the decorrelation procedure gives an altered version of the model lter. The two stimulus ensembles have the same covariance matrix.

as they are in the natural stimuli ensemble, gaussian correlations can be successfully removed from the STA according to equation A.2 to obtain the model lter. On the contrary, an attempt to use reverse correlation with natural stimuli results in an altered version of the model lter. We reiterate that for this example, the apparent noise in the decorrelated vector is not

Analyzing Neural Responses to Natural Signals

245

due to neural noise or nite data sets, since the “exact” STA has been used (see equation 4.2) in all calculations presented in Figures 8 and 9. The reverse correlation method gives the correct answer for any distribution of signals if the probability of generating a spike is a linear function of si , since then the second term in equation A.6 is zero. In particular, a linear input-output relation could arise due to a neural noise whose variance is much larger than the variance of the signal itself. This point is illustrated in Figures 9a, 9b, and 9c, where the reverse correlation method is applied to a threshold input-output function at low, moderate, and high signal-to-noise ratios. For small signal-to-noise ratios where the noise standard deviation is similar to that of projections s1 , the threshold nonlinearity in the inputoutput function is masked by noise and is effectively linear. In this limit, the reverse correlation can be applied with the exact STA. However, for experimentally calculated STA at low signal-to-noise ratios, the decorrelation procedure results in strong noise amplication. At higher signal-to-noise ratios, decorrelation fails due to the nonlinearity of the input-output function in accordance with equation A.6. Appendix B: Maxima of I.v/: What Do They Mean? The relevant subspace of dimensionality K can be found by maximizing information simultaneously with respect to K vectors. The result of maximization with respect to a number of vectors that is less than the true dimensionality of the relevant subspace may produce vectors that have components in the irrelevant subspace. This happens only in the presence of correlations in stimuli. As an illustration, we consider the situation where the dimensionality of the relevant subspace K D 2, and vector eO1 describes the most informative direction within the relative subspace. We show here that although the gradient of information is perpendicular to both eO1 and eO2 , it may have components outside the relevant subspace. Therefore, the vector vmax that corresponds to the maximum of I.v/ will lie outside the relevant subspace. We recall from equation 3.1 that Z d P.s1 jspike/ rI.Oe1 / D ds 1 P.s1 / (B.1) .hsjs1 ; spikei ¡ hsjs1 i/; ds 1 P.s1 / R We can rewrite the conditional averages hsjs1 i D ds2 P.s1 ; s2 /hsjs1 ; s2 i=P.s1 / R and hsjs1 ; spikei D ds 2 f .s1 ; s2 /P.s1 ; s2 /hsjs1 ; s2 i=P.s1 jspike/, so that Z P.spikejs1 ; s2 / ¡ P.spikejs1 / rI.Oe1 / D ds 1 ds2 P.s1 ; s2 /hsjs1 ; s2 i P.spike/ d P.s1 jspike/ £ ln (B.2) : ds 1 P.s1 / Because we assume that the vector eO1 is the most informative within the relevant subspace, eO1 rI D eO2 rI D 0, so that the integral in equation B.2

246

T. Sharpee, N. Rust, and W. Bialek

Figure 9: Application of the reverse correlation method to a model visual neuron with one relevant dimension eO1 and a threshold input-output function of decreasing values of noise variance ¾=¾ .s1 /s ¼ 6:1; 0:61; 0:06 in a, b, and c, respectively. The model P.spikejs1 / becomes effectively linear when the signal-to-noise ratio is small. The reverse correlation can be used together with natural stimuli if the input-output function is linear. Otherwise, the deviations between the decorrelated STA and the model lter increase with the nonlinearity of P.spikejs1 /.

is zero for those directions in which the component of the vector hsjs1 ; s2 i changes linearly with s1 and s2 . For uncorrelated stimuli, this is true for all directions, so that the most informative vector within the relevant subspace is also the most informative in the overall stimulus space. In the presence of correlations, the gradient may have nonzero components along some irrelevant directions if projection of the vector hsjs1 ; s2 i on those directions is not a linear function of s1 and s2 . By looking for a maximum of information, we will therefore be driven outside the relevant subspace. The deviation of vmax from the relevant subspace is also proportional to the

Analyzing Neural Responses to Natural Signals

247

strength of the dependence on the second parameter s2 because of the factor [P.s1 ; s2 jspike/=P.s1 ; s2 / ¡ P.s1 jspike/=P.s1 /] in the integrand. Appendix C: The Gradient of Information According to expression 2.5, the information I.v/ depends on the vector v only through the probability distributions Pv .x/ and Pv .xjspike/. Therefore, we can express the gradient of information in terms of gradients of those probability distributions: 1 rv I D ln 2

Z

µ Pv .xjspike/ rv .Pv .xjspike// dx ln Pv .x/ ¶ Pv .xjspike/ ¡ rv .Pv .x// ; Pv .x/

(C.1)

R where we took into account that dxP v .xjspike/ D 1 and does not change with v. To nd gradients of the probability distributions, we note that µZ rv Pv .x/ D rv D¡

¶ Z dsP.s/±.x ¡ s ¢ v/ D ¡ dsP.s/s± 0 .x ¡ s ¢ v/

¤ d £ p.x/hsjxi ; dx

(C.2)

and analogously for Pv .xjspike/: rv Pv .xjspike/ D ¡

¤ d £ p.xjspike/hsjx; spikei : dx

(C.3)

Substituting expressions C.2 and C.3 into equation C.1 and integrating once by parts, we obtain Z rv I D

£

¤ dxPv .x/ hsjx; spikei ¡ hsjxi ¢

µ

¶ d Pv .xjspike/ ; dx Pv .x/

which is expression 3.1 of the main text. Acknowledgments We thank K.D. Miller for many helpful discussions. Work at UCSF was supported in part by the Sloan and Swartz foundations and by a training grant from the NIH. Our collaboration began at the Marine Biological Laboratory in a course supported by grants from NIMH and the Howard Hughes Medical Institute.

248

T. Sharpee, N. Rust, and W. Bialek

References Aguera ¨ y Arcas, B., Fairhall, A. L., & Bialek, W. (2003). Computation in a single neuron: Hodgkin and Huxley revisited. Neural Comp., 15, 1715–1749. Baddeley, R., Abbott, L. F., Booth, M.C.A., Sengpiel, F., Freeman, T., Wakeman, E. A., & Rolls, E. T. (1997). Responses of neurons in primary and inferior temporal visual cortices to natural scenes. Proc. R. Soc. Lond. B, 264, 1775– 1783. Barlow, H. (1961). Possible principles underlying the transformations of sensory images. In W. Rosenblith (Ed.), Sensory communication (pp. 217–234). Cambridge, MA: MIT Press. Barlow, H. (2001). Redundancy reduction revisited. Network: Comput. Neural Syst., 12, 241–253. Bialek, W. (2002). Thinking about the brain. In H. Flyvbjerg, F. Julicher, ¨ P. Ormos, & F. David (Eds.), Physics of Biomolecules and Cells (pp. 485–577). Berlin: Springer-Verlag. See also physics/0205030.6 Bialek, W., & de Ruyter van Steveninck, R. R. (2003). Features and dimensions: Motion estimation in y vision. Unpublished manuscript. Brenner, N., Bialek, W., & de Ruyter van Steveninck, R. R. (2000). Adaptive rescaling maximizes information transmission. Neuron, 26, 695–702. Brenner, N., Strong, S. P., Koberle, R., Bialek, W., & de Ruyter van Steveninck, R. R. (2000). Synergy in a neural code. Neural Computation, 12, 1531–1552. See also physics/9902067. Chichilnisky, E. J. (2001). A simple white noise analysis of neuronal light responses. Network: Comput. Neural Syst, 12, 199–213. Creutzfeldt, O. D., & Northdurft H. C. (1978). Representation of complex visual stimuli in the brain. Naturwissenschaften, 65, 307–318. Cover, T. M., & Thomas, J. A. (1991). Information theory. New York: Wiley. de Boer, E., & Kuyper, P. (1968). Triggered correlation. IEEE Trans. Biomed. Eng., 15, 169–179. de Ruyter van Steveninck, R. R., & Bialek, W. (1988). Real-time performance of a movement-sensitive neuron in the blowy visual system: Coding and information transfer in short spike sequences. Proc. R. Soc. Lond. B, 265, 259–265. de Ruyter van Steveninck, R. R., Borst, A., & Bialek, W. (2000). Real time encoding of motion: Answerable questions and questionable answers from the y’s visual system. In J. M. Zanker, & J. Zeil (Eds.), Motion vision: Computational, neural and ecological constraints (pp. 279–306). New York: Springer-Verlag. See also physics/0004060. de Ruyter van Steveninck, R. R., Lewen, G. D., Strong, S. P., Koberle, R., & Bialek, W. (1997). Reproducibility and variability in neural spike trains. Science, 275, 1805–1808. See also cond-mat/9603127. 6 Where available we give references to the physics e–print archive, which may be found on-line at http://arxiv.org/abs/*/*; thus, Bialek (2002) is available on-line at http://arxiv.org/abs/physics/0205030. Published papers may differ from the versions posted to the archive.

Analyzing Neural Responses to Natural Signals

249

Dimitrov, A. G., & Miller, J. P. (2001). Neural coding and decoding: Communication channels and quantization. Network: Comput. Neural Syst., 12, 441–472. Dong, D. W., & Atick, J. J. (1995). Statistics of natural time-varying images. Network: Comput. Neural Syst., 6, 345–358. Fairhall, A. L., Lewen, G. D., Bialek, W., & de Ruyter van Steveninck, R. R. (2001). Efciency and ambiguity in an adaptive neural code. Nature, 412, 787–792. Kara, P., Reinagel, P., & Reid, R. C. (2000). Low response variability in simultaneously recorded retinal, thalamic, and cortical neurons. Neuron, 27, 635–646. Lewen, G. D., Bialek, W., & de Ruyter van Steveninck, R. R. (2001). Neural coding of naturalistic motion stimuli. Network: Comput. Neural Syst., 12, 317–329. See also physics/0103088. Mainen, Z. F., & Sejnowski, T. J. (1995). Reliability of spike timing in neocortical neurons. Science, 268, 1503–1506. Paninski, L. (2003a). Convergence properties of three spike–triggered analysis techniques. Network: Compt. in Neural Systems, 14, 437–464. Paninski, L. (2003b). Estimation of entropy and mutual information. Neural Computation, 15, 1191–1253. Panzeri, S., & Treves, A. (1996). Analytical estimates of limited sampling biases in different information measures. Network: Comput. Neural Syst., 7, 87–107. Pola, G., Schultz, S. R., Petersen, R., & Panzeri, S. (2002). A practical guide to information analysis of spike trains. In R. Kotter (Ed.), Neuroscience databases: A practical guide, (pp. 137–152). Norwell, MA: Kluwer. Press, W. H., Teukolsky, S. A., Vetterling, W. T., & Flannery, B. P. (1992). Numerical recipes in C: The art of scientic computing. Cambridge: Cambridge University Press. Reinagel, P., & Reid, R. C. (2000). Temporal coding of visual information in the thalamus. J. Neurosci., 20, 5392–5400. Rieke, F., Bodnar, D. A., & Bialek, W. (1995). Naturalistic stimuli increase the rate and efciency of information transmission by primary auditory afferents. Proc. R. Soc. Lond. B, 262, 259–265. Rieke, F., Warland, D., de Ruyter van Steveninck, R. R., & Bialek, W. (1997). Spikes: Exploring the neural code. Cambridge, MA: MIT Press. Ringach, D. L., Hawken, M. J., & Shapley, R. (2002). Receptive eld structure of neurons in monkey visual cortex revealed by stimulation with natural image sequences. Journal of Vision, 2, 12–24. Ringach, D. L., Sapiro, G., & Shapley, R. (1997). A subspace reverse-correlation technique for the study of visual neurons. Vision Res., 37, 2455–2464. Rolls, E. T., Aggelopoulos, N. C., & Zheng, F. (2003). The receptive elds of inferior temporal cortex neurons in natural scenes. J. Neurosci., 23, 339–348. Ruderman, D. L. (1994). The statistics of natural images. Network: Compt. Neural Syst., 5, 517–548. Ruderman, D. L., & Bialek, W. (1994). Statistics of natural images: Scaling in the woods. Phys. Rev. Lett., 73, 814–817. Schwartz, O., Chichilnisky, E. J., & Simoncelli, E. (2002). Characterizing neural gain control using spike-triggered covariance. In T. G. Dietterich, S. Becker, & Z. Ghahramani (Eds.), Advances in Neural InformationProcessing (pp. 269–276). Cambridge, MA: MIT Press.

250

T. Sharpee, N. Rust, and W. Bialek

Sen, K., Theunissen, F. E., & Doupe, A. J. (2001). Feature analysis of natural sounds in the songbird auditory forebrain. J. Neurophysiol., 86, 1445–1458. Simoncelli, E., & Olshausen, B. A. (2001). Natural image statistics and neural representation. Annu. Rev. Neurosci., 24, 1193–1216. Smirnakis, S. M., Berry, M. J., Warland, D. K., Bialek, W., & Meister, M. (1996). Adaptation of retinal processing to image contrast and spatial scale. Nature, 386, 69–73. Smyth, D., Willmore, B., Baker, G. E., Thompson, I. D., & Tolhurst, D. J. (2003). The receptive-eld organization of simple cells in primary visual cortex of ferrets under natural scene stimulation. J. Neurosci., 23, 4746–4759. Stanley, G. B., Li, F. F., & Dan, Y. (1999). Reconstruction of natural scenes from ensemble responses in the lateral geniculate nucleus. J. Neurosci., 19, 8036– 8042. Strong, S. P., Koberle, R., de Ruyter van Steveninck, R. R., & Bialek, W. (1998). Entropy and information in neural spike trains. Phys. Rev. Lett., 80, 197–200. Theunissen, F. E., Sen, K., & Doupe, A. J. (2000). Spectral-temporal receptive elds of nonlinear auditory neurons obtained using natural sounds. J. Neurosci., 20, 2315–2331. Tishby, N., Pereira, F. C., & Bialek, W. (1999). The information bottleneck method. In B. Hajek & R. S. Sreenivas (Eds.), Proceedings of the 37th Allerton Conference on Communication, Control and Computing (pp. 368–377). Urbana: University of Illinois. See also physics/0004057. Touryan J., Lau, B., & Dan, Y. (2002). Isolation of relevant visual features from random stimuli for cortical complex cells. J. Neurosci., 22, 10811–10818. Treves, A., & Panzeri, S. (1995). The upward bias in measures of information derived from limited data samples. Neural Comp., 7, 399–407. von der Twer, T., & Macleod, D. I. A. (2001). Optimal nonlinear codes for the perception of natural colours. Network: Comput. Neural Syst., 12, 395–407. Vickers, N. J., Christensen, T. A., Baker, T., & Hildebrand, J. G. (2001). Odourplume dynamics inuence the brain’s olfactory code. Nature, 410, 466–470. Vinje, W. E., & Gallant, J. L. (2000). Sparse coding and decorrelation in primary visual cortex during natural vision. Science, 287, 1273–1276. Vinje, W. E., & Gallant, J. L. (2002). Natural stimulation of the nonclassical receptive eld increases information transmission efciency in V1. J. Neurosci., 22, 2904–2915. Voss, R. F., & Clarke, J. (1975). “1/f noise” in music and speech. Nature, 317–318. Weliky, M., Fiser, J., Hunt, R., & Wagner, D. N. (2003). Coding of natural scenes in primary visual cortex. Neuron, 37, 703–718. Received January 21, 2003; accepted August 6, 2003.

LETTER

Communicated by Laurence Abbott

Rapid Temporal Modulation of Synchrony by Competition in Cortical Interneuron Networks P.H.E. Tiesinga [email protected] Department of Physics and Astronomy, University of North Carolina, Chapel Hill, NC 27599, U.S.A.

T.J. Sejnowski [email protected] Sloan-Swartz Center for Theoretical Neurobiology, Salk Institute, La Jolla, CA 92037, Computational Neurobiology Lab, Salk Institute, La Jolla, CA 92037, Howard Hughes Medical Institute, Salk Institute, La Jolla, CA 92037, and Department of Biology, University of California–San Diego, La Jolla, CA 92093, U.S.A.

The synchrony of neurons in extrastriate visual cortex is modulated by selective attention even when there are only small changes in ring rate (Fries, Reynolds, Rorie, & Desimone, 2001). We used Hodgkin-Huxley type models of cortical neurons to investigate the mechanism by which the degree of synchrony can be modulated independently of changes in ring rates. The synchrony of local networks of model cortical interneurons interacting through GABAA synapses was modulated on a fast timescale by selectively activating a fraction of the interneurons. The activated interneurons became rapidly synchronized and suppressed the activity of the other neurons in the network but only if the network was in a restricted range of balanced synaptic background activity. During stronger background activity, the network did not synchronize, and for weaker background activity, the network synchronized but did not return to an asynchronous state after synchronizing. The inhibitory output of the network blocked the activity of pyramidal neurons during asynchronous network activity, and during synchronous network activity, it enhanced the impact of the stimulus-related activity of pyramidal cells on receiving cortical areas (Salinas & Sejnowski, 2001). Synchrony by competition provides a mechanism for controlling synchrony with minor alterations in rate, which could be useful for information processing. Because traditional methods such as cross-correlation and the spike eld coherence require several hundred milliseconds of recordings and cannot measure rapid changes in the degree of synchrony, we introduced a new method to detect rapid changes in the degree of coincidence and precision of spike timing. Neural Computation 16, 251–275 (2004)

c 2003 Massachusetts Institute of Technology °

252

P. Tiesinga and T. Sejnowski

1 Introduction Selective attention greatly enhances the ability of the visual system to detect visual stimuli and to store and recall these stimuli. For instance, in change blindness, differences between two images shown after a brief delay are not visible unless the subject pays attention to the particular spatial location of the change (Simons, 2000; Rensink, 2000). The neural correlate of selective attention has recently been studied in macaque monkeys (Connor, Gallant, Preddie, & Van Essen, 1996; Connor, Preddie, Gallant, & Van Essen, 1997; Luck, Chelazzi, Hillyard, & Desimone, 1997; McAdams & Maunsell, 1999; Treue & Martinez Trujillo, 1999; Reynolds, Pasternak, & Desimone, 2000; Fries, Reynolds, Rorie, & Desimone, 2001; Moore & Armstrong, 2003). A key nding is that attention modulates both the mean ring rate of a neuron in response to a stimulus (McAdams & Maunsell, 1999; Reynolds et al., 2000) and the coherence of spiking with other neurons responsive to the stimulus (Fries et al., 2001). Increased synchrony can boost the impact of spikes on subsequent cortical areas to which these neurons project (Salinas & Sejnowski, 2000, 2001, 2002). These results suggest that mean activity and the degree of coherence may represent independent signals that can be independently controlled. It is unclear biophysically what mechanisms are responsible. For example, when excitatory networks are depolarized, there can be an increase in synchrony but it is always associated with a large increase in the mean output rate (Aoyagi, Takekawa, & Fukai, 2003). Here we explore the hypothesis that interneuron networks are involved in modulating synchrony. The issue of how the interneuron network affects the pyramical cells is a complex one that is not the subject of this article (but see section 4). We found that interneuron networks can dynamically modulate their synchrony on timescales of 100 ms with only moderate changes in the mean ring rate. Synchrony arose by competition between two groups of interneurons and required a specic range of synaptic background activity. To analyze the temporal dynamics of synchrony modulation, we introduced time-resolved measures for the degree of coincidence and precision. The sensitivity of these measures is compared to the spike eld coherence (Fries et al., 2001). 2 Methods 2.1 Network Model. The model consisted of a network of N interneurons connected all-to-all. Each neuron had a sodium INa , a potassium IK , and a leak current IL and was driven by a white noise current ´ with mean Ii and variance 2D. The equation for the membrane potential V reads:

Cm

dV D ¡INa ¡ IK ¡ IL C ´.t/ C Isyn: dt

(2.1)

Rapid Modulation of Synchrony

253

Here, Cm D 1¹F/cm2 is the membrane capacitance (normalized by area), and Isyn is the synaptic input from other neurons in the network. For the rst NA neurons the current was Ii D IA C ±Ii (i D 1; : : : ; NA ), and for the remainder, it was Ii D IB C ±Ii (i D NA C 1; : : : ; N). Heterogeneity in neural properties was represented by ±Ii , which was drawn from a gaussian distribution with a standard deviation ¾I . IA was increased from I to I C 1I during a 500 ms interval between t D 1000 and 1500 ms; IB was equal to I for the duration of the simulation. The neurons were connected by GABAA synapses with unitary strength gsyn =N and time constant ¿syn D 10 ms. The model equations and implementation are exactly as described by Tiesinga and Jos´e (2000a) and are not repeated here. The model used here was adapted from that introduced by Wang and Buzs a´ ki (1996). The standard set of parameters was N D 1000, N A D 250, NB D N ¡ NA D 750, I D 0:7 ¹A/cm2 , 1I D 0:3 ¹A/cm2 , gsyn D 0:4 mS/cm2 , ¾I D 0:02 ¹A/cm2 , and D D 0:02 mV2 /ms. 2.2 Statistical Analysis. Spike times were recorded when the membrane potential (V in equation 2.1) crossed 0 mV from below. tni is the ith spike time by the nth neuron; likewise, tmj is the jth spike time by the mth neuron. N is the total number of neurons in the network, and Ns is the number of spikes (by all neurons) generated during a given simulation run. For some calculations, it is convenient to pool the spike times of all neurons together into one set, ft1 ; : : : ; tº ; : : : ; tNs g, ordered from low to high values. The ordered set is indexed by º, where º D 1 is the earliest spike and º D Ns is the latest. Most of the statistical quantities determined from the simulations were spike based. That is, for each measurement yº there is an associated spike time tº . For instance, yº could be an interspike interval, and tº could be the rst spike time of the interval. The specic choices for y are given below. The simulation runs were divided into three intervals: (I) the baseline state, 0 < t < 1000 ms; (II) the activated state, 1000 ms < t < 1500 ms, during which NA of the neurons received a current step, and (III) return to baseline for 1500 ms < t < Tmax . Here Tmax is the duration of the simulated trial. Each interval is divided in subintervals A and B, A is the transient during which the network adapts from the old state to the new state, and B is the steady state reached after the transient. For instance, IIA is the onset of the synchronous state, and IIB is the synchronous state itself. Note that the subdivision in A and B depends on network parameters such as I and D. Averages of y were determined for each of these intervals, using all the yº associated with a tº in the interval under consideration. Since the tº are ordered, this is a contiguous set, ºb · º · ºe (b stands for begin and e stands for end):

hyi D

ºe X 1 yº ; ºe C 1 ¡ ºb ºDºb

hti D

ºe X 1 tº : ºe C 1 ¡ ºb ºDºb

(2.2)

254

P. Tiesinga and T. Sejnowski

We also determined time-resolved averages, hyik and htik , using a sliding window of length Tav that was translated along the time axis with increments equal to Tincr . The position of the sliding window was indicated by the index k, and the sum is over all º values with .k ¡ 1/T incr · tº < .k ¡ 1/Tincr C T av . For convenience, the set of points .htik ; hyik / connected by lines is denoted by y.t/. Tav and Tincr are expressed in terms of nav and nincr using the relations Tav D Tmax =nav and Tincr D .Tmax ¡ Tav /=nincr . We used nav D 30, nincr D 90 and T max D 3000 ms. 2.2.1 Standard Spike Train Statistics. The ith interspike interval (ISI) of the nth neuron is ¿ni D tn;iC1 ¡ tn;i . The mean ISI for neuron n is denoted by ¿n , and its average across all neurons is ¿ . The coefcient of variation (CV) of the ISIs is the standard deviation of the ¿ni across all neurons divided by ¿ . The interval ¿ni can be associated with three times: the starting spike time of the interval, tni , the end spike time of the interval, tn;iC1 , and the mean, .tni C tn;iC1 /=2. For the sliding window average, we used tº D .tni C tn;iC1 /=2. Each individual ISI was plotted in a scatter plot as a point with coordinates .tni ; ¿ni /. The mean ring rate f is dened as the number of spikes divided by the duration of the measurement interval. The mean ring rate is determined for each time interval (I–III) for all the N network neurons together, as well as separately for the NA activated neurons and the NB nonactivated neurons. 2.2.2 Binned Spike Train Statistics. A binned representation Xn .t/ of the spike train of the nth neuron is obtained by setting Xn .t/ D 1 when there is a spike time between t ¡ 1t=2 and t C 1t=2 and Xn .t/ D 0 otherwise. We took 1t D 1 ms. The spike time histogram, STH D .10001t=N/X.t/, P is proportional to the sum of Xn over all neurons, X D N nD1 Xn (since 1t is expressed in ms, the factor 1000 is necessary so that STH is in Hz). The local eld potential (LFP) is estimated by ltering X by an exponential lter, exp.¡t=¿ //=¿ , with timescale ¿ D 5 ms. The sliding window average, ¾LFP .t/, of the standard deviation of the LFP was used as a synchrony measure. We N also determined the normalized autocorrelation, AC.t/ D h.X.s/ ¡ X/.X.s ¡ N s =h.X.s/ ¡ X/.X.s/ N N s , where his denotes the time average and ¡ X/i t/ ¡ X/i XN D hXis . 2.2.3 Spectral Measures for Neural Synchrony. The coherence of spike trains with the underlying network oscillations was determined using the spike-triggered average (STA) (see Rieke, Warland, de Ruyter van Steveninck, & Bialek, 1997; Dayan & Abbott, 2001, for details). The STA was calculated by taking the LFP from the full network centered around the combined spike times of four neurons, two of which were part of the activated group and the other two were part of the nonactivated group. The STA power spectrum was calculated using the MATLAB routine psd with standard windowing using a 2,048 point, fast Fourier transform. The spike eld co-

Rapid Modulation of Synchrony

255

herence (SFC) is the STA power spectrum divided by the power spectrum of the LFP (Fries et al., 2001). The SFC for the standard simulations was calculated based on 300 ms long LFP segments (IB: 700 ms · t · 1000 ms and IIB: 1200 ms · t · 1500 ms). This was necessary since synchrony was modulated on a fast timescale. Simulations were also performed for the situation where the baseline state and the activated state both lasted for 2000 ms. The SFC in that case was calculated based on 1500 ms long segments. 2.2.4 Event-Based Measures for Neural Synchrony. During synchronized network activity, the histogram X.t/ contains peaks of high activity—events —during which many neurons spike at approximately the same time. Spike times tº were clustered in events using MATLAB’s fuzzy k-means clustering algorithm, fcm (Ripley, 1996). For each cluster k, the algorithm returns the indices ºk , ºk C 1; : : : ; ºkC1 ¡ 1 of the spike times that are part of the cluster. The number of spike times Nk D ºkC1 ¡ ºk in event k and their standard deviation ¾k was determined. For low synchrony, peaks in the STH are broad and overlap; as a result, the clustering procedure could not reliably cluster the spikes into events. 2.2.5 Interspike Distance Synchrony Measures. The CVP measure is based on the idea that during synchronous states, the minimum distance between spikes of different neurons is reduced compared with asynchronous states. The interspike interval of the combined set of network spikes is ¿º D tºC1 ¡tº . Note that these interspike intervals are between different neurons. The CV is p h¿º2 iº ¡ h¿º i2º CVP D ; h¿º iº where P stands for population and hiº denotes the average over all intervals. As with regular ISIs, the interval ¿º can be identied with three times: tº , tºC1 , and the mean .tº C tºC1 /=2. For the sliding window average, CVP .t/, the last was used. In an asynchronous network, each neuron res independently at a constant rate. The sum of the spike trains is to a good approximation a Poisson spike train, with a coefcient of variation, CVP , equal to one. For a fully synchronous network with N active neurons oscillating with period T, the set ¿º consists of Ns =N ¡ 1 intervals equal to T and Ns ¡ Ns =N intervals equal to 1 ! 0. Hence, p CVP D ® [1 ¡ .®=T C T=®/¯1] ; with ® D .Ns ¡ 1/=.Ns =N p ¡ 1/ and ¯ D .Ns =N/.N p ¡ 1/=.Ns ¡ 1/. For large Ns , this reduces to CVP ¼ N. Thus, .CVP ¡ 1/= N is a measure for synchrony normalized between 0 and 1. CVP is sensitive to the precision of the network discharge as well as to the degree of coincidence.

256

P. Tiesinga and T. Sejnowski

The nearest-neighbor distance of spike tni from the spikes of neuron m is 1mni D min jtni ¡ tmj j: j

The minimum is taken over all spike times tmj of neuron m. There are Np D .N ¡ 1/Ns different pairs. Note that each pair is counted twice, 1mni D 1nmj , because it is associated with tni as well as with tmj . The coincidence factor · is dened as the number of 1mni < P, divided by the maximum number of coincidences Nc . Here, P is a preset precision, typically equal to 2 ms. The number of 1mni < P cannot exceed Np ; hence, taking Nc D Np D .N ¡ 1/Ns would yield a number between 0 and 1 (note that Ns is now the total number of spikes in the time interval under consideration). However, when the minimum single-neuron interspike interval is larger than P, the number of coincidences in a pair of neurons .n; m/ cannot be larger than the minimum ofPthe number P of spikes produced by neurons n and m, formally equal to min. t Xn .t/; t Xm .t//. Hence, the normalization should be Á ! X X X min (2.3) Nc D Xn .t/; Xm .t/ : n6Dm

t

t

During intervals I and III, when the network dynamics is homogeneous, Np ¼ Nc (the difference is 10% or less). However, when there is a bimodal distribution of ring rates, this approximation is not appropriate. The sliding window average ·.t/ was calculated based on 1mni , as described above. The density of coincidences was also determined at a ner temporal resolution. K.t/ was the number of 1mni < P with t ¡ 1t=2 · tni · t C 1t=2 normalized by Np (Nc could not be estimated reliably for these small intervals). The · used here is related to the quantity described in Wang and Buzs a´ ki (1996), denoted here by ·WB : P X 1 Xn .t/Xm .t/ pP t (2.4) ·WB D ; P N.N ¡ 1/ n6Dm X t n .t/ t Xm .t/ P where t denotes the sum over the discrete bins (width 1t). The measure · used here differs from ·WB in three aspects:

1. ·WB is based on discrete bins. Hence, it is possible that two spike times with a distance less than 1t are in different bins and considered noncoincident (White, Chow, Ritt, Soto-Trevino, ˜ & Kopell, 1998). 2. The normalization of ·WB does not work if the mean ring rates of neurons are signicantly different. In that case, ·WB would be smaller than 1 even if all the spikes were coincident. the maxPThis is because P imumpnumber of coincident spikes is min. rather X .t/; X .t// n m t t P P than used in equation 2.4. The normalization by X .t/ X .t/, n m t t Nc of ·.t/ accounts for these effects.

Rapid Modulation of Synchrony

257

3. The calculation of ·.t/ is better suited to track fast changes in synchrony. An estimate for the dispersion of spike times in an event can be obtained from 1mni . For a given spike time tni , we take the mean of the N1 lowest values of 1mni . This value is averaged across all tni in an averaging time interval of length Tav to obtain 1.t/. When N1 is somewhat smaller than the number of neurons ring on each cycle, 1.t/ is proportional to the dispersion in spike times. Typically it is slightly less than the dispersion obtained directly from clustering. The relationship between the value of 1 and the dispersion of the spike times on a cycle was studied using a set of N D 300 deviates of a gaussian probability distribution with a variance of 1. 1 was equal to 0:53 § 0:02 for N1 D 150, asymptoted to 1:13 § 0:05 for N1 D 300, and reached 1 for N1 D 278 (errors were the standard deviation across 100 different realizations). 3 Results 3.1 Competition Leads to Dynamical Modulation of Synchrony. We study the dynamics of an all-to-all connected interneuron network. Key parameters are mean level of depolarizing current I, variance 2D of the white noise current, and neuronal heterogeneity modeled as a dispersion ¾I in the driving current across neurons. These three parameters can account for changes occuring in vivo in the level of background synaptic activity and the effect of neurotransmitters and neuromodulators. Previous work shows that synchrony is stable only up to ¾I ¼ 0:10 ¹A/cm2 (Wang & Buzsaki, ´ 1996) and D ¼ 0:10 mV2 /ms (Tiesinga & Jos´e, 2000a); for higher values, only the asynchronous state is stable. For moderate values of ¾I and D (below these critical values), synchrony is stable only for a large enough driving current I. Based on these results, it is expected that applying a current pulse to the network in a low-synchrony state will lead to increased synchrony. Representative simulation results are shown in Figure 1. There were N D 1000 neurons in the network; the other parameters were D D 0:02 mV2 /ms and ¾I D 0:02 ¹A/cm2 . Initially, the driving current was I D 0:7 ¹A/cm2 . In this baseline state, the mean ring rate per neuron was f D 8:18 Hz with a CV D 0:38. During the time interval between 1000 ms and 1500 ms, the drive to all neurons was increased to I D 1:0 ¹A/cm2 . The ring rate increased by approximately 50% to f D 12:25; the CV was 0:25 (see Figure 1A). There was little change in synchrony. In contrast, when the depolarizing current was increased to only 250 of the 1000 interneurons, a robust increase in synchrony was observed (see Figure 1B). During that period, the mean ring rate increased only weakly from f D 8:18 Hz to f D 9:48 Hz (CV D 0:40 ! 1:55). It is also possible to induce synchrony by increasing the driving current to all neurons in the network. However, the necessary increase in current is

258

P. Tiesinga and T. Sejnowski

sp/s

A 40

sp/s

0

B

40

IA

IB

II

IIIA

0 0

1000

2000

t (ms) Figure 1: Competition dynamically modulates synchrony. (A) Driving current I D 0:7 ¹A/cm2 was increased to I D 1:0 ¹A/cm2 during the time interval between 1000 ms and 1500 ms (indicated by the bar in B). (B) Driving current to NA D 250 out of N D 1000 neurons was increased from I D 0:7 ¹A/cm2 to I D 1:0 ¹A/cm2 . The other neurons remained at baseline. Roman numerals in B are explained in the text. Parameters for these simulations were ¾I D 0:02 ¹A/cm2 , D D 0:02 mV2 /ms, gsyn D 0:4 mS/cm2 , and ¿syn D 10 ms.

much higher than used in Figure 1A, and the resulting ring rate increase is also much higher compared with the competitive mechanism. In the remainder, we investigate the properties of the synchrony-by-competition mechanism. 3.2 Synchrony Modulation by Competition. We have labeled four distinct time intervals in the histogram shown in Figure 1B. The rastergram for all neurons and histograms for the activated and non-activated neurons are shown separately in Figure 2 for the time interval between 500 ms and 2000 ms. (IA) 0 < t < 500 ms. The membrane potential of the neurons at t D 0 was drawn for each neuron independently from a uniform distribution between ¡70 mV and ¡50 mV. Initially, the network was partially synchronized due to the particular realization of the initial condition. The amount of synchrony decreased over the course of 500 ms. (IB) 500 < t < 1000 ms: During the baseline state there was only weak synchrony. (II) 1000 < t < 1500 ms: A subset of N A D 250 neurons gets activated. Their ring rate increased, and they became synchronized within

Rapid Modulation of Synchrony

259

200 ms. Initially, the NB D 750 nonactivated neurons were completely suppressed by the low-synchronous activated neurons, but as the synchrony of the activated neurons increased, the nonactivated neurons resumed ring, albeit at a much lower rate. (IIIA) 1500 < t < 1700 ms: The additional drive to the activated neurons was reduced back to baseline. Their ring rate decreased, which released the nonactivated neurons from inhibition. Because their membrane potentials were similar, the network was transiently synchronized. Over the course of 200 ms, the network returned to the asynchronous state. During the baseline state, there were brief episodes of spontaneously enhanced synchrony (see below). The changes in network dynamics are mimicked in the distribution of interspike intervals (see Figure 2B). The interspike intervals (ISI) are shown as dots in a scatter plot (see Figure 2Ba). The sliding window averages of the mean ISI and its coefcient of variation are shown in Figures 2Bb and 2Bc, respectively. In the baseline state, the ISI distribution was homogeneous, but it had a large dispersion (see Figure 2Ba). However, during the synchronized period, the distribution became bimodal: a set of short intervals with a small dispersion corresponding to the activated neurons and a set of long intervals with a large dispersion due to cycle skipping associated with the nonactivated neurons (see Figure 2Ba). The sliding window average of the mean ISI was sensitive to the onset and offset of bimodality associated with synchrony (see Figure 2Bb). Similarly, the CV of the intervals increased during the synchronized period because of the onset of bimodality in the ISI distribution (see Figure 2Bc). However, these time-resolved statistics are not a good measure of synchrony, since synchrony may increase without introducing bimodality in the ISIs. 3.3 Spectral Analysis of Synchrony. In the experimental analysis of attentional modulation of synchrony, there was an increase in the coherence of multiunit spike trains with the local eld potential (LFP) in response to the attended stimuli (Fries et al., 2001). Here we used a ltered version of the spike time histogram (STH) as an estimate for the LFP, and we used the spikes obtained by combining spike trains of four neurons (two activated, two nonactivated) as the multiunit spike train (see Figure 3A). The autocorrelation function AC.t/ of the spike time histogram during a 300 ms long synchronized period showed strong periodic peaks, whereas during an equally long asynchronous period, the AC showed periodic modulations but at a much lower amplitude (see Figure 3B). Hence, the AC was sensitive to synchrony. The spike-triggered average (STA) of the LFP triggered on the multiunit spike train was also determined for the same period. The spike eld coherence (SFC) is the power spectrum of the STA divided by that of the LFP. For the 300 ms segments used before, there was no clear difference in the SFC between the synchronized and asynchronized segments across a single trial (results not shown). The analysis was also performed on 1500 ms long segments from a longer simulation of the same network (see

260

P. Tiesinga and T. Sejnowski

neuron index

A

400

200

b

sp/s

200

a

0

c

sp/s

50

0 500

ISI (ms)

B

600

1000

t (ms)

1500

2000

1500

2000

a

400 200

t (ms)

0 200 150 100 50

CV

2

b

c

1 0 500

1000

t (ms)

Figure 2: Mechanism for synchrony modulation by competition. (A) Firing rate dynamics. (a) Rastergram showing the spike trains of 200 out of the 1000 neurons. The bottom 50 neurons were activated by the current pulse, and the top 150 neurons received baseline current throughout the simulation. Spike time histogram of the (b) 250 activated neurons and (c) 750 nonactivated neurons. (B) Dynamics of the interspike interval (ISI) statistics. (a) Scatter plot: The ycoordinate is the length of the interval, and the x-coordinate is the starting spike time of the interval. Sliding window average of the (b) mean ¿ and (c) coefcient of variation CV of the interspike intervals. Data are from the simulation results shown in Figure 1B.

Rapid Modulation of Synchrony

261

LFP (Hz)

A 20 0

STA (Hz) AC

500 1.0

2500

B

0

50

0.0

C 20 0 50 10

SFC

1500

t (ms)

0

10

1

10

2

t (ms)

D

0

20

40

f (Hz)

60

80

100

Figure 3: Spike eld coherence (SFC) is increased during the synchronous episode. (A, top) The simulated local eld potential (LFP) was calculated by convolving the spike time histogram with an exponential lter (time constant 5 ms). (A, bottom) Simulated multiunit recording consisting of the combined spike trains of two activated and two nonactivated neurons. (B) Autocorrelation of the spike time histogram during (solid line) synchronous period (IIB, 1200–1500 ms) and (dotted line) asynchronous period (IB, 700–1000 ms). (C, D) Spike-triggered average (STA) of the LFP triggered on the simulated multiunit recording and the resulting SFC during (solid line) a 1500 ms long synchronous period and (dashed line) an asynchronous period of equal length. Data for A and B are from the simulation results shown in Figure 1B. Data in C and D are from a 6000 ms long simulation with the same parameters, except that the current pulse was applied during the time interval between 2000 and 4000 ms. The synchronous period used for analysis was between 2500 and 4000 ms, and the asynchronous period was 4500 to 6000 ms.

Figures 3C and 3D). The SFC in the gamma frequency band (30–60 Hz) was two orders of magnitude larger for the synchronized segment compared with the asynchronous segment. For our simulation data, the SFC is not a robust measure for changes in synchrony on a 100 ms timescale, but it does work well when longer time intervals are available for averaging.

262

P. Tiesinga and T. Sejnowski

3.4 Spike-Distance Based Synchrony Measures. During the synchronous state, many neurons re at similar times; hence, the distance between spikes of different neurons is reduced compared with the asynchronous state. We calculate two synchrony measures using this fact (see section 2). The coefcient of variation CVP is obtained by pooling the spike times of all neurons, ordering them from low to high values and then calculating the coefcient of variation of the resulting interspike intervals (see section 2). For an asynchronous population, the combined spike train will form a Poisson process with CVP D 1. For a synchronized network, there will be many small intervals between spikes in a peak and a few long intervals between the rst spike p of the next peak and the last spike of the preceding peak, yielding » CVP N (here, N is the number of neurons that re on a given cycle). Note that CVP is different from the CV shown in Figure 2Bc: the latter is based on the interspike intervals between spikes of the same neurons. The sliding window average of CVP could resolve fast changes in neural synchrony and was sensitive to the transient synchrony at the start of the simulation and an episode of spontaneous synchrony at t ¼ 2800 ms (the circle in Figure 4A). The oscillations in the LFP also increased during synchrony. Hence, the sliding window average ¾LFP of the standard deviation of the LFP was also sensitive to changes in synchrony (see Figure 4B). There are two different aspects of synchrony: the number of coincidences (the number of neurons ring at approximately the same time) and the temporal dispersion of the spike times. CVP and ¾LFP confound these aspects. For CVP , only the distance between a given spike and the closest other spike time is taken into account. A new time-resolved coincidence measure that is an extension of CVP takes into account the nearest spike time for all other neurons separately. Hence, for each spike tni (ith spike of nth neuron), there are N ¡ 1 nearest distances 1mni (m D 1; : : : ; n ¡ 1; n C 1; : : : ; N; see section 2). We determined the set of 1mni smaller than P D 2 ms. The density K.t/ of the coincident pairs and the sliding window average ·.t/ are shown in Figures 4C and D, respectively. ·.t/ was normalized by the number of expected coincidences, as explained in section 2. Note that when ·.t/ was normalized by the total number of pairs (Np ; see section 2), it signicantly underestimated the number of coincidences during interval II (lled circles versus solid line in Figure 4D). An estimate of the jitter 1.t/ (the inverse of the precision) was obtained by calculating the mean distance of the N1 D 100 nearest spike times to a given spike time and averaging over all spike times (see Figure 4E). The number of coincidences and the dispersion of spike times were calculated using a clustering algorithm for comparison purposes. During the synchronous state, peaks of elevated ring rate occur in the STH (see Figure 5A). The spikes that are part of a peak were determined using a fuzzy k-means clustering algorithm. The number of peaks during a particular time interval was determined by visual inspection and supplied as a parameter to the algorithm. The number Nk of neurons ring on a particular cycle and

2 1 0

0.05

0.1 0.0 0.4 0.2 0.0 7

D (ms)

K (t)

0.00 0.2

k(t)

sLFP

CVP

Rapid Modulation of Synchrony

2

263

A B C

D E 0

1000

t (ms)

2000

3000

Figure 4: Spike distance measures can track synchrony modulation on fast timescales. (A) Sliding window average of the coefcient of variation (CVP ) of the interspike intervals between all the network neurons. The arrow points to the decay of transient synchrony due to the initial condition, and the circle highlights an episode of spontaneously elevated synchrony. (B) Sliding window average ¾LFP of the standard deviation in the local eld potential (LFP). The number of spike pairs with a distance 1mni smaller than 2 ms plotted as (C) a histogram K.t/ (bin width was 1 ms) and (D) a sliding window average · .t/. The small circles in D are · normalized by Nc , and the solid line is · normalized by Np . (E) The sliding window average 1.t/ of the mean distance of the N1 D 100 closest spike times to the given spike time. The analysis in A and B used all the spike data, whereas the analysis in C and E was based on the spike trains of 125 activated neurons and 375 nonactivated neurons. Data are from the simulation results shown in Figure 1B.

their temporal dispersion ¾k were calculated and plotted as a function of the centroid of the peak (see Figure 5B). After the onset of the current pulse, the jitter ¾k decreased over time, whereas the number of neurons Nk ring per cycle remained approximately constant. Likewise, when the current pulse was turned off, the jitter increased and did so at a much faster rate of change. Changes in synchrony can be resolved on a cycle-by-cycle basis using the jitter. The temporal modulation of 1.t/ (see Figure 4E) closely mimics that of ¾k , shown in Figure 5B.

264

P. Tiesinga and T. Sejnowski

sp/s

60

A

40 20 0

B 300 200

Nk

sk (ms)

10 5 0 1000

100 1200

1400

t (ms)

1600

0

Figure 5: Clustering of spike times during the onset of the synchronized period. Each event corresponds to a cycle of the network oscillation. (A) Spike time histogram with the odd-numbered clusters shown in black and the even-numbered in white. (B, left-hand side scale) The standard deviation ¾k of the spike times and (B, right-hand side scale) the number Nk of neurons that spiked during a cycle plotted versus the mean spike time during a cycle. The oscillation period between 1200 ms and 1500 ms was 27:5 ms. Data are from the simulation results shown in Figure 1B.

3.5 Synaptic Background Activity is Necessary for Dynamic Changes in Synchrony. High synchrony occurs when the network discharge has a jitter (¾k or 1) of less than 5 ms; for low synchrony, the network discharges with strong temporal modulation of the ring rate but with a jitter of more than 5 ms. Asynchronous network discharge does not have signicant ring rate modulations. The neurons in the network received incoherent background activity, which was studied by systematically varying the variance D of a balanced white noise current. Increasing the variability in the input current without increasing the mean input current corresponded to a larger value of D (Tiesinga, Jos´e, & Sejnowski, 2000; Chance, Abbott, & Reyes, 2002). For each value of D, a current pulse of 1I D 0:3 ¹A/cm2 was applied to 250 out of 1000 neurons during the time interval between 1000 ms and 1500 ms. As before, the simulation was run for 3000 ms and divided into

Rapid Modulation of Synchrony

265

three intervals, I–III. During the rst (I) and third (III) intervals, the network was in its baseline state, and during the second (II) interval, it was in its activated state. The spike time histograms are shown in Figure 6A. For D D 0 mV2 /ms, the network synchronized during the rst interval and remained synchronized throughout the second and third intervals. During the second interval, the network splits into two populations with different ring rates ( f D 38 Hz for activated neurons, whereas the nonactivated population spiked on only 1 in 13 cycles of the network oscillation). Because there was neural heterogeneity, ¾I D 0:02¹A/cm2 , the p jitter ¾k was nonzero, but CVP was close to its fully synchronized value, N. For D D 0:0004 mV2 /ms, network synchrony was stable for the baseline network state. Starting from a synchronized state, the network remained synchronized, and starting from an asynchronous state, it became synchronized over time. During the rst interval, the network synchronized; synchrony increased even more during the second interval, and in the third interval the degree of synchrony continued to increase. At the start of the third interval, the nonactivated neurons all had a similar voltage. Once these neurons were released from the inhibition projected by the activated neurons, they immediately synchronized into a two-cluster state (see below). The mean ring rate was doubled compared with that during the rst interval. For D D 0:0084 mV2 /ms, the network converged to a low-synchrony state during the rst interval. The network switched to a high-synchrony state during the second interval. In the third interval, the network returned slowly to the low-synchrony state. The dynamics for D D 0:0164 mV2 /ms was similar, except that the return to the low-synchrony state during the third interval was faster. The same holds for D D 0:0284 mV2 /ms, except that the degree of synchrony reached during the second interval was reduced. For D D 0:0364 mV2 /ms, the network was in an asynchronous state during the rst and third intervals. The current pulse during the second interval was unable to drive the network to a synchronous state. The network dynamics for different values of D is also reected in the time-resolved coincidence measure ·.t/ and jitter 1.t/. The state with dynamic modulation of synchrony is characterized by low values of ·.t/ and high values of 1.t/ during the rst and third intervals and high values of ·.t/ and low values of 1.t/ during the second interval. This is the case only for the curves with D D 0:084 mV2 /ms and D D 0:0204 mV2 /ms (see Figure 6B). In Figure 7, the mean ring rate and synchrony measures CVP , ·, and 1 are plotted versus D for the intervals IB, II, and IIIB. For D < 0:005 mV2 /ms, there was multistability. Depending on the distribution of initial voltages across the neurons, the network converged to states with two, three, or more clusters present (these clusters should not be confused with those obtained

266

P. Tiesinga and T. Sejnowski

A

0.0004

0.0084

0.0164

0.0284 80

sp/s

0.0364

0 500

1500

2500

t (ms)

B

1

a 1

2

k

4 5

3

0

8

b 5

D (ms)

6

4

4 2

3 1

2

0 0

1000

t (ms)

2000

3000

Figure 6: Synaptic background activity is required for fast synchrony modulation. (A) Five spike time histograms with D increasing from top to bottom. The value of D is indicated in the top left corner of each graph. (B) Sliding window average of (a) the number · .t/ of coincidences and (b) the estimate 1.t/ of spike time dispersion for D D (1) 0:0004, (2) 0:0044, (3) 0:0084, (4) 0:0204 and (5) 0:0364 (D values are in mV2 /ms). A current pulse 1I D 0:3¹A/cm2 was applied to 250 of the 1000 neurons during the time interval between t D 1000 ms and 1500 ms (indicated by the bar in each graph). Network parameters are the same as in Figure 1B except that D is varied.

Firing rate (Hz)

Rapid Modulation of Synchrony

20

A

15 10 5 0 30

B 2

20

CVP

267

0 0.01

10

0.02

0.03

0 1.0

C

0.8

k

0.6 0.4 0.2 0.0

8

D

D

6 4 2 0 0.00

0.01

0.02

0.03

0.04

2

D (mV /ms) Figure 7: Optimal level of synaptic background activity required for synchrony modulation. (A) The mean ring rate f and (B) CVP (C) · and (D) 1 versus noise strength D for three time intervals: (IB, circles) 500–1000 ms, (II, squares) 1000–1500 ms and (IIIB, diamonds) 2000–3000 ms. The inset in B is a close-up. Network parameters are the same as in Figure 1B except that D is varied.

in the preceding section using a clustering procedure). For instance, in a two-cluster state, neurons red once every two cycles, yielding a ring rate of approximately 20 Hz. Each neuron was in the cluster that was active on either even cycles or odd cycles. The dynamics of states with a higher number of clusters is similar (Kopell & LeMasson, 1994). The number of clusters in IB depended on the initial condition. f , CVP , ·, and 1 had different val-

268

P. Tiesinga and T. Sejnowski

ues depending on the number of clusters. The graphs of these quantities versus D would have a ragged appearance due to the sensitivity to initial conditions. Hence, we did not plot them in Figure 7 for small D values in intervals I and II. However, after the application of the current pulse during the second interval, the network settled into the most synchronous state with the highest ring rate. As a result, the network state reached in IIIB did not depend on initial conditions. For higher D values, these synchronized states became unstable, and the ring rate as well as the values of CVP and · in the third interval dropped precipitously (see Figure 7, arrow). In summary, synchrony can be modulated in time only for a nite region of background activity consistent with D values around D D 0:02 mV2 /ms. For weaker background activity, it is possible to increase synchrony, but it cannot be shut off. Furthermore, the ring rate during synchrony is much higher compared with that for lower synchrony. For stronger background activity, synchrony cannot be established. 4 Discussion 4.1 Measuring Synchrony. Physiological correlates of attention have been observed in the response of cortical neurons in macaque monkeys to either orientated gratings (area V4) or random dot patterns moving in different directions (area MT) (McAdams & Maunsell, 1999; Treue & Martinez Trujillo, 1999). When attention is directed into the receptive eld of the recorded neuron, the gain of the ring rate response may increase (McAdams & Maunsell, 1999; Treue & Martinez Trujillo, 1999), as well as the coherence of the neuron’s discharge with the local eld potential (Fries et al., 2001). The coherence was quantied using the SFC, the power spectrum of the spike triggered average divided by that of the local eld potential. The experimental recordings were 500 ms to 5000 ms in duration—long enough to get statistical differences with attention focused either into the receptive eld or outside the receptive eld. Here we study how synchrony can be modulated dynamically as it would be during normal behavior. The speed with which attention can be covertly shifted is quite rapid—in the range of a few hundred ms (Shi & Sperling, 2002). Attentional shifts caused by sudden changes in stimulus properties, such as brightness and motion, can be as low as 100 ms. For our simulations, the SFC across a single trial could not resolve rapid changes in synchrony that were obvious from visual inspection of the spike time histogram. For oscillatory synchrony in the gamma frequency range, the fastest timescale on which synchrony can change is the cycle length. Event-based methods could resolve changes in synchrony on these timescales (see Figure 5). We introduced two nonevent-based measures that could track the temporal dynamics of synchrony on timescales of about 100 ms: the sliding window average of the standard deviation of the LFP and CVP . These measures were able to resolve the changes in synchrony that were visible in

Rapid Modulation of Synchrony

269

the spike time histogram. These measures confound two different aspects of synchrony: the number of coincident spikes per event and their temporal dispersion corresponding to Nk and ¾k , respectively, in the event-based analysis. Either increasing the number of spikes or decreasing their temporal dispersion will lead to increases in the value of ¾LFP or CVP . However, the distribution of the pairwise distances between spikes used to dene · and 1 distinguishes between changes in the number of coincident neurons and the temporal dispersion. There were two parameter values—P, the precision of the coincidences, and N1 , the number of pairwise distances—in the average for 1. For ·, all Np D .N ¡ 1/Ns » N 2 f Tmax pairwise distances need to be determined. The computational load increases quadratically with network size N and linearly with ring rate f and the length Tmax of the measurement interval; hence, it is less efcient than CVP . ·.t/ is a time-resolved version of the quantity used by Wang and Busz a´ ki (1996). We adapted · so that it was correctly normalized for networks with a bimodal distribution of ring rates. In summary, CVP is a fast and easy statistic to quantify synchrony modulations, but ·.t/ and 1.t/ delineate changes in temporal dispersion from changes in the number of coincident spikes. 4.2 Modulating Synchrony. The synchronization properties of model networks of interneurons has recently been intensively investigated (Traub, Whittington, Colling, Buzs a´ ki, & Jeffreys, 1996; Wang & Buzs a´ ki, 1996; White, Chow, Ritt, Soto-Trevino, ˜ & Kopell, 1998; Chow, 1998; Ermentrout & Kopell, 1998; Brunel & Hakim, 1999; Kopell, Ermentrout, Whittington, & Traub, 2000; White, Banks, Pearce, & Kopell, 2000; Tiesinga & Jos´e, 2000a; Hansel & Golomb, 2000; Bressloff & Coombes, 2000; Gerstner, 2000; Tiesinga, Fellous, Jos´e, & Sejnowski, 2001; Aradi, Santhakumar, Chen, & Soltesz, 2002; Bartos et al., 2002; Borgers & Kopell, 2003; Olufsen, Whittington, Camperi, & Kopell, 2003; Hansel & Mato, 2003; Fransen, 2003). The degree of synchrony, and its robustness against noise (D) and heterogeneity (¾I ), depends primarily on the synaptic coupling strength gsyn and degree of depolarization I (Wang & Buzs a´ ki, 1996; White et al., 1998; Tiesinga & Jos´e, 2000a). For I D 0:7 ¹A/cm2 , gsyn D 0:4 mS/cm2 , the network studied here is asynchronous (with D D 0:02 mV2 /ms and ¾I D 0:02 ¹A/cm2 ). During the application of a current pulse to the NA network neurons, the other NB neurons were almost completely suppressed. Hence, the full network behaves approximately as a network of NA neurons with effective synaptic coupling gef f D .NA =N/gsyn D 0:1 mS/cm2 and driving current I D 1:0 ¹A/cm2 . For these parameters, the network is synchronous (Wang & Buzs a´ ki, 1996; Tiesinga & Jos´e, 2000a). Synchrony by competition is effective when the baseline state is asynchronous and reached quickly from a synchronized network state (corresponding to an initial condition with all neurons having a similar mem-

270

P. Tiesinga and T. Sejnowski

brane potential). The activated state should be synchronous and should be established quickly from an asynchronous state (corresponding to an initial condition with a wide dispersion of membrane potential across different neurons). In a previous study (Tiesinga & Jos´e, 2000a), the focus was on the degree of synchronization in the asymptotic state of the network. Here, we study a network with competition between otherwise identical neurons, and the focus is on the speed with which the network can switch between synchronous and asynchronous states. The strength of the white noise current D, corresponding to the level of balanced synaptic background activity, was critical. In the noiseless network (D D 0), there are multiple stable cluster states. For the baseline state, we found states with two, three, or four approximately equal-sized clusters that were reached from the set of random initial conditions used in the simulations (Golomb & Rinzel, 1993, 1994; Kopell & LeMasson, 1994; van Vreeswijk, 1996; Tiesinga & Jos´e, 2000b). The cluster states were stable against weak noise. The transition to the activated state could switch the baseline state from one cluster state to another, but it would not affect the degree of synchrony. The ring rate did, however, vary strongly. For strong noise, neither the baseline nor the activated state was synchronized. Only for intermediate noise strengths could the degree of synchrony be modulated. In this regime, the degree of synchrony reached during the activated state decreased with increasing D, but the transition from the synchronized state to the asynchronous state during interval IIIA was speeded up. For what parameters values can synchrony by competition be obtained? We varied the value of the amplitude of the current pulse, 1I; the mean of the common driving current, I; the number of activated neurons, N A ; the total number of neurons in the network, N; and the strength of synaptic coupling, gsyn (data not shown). For large networks, N > 100, with all-to-all coupling, the network state did not depend on N (Tiesinga & Jos´e, 2000a). Let us denote the ring rate of the baseline network by f1 and that of the NA neurons in the activated network by f2 . f2 should be below but within 10% of the oscillation frequency (40 Hz for gamma frequency range). Mean activity was constant when f1 =f2 ¼ NA =N. Networks with NA > 100 robustly synchronize (Wang & Buzs a´ ki, 1996), and we found that N=NA > 2 works best. The values of I, gsyn, and 1I are less critical: I and gsyn should lead to a ring rate f1 in the appropriate range, whereas I C 1I and .NA =N/gsyn should lead to a synchronous state with a ring rate equal to f2 . This is possible since the higher ring rate state is usually more synchronous (Tiesinga & Jos´e, 2000a). The specic value of 1I determines how fast synchrony is attained in interval II. These parameter values are specic to the model used in our investigation. This raises the issue whether and to what extent the results reported here generalize to other models. When the maximum conductance of the sodium and potassium currents in the model were varied by less than 10%,

Rapid Modulation of Synchrony

271

we obtained similar results to those reported here. Hence, there is robustness against small changes in model neuron parameters. The key requirements are that the model network has a synchronous state in the gamma frequency range and an asynchronous state and that the degree of synchrony depends on the coupling strength and driving current. These properties hold across different Hodgkin-Huxley-type models (Wang & Buzs a´ ki, 1996; White et al., 1998) and for leaky integrate-and-re model neurons (Brunel & Hakim, 1999; Hansel & Golomb, 2000). This suggests that synchrony modulation by competition will also be present in these models. The ring rate of neurons recorded in vitro adapts in response to a sustained tonic depolarizing current: at the onset of current injection, neurons re at a high rate but the rate decreases and saturates at a lower value (McCormick, Connors, Lighthall, & Prince, 1985; Shepherd, 1998). The role of adaptation in synchronous oscillations has been studied in models (van Vreeswijk & Hansel, 2001; Fuhrman, Markram, & Tsodyks, 2002). We expect that synchrony modulation by competition also works in the presence of adaptation, but that the temporal dynamics of the transition between the activated and nonactivated states may be different. Further investigation of this issue is needed. 4.3 Functional Consequences of Temporal Modulation of Synchrony. The driving hypothesis of this investigation is that attention modulates the synchrony of cortical interneuron networks. We have shown that inhibitory networks can modulate their synchrony on timescales of the order 100 ms without increasing their mean activity. The question is, How do these synchrony modulations relate to the effects of attention on putative pyramidal neurons recorded in vivo (Connor et al., 1996, 1997; Luck et al., 1997; McAdams & Maunsell, 1999; Treue & Martinez Trujillo, 1999; Reynolds et al., 2000; Fries et al., 2001)? Attentional modulation can be studied in models using the following conceptual framework. The neuron is thought to receive two sets of inputs: predominantly excitatory, stimulus-related inputs from upstream cortical areas and modulatory synaptic inputs (Chance et al., 2002). Some of the modulatory inputs are from local interneuron networks that project to the principal output neurons in cortical layer 2/3 (Galarreta & Hestrin, 2001). The response to stimulus-related inputs can be characterized using the ring rate versus input ( f ¡ I) curve. The f ¡ I curve is determined by measuring the ring rate while the amplitude I of the tonic depolarizing current is systematically varied (sometimes the ring rate of presynaptic excitatory afferents is varied instead). Modulatory inputs alter the f ¡ I curve, leading to a different ring rate in response to the same input. The modulatory changes fall in two categories: leftward shifts of the f ¡ I curve, leading to higher sensitivity as the neuron can respond to weaker inputs, and gain changes, where the ring rate response to any input is multiplied by a constant factor. The attentional modulation of ring rate tuning curve observed

272

P. Tiesinga and T. Sejnowski

in McAdams and Maunsell (1999) and Treue and Martinez Trujillo (1999) can be explained as a multiplicative gain change of the f ¡ I curve. Recent investigations reveal that the gain of the f ¡ I curve is modulated by the amount of tonic inhibition that a neuron receives (Prescott & DeKoninck, 2003; Mitchell & Silver, 2003). However, measurements of the LFP (Fries et al., 2001) indicate that neurons receive temporally patterned inputs in the gamma frequency range. The spike time coherence in the gamma-frequency range increased with attention (Fries et al., 2001). Our hypothesis is that the modulations in gamma-frequency coherence are due to synchrony modulation of local interneuron networks. We studied the f ¡ I curves and the coherence with the LFP of model neurons receiving synchronous inhibition in the gamma-frequency range (Jos´e, Tiesinga, Fellous, Salinas, & Sejnowski, 2002). The results are summarized here, full details will be reported elsewhere. We found that changing the degree of synchrony, quantied as the jitter (¾k in Figure 5), led to gain changes as well as shifts. The mean number of inputs per oscillation cycle (Nk in Figure 5) determined whether the change in gain dominated the shift or vice versa. For Nk > 50, the modulation of the f ¡ I curve was shift dominated. In that case, the ring rate of output neurons could saturate for strong inputs, and an increase in synchrony led to enhanced coherence with the LFP but not to an increase in ring rate. For weak inputs, the input synchrony acted as as a gate. For low-synchrony inhibitory inputs, the receiving cortical neuron was shut down, whereas for high synchrony, it could transmit stimulusrelated information in its output. For Nk ¼ 10, there were large changes in the gain of the f ¡ I curve that were consistent with the gain modulation of orientation tuning curves reported in McAdams and Maunsell (1999). The modulation of ring rate with input synchrony depended on the oscillation frequency of the interneuron network. We found that the strongest modulation was obtained with oscillations in the gamma-frequency range. To summarize, synchrony by competition in the gamma-frequency range can selectively gate cortical information ow at approximately constant metabolic expense in the interneuron networks. Further investigation is necessary to determine what type of regulation can be performed using this mechanism. References Aoyagi, T., Takekawa, T., & Fukai, T. (2003). Gamma rhythmic bursts: Coherence control in networks of cortical pyramidal neurons. Neural Computation, 15, 1035–1062. Aradi, I., Santhakumar, V., Chen, K., & Soltesz, I. (2002). Postsynaptic effects of GABAergic synaptic diversity: Regulation of neuronal excitability by changes in IPSC variance. Neuropharmacology, 43, 511–522. Bartos, M., Vida, I., Frotscher, M., Meyer, A., Monyer, H., Geiger, J., & Jonas, P. (2002). Fast synaptic inhibition promotes synchronized gamma oscilla-

Rapid Modulation of Synchrony

273

tions in hippocampal interneuron networks. Proc. Natl. Acad. Sci., 99, 13222– 13227. Borgers, C., & Kopell, N. (2003). Synchronization in networks of excitatory and inhibitory neurons with sparse random connectivity. Neural Computation, 15, 509–538. Bressloff, P., & Coombes, S. (2000). Dynamics of strongly-coupled spiking neurons. Neural Computation, 12, 91–129. Brunel, N. and Hakim, V. (1999). Fast global oscillations in networks of integrateand-re neurons with low ring rates. Neural Computation, 11, 1621–1671. Chance, F., Abbott, L., & Reyes, A. (2002). Gain modulation from background synaptic input. Neuron, 35, 773–782. Chow, C. (1998). Phase-locking in weakly heterogeneous neuronal networks. Physica D, 118, 343–370. Connor, C., Gallant, J., Preddie, D., & Van Essen, D.C. (1996). Responses in area V4 depend on the spatial relationship between stimulus and attention. J. Neurophysiol., 75, 1306–1308. Connor, C., Preddie, D., Gallant, J., & Van Essen, D.C. (1997). Spatial attention effects in macaque area V4. J. Neurosci., 17, 3201–3214. Dayan, P., & Abbott, L. (2001). Theoretical neuroscience. Cambridge, MA: MIT Press. Ermentrout, G., & Kopell, N. (1998). Fine structure of neural spiking and synchronization in the presence of conduction delays. Proc. Natl. Acad. Sci., 95, 1259–1264. Fransen, E. (2003). Coexistence of synchronized oscillatory and desynchronized rate activity in cortical networks. Neurocomputing, 52–54, 763–769. Fries, P., Reynolds, J., Rorie, A., & Desimone, R. (2001). Modulation of oscillatory neuronal synchronization by selective visual attention. Science, 291, 1560– 1563. Fuhrman, G., Markram, H., & Tsodyks, M. (2002). Spike frequency adaptation and neocortical rhythms. J. Neurophys., 88, 761–770. Galarreta, M., & Hestrin, S. (2001). Electrical synapses between GABA-releasing interneurons. Nat. Rev. Neurosci., 2, 425–433. Gerstner, W. (2000). Population dynamics of spiking neurons: Fast transients, asynchronous states, and locking. Neural Computation, 12, 43–89. Golomb, D., & Rinzel, J. (1993). Dynamics of globally coupled inhibitory neurons with heterogeneity. Phys. Rev. E, 48, 4810–4814. Golomb, D., & Rinzel, J. (1994). Clustering in globally coupled inhibitory neurons. Physica D, 72, 259–282. Hansel, D., & Golomb, D. (2000). The number of synaptic inputs and the synchrony of large sparse neuronal networks. Neural Computation, 12, 1095– 1139. Hansel, D., & Mato, G. (2003). Asynchronous states and the emergence of synchrony in large networks of interacting excitatory and inhibitory neurons. Neural Computation, 15, 1–56. Jos´e, J., Tiesinga, P., Fellous, J., Salinas, E., & Sejnowski, T. (2002). Is attentional gain modulation optimal at gamma frequencies? Society of Neuroscience Abstract, 28, 55.6.

274

P. Tiesinga and T. Sejnowski

Kopell, N., Ermentrout, G., Whittington, M., & Traub, R. (2000). Gamma rhythms and beta rhythms have different synchronization properties. Proc. Natl. Acad. Sci., 97, 1867–1872. Kopell, N., & LeMasson, G. (1994). Rhythmogenesis, amplitude modulation, and multiplexing in a cortical architecture. Proc. Natl. Acad. Sci., 91, 10586–10590. Luck, S., Chelazzi, L., Hillyard, S., & Desimone, R. (1997). Neural mechanisms of spatial selective attention in areas V1, V2, and V4 of macaque visual cortex. J. Neurophys., 77, 24–42. McAdams, C., & Maunsell, J. (1999). Effects of attention on orientation-tuning functions of single neurons in macaque cortical area V4. J. Neurosci., 19, 431–441. McCormick, D., Connors, B., Lighthall, J., & Prince, D. (1985). Comparative electrophysiology of pyramidal and sparsely spiny stellate neurons of the neocortex. J. Neurophys., 54, 782–806. Mitchell, S., & Silver, R. (2003). Shunting inhibition modulates neuronal gain during synaptic excitation. Neuron, 38, 433–445. Moore, T., & Armstrong, K. (2003). Selective gating of visual signals by microstimulation of frontal cortex. Nature, 421, 370–373. Olufsen, M., Whittington, M., Camperi, M., & Kopell, N. (2003). New roles for the gamma rhythm: Population tuning and preprocessing for the beta rhythm. J. Comput. Neurosci., 14, 33–54. Prescott, S., & De Koninck, Y. (2003). Gain control of ring rate by shunting inhibition: Roles of synaptic noise and dendritic saturation. Proc. Natl. Acad. Sci., 100, 2071–2081. Rensink, R. (2000). The dynamics representation of scenes. Visual Cognition, 7, 17–42. Reynolds, J., Pasternak, T., & Desimone, R. (2000). Attention increases sensitivity of V4 neurons. Neuron, 26, 703–714. Rieke, F., Warland, D., de Ruyter van Steveninck, R. R., & Bialek, W. (1997). Spikes: Exploring the neural code. Cambridge, MA: MIT Press. Ripley, B. (1996). Pattern recognition and neural networks. Cambridge: Cambridge University Press. Salinas, E., & Sejnowski, T. (2000). Impact of correlated synaptic input on output variability in simple neuronal models. J. Neurosci., 20, 6193–6209. Salinas, E., & Sejnowski, T. (2001). Correlated neuronal activity and the ow of neural information. Nat. Rev. Neurosci., 2, 539–550. Salinas, E. and Sejnowski, T. (2002). Integrate-and-re neurons driven by correlated stochastic input. Neural Computation, 14, 2111–2155. Shepherd, G. (1998). Synaptic organization of the brain (4th ed.). New York: Oxford University Press. Shi, S., & Sperling, G. (2002). Measuring and modeling the trajectory of visual spatial attention. Psychological Review, 109, 260–305. Simons, D. (2000). Current approaches to change blindness. Visual Cognition, 7, 1–15. Tiesinga, P., Fellous, J.-M., Jos´e, J., & Sejnowski, T. (2001). Computational model of carbachol-induced delta, theta and gamma oscillations in the hippocampus. Hippocampus, 11, 251–274.

Rapid Modulation of Synchrony

275

Tiesinga, P., & Jos´e, J. (2000a). Robust gamma oscillations in networks of inhibitory hippocampal interneurons. Network, 11, 1–23. Tiesinga, P., & Jos´e, J. (2000b). Synchronous clusters in a noisy inhibitory network. J. Comp. Neurosci., 9, 49–65. Tiesinga, P., Jos´e, J., & Sejnowski, T. (2000). Comparison of current-driven and conductance-driven neocortical model neurons with Hodgkin-Huxley voltage-gated channels. Phys. Rev. E, 62, 8413–8419. Traub, R., Whittington, M., Colling, S., Buzs´aki, G., & Jeffreys, J. (1996). Analysis of gamma rhythms in the rat hippocampus in vitro and in vivo. J. Physiol., 493, 471–484. Treue, S., & Martinez Trujillo, J. (1999). Feature-based attention inuences motion processing gain in macaque visual cortex. Nature, 399, 575–579. van Vreeswijk, C. (1996). Partial synchronization in populations of pulsecoupled oscillators. Phys. Rev. E, 54, 5522–5537. van Vreeswijk, C., & Hansel, D. (2001). Patterns of synchrony in neural networks with spike adaptation. Neural Computation, 13, 959–992. Wang, X., & Buzs´aki, G. (1996). Gamma oscillation by synaptic inhibition in a hippocampal interneuronal network model. J. Neurosci., 16, 6402–6413. White, J., Banks, M., Pearce, R., & Kopell, N. (2000). Networks of interneurons with fast and slow gamma -aminobutyric acid type A (GABAA) kinetics provide substrate for mixed gamma-theta rhythm. Proc. Natl. Acad. Sci., 97, 8128–8133. White, J., Chow, C., Ritt, J., Soto-Trevino, ˜ C., & Kopell, N. (1998). Synchronization and oscillatory dynamics in heterogeneous, mutually inhibited neurons. J. Comput. Neurosci., 5, 5–16. Received June 24, 2003; accepted August 6, 2003.

LETTER

Communicated by Jonathan Victor

Dynamic Analyses of Information Encoding in Neural Ensembles Riccardo Barbieri [email protected] Neuroscience Statistics Research Laboratory, Department of Anesthesia and Critical Care, Massachusetts General Hospital/Harvard Medical School, Boston, MA 02114, U.S.A.

Loren M. Frank [email protected]

David P. Nguyen [email protected] Neuroscience Statistics Research Laboratory, Department of Anesthesia and Critical Care, Massachusetts General Hospital; Division of Health, Sciences, and Technology, Harvard Medical School/MIT, Boston, MA 02114, U.S.A.

Michael C. Quirk [email protected] Picower Center for Learning and Memory, Riken-MIT Neuroscience Research Center, Department of Brain and Cognitive Sciences, Massachusetts Institute of Technology, Cambridge, MA 02139, U.S.A.

Victor Solo [email protected] School of Electrical Engineering and Telecommunications, University of New South Wales,2052,Sydney, Australia;Martinos Center for BiomedicalImaging,Massachusetts General Hospital/Harvard Medical School, Boston, MA 02114, U.S.A.

Matthew A. Wilson [email protected] Picower Center for Learning and Memory, Riken-MIT Neuroscience Research Center, Department of Brain and Cognitive Sciences, Massachusetts Institute of Technology, Cambridge, MA 02139, U.S.A.

Emery N. Brown [email protected] Neuroscience Statistics Research Laboratory, Department of Anesthesia and Critical Care, Massachusetts General Hospital; Division of Health, Sciences, and Technology, Harvard Medical School/MIT, Boston, Massachusetts 02114-2698, U.S.A.

Neural Computation 16, 277–307 (2004)

c 2003 Massachusetts Institute of Technology °

278

R. Barbieri et al.

Neural spike train decoding algorithms and techniques to compute Shannon mutual information are important methods for analyzing how neural systems represent biological signals. Decoding algorithms are also one of several strategies being used to design controls for brain-machine interfaces. Developing optimal strategies to desig n decoding algorithms and compute mutual information are therefore important problems in computational neuroscience. We present a general recursive lter decoding algorithm based on a point process model of individual neuron spiking activity and a linear stochastic state-space model of the biological signal. We derive from the algorithm new instantaneous estimates of the entropy, entropy rate, and the mutual information between the signal and the ensemble spiking activity. We assess the accuracy of the algorithm by computing, along with the decoding error, the true coverage probability of the approximate 0.95 condence regions for the individual signal estimates. We illustrate the new algorithm by reanalyzing the position and ensemble neural spiking activity of CA1 hippocampal neurons from two rats foraging in an open circular environment. We compare the performance of this algorithm with a linear lter constructed by the widely used reverse correlation method. The median decoding error for Animal 1 (2) during 10 minutes of open foraging was 5.9 (5.5) cm, the median entropy was 6.9 (7.0) bits, the median information was 9.4 (9.4) bits, and the true coverage probability for 0.95 condence regions was 0.67 (0.75) using 34 (32) neurons. These ndings improve signicantly on our previous results and suggest an integrated approach to dynamically reading neural codes, measuring their properties, and quantifying the accuracy with which encoded information is extracted. 1 Introduction Neural spike train decoding algorithms are commonly used methods for analyzing how neural systems represent biological signals (Georgopoulos, Kettner, & Schwartz, 1986; Bialek, Rieke, de Ruyter van Stevenick, & Warland, 1991; Wilson & McNaughton, 1993; Warland, Reinagel, & Meister, 1997; Brown, Frank, Tang, Quirk, & Wilson, 1998; Zhang, Ginzburg, McNaughton, & Sejnowski, 1998; Stanley, Li, & Dan, 1999; Pouget, Dayan, & Zemel, 2000). More recently, these algorithms are also one of several strategies being used to design controls for neural prosthetic devices (Chapin, Moxon, Markowitz, & Nicolelis, 1999; Wessberg et al., 2000; Shoham, 2001; Serruya, Hatsopoulos, Paninski, Fellows, & Donoghue, 2002; Taylor, Tillery, & Schwartz, 2002) and other brain-machine interfaces (Donoghue, 2002; Wickelgren, 2003). Developing optimal strategies for constructing and testing decoding algorithms is therefore an important question in computational neuroscience. The rat hippocampus has specialized neurons known as place cells with spatial receptive elds that carry a representation of the animal’s environment (O’Keefe & Dostrovsky, 1971). The spatial information carried by the

Dynamic Analyses of Neural Information Encoding

279

spiking activity of the hippocampal place cells is believed to be an important component of the rat’s spatial navigation system (O’Keefe & Nadel, 1978). We previously derived the Bayes’ lter algorithm to decode the position of a rat foraging in an open environment from the ensemble spiking activity of pyramidal neurons in the CA1 region of the hippocampus (Brown et al., 1998). We represented the spatial receptive eld of each neuron as a twodimensional gaussian surface and assumed that the position of the animal in the receptive eld through time parameterized the rate function of an inhomogeneous Poisson process. We modeled the animal’s path as a random walk. The Bayes’ lter was a causal, recursive decoding algorithm developed by computing gaussian approximations to the well-known Bayes’ rule and the Chapman-Kolmogorov system of equations for state-space modeling (Mendel, 1995; Kitagawa & Gersh, 1996). The algorithm computed position estimates and their condence regions at 33 msec intervals. Because the algorithm depended critically on the form of the spatial gaussian model, it could not be used with other models of the place elds. While representing the place elds as gaussian surfaces and the path as a random walk was reasonable, these models gave only an approximate description of these data (Brown et al., 1998). Methods to compute the Shannon mutual information (Skaggs, McNaughton, Gothard, Marcus, 1993; Rieke, Warland, de Ruyter van Steveninck, & Bialek, 1997; Dan, Alonso, Usrey, & Reid, 1998; Strong, Koberle, de Ruyter van Steveninck, & Bialek, 1998; Reinagel & Reid, 2000; Reich, Melcher, & Victor, 2001; Johnson, Gruner, Baggerly, & Shehagiri, 2001; Nirenberg, Carcieri, Jacobs, & Latham, 2001; Victor, 2002), rather than decoding algorithms, are perhaps the most widely used techniques for analyzing how neural systems encode biological signals. None of these methods uses a parametric statistical model of the relation between spiking activity and the biological signal, and none computes mutual information estimates instantaneously (i.e., on the millisecond timescale of the decoded signal updates). Moreover, current analyses do not fully exploit the link between mutual information and decoding. We derive a general recursive lter decoding algorithm by using a point process to represent the spiking activity of individual neurons and a linear stochastic state-space model (Smith & Brown, 2003) to represent the animal’s path. From the algorithm, we derive new instantaneous estimates of the mutual information between position and the ensemble spiking activity, the entropy, and the entropy rate. We assess the accuracy of the algorithm by computing, along with the decoding error, the true coverage probability of the approximate 0.95 condence regions for the individual position estimates. We illustrate the new algorithm by reanalyzing the position and ensemble neural spiking activity of CA1 hippocampal neurons from two rats foraging in an open circular environment (Brown et al., 1998). For the point process, we use an inhomogeneous Poisson model with spatial receptive elds modeled as both gaussian surfaces and Zernike polynomial

280

R. Barbieri et al.

expansions (Born & Wolf, 1989), and for the linear state-space model of the path, we use a bivariate rst-order autoregressive process. We compare the performance of these algorithms with a linear lter algorithm constructed by the widely-used reverse correlation method (Warland, Reinagel, & Meister, 1997; Stanley et al., 1999; Wessberg et al., 2000; Serruya et al., 2002). 2 Theory In this section, we develop the theoretical framework for our approach. First, in the encoding analysis, we dene the relation between position and spiking activity for individual place cell neurons with two inhomogeneous Poisson models. In the rst model, the dependence of the rate function on space is dened as a gaussian surface, whereas in the second model, the spatial dependence is represented as an expansion in Zernike polynomials. We describe how the two models are t to the experimental data by maximum likelihood. Second, in the decoding analysis, we describe the autoregressive model of the path and the recursive lter algorithm we will use to decode position from the ensemble place cell spiking activity. Third, we describe several properties of the recursive lter algorithm. Fourth, we present the reverse correlation algorithm, and nally, we present our algorithms for computing instantaneous entropy, entropy rate, and mutual information. 2.1 Encoding Analysis: Place Cell Model. To dene the point process observation model, we consider two inhomogeneous Poisson models for representing the place cell spiking activity. In the rst, we model the rate or conditional intensity function ¸cG .t j x.t/; ³G / as a two-dimensional Gaussian surface dened as » ¼ 1 (2.1) ¸cG .t j x.t/; ³Gc / D exp ® c ¡ .x.t/ ¡ ¹ c /0 .Qc /¡1 .x.t/ ¡ ¹c / ; 2 where x.t/ D .x1 .t/; x2 .t// are the coordinates of the animal’s position at time t, ¹c D .¹c1 ; ¹c2 // are the coordinates of the place eld’s center, µ c 2 ¶ 0 .¾1 / c (2.2) Q D ; 0 .¾ 2c /2 is the scale matrix, ® is the log maximum ring rate (Brown et al., 1998), and ³Gc D .® c ; ¹c ; Qc /0 where c indexes a neuron, for c D 1; : : : ; C. As in Brown et al. (1998), we have taken the off-diagonal elements of Qc to be zero because for each neuron, these parameter estimates were all close to zero. In the second, we model the conditional intensity function ¸cz .t j x.t/; ³zc / as an exponentiated linear combination of Zernike polynomials given as ( ) L X ` X c c c m (2.3) ¸z .t j ³z / D exp ³`;m z` .½ .t/; Á .t// ; `D0 mD¡`

Dynamic Analyses of Neural Information Encoding

281

c where z`;m is the mth component of the `th order Zernike polynomial, ³`;m ¡1=2 2 2 1=2 is the associated coefcient, ½.t/ D r [.x1 .t/ ¡ ´1 / C .x2 .t/ ¡ ´2 / ] , Á.t/ D tan¡1 [.x2 .t/ ¡ ´2 /.x1 .t/ ¡ ´1 /¡1 ], .´1 ; ´2 / are the coordinates of the center of the circular environment, r is the radius of the circular environment, c gmD` gL , the coefcients in and ´1 D ´2 D r D 35cm. We have ³zc D ff³`;m mD¡` `D0 equation 2.3. The Zernike polynomials form an orthogonal basis whose support is restricted to the circular environment (Born & Wolf, 1989) and are dened as

( Zm ` .½ .t/; Á .t//

Rm ` .½.t//

D

8.`¡jmj/=2 > < X D

> :

0

jD0

Rm ` .½ .t// sin.mÁ .t// m > 0 Rm ` .½ .t// cos.mÁ.t// m < 0

.¡1/ j .` ¡ j/!

j!. `Cm 2

¡ j/!. `¡m 2 ¡ j/!

¢ ½.t/`¡2j

(2.4)

.` ¡ m/ even

(2.5)

.` ¡ m/ odd;

where 0 · ½.t/ · 1, 0 · Á.t/ · 2¼, 0 · jmj · `. To balance the tradeoff between model exibility and computational complexity, we chose L D 3. Based on equation 2.4, there are 10 nonzero coefcients for L D 3 in equation 2.3. We assume that during the experiment, the neural spiking activity of an ensemble of C neurons is recorded on an interval .0; T]. We take the subinterval .0; T e ] to be the encoding interval and the subinterval .T e ; T] to be the decoding interval. The spiking activity recorded during the encoding interval will be used to estimate the model parameters, whereas the spiking activity recorded during the decoding interval will be used to estimate, or decode, the animal’s position from the ensemble neural spiking activity through time. We dene the encoding interval to be the rst 15 minutes of the experiment for Animal 1 and the rst 13 minutes for Animal 2. Let 0 < ¹c1 < ¹c2 <; : : : ; < ¹cJc · T be the set of Jc spike times from neuron c be the sample path of the spike times c for c D 1; : : : ; C. For t 2 .0; T], let N0:t c D f0 from neuron c in .0; t]. It is dened as the event N0:t < uc1 < uc2 < : : : : ; < ujc · t \ N c .t/ D jg, where N c .t/ is the number of spikes in .0; t] from neuron c and j · Jc . To estimate the parameters, ³Gc for the gaussian model and ³Zc for the Zernike models by maximum likelihood, we have to specify the likelihood function for these two models given the spiking activity in the encoding interval (Pawitan, 2001). It follows from the theory of point processes (Barbieri, Quirk, Frank, Wilson, & Brown, 2001; Brown, Barbieri, Eden, & Frank, 2003; Daley & Vere-Jones, 2003) that given the specication for neuron c of the conditional intensity function for the gaussian model in equation 2.1 and for the Zernike model in equation 2.3, the likelihood

282

R. Barbieri et al.

function for the two models on the encoding interval can be written as (Z c L.N0:T e

j

³jc /

D exp

Te

0

Z log.¸.u j

³jc /dN c .u/

¡

Te 0

) .¸.u j

³jc /du

;

(2.6)

where j D G and j D Z index, respectively, the gaussian and the Zernike model parameters. The parameters of both models are estimated by maximum likelihood from the rst 15 (13) minutes for Animal 1 (2) (Barbieri, Frank, Quirk, Wilson, & Brown, 2002). We used the Bayesian information criterion (BIC; Box, Jenkins, & Reinsel, 1994) to compare model goodness of t. 2.2 Decoding Analysis: Path Model. The decoding interval contains the spiking activity and path data recorded in .Te ; T]. To simplify notation for the decoding analysis, we reexpress the decoding interval as the interval .0; T d ], where 0 D T e ¡ T e and T d D T ¡ T e . That is, we denote the interval .Te ; T] as the interval .0; T d ]. We dene the updating lattice by choosing K large and by dividing .0; T d ] into K intervals of equal width 1 D T d =K. To dene the stochastic state-space model, we assume that the position of the rat during foraging obeys a bivariate pth-order linear gaussian autoregressive AR.p/ model dened as 1

xk D ¹x C Fxk¡1 C R 2 "k ;

(2.7)

where xk is the animal’s position at time k1, F is a 2p £ 2p matrix of system parameters, ¹x is a 2p £ 1 vector of mean parameters, "k is a 2p £ 1 gaussian random variable with mean zero and 2p £ 2p covariance matrix W" , and R is the learning rate scale factor (LRSF), a tuning parameter for the decoding algorithm dened below. Similarly, we estimate ¹x , F, and W " by maximum likelihood from 15 (13) minutes for Animal 1 (2) with R D 1. The state-space model used in the decoding algorithm is p D 1. We use the AR.1/ model because, like the random walk model used in our previous analysis, it is an elementary model that enforces stochastic continuity on the decoding. In addition, the AR.1/ model is stationary and therefore can be used in both the decoding and mutual information computations. The random walk model cannot be used in the mutual information estimation because it is a nonstationary process (Priestley, 1981). 2.3 Decoding Analysis: Recursive Filter Algorithm. In the encoding analysis, for a given animal, we estimate ³ c for each of its place cells, for both the Zernike and gaussian models, and we estimate the parameters of its path F and W" . In the decoding analysis, we treat these estimated parameters as known and use them to recursively estimate the animal’s position from the ensemble spiking activity. We derive our decoding algorithm from the well-known recursion relation between two coupled probability densities:

Dynamic Analyses of Neural Information Encoding

283

the posterior probability density and the Chapman-Kolmogorov (one-step prediction) probability density dened as (Mendel, 1995; Kitagawa & Gersh, 1996) p.xk j N0:k / D

p.xk j N0:k¡1 /p.Nk¡1:k j xk ; N0:k¡1 / p.Nk¡1:k j N0:k¡1 /

(2.8)

Z p.xk j N0:k¡1 / D

(2.9)

p.xk j xk¡1 /p.xk¡1 j N0:k¡1 /dxk¡1 ;

where N0:k is the ensemble neural spiking activity from .0; k1] for k D 1; : : : ; K. These equations are a recursive system for computing the probability density of the position at k1 given the ensemble spiking activity .0; k1]. Under the assumption that the neurons are conditionally independent, the solutions to these equations using a gaussian approximation to equation 2.8 for the point process linear state-space model we study yield the following recursive lter algorithm: (One-step prediction) O k¡1jk¡1 ; xkjk¡1 D ¹x C Fx

(2.10)

(One-step prediction variance or learning rate) O k¡1jk¡1 FO0 C RW O "; Wkjk¡1 D FW

(2.11)

(Posterior mode)

xkjk D xkjk¡1 C Wkjk¡1

C X cD1

r log ¸c .xkjk j ³Ojc /

c £ [Nk¡1:k ¡ ¸c .xkjk j ³Ojc /1];

(Posterior variance) " ¡1 ¡1 D Wkjk¡1 ¡ Wkjk

C X cD1

(2.12)

c [r 2 log ¸c .xkjk j ³Ojc /[Nk¡1:k ¡ ¸c .xkjk j ³Ojc /1]

#

¡ r log ¸c .xkjk j ³Ojc /[r¸c .xkjk j ³Ojc /1]0 ] ;

(2.13)

for k D 1; : : : ; K, where ¸c .xkjk j ³Ojc / is the conditional intensity (rate) function for either the spatial gaussian or Zernike model for neuron c, ³O c is the asj

sociated model parameter for either the spatial gaussian .j D G/ or Zernike O " are, respec.j D Z/ model estimated from the encoding analysis, FO and W tively, the estimates of the transition matrix and the white noise covariance matrix for the AR.1/ model from the encoding analysis, 1 (12 ) denotes the rst (second) derivative of the indicated function with respect to xk , and the

284

R. Barbieri et al.

notation j j k denotes the estimate at time j1 given the spiking activity in .0; k1]. The derivation of the algorithm is given in the appendix. Equations 2.10 to 2.13 provide a recursive system of gaussian approximations to equations 2.8 and 2.9. Equations 2.10 and 2.11 give, respectively, the mean and covariance matrix for the gaussian approximation to p.xk j N0:k¡1 / in equation 2.9, whereas equations 2.12 and 2.13 give, respectively, the mean and covariance matrix for the gaussian approximation to p.xk j N0:k / in equation 2.8. Because we are computing gaussian approximations to p.xk j N0:k / and p.xk j N0:k¡1 /, it sufces to recursively compute only their respective means and covariance matrices. We use a gaussian approximation because this is a standard rst approach for approximating probability densities (Tanner, 1996; Pawitan, 2001). What is important about our analysis is that this gaussian approximation is made to the posterior probability density, yet the spiking activity enters into the computations in a very nongaussian way through the point process model (see equation A.7 in the appendix). This is very different from approximating the spiking activity as a gaussian process, an assumption that could not be justied over small time intervals for hippocampal neurons where the average spike rate is 0.75 to 2 Hz. We mention alternatives of the gaussian approximation in section 4. The derivation of the algorithm follows, in the appendix, the arguments used in the maximum a posteriori derivation of the Kalman lter (Mendel, 1995; Brown et al., 1998; Fahrmeir & Tutz, 2001). The essential idea in the derivation is that the logarithm of equation 2.8 is approximated as a quadratic function (Tanner, 1996) at each step k of the algorithm. The mode xkjk in equation 2.12 is the point where the quadratic approximation is carried out. It is computed by differentiating the log of the posterior probability density (see equation A.7) and setting the derivative equal to zero. If the posterior probability density is approximately symmetric, then xkjk also approximates its mean and median. In this case, the recursive lter algorithm in equations 2.10 to 2.13 is an approximate optimal lter in both the mean square and mean absolute error sense. Equations 2.12 and 2.13 are nonlinear functions of xkjk that are solved iteratively using a Newton’s procedure. The starting value for the Newton’s procedure at step k is the previous estimate xk¡1jk¡1 . The position update xkjk combines the one-step prediction xkjk¡1 , based on the ensemble spiking activity from .0; .k ¡ 1/1], with the weighted sum c of [Nk¡1:k ¡ ¸c .xkjk j ³Ojc /1], the innovation or error signal from neuron c for c D 1; : : : ; C multiplied by the learning rate (one-step prediction variance) (Brown et al., 1998; Brown, Nguyen, Frank, Wilson, & Solo, 2001). The innovations are the new information from the ensemble spiking activity. As is true for a Poisson process, ¸c .xkjk j ³Ojc /1 denes for a general point process model (see equation A.1) the probability of a spike in ..k ¡ 1/1; k1] (Daley & Vere-Jones, 2003). Each innovation compares the probability of a spike, c , which, for small 1, is 1 if a ¸c .xkjk j ³Ojc /1, in ..k ¡ 1/1; k1] with Nk¡1:k

Dynamic Analyses of Neural Information Encoding

285

spike is observed from neuron c in ..k ¡ 1/1; k1] and 0 otherwise. Thus, for small 1, the innovation gives a weight in the interval (¡1,1). A large positive weight arises if a spike occurs when a neuron has a low probability of a spike in ..k ¡ 1/1; k1], whereas a large negative weight arises if no spike occurs in ..k ¡ 1/1; k1] when the probability of a spike is high. By equation 2.12, for a given value of R, how much the new information is weighted depends on the learning rate. If the one-step prediction variance is large, then the uncertainty about the position estimate at k1 is large given the spiking activity in .0; .k ¡ 1/1], and more weight is given to the innovations in determining the new position. If the one-step prediction variance is small, then the uncertainty about the position estimate at k1 is small given the spiking activity in .0; .k ¡ 1/1], and less weight is given to the innovations in determining the new position. While the learning rate is dynamic (i.e., it changes at each step), increasing R increases the weight given to the innovations. Therefore, for each animal, with its place eld parameters xed at the values estimated in the encoding analysis, we optimize the decoding algorithm by studying its performance as a function of R. The initial position, x0 , and the initial covariance matrix, W0 , are computed by local maximum likelihood (Brown et al., 1998) using the 1 sec of ensemble spiking activity prior to the start of the decoding period. As we show in the appendix, the recursive lter algorithm in equations 2.10 to 2.13 provides a straightforward generalization of our previous Bayes’ lter algorithm by using a linear gaussian state-space model and by using the conditional intensity function, the rate function of a point process, to describe the spiking activity of each neuron. Use of the conditional intensity function makes it possible to use point process models more general than the Poisson to represent the spiking activity of the ensemble neurons. 2.4 Reverse Correlation Decoding Analysis. We computed the reverse correlation lter by matrix regression of position on spiking activity from the rst 15 (13) minutes of spike train and position data for Animal 1 (2) assuming a lter length of 1 sec and bin widths of 33 msec. The reverse correlation updates were computed at 33 msec during the 10 minutes of decoding. 2.5 Condence Regions and Coverage Probabilities. At k1, an approximate 0.95 condence region for the true position xk may be constructed as ¡1 .xk ¡ xkjk /0 W kjk .xk ¡ xkjk / · 6;

(2.14)

for k D 1; : : : ; K, where 6 is the 0.95 quantile of the Â 2 distribution with 2 degrees of freedom. If the decoding algorithm performs well, then approximately 0.95 of the condence regions should cover the true position. We compute the dynamic estimate of the coverage probability by tracking the fraction jkk , where jk is the number of times the true position is within

286

R. Barbieri et al.

the condence regions in .0; k1] for k D 1; : : : ; K. The estimate of the true coverage probability for the entire decoding period is jKK . 2.6 Instantaneous Entropy and Entropy Rate. From equations 2.8, 2.11, and 2.12, it follows that the posterior probability density of position xk given the ensemble spiking activity in .0; k1] can be approximated as the gaussian probability density p.xk j N0:k / with mean xkjk (see equation 2.11) and covariance matrix Wkjk (see equation 2.13). Another way to characterize the uncertainty in this probability density is to compute its entropy. We can express the entropy of this conditional probability density as (Cover & Thomas, 1991) Z H.xkjk / D ¡ log2 p.xk j N0:k /p.xk j N0:k /dxk D 2¡1 log2 [.2¼ e/2 jWkjk j];

(2.15)

where xkjk denotes the random variable xk given N0:k , jWkjk j is the determinants of W kjk and log2 is log base 2. To examine how the instantaneous entropy changes from one update to the next we compute the entropy rate which we dene as 1H.xkjk / D H.xkjk / ¡ H.xk¡1jk¡1 /

D 2¡1 log2 jWkjk j=jWk¡1jk¡1 j:

(2.16)

The nal line in equation 2.16 follows from the gaussian approximation to p.xk j N0:k /. Because H.xkjk / denes in terms of number of bits, the uncertainty in the animal’s position at time k1 given the ensemble spiking activity in .0; k1], 1H.k j k/ is the number of bits the ensemble spiking activity provided in ..k ¡ 1/1; k1] and hence, 1H.xkjk / is the entropy rate. The entropy rate in equation 2.16 has an intuitive interpretation in terms of the condence region areas at each step of the decoding because, under the gaussian approximation to p.xk j N0:k /, the area of the condence region (see equation 2.14) at step k is proportional to the determinant of Wkjk . The entropy rate is negative if jW k¡1jk¡1 j > jWkjk j and is positive if jWk¡1jk¡1 j < jWkjk j. If there is less uncertainty about the animal’s position at step k than at step k ¡ 1, that is, jW k¡1jk¡1 j > jWkjk j, then the entropy decreases at step k, whereas if there is more uncertainty about the animal’s position at step k than at step k ¡ 1, that is, jWkjk j > jWk¡1jk¡1 j, then the entropy at step k increases. We can use the relation between the covariance matrices and the entropy rate to analyze whether the value of the entropy rate in ..k ¡ 1/1; k1] is due to either the AR.1/ model of the path or the spiking activity of the neural ensemble. By equation 2.11, Wk¡1jk D FWk¡1jk¡1 F0 C RW " . Therefore, if jWk¡1jk j > jWk¡1jk¡1 j, then the AR.1/ path model tends to increase the uncertainty and hence, the entropy about the animal’s location at any step k of

Dynamic Analyses of Neural Information Encoding

287

the algorithm. That is, the evolution of the path as described by the AR.1/ model makes the entropy rate positive in ..k¡1/1; k1]. For the entropy rate to be negative or, equivalently, for jWkjk j to decrease, the ensemble spiking activity must cause the sum on the right-hand side of equation 2.13 to be negative. The activity of the neural ensemble can further increase the entropy rate beyond the increase due to the path model if the spiking activity makes the second term on the right-hand side of equation 2.13 positive. However, because the path model makes the entropy rate positive, a decrease in the entropy, or a negative entropy rate in ..k ¡ 1/1; k1] can be due only to the spiking activity of the neural ensemble in that interval. Because this conclusion is predicated on jWk¡1jk j > jWk¡1jk¡1 j, we will check this condition at each decoding step. As mentioned above, although we use a gaussian approximation to estimate the probability densities, the point process likelihood in equations A.2 and A.6, the ensemble spiking activity enters the computation of Wkjk through equations 2.12 and 2.13 in a nongaussian manner. 2.7 Mutual Information. Because the animal’s position as a function of time is modeled as a bivariate gaussian AR.1/ process, its marginal probability density p.xk / is the gaussian probability density with mean 0 ¹ D [1 ¡ F]¡1 ¹x and covariance matrix W x D [I ¡ FF ]¡1 W" . Using the denition of the Shannon mutual information and the gaussian approximation to p.xk j N0:k /, the mutual information between position at k1 and the ensemble spiking activity in .0; k1] is (Cover, & Thomas, 1991; Twum-Danso, 1997) Z I.xk I N0:k / D ¡ p .xk / log2 [p.xk /]dxk µ Z Z ¶ ¡ ¡ p.xk j N0:k / log2 [p.xk j N0:k /]dxk p .N0:k / dN0:k Z ¡1 2 jWx j = D 2 log2 .2¼ e/ (2.17) Wkjk p.N0:k /dN0:k ; where jWx j is the determinant of W x and p.N0:k / is the marginal probability mass function of the neural ensemble in .0; k1]. Equation 2.17 can be evaluated at each step k of the algorithm and denes on average, the number of bits the random variable N0:k , i.e. the ensemble neural spiking activity in .0; k1], provides about the random variable xk , the animal’s position at k1. This is in contrast to the conditional entropy at k1 which describes the uncertainty about the animal’s position at k1 given the ensemble spiking activity in k1.The mutual information describes a property of the system whereas the instantaneous entropy provides a measure of uncertainty at a given time in a given realization of the ensemble spiking activity (TwumDanso, 1997). Given the inhomogeneous Poisson model for the spike train and the AR.1/ model for the path, we easily can compute the integral in the last line of equation 2.17 by Monte Carlo.

288

R. Barbieri et al.

3 Application As in our previous work (Brown et al., 1998), we divide the experimental data into two parts and the analysis into two stages: encoding analysis and decoding analysis. In the encoding analysis, we use the rst part of the experimental data to estimate the relation between spiking activity and position for each neuron for both animals and to estimate for each animal the parameters of the AR.1/ path model. In the decoding analysis, we apply the path and place cell models with their parameters estimated in the encoding analysis to the second part of the experimental data to decode the position of each animal from the ensemble spiking activity of its respective hippocampal neurons. We perform the decoding analysis by comparing the gaussian and Zernike decoding models across a range of learning rates and and two update intervals. For the optimal model, we analyze the mutual information between the spiking activity and the path. We begin with a brief review of the experimental protocol. 3.1 Experimental Protocol. We applied our paradigm to place cell spike train and position data recorded from two Long-Evans rats freely foraging in a circular environment 70 cm in diameter with walls 30 cm high and a xed visual cue. A multiunit electrode array was implanted into the CA1 region of the hippocampus of each animal. The simultaneous activity of 34 (32) place cells was recorded from the electrode array while the two animals foraged in the environment for 25 (23) minutes. Simultaneous with the recording of the place cell activity, the position of each animal was measured at 30 Hz by a camera tracking the location of two infrared diodes mounted on the animal’s headstage (Brown et al., 1998). 3.2 Encoding Analysis. The maximum likelihood estimates of the spatial gaussian and Zernike inhomogeneous Poisson rate functions computed from the rst 15 (13) minutes for Animal 1 (2) are shown in Figure 1. Neuron 34 for Animal 1 and neurons 29 to 32 for Animal 2 had split elds. Therefore, for the analysis with the gaussian model, the spiking activity of each of these neurons was t with two gaussian surfaces to capture this feature. The spiking activity of each neuron was also t with both a single Zernike model and two Zernike models for comparison. The spatial model ts in Figure 1 are displayed along with smoothed spatial rate maps. The empirical rate maps are displayed as an approximation to the raw data. In general, both the gaussian and Zernike model estimates of the place elds agree, with several noticeable exceptions. The Zernike place eld estimates are concentrated in a smaller area and have a wider range of asymmetric shapes. The gaussian model attempts to capture the asymmetry by placing the center of the eld outside the circle. The Zernike models can t split elds with a single polynomial expansion, whereas two gaussian models are required to t split elds. In a few cases when a place eld is con-

Dynamic Analyses of Neural Information Encoding

289

Figure 1: Spatial gaussian and Zernike place elds. Smoothed spatial spike histogram (row 1) and pseudocolor maps of place elds estimates from the spatial gaussian (row 2) and Zernike (row 3) models for the 34 place cells for Animal 1 (A) and the 32 place cells for Animal 2 (B).

290

R. Barbieri et al.

centrated on one border, the Zernike models produce an artifact of nonzero spiking activity on the opposite side of the environment. By the BIC, 29 (30) of the 34 (32) cells of Animal 1 (2) were best t by the Zernike spatial model, and the remaining 5 (2) were best t by the spatial gaussian model, suggesting that the former is an important improvement over the latter. 3.3 Decoding Analysis. We estimated the F and W " matrices for both Animal 1 and Animal 2 by maximum likelihood from the 13 (15) minutes of the respective path data in the encoding stage. The diagonal elements of the F matrices were 0.99 for both animals, and the off-diagonal elements were less than 0.01. This means that the decoding estimates from the AR.1/ model will be similar to those from the random walk model and that the determinants of the F matrices are close to 1. We decoded the position from the ensemble spiking activity of the 34 (32) neurons from the last 10 minutes of foraging for Animal 1 (2) using 20 different combinations of parameters and models for our algorithm (see equations 2.10–2.13). These were two rate function models, spatial gaussian or Zernike, two updating intervals, 1 D 3:3 or 33.3 msec, and ve values of LRSF with R D 1; 2; 5; 10 or 20 (see equation 2.10). For both animals, the median decoding error was a minimum when the LRSF was 5 (see Figure 2), suggesting that below 5, the new spiking activity should be weighted more, whereas above 5, the new spiking activity should be weighted less. For the ve LRSFs, the Zernike models with both 3.3 and 33 msec updating had smaller median decoding errors than the corresponding Figure 2: Facing page. Summary of decoding algorithm performance. (A, B) Median decoding error computed as the difference between the true and estimated path at each decoding update. (C, D) Median square root of the 0.95 condence region area. (E, F) Median coverage probability as a function of the learning rate scale factor for the spatial gaussian model with a 33 msec update (dark blue), the spatial gaussian model with a 3.3 msec update (light blue), the Zernike model with a 33 msec update (light red), and the Zernike model with a 3.3 msec update (dark red) for Animal 1 (left column) and Animal 2 (right column). The black circles in A and B indicate the minimum median error for the Zernike model with 3.3 msec updating and LRSF = 5. The red circles in A and B indicate the median error of the spatial gaussian model with a 33 msec updating and LRSF = 1. Although the spatial gaussian model also used the AR.1/ path model, the results are identical to the decoding results reported for this model in Brown et al. (1998) using the random walk path model. The black circles in E and F indicate the median coverage probability of the Zernike model with 3.3 msec updating and LRSF = 5. The red circles indicate the median coverage probability of the spatial gaussian model with 33 msec updating and LRSF = 1. Because the Zernike model with 3.3 msec updating and LRSF = 5 achieved for both animals nearly the highest median coverage probability with the smallest median error, we termed it the optimal Zernike model.

Dynamic Analyses of Neural Information Encoding

291

spatial gaussian models. For Animal 2, the 3.3 msec updating gave a smaller median error for both the Zernike and spatial gaussian models at all LRSFs, whereas for Animal 1, there was no difference in median error between the two sets of models for the 3.3 versus the 33 msec updating until the LRSF was greater than 5. The Zernike model with the 3.3 msec updating and an

292

R. Barbieri et al.

LRSF of 5 gave the smallest median decoding error: 5.9 (5.5) cm for Animal 1 (2). This is an improvement of 2.0 (2.2) cm from the 7.9 (7.7) cm using the spatial gaussian model and either the random walk (Brown et al., 1998) or AR.1/ path models (the red circles in Figure 2). Of this improvement, 1.2 (1.3) cm for Animal 1 (2) was due to optimizing the learning rate, whereas 0.7 (0.7) cm was due to the switch from the spatial gaussian to the Zernike model. Because increasing the LRSF increases the variance of the position estimate (see equations 2.11 and 2.13), the areas of the condence regions necessarily grow with increasing LRSF. Beyond an LRSF of 2, this increase is faster for all algorithms for both animals (see Figure 2). For all the algorithms, the condence region areas are smaller for the Zernike models than those of their corresponding spatial gaussian models. At an LRSF of 5, the median condence region area for the Zernike model was 8.62 (9.32) cm2 compared with 9.32 (10.52) cm2 for Animal 1 (2). The true coverage probability for the 0.95 condence regions (see equation 2.13) also increases with increasing LRSF. However, at an LRSF of 5, it reaches an approximate plateau for both animals for all algorithms. Up to an LRSF of 5, the coverage probability for both animals is greater for any Zernike model compared with its corresponding spatial gaussian model. The coverage probabilities plateau after 5 because even though the condence region areas increase with higher learning rates (see Figures 2C and 2D), the higher learning rates beyond this point give less accurate decoding (see Figures 2A and 2B). At an LRSF of 5, for the Zernike model with 3.3 msec, the coverage probability of the 0.95 condence regions was 0.67 (0.75) for Animal 1 (2) (black circles in Figures 2E and 2F). This is a signicant improvement over the 0.31 (0.40) for the spatial gaussian model with either the random walk or AR.1/ path models (red circles in Figures 2E and 2F). We term the decoding algorithm with Zernike model with LRSF = 5 and updated at 3.3 msec the optimal Zernike model. Illustrations of the recursive lter decoding algorithm with the spatial gaussian and optimal Zernike models are shown, along with a reverse correlation analysis, for a 60 sec continuous segment of the path from Animal 1 in Figure 3. The estimated trajectories from the optimal Zernike model resembled more closely the true trajectories than did those of the spatial gaussian model and had the smallest median error in each of the four 15 sec segments. In contrast, the reverse correlation algorithm performed poorly over the entire segment. The differences in performance of the algorithms in decoding the spike train data for both animals can be best appreciated in the videos of this analysis on our Web site (http://neurostat.mgh.harvard.edu). To quantify further the relative performance of the decoding algorithms, we computed the decoding error dened for each algorithm as the difference between the true and estimated path at each 33 msec time step for the 10 min of decoding. The box plot summaries of these error distributions for the three decoding algorithms are shown in Figure 4. For Animal 1 (2),

Dynamic Analyses of Neural Information Encoding

293

Figure 3: Continuous 60 sec segment of the true path (gray) displayed in four 15 sec intervals with the spatial gaussian decoding estimate with LRSF = 1 and 33 msec update (black, left column), the optimal Zernike model (black, center column), and the reverse correlation (black, right column). Numbers are the median error for the segment.

294

R. Barbieri et al.

Figure 4: Box plot summary of decoding error (True position—estimated position) distributions. Box plot summaries of the decoding error distributions for the spatial gaussian model with LRSF = 1 and 33 msec (left), the optimal Zernike model (center), and the reverse correlation (right) for Animals 1(A) and 2(B). The lower border of the box is the 25th percentile of the distribution, and the upper border is the 75th percentile. The white bar within the box is the median of distribution. The distance between the 25th and 75th percentiles is the interquartile range (IQR). The lower (upper) whisker is at 1.5 the IQR below (above) the 25th (75th) percentile. All the black bars below (above) the lower (upper) whiskers are far outliers. For reference, less than 0.35% of the observations from a gaussian distribution would lie beyond the 75th percentile plus 1:5 £ IQR.

the median decoding error for the spatial gaussian model was 7.7 (7.9) cm, with a 25th percentile of 5.2 (4.9) cm, a 75th percentile of 10.7 (10.9) cm, a minimum of 0.12 (0.06) cm, and a maximum of 44.6 (30.7) cm. For Animal 1 (2), the median decoding error for the Zernike model was 5.9 (5.5) cm, with a 25th percentile of 3.9 (3.3) cm, a 75th percentile of 8.6 (8.4) cm, a minimum of 0.8 (0.4) cm, and a maximum of 22.3 (43.8) cm. The decoding error distribution for Animal 1 was smaller for the optimal Zernike model compared with the spatial gaussian model (see Figure 4). For Animal 2, the decoding error distribution for the Zernike model was smaller up to the upper whisker of the box plot; however, beyond this point, this error distribution had a larger tail compared to the one for the corresponding spatial gaussian model. For both animals, the reverse correlation decoding errors were much larger. For Animal 1 (2), the median decoding error for the reverse correlation method was 27.5 (27.3) cm, with a 25th percentile of 14.7 (14.3) cm, a 75th percentile of 49.7 (43.8) cm, a minimum of 0.2 (0.7) cm, and a maximum of 143.3 (146.8) cm. The optimal Zernike model was also noticeably more accurate than the spatial gaussian model near the borders of the environment (see Figure 5). This is because the Zernike model gave a more accurate description of the place elds (see Figure 1) and because, by equation 2.5, the support of the Zernike model is restricted to the circle. The more realistic trajectory es-

Dynamic Analyses of Neural Information Encoding

295

Figure 5: Decoding analysis: 95% condence regions. Fifteen sec of true path (gray) optimal Zernike model estimate (black, A), the corresponding spatial gaussian model estimate (black, B), and 0.95 condence regions (ellipses) computed at position estimates spaced 1.5 sec apart. The median decoding error for the Zernike (gaussian) model is 5.99 (7.28) cm, and the median condence region area is 9.32 (10.62) cm2 on this 5 sec segment.

timates, the smaller decoding error, and the greater coverage probability with smaller condence regions (see Figures 2B and 2C, 4, and 5) show that the recursive lter decoding algorithm with the Zernike model was more accurate than using either this algorithm with the spatial gaussian model or the reverse correlation procedure. 3.4 Instantaneous Entropy, Entropy Rate, and Mutual Information. Along with the decoding estimate at each time step, we computed from the decoding analysis with the optimal Zernike model the instantaneous entropy, entropy rate, and the Shannon mutual information between the animal’s position and the ensemble spiking activity. Figure 6 shows the instantaneous entropy, entropy rate, condence region area, decoding error, and coverage probability for the 60 sec segment of data from Animal 1 considered in Figure 3. By equations 2.14 and 2.15, the number of bits the entropy of position given the ensemble spiking activity conveys about position (see Figure 6A) is directly proportional to the area of the 0.95 condence region at that instant (see Figure 6C). The entropy provided a different assessment of the algorithm’s performance compared with the decoding error. For example, at 12 sec, the decoding error was 20 cm (see Figure 6D), the condence region area was 172 cm2 (see Figure 6C), and the entropy was 9.3 bits (see Figure 6A). However, at 44 sec, the decoding error was nearly the same at 182 cm2 , yet the condence region area was 72 cm2 and the entropy was 8.0 bits. Because the entropy rate approximates the derivative of the entropy, it was positive when the instantaneous entropy increased and neg-

296

R. Barbieri et al.

Figure 6: Decoding and mutual information analysis. (A) Instantaneous entropy. (B) Entropy rate. (C) Square root of the condence region area. (D) Decoding error. (E) Coverage probability for the continuous 60 sec segment in Figure 3 computed by decoding with the optimal Zernike model.

Dynamic Analyses of Neural Information Encoding

5 0

bits/sec

-8

5

5

0

-5

7 6

bits

7

Entropy Rate 8

8

9

B

Entropy

9

A

297

ANIMAL 1 ANIMAL 2

C

ANIMAL 1 ANIMAL 2

Mutual Information

9.6

bits

9.8

10.0

10

9

9.0

9.2

9.4

9.5

ANIMAL 1 ANIMAL 2 Figure 7: Box plots of the (A) Instantaneous entropy, (B) entropy rate, and (C) Shannon mutual information distributions for the optimal Zernike model for Animal 1 (left) and Animal 2 (right) for the 10 min of decoding. See Figure 4 for an interpretation of the box plots.

ative when it decreased (see Figure 6B). For both animals, we veried that at each time step k of the algorithm, jWk¡1jk j > jWk¡1jk¡1 j. Therefore, as stated in section 2.6, the negative entropy rates were due solely to the ensemble spiking activity. The mean and median of the entropy rate were zero during this 60 sec segment, suggesting that there was no overall increase or decrease in the uncertainty about position given the ensemble spiking activity. The instantaneous coverage probabilities for this 60 sec segment range from 0.71 to 0.85 (see Figure 6E). The value of 0.73 at the end of the segment is closer to the total decoding stage coverage probability for this animal of 0.67. Together, the decoding error, the entropy, the entropy rate, condence regions, and the coverage probability give a comprehensive assessment of algorithm performance not appreciated from applying any single measure alone. To appreciate the range of entropy and entropy rates observed during the 10 minutes of decoding with the optimal Zernike model, we summarized as box plots in Figure 7 the distributions of these values for each animal. During 10 min of decoding with the optimal Zernike model, the median

298

R. Barbieri et al.

entropy was 6.9 (7.0) bits for Animal 1 (2) with a minimum of 3.5 (5.2) bits and a maximum of 9.3 (12.0) bits (see Figure 7). For the entire 10 min of decoding stage, as in the analysis of the 60 sec segment of the ensemble spiking activity in Figure 6, the median information rate was ñ0.08 (¡0:09) and the mean information rate was 0.002 (0.004) for Animal 1 (2). The corresponding 25th and 75th percentiles of the entropy rate distributions were ¡0:57 (¡0:63) and 0.67 (0.73) for Animal 1 (2), respectively. These ndings show that there is no overall growth or decline in the entropy of position given the ensemble spiking activity during the decoding time interval. We computed the mutual information in equation 2.17 by using Monte Carlo methods to simulate 50 realizations of the path (equation 2.7) and the ensemble spiking activity given the path model (equation 2.3) and the maximum likelihood estimate of the Zernike model parameters computed in the encoding analysis. The median and mean mutual information between position and ensemble spiking activity is 9.38 (9.45) with 25th and 75th percentile of 9.37 (9.43) and 9.41 for Animal 1 (2), respectively. 4 Discussion Our Bayes’ lter previously estimated the rat’s position with a median error of 8 cm during 10 min of decoding using » 30 place cells with a coverage probability of » 0.35 (Brown et al., 1998). The recursive point process decoding algorithm with the optimal Zernike model reduced the median error to less than 6 cm and increased the coverage probability to » 0.70, suggesting that the ensemble spiking activity carries more information than we previously estimated. At any instant, approximately 30% of the 105 (3 £ 104 ) neurons in the hippocampus are active in a given environment (Wilson & McNaughton, 1993). Therefore, our results based on » 30 neurons suggest that this brain region maintains a precise, dynamic representation of the animal’s spatial location. 4.1 Decoding, Coverage Probability, Entropy, and Mutual Information. The central concept in our decoding paradigm is the recursive estimation of the posterior probability density (see equation 2.8). The decoding estimates, entropy, mutual information, condence regions, and coverage probability are all functions of the posterior probability density that can be computed directly once this density has been estimated. The relation among these functions is the difference between the message in the code, certain properties of the code, and the accuracy with which the algorithm reads the code. The decoded position estimate together with its 0.95 condence region give a read-out of the message in the code at a given instant. In contrast, the entropy of the animal’s position given the ensemble spiking activity, and the area of the 0.95 condence region at that instant, are second-order properties of the code. The smaller the number of bits in the entropy, the smaller is the condence region. In this way, the entropy rate,

Dynamic Analyses of Neural Information Encoding

299

like the changes in the areas of the condence regions, identies when the spiking activity contributes “meaningfully” to the decoding estimation. Because the mutual information is computed by integrating the entropy, with respect to the marginal probability mass function of the ensemble spiking activity (equation 2.17), it measures the average number of bits at each time point the ensemble provides about the path. Hence, the entropy provides a measure of uncertainty for the current experiment whereas the mutual information denes the number of bits on average, i.e. over many realizations, that the neural ensemble conveys about the animal’s position. The decoding error and the coverage probability measure the accuracy with which the algorithm reads the neural code. The coverage probability measures simultaneously the accuracy of the algorithm’s rst-order (position estimate) and second-order (position uncertainty) properties. While the coverage probability for no model reached the expected 0.95, the coverage probability of each Zernike model was greater, with condence region areas that were smaller than those of the corresponding spatial gaussian model. Moreover, the coverage probability of the optimal Zernike model was greater than that of our original spatial gaussian model with either the random walk or AR.1/ path models. These ndings suggest that the new algorithm with the new spatial model is more accurate. Our computation of the instantaneous entropy and the entropy rate follows directly from applying the denition of entropy to the Bayes’ rule Chapman-Kolmogorov updating formula in equations 2.8 and 2.9. We used these calculations to make explicit the relation between recursively computing a decoding estimate and computation of mutual information. In the current experiments, the entropy was constant at approximately 7 bits and the median and mean entropy rates were zero. This suggests that the uncertainty in position determined by the ensemble of hippocampal neurons did not change during the experiment. Because place elds do evolve as an animal moves through both familiar (Mehta, Lee, & Wilson, 2002) and novel environments (Frank, Stanley, & Brown, 2003), it would not have been surprising to see an increase in entropy and more positive entropy rates as the decoding proceeded. This is because, as the place elds evolve, their estimates computed during the encoding stage no longer represent the spatial location that the neuron encoded during the decoding stage. To carry out decoding in this case would require combining our decoding algorithms with adaptive estimation algorithms (Brown et al., 2001) to track place eld evolution during decoding. Equation 2.16 and our analysis in section 2.6 showed that by the construction of our algorithm, a negative entropy rate at any step of the algorithm or, equivalently, a decrease in the entropy came solely from the spiking activity and not from the AR.1/ path model. This was because the AR.1/ path model contributed at each step to a positive entropy rate because, at each step, jWkjk¡1 j > jWk¡1jk¡1 j. Furthermore, because under the Poisson assumption used in the current analysis, the spiking activity of neurons in

300

R. Barbieri et al.

nonoverlapping intervals is independent, the decrease in entropy due to the spiking activity did not arise from correlated activity. The point process conditional intensity framework is well suited to carry out the important next step of repeating the current entropy and mutual information analysis with models using history-dependent spiking activity. Our parametric modeling paradigm provides dynamic estimates of entropy, the entropy rate, and mutual information using the Bayes’ rule and Chapman-Kolmogorov equations (see equations 2.8 and 2.9) to model explicitly the temporal evolution of the relation between position and ensemble spiking activity. The paradigm relates decoding and mutual information analyses and offers a plausible alternative to static mutual information analyses computed over long time intervals. Moreover, our approach to estimating entropy and mutual information obviates concerns about choosing word lengths and the accuracy of information estimation, particularly when word lengths are large relative to the length of the spike train (Victor, 2002). 4.2 The State-Space Paradigm Applied to Neural Spike Train Decoding. The current decoding algorithm is part of the paradigm we have been developing to conduct state estimation from point process measurements (Smith & Brown, 2003). We reviewed there the relation of our paradigm to other approaches to state estimation from point process observations. We previously compared our Bayes’ lter algorithm to several other approaches (Brown et al., 1998). In this study, we compared our new decoding algorithm to the reverse correlation because it was not included in our previous comparison and because it is one of the most widely used methods for decoding ensemble neural spike train activity (Warland et al., 1997; Stanley et al., 1999; Serruya et al., 2002). The reverse correlation algorithm, while simple to implement, performed signicantly worse than any version of our algorithm because it made no use of the spatial and temporal structure in this problem. This nding suggests that application of our algorithm to other decoding and neural prosthetic control problems where the reverse correlation methods have performed successfully (Bialek et al., 1991; Warland et al., 1997; Stanley et al., 1999; Wessberg et al., 2000; Serruya et al., 2002) may yield important improvements. The decoding error, condence regions, and coverage probabilities used together are more meaningful measures of algorithm performance than the R2 coefcient typically reported in reverse correlation analyses (Wessberg et al., 2000; Serruya et al., 2002; Taylor et al., 2002). Equations 2.10 to 2.13 generalize our Bayes’ lter decoding algorithm to one using arbitrary point process and linear stochastic state-space models. To dene our new algorithm or, equivalently, the posterior probability density (see equation 2.8), it sufces to specify these two models. This general formulation was necessary for our current analysis because our original algorithm depended critically on the form of the spatial gaussian model (Brown et al., 1998) and, hence, could not be applied to the Zernike model. The current formulation of the algorithm can use any point process model

Dynamic Analyses of Neural Information Encoding

301

that can be dened in terms of a conditional intensity function (see equation A.1). The inhomogeneous Poisson observation model based on the Zernike expansion gave more accurate decoding results than the one based on the gaussian model for two reasons. First, the Zernike model was more exible and represented more realistically the complex place eld shapes in the circular environment, as indicated by the plots in Figure 1 and the BIC goodness-of-t assessments. In particular, the support of the Zernike models was restricted to the circular environment, whereas the support of the gaussian model was all of R2 . Second, the algorithm with the Zernike Poisson model decoded more accurately because this model more accurately estimated the probability of a neuron’s spiking or not in a given location. The decoding update (see equation 2.11) “listens” to both spiking and silent neurons in each time window. Because few spikes occur within the 3.3 and 33 msec time windows, the new information (innovations) at each update comes mostly from neurons that do not spike. Use of the information from the neurons that do not spike in order to decode was a feature of our Bayes’ lter algorithm (Brown et al., 1998). Wiener and Richmond (2003) recently noted this feature as well. This observation is consistent with the idea that once a brain region develops a representation, which neurons do not spike, or are inhibited, can convey as much information as those that do spike. We chose an AR.1/ model as our linear stochastic state-space model. There were no differences between the decoding results for the random walk model compared with the AR.1/ model. However, because the AR.1/ model is stationary, it could be used, unlike the random walk model, to compute meaningful estimates of mutual information. Scaling the model’s white noise variance (see equation 2.7) scaled the learning rate. This scaling enhanced signicantly the algorithm’s accuracy by balancing the weight given to the prediction based on the previous estimate (see equation 2.10) and the modication produced by the innovations (see equation 2.12). Further improvements in our algorithm can be readily incorporated into the current framework. First, these include combining the Zernike spatial model with models of known hippocampal temporal dynamics such as bursting (Quirk & Wilson, 1999; Barbieri et al., 2001), theta phase modulation (Skaggs, McNaughton, Wilson, & Barnes, 1996; Brown et al., 1998; Jensen & Lisman, 2000), and phase precession (O’Keefe & Reece, 1993). Second, we can consider both higher-order linear state-space models (Shoham, 2001; Gao, Black, Bienenstock, Shoham, & Donoghue, 2002; Smith & Brown, 2003) or, perhaps, nonlinear state-space models to describe the path dynamics more accurately (Kitagawa & Gersh, 1996). Finally, we developed gaussian approximations to evaluate the Bayes’ rule and Chapman-Kolmogorov equations in equations 2.8 and 2.9 as a plausible initial approach. We can improve our algorithm and evaluate the accuracy of this approximation by applying nongaussian approximations (Pawitan, 2001), exact numerical integration techniques (Genz & Kass, 1994; Kitagawa & Gersh, 1996), and Monte Carlo methods (Doucet, DeFreitas, Gordon, & Smith, 2001; Shoham, 2001)

302

R. Barbieri et al.

to evaluate these equations. We suggest our point process linear state-space model framework as a broadly applicable approach for studying decoding problems as well as for designing algorithms to control neural prosthetic devices and brain-machine interfaces. Appendix: Derivation of the Decoding Algorithm During the decoding interval .0; T d ], we record the spiking activity of C neurons. Let 0 < uc1 < uc2 <; : : : ; < ucJc · T d be the set of Jc spike times from c be the sample path of the neuron c for c D 1; : : : ; C. For t 2 .0; T d ], let N0:t c D f0 spike times from neuron c in .0; t]. It is dened as the event N0:t < uc1 < c c · \ c c u2 <; : : : ; < uj t N .t/ D jg, where N .t/ is the number of spikes in .0; t] 1 c g be the ensemble spiking from neuron c and j · Jc . Let N0:t D fN0:t ; : : : ; N0:t activity in .0; t]. To dene a general point process probability model for a neuron, it sufces to specify its conditional intensity function for t 2 .0; T d ], dened as

Pr.N c .t C 1/ ¡ N c .t/ D 1 j N0:t / : 1!0 1

¸c .t j N0:t / D lim

(A.1)

The conditional intensity function is a history-dependent rate function that generalizes the denition of the Poisson rate (Barbieri et al., 2001; Brown et al., 2001; Brown et al., 2002; Brown et al., 2003; Daley & Vere-Jones, 2003). If the point process is an inhomogeneous Poisson process, then the conditional intensity is ¸c .t j N0:t / D ¸c .t/. It follows that ¸c .t j N0:t /1 is the probability of a spike in [t; t C 1/ when there is history dependence in the spike train. From equation A.1, we denote the point process observation model on the updating lattice by the conditional intensity function ¸c .xk j N0:k / for neuron c, c D 1; : : : ; C evaluated at position xk , the animal’s position at k1, and given N0:k , the spiking history in .0; k1]. For small 1, and under the assumption that the spiking activity of the neurons in the ensemble is conditionally independent, the joint probability density or the likelihood function of the spike train observations in the interval ..k ¡ 1/1; k1] is (Brown et al., 1998; Smith & Brown, 2003) p.Nk¡1:k j xk ; N0Ik / D

C Y cD1

c j xk ; N0Ik / p.Nk¡1:k

D exp

" C X cD1

¡

c log ¸c .xk j N0:k / Nk¡1:k

C X cD1

# c

¸ .xk j N0:k /1 :

(A.2)

The probability model in equation A.2 depends on the parameter ³O c , estimated in the encoding analysis for each neuron. We have omitted this parameter to simplify the notation.

Dynamic Analyses of Neural Information Encoding

303

To evaluate the system in equations 2.8 and 2.9, we apply the maximum a posteriori derivation of the Kalman lter and approximate the p.xk j N0:k / as gaussian probability densities by recursively computing their means (modes) and negative inverse of the Hessian matrices (covariance matrices) (Mendel, 1995; Brown et al., 1998). To initiate our construction of the recursive algorithm, we assume that the mean xk¡1jk¡1 and covariance matrix Wk¡1jk¡1 have been estimated at time .k ¡ 1/1. That is, we take p.xk¡1 j N0:k¡1 /, the posterior probability density at time .k ¡ 1/1, to be the gaussian probability density with mean xk¡1jk¡1 and covariance matrix Wk¡1jk¡1 . The next step is to compute p.xk j N0:k¡1 /, the one-step prediction probability density at time k1. This is, the probability density of the predicted position at k1 given the spiking activity in .0; .k¡1/1]. To do this, we must evaluate equation 2.9. By equation 2.7, the state model, p.xk j xk¡1 /, is the gaussian probability density with mean ¹x CFxk¡1 and covariance matrix RW " . Because both p.xk¡1 j N0:k¡1 / and p.xk j xk¡1 / are gaussian probability densities and xk¡1 enters linearly into the mean of both, p.xk¡1 j N0:k¡1 / is also a gaussian probability density and sufces to compute the mean xkjk¡1 and covariance matrix W kjk¡1 in order to dene it. It follows from standard properties of integrals of gaussian functions that the mean and covariance matrix are respectively dened as (A.3)

xkjk¡1 D E.xk j N0:k / D ¹x C Fxk¡1jk¡1

Wkjk¡1 D Var.xk j N0:k / D FWk¡1jk¡1 F0 C RW" :

(A.4)

Equations A.3 and A.4 are, respectively, the one-step prediction estimate and the one-step prediction variance, or learning rate, in equations 2.10 and 2.11 of the algorithm. The probability density dened by equations A.3 and A.4 is p.xk j N0:k¡1 /

» ¼ 1 1 ¡1 D .2¼ 2 jWkjk¡1 j/¡p exp ¡ .xk ¡ xkjk¡1 /0 Wkjk¡1 .xk ¡ xkjk¡1 / : (A.5) 2

Substituting equations A.2 and A.5 into equation 2.8 and neglecting the denominator and other terms that do not depend on position, we obtain an explicit expression for the probability density of position at time k1 given the spiking activity in .0; k1]: » ¼ 1 ¡1 p.xk j N0:k / / exp ¡ .xk ¡ xkjk¡1 /0 Wkjk¡1 .xk ¡ xkjk¡1 / 2 " C X c £ exp log ¸c .xk j N0:k / Nk¡1:k cD1

¡

C X cD1

# c

¸ .xk j N0:k /1 :

(A.6)

304

R. Barbieri et al.

To evaluate equation 2.8, or equivalently equation A.6, we derive a gaussian approximation to it by deriving a quadratic approximation to log p.xk j N0:k /. This corresponds to taking the rst two terms in its Taylor series expansion of log p.xk j N0:k / (Tanner, 1996; Brown et al., 1998). From equation A.6, log p.xk j N0:k / is 1 log p.xk j N0:k / / ¡ .xk ¡ xkjk¡1 /0 W kjk¡1 .xk ¡ xkjk¡1 / 2 C C X X c C log ¸c .xk j N0:k / ¡ Nk¡1:k ¸c .xk j N0:k /1: (A.7) cD1

cD1

The gradient and Hessian matrix required for the quadratic approximation are, respectively, ¡1 r log p.xk j N0:k / D ¡Wkjk¡1 .xk ¡ xkjk¡1 / C

C X cD1

r log ¸c .xk j N0:k /

c £ [Nk¡1:k ¡ ¸c .xk j N0:k /1]; ¡1 r 2 log p.xk j N0:k / D ¡Wkjk¡1 C

C X cD1

(A.8)

r 2 log ¸c .xk j N0:k /

c £ [Nk¡1:k ¡ ¸c .xk j N0:k /1]

¡ r log ¸c .xk j N0:k /[r¸c .xk j N0:k /1]0 :

(A.9)

Solving for xk in equation A.8 yields the posterior mode in equation 2.12, and because under the gaussian approximation the covariance matrix is the negative inverse of the Hessian (Tanner, 1996; Brown et al., 1998), equation A.9 yields equation 2.13. Equations A.8 and A.9 are generalizations of equations A.12 and A.13 in Brown et al. (1998) for an arbitrary point process model. Acknowledgments Support was provided in part by NIH grants MH59733, MH61637, MH65018, and DA015644; NSF grant 0081548; and grants from DARPA and ONR. We are grateful to Jonathan Victor for helpful discussions and to an anonymous referee whose suggestions helped signicantly improve the work presented in this article. References Barbieri, R., Frank, L. M., Quirk, M. C., Wilson, M. A., & Brown, E. N. (2002).Construction and analysis of non-gaussian place eld models of neural spiking activity. Neurocomputing, 44–46, 309–314.

Dynamic Analyses of Neural Information Encoding

305

Barbieri, R., Quirk, M. C., Frank, L. M., Wilson, M. A., & Brown, E. N. (2001). Construction and analysis of non-Poisson stimulus-response models of neural spiking activity. J. Neurosci. Methods, 105, 25–37. Bialek, W., Rieke, F. de Ruyter van Stevenick, R. R., & Warland, D. (1991).Reading a neural code. Science, 252, 1854–1857. Born, M., & Wolf, E. (1989). Principles of optics (6th ed.). New York: Pergamon. Box, G.E.P., Jenkins, G. M., & Reinsel, G. C. (1994). Time series analysis, forecasting and control (3rd ed.). Upper Saddle River, NJ: Prentice Hall. Brown, E. N., Barbieri, R., Eden, U. T., & Frank, L. M. (2003). Likelihood methods for neural data analysis (pp. 253–286). In J. Feng (Ed.), Computational neuroscience: A comprehensive approach. London: Chapman and Hall. Brown E. N., Barbieri R., Ventura V., Kass, R. E., & Frank, L. M. (2002). The timerescaling theorem and its application to neural spike train data analysis. Neural Comp., 14(2), 325–346. Brown, E. N., Frank, L. M., Tang, D., Quirk, M., & Wilson, M. A. (1998). A statistical paradigm for neural spike train decoding applied to position prediction from ensemble ring patterns of rat hippocampal place cells. J. Neurosci., 18, 7411–7425. Brown, E. N., Nguyen, D. P., Frank, L. M., Wilson, M. A., & Solo V. (2001). An analysis of neural receptive eld plasticity by point process adaptive ltering. Proceedings of the National Academy of Sciences, 98, 12261–12266. Chapin, J. K., Moxon, K. A., Markowitz, R. S., & Nicolelis, M.A.L. (1999). Realtime control of a robot arm using simultaneously recorded neurons in the motor cortex. Nature Neurosci., 7, 664–670. Cover, T. M., & Thomas, J. A. (1991). Elements of information theory. New York: Wiley. Daley, D., & Vere-Jones, D. (2003). An introduction to the theory of point process (2nd ed.). New York: Springer-Verlag. Dan, Y., Alonso, J. M., Usrey, W. M., & Reid, R. C. (1998). Coding of visual information by precisely correlated spikes in the lateral geniculate nucleus. Nature Neuroscience, 1(6), 501–507. Donoghue, J. P. (2002). Connecting cortex to machines: Recent advances in brain interfaces. Nature Neurosci. Supplement., 5, 1085–1088. Doucet, A., DeFreitas, N., Gordon, N., & Smith, A. (2001). Sequential Monte Carlo methods in practice. New York: Springer-Verlag. Fahrmeir, L., & Tutz, G. (2001).Multivariatestatisticalmodelling based on generalized linear models (2nd ed.). New York: Springer-Verlag. Frank, L. M., Stanley, G. B., & Brown E. N. (2003, November). New places and new representations: Place eld plasticity in the hippocampus. Paper presented at the Society for Neuroscience conference, New Orleans, LA. Gao, Y., Black, M. J., Bienenstock, E., Shoham, S., & Donoghue, J. (2002). Probabilistic inference of hand motion from neural activity in motor cortex. In T. G. Dietterich, S. Becker, & Z. Ghahramani (Eds.), Advances in neural information processing systems, 14. Cambridge, MA: MIT Press. Genz, A., & Kass, R. E. (1994). Subregion-adaptive integration of functions having a dominant peak. J. Comp. Graph. Stat., 6, 92–11.

306

R. Barbieri et al.

Georgopoulos, A. P., Kettner, R. E., & Schwartz, A. B. (1986). Neuronal population coding of movement direction. Science, 233, 1416–1419. Jensen, O., & Lisman, J. E. (2000). Position reconstruction from an ensemble of hippocampal place cells: Contribution of theta phase coding. J. Neurophysiol., 83, 2602–2609. Johnson, D. H., Gruner, C. M., Baggerly, K., & Shehagiri, C. (2001). Informationtheoretic analysis of neural coding. J. Comp. Neurosci., 10, 47–69. Kitagawa, G., & Gersh, W. (1996). Smoothness priors analysis of time series. New York: Springer-Verlag. Mehta, M. R., Lee, A. L., & Wilson, M. A. (2002). Role of experience and oscillations in transforming a rate code into a temporal code. Nature, 417(6), 741–746. Mendel, J. M. (1995). Lessons in estimation theory for signal processing, communications, and control. Englewood Cliffs, NJ: Prentice Hall. Nirenberg, S., Carcieri, S. M., Jacobs, A. L., & Latham, P. E. (2001). Retinal ganglion cells act largely as independent encoders. Nature, 411, 698–701. O’Keefe J., & Dostrovsky J. (1971). The hippocampus as a spatial map: Preliminary evidence from unit activity in the freely-moving rat. Brain Res., 34, 171–175. O’Keefe, J., & Nadel, N. (1978). The hippocampus as a cognitive map. New York: Oxford University Press. O’Keefe, J., & Reece, M. L. (1993).Phase relationship between hippocampal place units and the EEG theta rhythm. Hippocampus, 3, 317–330. Pawitan, Y. (2001). All likelihood: Statistical modelling and inference using likelihood. New York: Oxford University Press. Pouget, A., Dayan, P., & Zemel R. (2000). Information processing with population codes. Nature Reviews Neurosci., 1, 125–132. Priestley, M. B. (1981). Spectral analysis and time series. Orlando, FL: Academic Press. Quirk, M. C., & Wilson, M. A. (1999). Interaction between spike waveform classication and temporal sequence detection. J. Neurosci. Methods, 94, 41–52. Reich, D. S., Melcher, F., & Victor, J. D. (2001). Independent and redundant information in nearby cortical neurons. Science, 294, 2566–2568. Reinagel, P., & Reid, R. C. (2000). Temporal coding of visual information in the thalamus. J. Neurosci., 20, 5392–5400. Rieke, F., Warland, D., de Ruyter van Steveninck, R. R., & Bialek, W. (1997). Spikes: Exploring the neural code. Cambridge, MA: MIT Press. Serruya, M. D., Hatsopoulos, N. G., Paninski, L., Fellows, M. R., & Donoghue, J. P. (2002). Instant neural control of a movement signal. Nature, 416, 141– 142. Shoham, S. (2001). Advances towards an implantable motor cortical interface. Unpublished doctoral dissertation, University of Utah. Skaggs, W. E., McNaughton, B. L. Gothard, K., & Marcus, E. (1993). An information-theoretic approach to deciphering the hippocampal code. In S. J. Hanson, J. D. Cowan, & C. L. Giles (Eds.), Advances in neural information processing, 5 (pp. 1030–1037). San Mateo, CA: Morgan Kaufmann.

Dynamic Analyses of Neural Information Encoding

307

Skaggs, W. E., McNaughton, B. L., Wilson, M. A., & Barnes, C. A. (1996). Theta phase precession in hippocampal neuronal populations and the compression of temporal sequences. Hippocampus, 6, 149–172. Smith, A. C., & Brown, E. N. (2003). Estimating a state-space model from point process observations. Neural Comp., 15, 965–991. Stanley, G. B., Li, F. F., & Dan Y. (1999). Reconstruction of natural scenes from ensemble responses in the lateral geniculate nucleus. J. Neurosci., 19, 8036– 8042. Strong, S. P., Koberle, R., Ruyter van Steveninck, R. R. de, & Bialek, W. (1998). Entropy and information in neural spike trains. Phys. Rev. Let., 80, 197–200. Tanner, M. A. (1996). Tools for statistical inference. New York: Springer-Verlag. Taylor, D. M. Tillery, S.I.H., & Schwartz, A. B. (2002). Direct cortical control of 3D neuroprosthetic devices. Science, 296, 1829–1832. Twum-Danso, N. T. (1997). Estimation, information and neural signals. Unpublished doctoral dissertation, Harvard University, Cambridge, MA. Victor, J. D. (2002). Binless strategies for estimation of information from neural data. Phys. Rev. E Stat. Nonlin. Soft Matter Phys., 66(5–1), 051903. Warland, D. K., Reinagel, P., & Meister, M. (1997). Decoding visual information from a population of retinal ganglion cells. J. Neurophysiol., 78, 2336–2350. Wessberg, J., Stambaugh, C. R., Kralik, J. D., Beck, P. D., Laubach, M., Chapin, J. K., Kim, J., Biggs, S. J., Srinivasan, M. A., & Nicolelis, M. A. (2000).Real-time prediction of hand trajectory by ensembles of cortical neurons in primates. Nature, 408, 361–365. Wickelgren, I. (2003). Tapping the mind. Science, 299, 496–499. Wiener, M. C., & Richmond, B. J. (2003). Decoding spike trains instant by instant using order statistics and the mixture-of-Poissons model. J. Neurosci., 23, 2394–2406. Wilson, M. A. & McNaughton, B. L. (1993). Dynamics of the hippocampal ensemble code for space. Science, 261, 1055–1058. Zhang, K., Ginzburg, I., McNaughton, B. L., & Sejnowski, T. J. (1998). Interpreting neuronal population activity by reconstruction: unied framework with application to hippocampal place cells. J. Neurophysiol., 79, 1017–1044. Received April 2, 2003; accepted July 30, 2003.

LETTER

Communicated by Eytan Ruppin

A Network Model of Perceptual Suppression Induced by Transcranial Magnetic Stimulation Yoichi Miyawaki yoichi [email protected] RIKEN Brain Science Institute, Saitama 351-0198, Japan

Masato Okada [email protected] RIKEN Brain Science Institute, Saitama 351-0198,Japan; “Intelligent Cooperationand Control,” PRESTO, JST, c/o RIKEN Brain Science Institute, Saitama 351-0198, Japan

We modeled the inhibitory effects of transcranial magnetic stimulation (TMS) on a neural population. TMS is a noninvasive technique, with high temporal resolution, that can stimulate the brain via a brief magnetic pulse from a coil placed on the scalp. Because of these advantages, TMS is extensively used as a powerful tool in experimental studies of motor, perception, and other functions in humans. However, the mechanisms by which TMS interferes with neural activities, especially in terms of theoretical aspects, are totally unknown. In this study, we focused on the temporal properties of TMS-induced perceptual suppression, and we computationally analyzed the response of a simple network model of a sensory feature detector system to a TMS-like perturbation. The perturbation caused the mean activity to transiently increase and then decrease for a long period, accompanied by a loss in the degree of activity localization. When the afferent input consisted of a dual phase, with a strong transient component and a weak sustained component, there was a critical latency period of the perturbation during which the network activity was completely suppressed and converged to the resting state. The range of the suppressive period increased with decreasing afferent input intensity and reached more than 10 times the time constant of the neuron. These results agree well with typical experimental data for occipital TMS and support the conclusion that dynamical interaction in a neural population plays an important role in TMS-induced perceptual suppression.

1 Introduction TMS can noninvasively stimulate the human brain with high temporal accuracy via a brief magnetic pulse produced by a coil placed on the scalp (Barker, Jalinous, & Freeston, 1985). Due to these outstanding characterisNeural Computation 16, 309–331 (2004)

c 2003 Massachusetts Institute of Technology °

310

Y. Miyawaki and M. Okada

tics, TMS has been widely used in experimental research of human brain functions, such as motor, memory, and perception. Most TMS studies have reported inhibitory effects on those functions, although positive percepts like phosphene (Marg & Rudiak, 1994; Meyer, Diehl, Steinmetz, Britton, & Benecke, 1991) and unmasking (Amassian et al., 1993) also occur. For example, TMS applied to the motor cortex suppresses a surface electromyogram (EMG) on an upper limb (Hallett, 1995; Inghilleri, Berardelli, Cruccu, & Manfredi, 1993), and TMS applied to the visual cortex causes a visual eld decit contralateral to the stimulated site (Amassian et al., 1989; Kamitani & Shimojo, 1999; Kastner, Demmer, & Ziemann, 1998). These TMS studies have strong advantages over classical lesion studies of patients, because brain functions can be interfered with noninvasively and precisely in the temporal domain. TMS enables us to create a temporal virtual lesion in a healthy brain (Hilgetag, Theoret, & Pascual-Leone, 2001; Walsh & Cowey, 2000). While these experimental outcomes are so clear, the most puzzling question is how TMS interferes with neural activity and suppresses these functions. The neural substrates underlying the inhibitory effects, especially in their theoretical aspects, are still totally unknown. To date, theoretical and computational studies have mostly engaged in a biophysical approach to discover the neural substrates of magnetic stimulation. These studies are motivated by similarities to electrical stimulation models of the peripheral nerve (Rattay, 1986), and they have demonstrated spike generation by an induced electric eld on a long, straight axon (Roth & Basser, 1990; Basser & Roth, 2000). Nagarajan, Durand, and Warman (1993) demonstrated spike generation in a nite cable model, which is more suitable for the case of cortical stimulation. These studies focused on the excitatory effects of magnetic stimulation, yet TMS typically exhibits inhibitory effects. Kamitani, Bhalodi, Kubota, and Shimojo (2001), on the other hand, demonstrated inhibitory effects of magnetic stimulation on a biophysical model. They applied an equivalent current induced by a single magnetic pulse to a realistic compartment model of a reconstructed 3D cortical neuron, showing an initial burst ring followed by a silent period of a duration comparable to that shown by experimental data. Several experimental studies suggest the involvement of the intracortical network in TMS-induced neural interference. Priori, Berardelli, Mercuri, Inghilleri, and Manfredi (1995) demonstrated that hyperventilation, which reduces extracellular Ca2C and decreases synaptic transmission, shortened the duration of suppression of a surface EMG induced by TMS of the motor cortex. Other researchers reported that administration of GABAergic drugs could modulate the amplitude of a motor-evoked potential on an upper limb induced by TMS of the motor cortex (Ziemann, Lonnecker, Steinhoff, & Paulus, 1996a, 1996b; Inghilleri, Berardelli, Marchetti, & Manfredi, 1996). These results suggest that the TMS-induced suppression is transsynaptically mediated by the GABAergic inhibitory network. In addition to this experimental evidence, we have to take into account that the spatial extent of an

A Network Model of TMS

311

electrical eld induced by TMS is considerably large compared to the size of a single cortical neuron, suggesting that rather than stimulating a specic single neuron, TMS simultaneously stimulates a large number of neurons connected to each other in the target area under the TMS coil. This evidence indicates that transsynaptic interaction in a neural population level might be deeply involved in suppression processes induced by TMS. Another key issue suggesting the involvement of a neural population in TMS-induced suppression is the relationship between the timing of TMS and the degree of suppression. Most experiments on occipital TMS, other than repetitive stimulation (rTMS), demonstrated that there is a critical latency period during which TMS produces a suppressive effect on visual perception. The most suppressive timing depends on the experimental conditions, such as the stimulated site (Hotson & Anand, 1999; Beckers & Zeki, 1995; Beckers & Homberg, 1992) and the contrast of the presented visual stimulus (Kammer & Nusseck, 1998; Masur, Papke, & Oberwittler, 1993; Miller, Fendrich, Eliassen, Demirel, & Gazzaniga, 1996). Under proper conditions, the induction timing throughout a range of over 100 ms around the optimal timing is still effective for suppression (see Figure 1; Kamitani & Shimojo, 1999; Amassian et al., 1989; Corthout, Uttl, Walsh, Hallett, & Cowey, 1999; Kammer & Nusseck, 1998). These long-term effects cannot be reproduced by only membrane dynamics of a single neuron with a small time constant. Even if a slow membrane mechanism is incorporated, it seems difcult to explain the temporal variation in the degree of suppression. There have been no experimental and theoretical studies that reproduced the temporally selective suppression peculiar to TMS by only singleneuron properties. Instead, it is a plausible hypothesis that these properties are determined by the temporal relationship between the induction timing of TMS and the macroscopic state of a neural population evolving into the nal excitation pattern given by an afferent signal input and synaptic interactions. To theoretically and computationally explain the temporal properties of TMS-induced perceptual suppression at the neural population level, we employ a simple neural network model of a sensory feature detector system, and we analyze its dynamical response to a TMS-like perturbation. We especially focus on the issues of optimal induction timing, the suppressive time range, and the effective duration for the perturbation to suppress the network activity. 2 Methods 2.1 Neural Population Model. We employed a simple analog neural network model (see equations 2.1–2.5) that has been well analyzed as a model for a sensory feature detector system. A schematic of the model’s architecture is shown in Figure 2. The details of the model are described in the literature (Ben-Yishai, Bar-Or, & Sompolinsky, 1995; Hansel &

312

Y. Miyawaki and M. Okada

Figure 1: A temporal prole of the inhibitory effect typically observed in occipital TMS experiments. TMS delivered with a certain delay after the onset of a visual stimulus degrades perceptual performance. For example, the character recognition rate decreases (Amassian et al., 1989), motion discrimination is impaired (Beckers & Zeki, 1995; Matthews, Luber, Qian, & Lisanby, 2001), and part of the visual eld is erased (Kamitani & Shimojo, 1999; Kastner et al. 1998). There is a little variation in the optimal induction timing and the effective time range for suppression, but in most cases of single-pulse TMS applied near the occipital pole, a TMS pulse delay of about 100 ms after visual stimulus presentation suppresses its percept most powerfully, and the effective time range covers about 100 ms around the optimal timing.

Sompolinsky, 1998). ¿m

d m.µ ; t/ D ¡m.µ ; t/ C g[h.µ ; t/] dt Z ¼ 0 2 dµ h.µ ; t/ D J.µ ¡ µ 0 /m.µ 0 ; t/ C hext .µ ; t/ ¡ ¼2 ¼ J.µ ¡ µ 0 / D ¡J0 C J2 cos 2.µ ¡ µ 0 /

hext .µ ; t/ D c.t/[1 ¡ ² C ² cos 2.µ ¡ µ0 /]:

(2.1) (2.2) (2.3) (2.4)

Here, m.µ ; t/ is the output of neuron µ, and ¿m is a microscopic characteristic time analogous to the membrane time constant of a neuron. g[ f ] is a quasilinear output function, 8 <0 g[ f ] D ¯. f ¡ T/ : 1

. f < T/ .T · f < T C 1=¯/ . f ¸ T C 1=¯/;

(2.5)

where T is the threshold of the neuron and ¯ is the gain factor. h.µ ; t/ denotes the input to neuron µ . The integral range of the right side of equation 2.2 is generally dened by the extent of the connections of neuron µ. Here, for simplicity, we assume that m.µ ; t/ has a periodic boundary

A Network Model of TMS

313

Figure 2: Network architecture and TMS induction. In this model, sharp tuning and a local excitation emerge due to uniform inhibition and feature-specic interaction of the network, even if the afferent input is almost isotropic. This model can be considered as a model of a sensory feature detector system, in particular, widely accepted as a model of the orientation tuning mechanisms in V1 (Ben-Yishai et al., 1995; Sompolinsky & Shapley, 1997; Somers et al., 1995; Adorjan et al., 1999; Ringach, Hawken, & Shapley, 1997; Bredfelt & Ringach, 2002). A TMS-like perturbation is assumed to be delivered uniformly to all neurons, because the spatial extent of the neural population in the local network is signicantly small compared to the spatial gradient of the electric eld induced by the TMS coil.

condition .¡¼=2 · µ · ¼=2/. The connections of each neuron are limited to this periodic range. The afferent input, hext.µ ; t/, has its maximal amplitude c.t/ at µ D µ0 . ² is an afferent tuning coefcient, which describes how the afferent input to the target population has already been localized around µ0 .0 · ² · 1=2/. Here, we consider the condition in which the center of the afferent input is constant during stimulation (µ0 .t/ D const.). The difference in µ0 shifts the activity prole laterally but does not affect the essence of the results; hence, we set µ0 D 0 for simplicity. J.µ ¡ µ 0/ is the synaptic weight from neuron µ 0 to µ , which consists of two parts: the uniform inhibition J0 and the feature-specic interaction J2 . J0 sharpens the activity prole of the network by increasing an effective threshold through all-to-all inhibition. J2 sharpens the activity prole by a feature-specic manner in which the activities of the neighboring neurons in a feature space are mutually enhanced and those of the distant ones are suppressed. As long as neurons interact in this manner, we can regard µ as a neuron responding to a specic feature in an arbitrary one-dimensional feature space. The most intuitive and widely accepted example represented by this

314

Y. Miyawaki and M. Okada

model is the case of orientation tuning mechanisms in the primary visual cortex (Ben-Yishai et al., 1995; Hansel & Sompolinsky, 1998; Sompolinsky & Shapley, 1997; Somers, Nelson, & Sur, 1995; Adorjan, Levitt, Lund, & Obermayer, 1999; Ringach, Hawken, & Shapley, 1997; Bredfelt & Ringach, 2002; Carandini & Ringach, 1997). Assuming that the coded feature is orientation of a stimulus, we can regard µ as a neuron responding to angle µ , hext as an input from the lateral geniculate nucleus (LGN) and J as a recurrent interaction in the primary visual cortex (V1). 2.2 Dynamics of the Order Parameters. Assuming all neurons are inactive in the initial state (m.µ ; 0/ D 0), the dynamics of m.µ ; t/ can be fully described by the zeroth- and second-order Fourier coefcients of m.µ; t/ because the synaptic input and afferent input have only zeroth- and secondorder Fourier components. Thus, the zeroth and second Fourier coefcients of m.µ ; t/ are the order parameters of this network model. The equations for the order parameter dynamics are derived from the Fourier transformation of equation 2.1, ¿m ¿m

d m0 .t/ D ¡m0 .t/ C dt

Z

¼ 2

¡ ¼2

d m2 cos .t/ D ¡m2 cos .t/ C dt

Z

dµ g[h.µ; t/] ¼ ¼ 2

¡ ¼2

dµ g[h.µ; t/] cos 2µ; ¼

(2.6) (2.7)

where Z m0 .t/ D Z m2 cos .t/ D

¼ 2

¡ ¼2 ¼ 2 ¡ ¼2

dµ m.µ ; t/ ¼

(2.8)

dµ m.µ ; t/ cos 2µ: ¼

(2.9)

m0 , the zeroth-order Fourier coefcient, represents the mean activity of the neurons averaged over the whole network. m2 cos , the cosine term of the second-order Fourier coefcient, represents the degree of modulation in the activity prole of the network. Here, the sine term of the second-order Fourier coefcient is always 0 because µ0 .t/ D 0 and m is an even function. We can hence omit the sine term. The variables m0 and m2 cos are sufcient to describe the network state. h.µ ; t/ can also be described in terms of these Fourier coefcients, h.µ ; t/ D ¡J0 m0 .t/ C c.t/.1 ¡ ²/ C .²c.t/ C J2 m2 cos .t// cos 2µ;

(2.10)

and the integral terms of equations 2.6 and 2.7 can be denoted by a combination of the subthreshold, linear, and saturated regions as G.¯.h ¡ T/I µc / ¡

A Network Model of TMS

315

cos 2µI µc /¡ G.¯.h ¡T/ cos 2µI µu /C G.¯.h¡ T/I µu /CG.1I µu / and G.¯.h¡T/ R® G.cos 2µ I µu /, where G. f I ®/ D ¼1 ¡® f dµ , h.µc ; t/ D T, and h.µ u ; t/ D TC1=¯. We can thus derive self-consistent equations for the order parameters: µ n ¯ 22u C .²C C J2 M 2 cos / sin 22c ¡ sin 22u M0 D ¼ ¯

µ

M2 cos D

¡2 cos 22c .2c ¡ 2u /

¯ sin 22u C .²C C J2 M2 cos / ¼ ¯ n 1 £ .2c ¡ 2u / C .sin 42c ¡ sin 42u / 4 o¶ ¡ cos 22c .sin 22c ¡ sin 22u /

T D ¡J0 M0 C C.1 ¡ ²/ C .²C C J2 M2 cos / cos 22c

T C 1=¯ D ¡J0 M0 C C.1 ¡ ²/ C .²C C J2 M2 cos / cos 22u ;

o¶ (2.11)

(2.12) (2.13) (2.14)

where the capital terms of M0 ; M2 cos ; 2c ; 2u ; and C denote the equilibrium states (t ! 1) of each parameter. The network dynamics and the equilibrium state are obtained numerically from these equations. 2.3 TMS Induction. Previous simulation studies using compartmental models demonstrated that the effect of TMS on a single neuron can be represented by the collection of transmembrane current in each compartment (Kamitani et al., 2001; Nagarajan et al., 1993). Hence, the TMS term can be incorporated at the same level as the synaptic input. The spatial extent of the neural population dealt with here is considered so small that the induced electric eld is almost uniform in such a limited area. We therefore assume that a TMS-induced perturbation would be constant for all neurons (see Figure 2), and we simply modify the input function into O ; t/ D h.µ ; t/ C ITMS .t/: h.µ

(2.15)

The order parameter equations are modied accordingly by replacing h O Here we employ a simple rectangular input of amplitude ITMS and into h. duration DTMS as a TMS-like perturbation. 3 Results 3.1 Bistability and Afferent Input. To evaluate whether a TMS-like perturbation is suppressive, we examine the equilibrium state of the network. However, if the intensity of the afferent input is always kept constant above threshold, there is only one attractor in the network dynamics we deal with

316

Y. Miyawaki and M. Okada

here. In such a case, the nal state of the network is uniquely determined by the intensity of the afferent input and not inuenced by any perturbation of huge amplitude. To achieve dependence on the perturbation conditions, we adopt a parameter set of a synaptic weight J0 , J2 and a gain ¯, which operates the network in a bistable mode. Under this condition, the self-consistent equations can hold two stable solutions as equilibrium states: the excitatory state (nonzero attractor) and the resting state (zero attractor). Figure 3A is the phase diagram illustrating the boundary between the monostable and bistable regime. In the bistable regime, below the thick line for a particular ¯, the network can keep the excitatory state if the afferent input decreases below the threshold (M0 ; M2 cos 6D 0 for C < T). There is another important boundary (thin line) below which the network can stay in the excitatory state even if the afferent input has completely vanished (M0 ; M2 cos 6D 0 for C D 0). We now assume a sensory feature detector model, in particular, the model for orientation tuning in the primary visual cortex; the network should be at the resting state if the afferent signal is terminated. These two boundaries shift depending on the gain ¯. Thus, we choose a parameter set of the synaptic weight J0 and J2 , which lies between these two boundaries for the particular ¯. An example of the bistability is shown in Figure 3B. The network holds two equilibrium values for the same afferent input in the hatched region in the gure. The upper line corresponds to the excitatory state in which the network has a local excitation (M0 ; M2 cos 6D 0), and the lower line corresponds to the resting state in which all neurons are inactive (M0 ; M2 cos D 0). The minimum amplitude of afferent input exhibiting bistability is bounded to a nonzero value, and the network goes to the resting state if the afferent input is terminated. The network converges to either attractor depending on the condition of the TMS-like perturbation. Another parameter, afferent tuning coefcient ², has a strong effect if there are no synaptic interactions and the network activity is determined only by the afferent projection. However, we now consider the network model with synaptic interactions, which dominates the properties of the network activity. In this case, a difference of the afferent tuning coefcient did not have signicant impact on the phase diagram and the total network behavior (see also Ben-Yishai et al., 1995). A typical example of an afferent input that leads the network into the bistable regime is one consisting of a suprathreshold transient component and a subthreshold sustained component. This kind of time course is a common, natural feature of neural signals observed in the LGN, visual cortex, and other brain areas (e.g., Schmolesky et al., 1998). We employ the simplest input with a rectangular shape, consisting of a transient component (amplitude: At .> T/, duration: Dt ) and a sustained component (amplitude: As .< T/). An example of this afferent input is shown at the bottom of Figure 4. The duration of the transient component, Dt , was xed at 4¿m ,

A Network Model of TMS

317

Figure 3: (A) Phase diagram of the network model. The thick line indicates the boundary below which the network exhibits bistability. The thin line indicates the boundary below which the network can hold the excitatory state even if the afferent input is terminated. These boundaries shift depending on the gain ¯ (¯0 D 0:25). (B) An example of bistability of the network, showing hysteresis for an afferent input. The dotted line indicates the nal state of the network starting from the initial state of m0 .0/ D m2 cos .0/ D 0, and the solid line indicates the nal state of the network starting after the suprathreshold afferent input is given (thin line: M0 ; thick line: M2 cos ). The afferent input consisting of the suprathreshold transient component followed by the subthreshold sustained component leads the network into bistability with the zero and nonzero attractors (the hatched region). The network parameters used here are ² D 0:1, ¯ D 0:25, J0 D 73, J2 D 110, and T D 1.

318

Y. Miyawaki and M. Okada

Figure 4: The time course of the network response to a brief pulsed perturbation (thick lines) and the response to the control condition without perturbation (thin lines). m0 (solid line) transiently increases and then decreases for a relatively long period, accompanied by a reduction in m2 cos (dotted line). The network activity persists if the perturbation is applied far from the afferent input onset, whereas the network activity is completely suppressed and the order parameters converge to zero if the perturbation is applied close to the afferent input onset. The parameters used here are At D 2, As D 0:3, Dt D 4¿m , ITMS D 15, and DTMS D 0:1¿m . The other parameters are the same as in Figure 3.

A Network Model of TMS

319

corresponding to 40 ms if ¿m is assumed to be 10 ms, that is, within a physiologically valid range as compared with a duration of a transient component of neural signals in the early visual cortex (e.g., Schmolesky et al., 1998). The amplitude of the transient and sustained components, At and As , was varied to examine the relationship between the intensity of the afferent input and the degree of TMS-induced suppression. 3.2 Network Dynamics and Critical Timing. The time courses of the order parameters are illustrated in Figure 4. The transition of the network state can also be depicted as a trajectory in the m0 -m2 cos plane of Figure 5, in which the relationship between the network state and the attractors is easily grasped. Without the perturbation, the network state moves toward the attractor generated temporarily by the suprathreshold transient component (0 < t < Dt ). It then turns back toward and nally reaches the relatively weak attractor generated by the subthreshold sustained component (the thin lines in Figure 4 and the top left graph of Figure 5). Applying the perturbation, the network state deviates from the trajectory of the control conditions (the thick lines in Figures 4 and 5). Here, we show ve examples of typical stimulus onset asynchrony (SOA), which is an interval between the perturbation and the onset of the afferent input. Regardless of the induction timing of the perturbation, the mean activity of the network increases transiently and then enters a relatively long-lasting suppressive phase. This response property is clearly observable if the network is perturbed after it almost converges to the equilibrium state and a local excitation builds up (e.g., SOA = 15 in Figure 4). The activity modulation of the network (m2 cos ) does not receive the direct impact of the perturbation if the activity is constant across the network, because there are no second-order Fourier components and the perturbation is balanced (e.g., SOA = ¡1 and ¡2 in Figure 4). If the network has a local excitation, m2 cos is decreased by the perturbation because the neurons in the inactive part of the network are equally activated by the perturbation, and consequently, inhibitory input via the lateral inhibitory connections from the inactive part to the active part is also enhanced, and then the local excitation is decreased and broadened (see also section 4.2). Thus, the mean activity and the degree of activity modulation (the feature selectivity, in other words) are temporarily lost by the perturbation, regardless of the induction timing. In this parameter setting, the duration of this transient deviation in the trajectory of the network state reaches about 10 times the time constant ¿m . The nal state of the network critically depends on the induction timing of the perturbation. If the perturbation is applied temporally close to the onset of the afferent input, the network is suppressed and cannot reach the excitatory state (nonzero attractor), and it then nally drops to the resting state (zero attractor). If the perturbation is delivered temporally far from

320

Y. Miyawaki and M. Okada

Figure 5: The trajectory of the network state in the m0 -m2 cos plane (solid line for each SOA condition and dotted line for the control condition without perturbation). The network converges to either of the two attractors (gray disks), depending on the induction timing of the perturbation (thick arrows). If the perturbation is applied close to the afferent input arrival, the network cannot reach the excitatory state (nonzero attaractor) due to the trajectory deviation, and it nally returns to the resting state (zero attractor). Otherwise, the network has a local excitation, even though it is perturbed and transiently deviates from the ordinary trajectory. The thin arrows indicate the direction in which the network state moves. The parameters used here are the same as in Figure 4.

A Network Model of TMS

321

the onset of the afferent input, the network transiently deviates from the original trajectory, but it still maintains the activity and nally converges again to the excitatory state. These results indicate that the perturbation can completely suppress the network activity within a limited SOA range. Furthermore, it is easily imagined that the susceptibility to the perturbation is not uniform throughout the suppressive period and that there is a best timing at which the perturbation is most effective for suppression. These properties can be measured by the minimum intensity of the perturbation needed for suppression at each SOA. In Figure 6A, we show the temporal proles of the minimum intensity of the suppressive perturbation. The minimum intensity of the suppressive perturbation increases with the intensity of the afferent input; however, the prole of these curves consistently keeps a basin shape with the bottom at SOA = 0, regardless of the intensity of the afferent input. These results indicate that the network is most susceptible to the perturbation at the timing of the afferent input onset, and the susceptibility decreases with an increase in SOA. The perturbation succeeds in suppressing the network activity in the parameter region inside these basin curves, and thus the horizontal widths of the curves (D Weff in Figure 6) indicate the suppressive SOA range for a particular intensity perturbation. If the afferent input is strong, a higherintensity perturbation is necessary to suppress the network activity, and the suppressive SOA range is limited to just around the afferent input onset. The relationship between the suppressive SOA range and the afferent intensity is illustrated in Figure 6B for the transient component and Figure 6C for the sustained component. The suppressive SOA range is increased by lowering the amplitude of the transient component and the sustained component of the afferent input; it reaches more than 10 times the time constant ¿m if the perturbation is sufciently strong and the afferent input is just slightly above the threshold at which the network has a local excitation. As described in section 3.1, the synaptic weight and the gain were chosen so that the network exhibits bistability but takes the resting state if the afferent input is terminated. We also examined the other parameter sets that fulll the above two requirements but observed little change in the major results: the perturbation transiently increased the mean activity and then decreased both the mean activity and the degree of activity modulation. The most susceptible timing to the perturbation consistently stayed around the afferent input onset regardless of changes of the network parameters, and the suppressive SOA range also remained comparable to experimental data if the sufciently strong perturbation is given with the afferent input near the threshold above which the network has a local excitation. 3.3 Inhibitory Strength-Duration Curve: Effective Duration of the Perturbation. One of the most signicant features of TMS is that the induced magnetic eld is very brief. Here, we address the issue of whether the brief-

322

Y. Miyawaki and M. Okada

Figure 6: (A) The minimum intensity of the suppressive perturbation. The perturbation suppresses the network activity inside these basin curves. The width of each curve indicates the suppressive SOA range for a particular intensity of the perturbation. For example, in the case of At D 1:5 and ITMS D 12, the network is suppressed if the SOA is between ¡3:55 and 6:42, so the suppressive SOA range (Weff ) is 9:97¿m . The parameters are the same as in Figure 4. (B, C) The suppressive SOA range shows the maximum value just above the lower limit of the afferent input (vertical dotted line) at which the network has a local excitation, and it decreases as either the transient or the sustained component increases. The parameters are the same as in Figure 4 except that As is xed at 0.3 in B and At is xed at 1.5 in C.

A Network Model of TMS

323

Figure 7: Temporal variation of the inhibitory strength-duration curves. The network activity is suppressed in the upper-right parameter region for each curve. The bottoms of the inhibitory strength-duration curves in the transient period are almost at and lower than the other SOAs. During this period, the perturbation suppresses the network activity efciently and robustly with respect to variation in the duration of the perturbation. The parameters used here are the same as in Figure 4.

ness of the perturbation is crucial to the inhibitory effect by examining the relationship between the duration and the intensity of a perturbation sufcient to suppress the network activity. The relationship between the current strength and duration required for excitation of neural tissues is usually called the strength-duration curve. In contrast to the ordinary strengthduration curve, here we dene the inhibitory strength-duration curve—the relationship between the minimum intensity (ITMS ) and duration (DTMS ) of a perturbation to suppress the network activity. The network state dynamically changes after the afferent input onset, so that the inhibitory strength-duration curve also depends on the induction timing of the perturbation. Here we show the temporal variations in inhibitory strength-duration curves at six typical SOA in Figure 7. Due to the low-pass effect of the network, there is a lower limit for the duration to be suppressive. This is represented by the vertical edge on the left side of each curve (the cut-off duration). If the duration is shorter than the cutoff value, the perturbation cannot suppress the network activity, even if its intensity is extremely high. To suppress the network activity, the duration of the perturbation must be larger than the cut-off value at each SOA (e.g., Dt > 0:03¿m is necessary at SOA = 0). Another important feature in Figure 7 is that the inhibitory strengthduration curves are not monotonic and exhibit minima in a certain duration

324

Y. Miyawaki and M. Okada

range. These results indicate that there is an optimal duration for suppressing the network activity efciently with a perturbation of lower intensity. For example, a duration of 0:1¿m is the most effective at SOA = 0, under which conditions an intensity of only 2.45 is required for suppression. If the duration of the perturbation is around 0.1 ¿m , which is comparable to a typical TMS pulse duration (see also section 4.1), the inhibitory strengthduration curves for SOA = 0 and 1 exhibit lower values than under other SOA conditions. This means that the transient period around the afferent input onset is the most susceptible to the perturbation (see also Figure 6A). Furthermore, in the transient period, the bottom of the inhibitory strengthduration curve is almost at with respect to duration of the perturbation. The network can hence be suppressed by a similar intensity even if the duration changes to a certain degree. For the other SOA, the minimum intensity for suppression is easily affected by a duration change. For example, if the duration is varied between 0.5 and 0.05¿m , the minimum intensity for suppression varies by 10.4 at SOA = 5, whereas it varies by only 1.23 at SOA = 0. These results indicate that a perturbation close to the afferent input onset can suppress the network activity efciently with a lower intensity and more robustly with respect to variation in duration of the perturbation. 4 Discussion 4.1 Consistency with Experiments. We observed the suppressive effects induced in the simple neural network model by a TMS-like perturbation. There is a critical period in which the perturbation can completely suppress the network activity. In addition, the network activity is suppressed efciently and robustly if the perturbation is close to the onset of the afferent input. Assuming that the time constant ¿m is 10 ms, these results can be considered from a quantitative viewpoint. The duration of the perturbation we use in this article would be 1 ms, which is of the same order as the pulse duration produced by a typical TMS system. The SOA range in which the perturbation can suppress the network would be over 100 ms, which is also consistent with experimental data for occipital TMS (Amassian et al., 1989; Kamitani & Shimojo, 1999; Kammer & Nusseck, 1998; Masur et al. 1993). We also observed the parametric relationship between the intensity of the afferent input and the degree of suppression (see Figure 6). Raising the intensity of the afferent input increased the minimum intensity of the suppressive perturbation and narrowed the suppressive SOA range. Only when the afferent input is slightly above the threshold at which the network has a local excitation did the suppressive SOA range reach more than 100 ms. Parallel results have been obtained in experimental studies of occipital TMS. Most of these experiments have demonstrated a suppressive SOA range of about 100 ms, but the range actually varies with the visibility of the stimuli. Kammer and Nusseck (1998) demonstrated that raising the contrast of the visual stimuli decreased the error rate of the recognition task and narrowed

A Network Model of TMS

325

the suppressive SOA range. Amassian, Cracco, Maccabee, and Cracco (2002) also reported that suppression cannot be achieved by occipital TMS if the visual stimuli (alphabetical letters in their experiments) can be recognized too easily. These results indicate that to induce a powerful suppression in a wide time range of over 100 ms, the afferent input given as sensory stimuli should be just slightly above the threshold. This is indeed the condition under which the suppressive SOA range reached a large value of over 100 ms in our simulation. Another important point regarding the temporal properties of TMS is the peak timing at which the perturbation acts most suppressively. In most cases of occipital stimulation, a delay of about 100 ms from presentation of the visual stimulus is the most effective for suppressing its percept (see Figure 1). However, this delay varies according to the cortical area where TMS is applied. Beckers and Zeki (1995) and Hotson and Anand (1999) used visual motion stimuli and found the difference in the optimal SOA for suppressing motion perception between V1 and V5 stimulation. Corthout, Uttl, Walsh et al. (1999) and Corthout, Uttl, Ziemann, Cowey, & Hallett (1999), on the other hand, observed that TMS over the occipital pole could induce two distinct suppressive periods, and they discussed these periods in terms of the time course of feedforward and feedback visual processing. In most of these studies, variations in the absolute timing of the suppressive SOA have been considered to reect the differences in the arrival timing of the afferent input and the transmission delay from one cortical area to another (Pascual-Leone & Walsh, 2001). In addition, the suppressive SOA period starts later as the intensity of the visual stimulus is decreased. This delay has also been considered to be due to the reduced transmission speed on the afferent visual pathway (Kammer & Nusseck, 1998; Miller et al., 1996; Masur et al., 1993; Wilson & Anstis, 1969). The results presented here are measured from the onset of the afferent input to the network, not from the timing of the presentation of the external stimulus. Considering the absolute timing, the simulation results therefore need to be biased with the delay in transmission of the neural signals to the target neural population. The temporal prole of the degree of suppression, however, can be considered independent of the absolute timing bias. In typical experiments, the temporal prole of TMS-induced perceptual suppression has been directly depicted with a continuous measure like the percentage of correct answers (see Figure 1), whereas in our simulation, the minimum intensity of the suppressive perturbation was used as the measure because the suppression occurs in an all-or-nothing fashion. Although the measures are different, both results equivalently represent the temporal variation of the susceptibility to a brief perturbation. They show quite similar temporal proles and a parallel dependence on the intensity of the afferent stimuli, and the range of the suppressive period is also of the same order. Thus, assuming the proper delay in the afferent input arrival, the temporally selective

326

Y. Miyawaki and M. Okada

suppression in the network model agrees well with experimental data for TMS-induced perceptual suppression. 4.2 Neural Mechanisms of TMS-Induced Suppression. The neural mechanisms of TMS-induced suppression constitute the major issue still under debate and also the primary question of this article. A recent computational study suggested the involvement of a calcium-dependent potassium current inducing a long hyperpolarization period after stimulation (Kamitani et al., 2001). This mechanism might contribute to the suppression as an elemental component. On the other hand, several experimental studies have suggested that TMS-induced suppression is mediated by the activation of GABAergic inhibitory interneurons, based on evidence from the experiments using GABAergic drugs (Ziemann et al., 1996a, 1996b; Inghilleri et al., 1996), hyperventilation (Priori et al., 1995), and multiple pulse techniques (Ziemann, Rothwell, & Ridding, 1996; Romeo et al., 2000; Inghilleri et al., 1993). These reports imply the involvement of an inhibitory network in suppression, although the concrete mechanisms have not yet been clearly described. As Walsh and Cowey (2000) pointed out in their review, TMS is highly unlikely to evoke a particular activity pattern as coordinated by a local cortical network or an afferent projection to the target area. Rather, TMS might induce random activities regardless of the activity pattern existing in the target area. This would be as if inducing random noise into ordered neural processes (see Figure 8A). In this article, we simply approximated TMS as a uniform perturbation by which all neurons could be stimulated regardless of the present activity pattern in the network. Walsh’s idea and our modeling are essentially equivalent in terms that TMS gives neural stimulation irrelevant to the cortical activity formed by an afferent input and the cortical network. In consequence of such a perturbation, the inactive neurons initially suppressed by the active neurons can be also activated, and then they start to work suppressively on the active neurons via lateral inhibitory connections (see Figure 8B). After the local excitation builds up well, the inhibitory input from the active neurons to the inactive neurons becomes so strong that it is difcult to pull the activity of the inactive neurons up above the threshold by perturbation, resulting in failure to produce a counter-suppression to the active neurons. This temporal relationship between the perturbation and a local excitation limits the critical time range for suppressing the network activity after the afferent input onset. Before the afferent input is given, only the uniform inhibition J0 has the dominant effect because there are no second-order Fourier components. Thus, the origin of the suppressive effect in that period is recurrent inhibition brought by J0 after a brief excitation induced by a perturbation, and whose relaxation time limits the critical time range for suppressing the network activity. This is a basic framework for the dynamics of TMS-induced suppression provided by the network model of feature selectivity in the visual cortex

A Network Model of TMS

327

Figure 8: (A) The hypothesis of random stimulation of a neural population by TMS. TMS may have difculty inducing a particular localized activity pattern; rather, it could randomly stimulate the population of neurons under the coil (see also Walsh & Cowey, 2000). (B) The basic framework of TMS-induced suppression in the network with feature-specic interaction. TMS is modeled as a uniform excitatory perturbation across the entire network, so that the activity of all neurons is increased rst. If the perturbation is strong enough to pull the activity of inactive neurons up above the threshold, the active neurons receive a counter-suppression from the inactive neurons via inhibitory connections.

and the TMS model of a uniform perturbation. However, the functional architecture of feature-selective excitatory and inhibitory interaction like that we used in this article might not be necessary to produce similar results. The essential property of our network model is bistability generated by recurrent excitatory and inhibitory synaptic connections and afferent input channels. Hence, a sparsely connected two-population (excitatory and inhibitory) model with suitable synaptic connections and external input channels might be sufcient (e.g., Amit & Brunel, 1997; van Vreeswijk and Sompolinsky, 1996). Suppose a strong perturbation were given to both populations in such a model; inhibitory units would be activated directly or transsynaptically, and then the total activities would be decreased through inhibitory interactions. TMS would act like a reset input in the excitatoryinhibitory balanced network. In brain areas other than the visual cortex, we can also observe TMS-induced suppressive effects. For example, TMS of the motor cortex induces a muscle twitch and suppression of a surface EMG on an upper limb. The functional architecture of the motor cortex is distinct from the visual cortex; however, the essential property for the TMS-induced suppression we discussed above is a ubiquitous feature in the brain and also exists in the motor cortex. Thus, the basic mechanisms of the TMS-induced suppression might be shared similarly in the different cortical area, and the typical experimental results of TMS of the motor cortex could also be explained by a general model like a two-population network. There would

328

Y. Miyawaki and M. Okada

also be specicity originating from differences of the functional architecture and the coding scheme of neural signals in each cortical area. It is necessary to take such area specicity into account to identify neural mechanisms that yield phenomena unique to the focused cortical site. The issue of area specicity remains open for future study. In this article, we focused on perceptual suppression by TMS; in particular, we assumed a stimulation of the visual cortex and chose the model that has typical functional architecture for feature selectivity or, more specically, the model of orientation tuning function in the primary visual cortex. We obtained the temporal properties of the suppressive effect consistent with the typical data from TMS experiments in the occipital area, in which a visual stimulus evokes a certain coordinated neural activity pattern organized according to a columnar structure and retinotopic topology (e.g., Tootell et al., 1998). Therefore, TMS might randomly activate the neurons in the target area, and such random activities may suppressively affect the coordinated neural processes via local inhibitory connections. These results suggest that TMS-induced suppression is transsynaptically mediated by the inhibitory network and that dynamical interaction in a neural population plays an important role in the temporal properties of the suppression. Acknowledgments This work was supported by Special Postdoctoral Researchers Program of RIKEN, Grant-in-Aid for Scientic Research on Priority Areas No. 14084212, and Grant-in-Aid for Scientic Research (C) No. 14580438. References Adorjan, P., Levitt, J., Lund, J., & Obermayer, K. (1999). A model for the intracortical origin of orientation preference and tuning in macaque striate cortex. Vis. Neurosci., 16, 303–318. Amassian, V., Cracco, R., Maccabee, P., & Cracco, J. (2002). Handbook of transcranial magnetic stimulation. In A. Pascual-Leone, N. Davey, J. Rothwell, E. Wassermann, & B. Puri (Eds.), Visual system (chapter 5, pp. 323–334). London: Arnold. Amassian, V., Cracco, R., Maccabee, P., Cracco, J., Rudell, A., & Eberle, L. (1989). Suppression of visual perception by magnetic coil stimulation of human occipital cortex. Electroencephalogr. Clin. Neurophysiol., 74, 458–462. Amassian, V., Cracco, R., Maccabee, P., Cracco, J., Rudell, A., & Eberle, L. (1993). Unmasking human visual perception with the magnetic coil and its relationship to hemispheric asymmetry. Brain Res., 605, 312–316. Amit, D., & Brunel, N. (1997). Model of global spontaneous activity and local structured activity during delay periods in the cerebral cortex. Cereb. Cortex, 7, 237–252. Barker, A., Jalinous, R., & Freeston, I. (1985). Non-invasive magnetic stimulation of human motor cortex. Lancet, 1, 1106–1107.

A Network Model of TMS

329

Basser, P., & Roth, B. (2000). New currents in electrical stimulation of excitable tissues. Annu. Rev. Biomed. Eng., 2, 377–397. Beckers, G., & Homberg, V. (1992). Cerebral visual motion blindness: Transitory akinetopsia induced by transcranial magnetic stimulation of human area V5. Proc. R. Soc. Lond. B. Biol. Sci., 249, 173–178. Beckers, G., & Zeki, S. (1995). The consequences of inactivating areas V1 and V5 on visual motion perception. Brain, 118, 49–60. Ben-Yishai, R., Bar-Or, R., & Sompolinsky, H. (1995). Theory of orientation tuning in visual cortex. Proc. Natl. Acad. Sci. USA, 92, 3844–3848. Bredfeldt, C., & Ringach, D. (2002). Dynamics of spatial frequency tuning in macaque V1. J. Neurosci., 22, 1976–1984. Carandini, M., & Ringach, D. (1997). Predictions of a recurrent model of orientation selectivity. Vision Res., 37, 3061–3071. Corthout, E., Uttl, B., Walsh, V., Hallett, M., & Cowey, A. (1999). Timing of activity in early visual cortex as revealed by transcranial magnetic stimulation. Neuroreport, 10, 2631–2634. Corthout, E., Uttl, B., Ziemann, U., Cowey, A., & Hallett, M. (1999). Two periods of processing in the (circum)striate visual cortex as revealed by transcranial magnetic stimulation. Neuropsychologia, 37, 137–145. Hallett, M. (1995). Transcranial magnetic stimulation. negative effects. Adv. Neurol., 67, 107–113. Hansel, D., & Sompolinsky, H. (1998). Modeling feature selectivity in local cortical circuits. In C. Koch & I. Segev (Eds.), Methods in neuronal modeling (2nd ed., pp. 499–567). Cambridge, MA: MIT Press. Hilgetag, C., Theoret, H., & Pascual-Leone, A. (2001). Enhanced visual spatial attention ipsilateral to RTMS-induced “virtual lesions” of human parietal cortex. Nat. Neurosci., 4, 953–957. Hotson, J., & Anand, S. (1999). The selectivity and timing of motion processing in human temporo-parieto-occipital and occipital cortex: A transcranial magnetic stimulation study. Neuropsychologia, 37, 169–179. Inghilleri, M., Berardelli, A., Cruccu, G., & Manfredi, M. (1993). Silent period evoked by transcranial stimulation of the human cortex and cervicomedullary junction. J. Physiol., 466, 521–534. Inghilleri, M., Berardelli, A., Marchetti, P., & Manfredi, M. (1996). Effects of diazepam, baclofen and thiopental on the silent period evoked by transcranial magnetic stimulation in humans. Exp. Brain Res., 109, 467–472. Kamitani, Y., Bhalodi, V., Kubota, Y., & Shimojo, S. (2001). A model of magnetic stimulation of neocortical neurons. Neurocomputing, 38–40, 697–703. Kamitani, Y., & Shimojo, S. (1999). Manifestation of scotomas created by transcranial magnetic stimulation of human visual cortex. Nat. Neurosci., 2, 767– 771. Kammer, T., & Nusseck, H. (1998). Are recognition decits following occipital lobe TMS explained by raised detection thresholds? Neuropsychologia, 36, 1161–1166. Kastner, S., Demmer, I., & Ziemann, U. (1998). Transient visual eld defects induced by transcranial magnetic stimulation over human occipital pole. Exp. Brain Res., 118, 19–26.

330

Y. Miyawaki and M. Okada

Marg, E., & Rudiak, D. (1994). Phosphenes induced by magnetic stimulation over the occipital brain: Description and probable site of stimulation. Optom. Vis. Sci., 71, 301–311. Masur, H., Papke, K., & Oberwittler, C. (1993). Suppression of visual perception by transcranial magnetic stimulation—experimental ndings in healthy subjects and patients with optic neuritis. Electroencephalogr. Clin. Neurophysiol., 86, 259–267. Matthews, N., Luber, B., Qian, N., & Lisanby, S. (2001). Transcranial magnetic stimulation differentially affects speed and direction judgments. Exp. Brain Res., 140, 397–406. Meyer, B., Diehl, R., Steinmetz, H., Britton, T., & Benecke, R. (1991). Magnetic stimuli applied over motor and visual cortex: Inuence of coil position and eld polarity on motor responses, phosphenes, and eye movements. Electroencephalogr. Clin. Neurophysiol. Suppl., 43, 121–134. Miller, M., Fendrich, R., Eliassen, J., Demirel, S., & Gazzaniga, M. (1996). Transcranial magnetic stimulation: Delays in visual suppression due to luminance changes. Neuroreport, 7, 1740–1744. Nagarajan, S., Durand, D., & Warman, E. (1993). Effects of induced electric elds on nite neuronal structures: A simulation study. IEEE Trans. Biomed. Eng., 40, 1175–1188. Pascual-Leone, A., & Walsh, V. (2001). Fast backprojections from the motion to the primary visual area necessary for visual awareness. Science, 292, 510–512. Priori, A., Berardelli, A., Mercuri, B., Inghilleri, M., & Manfredi, M. (1995). The effect of hyperventilation on motor cortical inhibition in humans: A study of the electromyographic silent period evoked by transcranial brain stimulation. Electroencephalogr. Clin. Neurophysiol., 97, 69–72. Rattay, F. (1986). Analysis of models for external stimulation of axons. IEEE Trans. Biomed. Eng., 33, 974–977. Ringach, D., Hawken, M., & Shapley, R. (1997). Dynamics of orientation tuning in macaque primary visual cortex. Nature, 387, 281–284. Romeo, S., Gilio, F., Pedace, F., Ozkaynak, S., Inghilleri, M., Manfredi, M., & Berardelli, A. (2000). Changes in the cortical silent period after repetitive magnetic stimulation of cortical motor areas. Exp. Brain Res., 135, 504–510. Roth, B., & Basser, P. (1990). A model of the stimulation of a nerve ber by electromagnetic induction. IEEE Trans. Biomed. Eng., 37, 588–597. Schmolesky, M., Wang, Y., Hanes, D., Thompson, K., Leutgeb, S., Schall, J., & Leventhal, A. (1998). Signal timing across the macaque visual system. J. Neurophysiol., 79, 3272–3278. Somers, D., Nelson, S., & Sur, M. (1995). An emergent model of orientation selectivity in cat visual cortical simple cells. J. Neurosci., 15, 5448–5465. Sompolinsky, H., & Shapley, R. (1997). New perspectives on the mechanisms for orientation selectivity. Curr. Opin. Neurobiol., 7, 514–522. Tootell, R., Hadjikhani, N., Vanduffel, W., Liu, A., Mendola, J., Sereno, M., & Dale, A. (1998). Functional analysis of primary visual cortex (V1) in humans. Proc. Natl. Acad. Sci. USA, 95, 811–817. van Vreeswijk, C., & Sompolinsky, H. (1996). Chaos in neuronal networks with balanced excitatory and inhibitory activity. Science, 274, 1724–1726.

A Network Model of TMS

331

Walsh, V., & Cowey, A. (2000). Transcranial magnetic stimulation and cognitive neuroscience. Nat. Rev. Neurosci., 1, 73–79. Wilson, J., & Anstis, S. (1969). Visual delay as a function of luminance. Am. J. Psychol., 82, 350–358. Ziemann, U., Lonnecker, S., Steinhoff, B., & Paulus, W. (1996a). The effect of lorazepam on the motor cortical excitability in man. Exp. Brain Res., 109, 127–135. Ziemann, U., Lonnecker, S., Steinhoff, B., & Paulus, W. (1996b). Effects of antiepileptic drugs on motor cortex excitability in humans: A transcranial magnetic stimulation study. Ann. Neurol., 40, 367–378. Ziemann, U., Rothwell, J., & Ridding, M. (1996). Interaction between intracortical inhibition and facilitation in human motor cortex. J. Physiol., 496, 873–881. Received December 27, 2002; accepted July 8, 2003.

LETTER

Communicated by Kwokleung Chan

Adaptive Two-Pass Median Filter Based on Support Vector Machines for Image Restoration Tzu-Chao Lin [email protected]

Pao-Ta Yu [email protected] Department of Computer Science and Information Engineering, National Chung Cheng University, Chiayi, Taiwan 62107, R.O.C.

In this letter, a novel adaptive lter, the adaptive two-pass median (ATM) lter based on support vector machines (SVMs), is proposed to preserve more image details while effectively suppressing impulse noise for image restoration. The proposed lter is composed of a noise decision maker and two-pass median lters. Our new approach basically uses an SVM impulse detector to judge whether the input pixel is noise. If a pixel is detected as a corrupted pixel, the noise-free reduction median lter will be triggered to replace it. Otherwise, it remains unchanged. Then, to improve the quality of the restored image, a decision impulse lter is put to work in the second-pass ltering procedure. As for the noise suppressing both xed-valued and random-valued impulses without degrading the quality of the ne details, the results of our extensive experiments demonstrate that the proposed lter outperforms earlier median-based lters in the literature. Our new lter also provides excellent robustness at various percentages of impulse noise. 1 Introduction Noise cancellation is an important step in image processing applications such as edge detection, image segmentation, and data compression. In the literature, nonlinear methods seem to perform better than linear methods in the case of corruption by impulse noise. The median lter, one of the most popular nonlinear lters, has been extensively studied due to its ability to suppress impulse noise computationally efciently (Astola & Kuosmanen, 1997). Although noise suppression is indeed achieved, the median lter tends to blur ne details and lines in many cases, that is, it usually removes desirable details. This problem becomes especially serious when the size of the lter window is large. Some modied median-based lters have been proposed to overcome this problem, including weighted median (WM) lters and fuzzy rules–based median lters as well as decision-based median lters (Yin, Neural Computation 16, 333–354 (2004)

c 2003 Massachusetts Institute of Technology °

334

T. Lin and P. Yu

Yang, Gabbouj, & Neuvo, 1996; Ko & Lee, 1991; Sun & Neuvo, 1994; Chen, Ma, & Chen, 1999; Chen & Wu, 2001; Abreu & Mitra, 1995; Arakawa, 1996). Acceptable results have been obtained by using these lters. Nevertheless, weighted median lters (WM) still tend either to remove ne details from the image or retain too much impulse noise. That is, a trade-off exists between noise reduction and detail preservation (Ko & Lee, 1991). On the other hand, fuzzy rules learning-based median lters are controlled by fuzzy rules that judge the possibility of the existence of impulse noise (Arakawa, 1996; Tsai & Yu, 1999, 2000; Yu & Chen, 1995; Yu, Lee, & Kuo, 1995). The learning process is realized by training over a reference image. Because the values of the membership function parameters depend on the reference image, some stations may remain untrained; in other words, the generalization capability is poor (Arakawa, 1996). Recently, decision-based median lters, realized by thresholding operations, have been studied (Sun & Neuvo, 1994; Chen et al., 1999). They mainly use a predened threshold value to control the lter so that it is activated only for contaminated pixels. That is, it avoids damage to good pixels. The switching schemes (Scheme I and II) by Sun and Neuvo (1994) prove to be more effective than uniformly applied methods. In both schemes, the nal output switches between the output from the median lter and the identity. The tristate decision-making mechanism by Chen et al. (1999) incorporates the median lter and the center weighted median (CWM) lter, which is a special case of WM lter, taking either the identity, the output of the median lter, or that of the CWM lter as the new pixel value. In general, decision-based median lters can be realized quite easily, but the problem is that they are difcult to practice threshold functioning on local signal statistics to judge whether an impulse noise exists. Moverover, those hard switching schemes may work well on large xed-valued impulses but poorly on random-valued ones, and vice versa. In this letter, we propose a novel adaptive two-pass median lter (ATM) based on SVMs and some switching schemes to overcome the drawbacks of these methods. Basically, the proposed lter applies a new SVM approach: an impulse detection algorithm will be put to use before the ltering process, and the detection results are used to decide whether a pixel should be modied. SVMs have been proposed recently as new kinds of feedforward networks for pattern recognition because of their good generalization capability (Haykin, 1999; Vapnik, 1995, 1998). In addition, to improve the ltering performance, the second-pass lter uses a simple decision impulse lter to lter rst-pass residue noise. With this new ltering framework, our lter can effectively eliminate both xed-valued and random-valued impulse noise, and the preservation of the original signals is greatly improved in comparison with other median-based lters without sacricing too much computational complexity. Furthermore, the new lter also shows its excellent robustness at different impulsive ratios, and it is independent of the reference image in the training phase.

Adaptive Median Filter Based on Support Vector Machines

335

In section 2, some basic concepts are reviewed, and the noise model is dened. In section 3, the design of the new SVMs technique is illustrated. In section 4, the design of the new ATM lter is portrayed in detail. In section 5, some extensive simulations demonstrate that the proposed ATM lter outperforms other median-based lters. The conclusion is given in section 6. 2 Basic Concepts and Impulse Noise Model 2.1 Basic Concepts. Before introducing the proposed ATM lter, some notations must be dened. Let X denote a two-dimensional h £ m image represented by 2 3 x11 x12 ¢ ¢ ¢ x1j ¢ ¢ ¢ x1m 6x21 x22 ¢ ¢ ¢ x2j ¢ ¢ ¢ x2m 7 6 7 6 :: :: :: :: 7 6 : 7 : : : 7 (2.1) XD6 6 xi1 xi2 ¢ ¢ ¢ xij ¢ ¢ ¢ xim 7 D [xij ]h£m ; 6 7 6 : :: :: :: 7 4 :: : : : 5 xh1

xh2

¢¢¢

xhj

¢¢¢

xhm

where h and m are its height and width, respectively, and xij 2 f0; 1; 2; : : : ; 255g is the pixel value at position .i; j/ in X. A lter window (or a sliding window) whose size is S D .2¿ C1/2 D 2nC1 (S is odd, in general) covers the image X at location .i; j/ to obtain an observed sample matrix Wij dened by 2 3 xi¡¿;j¡¿ ¢ ¢ ¢ xi¡¿;j ¢ ¢ ¢ xi¡¿;jC¿ 6 :: :: :: 7 6 : : : 7 6 7 7 ¢ ¢ ¢ ¢ ¢ ¢ (2.2) Wij D 6 x x x ; i;j¡¿ ij i;jC¿ 6 7 6 : 7 : : :: :: 5 4 :: xiC¿;j¡¿

¢¢¢

xiC¿;j

¢¢¢

xiC¿;jC¿

.2¿ C1/£.2¿ C1/

where 1 · i · h and 1 · j · m. The central pixel value of the lter window Wij is xij . Let the lter window Wij be slid on the image X from left to right, top to bottom, in a raster scan fashion. By using the row-major method on signal set Wij in equation 2.2, for convenience, the observed sample matrix can be represented by Wij D .xi¡¿;j¡¿ ; xi¡¿;j¡¿ C1 ; : : : ; xij ; : : : ; xiC¿;jC¿ ¡1 ; xiC¿;jC¿ /;

(2.3)

where the subscript ij can be replaced with a scalar running index, k D .i ¡ 1/ £ m C j. Hence, in order to transfer the .2¿ C 1/ £ .2¿ C 1/ window into a one-dimensional vector, equation 2.3 can be described as w.k/ D .x¡n .k/; x¡nC1 .k/; : : : ; x0 .k/; : : : ; xn¡1 .k/; xn .k//;

(2.4)

336

T. Lin and P. Yu

Figure 1: A 3 £ 3 lter window around the center pixel x0 .k/.

where x0 .k/ (or x.k/) is the originally central vector-valued pixel at location k. For example, we consider a 3 £ 3 lter window centered around x0 .k/, as shown in Figure 1, such that w.k/ D .x¡4 .k/; : : : ; x¡1 .k/; x0 .k/; x1 .k/; : : : ; x4 .k//:

(2.5)

The usual output value from a median lter is denoted as m.k/ at location k in a lter window of size 2n C 1 as follows: m.k/ D MED w.k/

D MED fx¡n .k/; : : : ; x¡1 .k/; x0 .k/; x1 .k/; : : : ; xn .k/g;

(2.6)

where MED means the median operation. 2.2 The Impulse Noise Model. In this work, the source images corrupted by impulse noise with probability of occurrence p can be described as follows, x.k/ D

» n.k/; a.k/;

with probability p ; with probability 1 ¡ p

(2.7)

where n.k/ denotes the image contaminated by impulse with noise ratio p, and a.k/ means the pixels are noise free. There are two types of impulse noise: xed-valued and random-valued impulse. In a gray-scale image, the xed-valued impulse, which is also known as salt and pepper noise, shows up as equal to or near the maximum or minimum of the allowable dynamic p range with equal probability (i.e., 2 ), while the random-valued impulse is uniformly distributed over the range of [0, 255] at ratio p.

Adaptive Median Filter Based on Support Vector Machines

337

3 SVMs Classier To take neural networks into account, there are many possible models to be chosen for classication problem (Lin & Yu, 2003). Since the SVMs have high generalization ability without the addition of a priori knowledge, even when the dimension of the input space is very high, the SVM approach is considered a good candidate for the classier. We describe the basic theory of the new promising technique, SVMs for classication (Haykin, 1999; Muller, ¨ Mika, Ra¨ tsch, Tsuda, & Sch¨olkopf, 2001; Cristianini & John, 2000), and then apply the SVM technique to our noise detector. 3.1 Support Vector Machines. SVMs are basically aimed at dealing with a two-class classication problem. Generally, the two classes are separated by an optimal discrimination function, which is induced from training examples. Let f.xi ; yi /; i D 1; : : : ; Ng be a set of training examples, where each example xi 2 Rt (t is the dimension of the input space) belongs to a class labeled by yi 2 f¡1; 1g with a hyperplane wx C b D 0, which divides the set of examples so that all the examples with the same label are on the same side of the hyperplane. If the set of examples is separated without error and the margin is maximal, then the optimal separating hyperplane (OSH) is obtained. The margin can be seen as a measure of the generalization ability; that is, the larger the margin, the better the generalization expected (Vapnik, 1995, 1998). A linearly separating hyperplane in the canonical form must satisfy the following constraint, yi [.w ¢ xi / C b] ¸ 1; i D 1; : : : ; N:

(3.1)

The distance of an example x to the hyperplane is d.w; bI x/ D

jw ¢ x C bj : kwk

(3.2)

The margin between the closest examples (called the support vectors) and 1 the hyperplane is kwk . Hence, the optimal separating hyperplane that optimally separates the data is the one that minimizes the cost function 8.w/ D

1 kwk2 : 2

(3.3)

The solution to the optimization problem in equation 3.3 under the constraint of equation 3.1 can be achieved by Lagrange multipliers. In this work, we need to consider both the linearly nonseparable case and the nonlinear case, for the training noisy examples have both. The nonlinear

338

T. Lin and P. Yu

discrimination function is of the form Á ! N X f .x/ D sign ®i yi K.xi ; x/ C b ;

(3.4)

iD1

where K.xi ; x/ is a positive denite kernel function, and nonnegative variables ®i are the Lagrange multipliers. There are three widely used kernel functions available: polynomial kernels, radial basis function (RBF) kernels, and sigmoidal kernels (Haykin, 1999). In this work, polynomial kernels are used to implement the proposed lter (Lin, 2001, 2002; Chang, Hsu, & Lin, 2000; Hsu & Lin, 2002; Chang & Lin, 2002). 3.2 Impulse Noise Detector Using SVMs. In this subsection, we design a decision median lter based on the SVM technique. The structure of the SVM classier is illustrated in Figure 2. The class level d is equal to 1 or ¡1, which means the input example x.k/ is a noise pixel or noise-free pixel, respectively. Note that x.k/ and z.k/ represent the input pixel value and the output value at location k, respectively. Figure 3 shows the feedforward network architecture of support vector machines for noise detection. Before training the SVMs, the observation vector must be extracted from the lter window w.k/ to reduce the size of the SVM structure. The details of feature extraction are discussed in the next section. 4 The Design of the ATM Filter The proposed ATM lter uses the SVM technique to classify the signal as either noise free or noise corrupted so that only noisy signals are ltered and good signals are preserved. Subsection 4.1 describes the structure of the ATM lter. Subsection 4.2 then presents a new approach to decide whether

Figure 2: Structure of the decision median ler.

Adaptive Median Filter Based on Support Vector Machines

339

Figure 3: Feedforward network architecture of support vector machines.

a pixel is corrupted by impulse noise. After that, the SVM impulse detector is devised to solve the classier problem. The two-pass ltering process is described in subsection 4.3. 4.1 The Structure of the ATM Filter. In section 3, we presented the SVMs two-class recognition scheme. In other words, SVMs recognize the input signals as two separate classes: noise-free and noise-corrupted signals. Now we describe the detailed structure of the proposed ATM lter, which consists of two-pass median lters to remove impulse noise, as shown in Figures 4a and 4b. In general, to reduce impulse noise effectively without degrading the quality of the original image, the ltering should work only on the corrupted pixels. The noise-free pixels should be kept unchanged. This can be achieved by the SVM detection process in the rst pass of the proposed ATM lter. The rst-pass ltering mainly has two steps, as shown in Figure 4a. Before the noise reduction ltering, the input signals are rst

340

T. Lin and P. Yu

(a)

(b) Figure 4: ATM lter. (a) First-pass ATM lter. (b) Second-pass ATM lter.

judged by the SVM impulse detector to be noise-free or noise-corrupted pixels. A binary ag map I.i; j/, 1 · i · h, 1 · j · m is used to indicate whether the pixels have been detected as impulses in the test image. (The feature extraction and SVM impulse detector are discussed in the next subsection.) The input signal x.k/ is determined to be noise free or noise corrupted depending on the binary ag map I.i; j/. If a pixel is determined to be a corrupted pixel, it is replaced with an estimated median selected from only the noise-free pixels within the lter window w.k/. Otherwise, it is kept unchanged. We call this special lter the noise-free reduction median

Adaptive Median Filter Based on Support Vector Machines

341

(NFR) lter. In case the SVM impulse detector might mistakenly leave any noisy pixels undetected or misdetect good pixels, the second-pass ltering by the decision median lter to remove the noisy pixels is added. In order to remove only residue impulses with very small signal distortion, a hard switching scheme is also a part of our new lter. 4.2 A New Approach to Judge Whether an Impulse Noise Exists. In this work, we propose a new and efcient classier approach to identify noise by using the SVM method. The SVM impulse detector as shown in Figure 4a uses the local characteristics in the lter window w.k/ as observation vectors to classify the signals into two classes: noise-free and noisecorrupted signals. This efcient approach can minimize not only the risk of misclassifying the examples in the training set (i.e., training errors) but also that of misclassifying the examples of the test set (i.e., generalization errors). This design of the detection approach consists of two steps: (1) feature extraction and (2) training the SVM impulse detector. 4.2.1 Feature Extraction Using a Local Information Measure. Before the noise-reduction ltering, the observation vector must be extracted from the lter window w.k/ to identify noisy pixel (Lin & Yu, in press). One can dene the following three variables to generate the observation vector O.k/ as the input data of the SVM impulse detector. In general, the amplitudes of the most impulse noises are more prominent than the ne changes of signals. Thus, the difference between the value x.k/ of the input pixel and the output value from the median lter provides an efcient measurement to identify noisy pixel. Denition 1. The variable u.k/ denotes the absolute difference between the input x.k/ and the median value m.k/ as follows: u.k/ D jx.k/ ¡ m.k/j:

(4.1)

Note that u.k/ is a measure for detecting the probability whether the input x.k/ is contaminated. A large value of u.k/ indicates that the input x.k/ is dissimilar to the median value of the lter window w.k/; that is, the central pixel x.k/ is corrupted by impulse noise. The correctness of the SVM impulse detector is decisive in the ltering process, because the whole noise ltering depends on the noise detection results. Most impulse noises can be detected using the variable u.k/ as the indicator. Nevertheless, using only the variable u.k/ as the noise detector has two problems: The rst problem is that if we take the value of u.k/ only to judge whether an impulse noise exists, then it is difcult to separate impulse noise fully. For example, line components are usually present in an image, and its width is just 1 pixel; therefore, if x.k/ is located on the line, it may be interpreted as impulse noise and be removed. This problem is shown in the following

342

T. Lin and P. Yu

example, where a skew line is located in a lter window as follows: 0

200 @ 10 10

10 200 10

1 10 10 A : 200

To avoid incorrect judgments, it is necessary to add other observations. Thus, another variable v.k/ is provided, dened as follows: Denition 2. v.k/ D

jxo .k/ ¡ xc1 .k/j C jx0 .k/ ¡ xc2 .k/j ; 2

(4.2)

where jx0 .k/ ¡ xc1 .k/j · jx0 .k/ ¡ xc2 .k/j · jx0 .k/ ¡ xi .k/j; ¡n · i · n; i 6D 0; c1; c2: Note that xc1 .k/ and xc2 .k/ are selected to be the pixel values closest to that of x.k/ in its adjacent pixels in the lter window w.k/, as shown in Figure 1. If the variable v.k/ is applied, then the line component in the lter window w.k/ will not be detected as noise but will be preserved because of the small v.k/. The second problem is that when a pixel is a good pixel, if u.k/ is larger, it is usually misjudged as an impulse noise. Such pixels may exist in edge areas. Therefore, it is important to distinguish noise components from edge components for effective noise ltering and edge preservation. For example, edges in a 3 £ 3 lter window can generally be represented as Figure 5 (Lee, Hwang, & Sohn, 1998). That is, edges generally have more than three similar pixels in the 3 £ 3 lter window, and they should not be detected

Figure 5: Edges in a 3 £ 3 window.

Adaptive Median Filter Based on Support Vector Machines

343

as noisy pixels. To overcome this problem, another variable, q.k/, should be provided. Denition 3. cw0 .k/ D MEDfx¡n .k/; : : : ; x¡1 .k/; w0 ¦ x0 .k/; : : : ; xn .k/g;

(4.3)

where MEDfx¡n .k/; : : : ; x¡1 .k/; w0 ¦ x0 .k/; : : : ; xn .k/g D MEDfx¡n .k/; : : : ; x¡1 .k/; x0 .k/; : : : ; x0 .k/; : : : ; xn .k/g: | {z } w0 times

MED means the median operation, w0 denotes the nonnegative integer weight, and w0 ¦ x0 .k/ means that there are w0 copies of input pixel x0 .k/ (Ko & Lee, 1991). Denition 4. q.k/ D jx0 .k/ ¡ c3 .k/j:

(4.4)

If the variable q.k/ is applied, whose center weight w0 equals 3, then the edge components in the lter window w.k/ will not be detected as noise but will be preserved, so we can decide that no impulse noise is located at the pixel x.k/. In this work, according to the above three variables u.k/; v.k/, and q.k/, the observation vectors are given by O.k/ D .u.k/; v.k/; q.k//:

(4.5)

The observation vectors O.k/ are derived to extract the feature information from the lter window for training the SVM impulse detector. This feature information will also be used in the ltering stage. 4.2.2 Training the SVM Impulse Detector. In the above steps, local features are extracted in an unsupervised manner. For training the SVM impulse detector, supervised class information of the training examples is incorporated with these extracted observation vectors O.k/. After learning, the discrimination function between noise-free and noise-corrupted classes is obtained, that is, represented by several support vectors together with their combination coefcients, as shown in equation 3.4. That is, the optimal separating hyperplane is obtained. The training process is shown in Figure 6.

344

T. Lin and P. Yu

Figure 6: Training framework of the SVM impulse detector.

4.3 Two-Pass Filtering 4.3.1 The First Pass Filtering. The ATM rst-pass lter mainly consists of two steps: an SVM impulse detector and a noise-free reduction lter, as shown in Figure 4a. Step 1: SVM impulse detector. First, the feature information O.k/ is extracted from the test image, and then the SVM impulse detector decides whether x.k/ is a noise-free or noise-corrupted pixel according to the discrimination function learned in the training stage. The I.i; j/ records the

Adaptive Median Filter Based on Support Vector Machines

345

location of the impulse noise in the test image, called a noise ag map: » 1; I.i; j/ D 0;

if SVM classies the pixel to be noisy : otherwise

(4.6)

Step 2: Noise-free reduction lter. If the input pixel is classied as an impulse according to the noise ag map I.i; j/, the noise-free reduction median lter replaces it by the median value. Otherwise, the input pixel is not replaced, and its original intensity is the output. However, the median value here is selected from only the noise-free pixels decided by the ag map I.i; j/ within a lter window w.k/ (Wang & Zhang, 1999). The noise-free pixels can be sorted in ascending order, which denes the vector as g.k/ D .g1 .k/; g2 .k/; : : : ; gR .k//;

(4.7)

where g1 · g2 .k/ · ¢ ¢ ¢ · gR .k/ are elements of w.k/ and are good pixels according to their corresponding I.i; j/ R denotes the number of all the pixels with I.i; j/ D 0 in the lter window w.k/. If R is odd, then m.k/ D MED g.k/

D MEDfg1 .k/; g2 .k/; : : : ; gR .k/g:

(4.8) (4.9)

If R is even but not 0, then m.k/ D .gR=2 .k/ C gR=2C1 .k//=2;

(4.10)

where gR=2 .k/ is the .R=2/th largest value and gR=2C1 .k/ is the .R=2 C 1/th largest value of the sorted noise-free pixel g.k/ The value of x.k/ is modied only when the pixel x.k/ is an impulse and R is greater than 0. Then the output y0 .k/ of the rst-pass lter is obtained by » y0 .k/ D

m.k/; x.k/;

if I.i; j/ D 1I R > 0 : else

(4.11)

4.3.2 The Second Pass Filtering. The rst-pass ltering involves the SVM impulse detector and the noise-free reduction lter. Owing to the possible mistakes the SVM impulse detector might make, two problems might need to be taken care of. First, the undetected noisy pixels may remain in the restored image because the SVM impulse detector has not detected them to be noisy pixels. Second, the misdetected pixels may appear in the restored image, causing the noise-free reduction lter to modify these input pixels though they are good pixels. Therefore, in addition to the SVM impulse detector and the noise-free reduction lter, we need a decision impulse lter to help improve the performance of the ATM lter by reducing both undetected and misdetected pixels.

346

T. Lin and P. Yu

The decision impulse lter of the second-pass ltering to detect the corrupted pixels is based on the difference between the output y0 .k/ of the rst-pass lter and the output value y0med .k/ of the median lter in the lter window w.k/, as shown in Figure 4b. The difference value is called a thresholding value T. Then the second-pass ATM lter uses the threshold to decide the nal output y.k/, » y.k/ D

y0med .k/; y0 .k/;

if jy0 .k/ ¡ y0med .k/j ¸ T : else

(4.12)

A potential advantage of the decision impulse lter is that when pixels are not corrupted by impulse noise, they will not be affected by the ltering; moreover, the residue noisy pixels can also be removed. Thus, the new ATM lter can suppress both xed-valued and random-valued impulses without degrading the quality of the ne details of the other median-based lters. 5 Experimental Results The proposed ATM lter has been tested to see how well it can remove the impulse noise and enhance the image restoration performance for signal processing. These extensive experiments have been conducted on a number of 512 £ 512 test images to evaluate and compare the performance of the proposed ATM lter with those of a number of existing impulse removal techniques, which are variances of the standard median lter. The peak signal-to-noise ratio (PSNR) criterion is adopted to measure the restoration performance quantitatively. The optimal separating hyperplane is obtained through a training reference image in the training process. That is, the images being ltered are outside the training set. The ATM lter is more powerful than other medianbased lters in generalization capability. Moreover, the performance is not strongly dependent on the training reference image. For example, the ltering results by the Couple training image are close to the ltering results by Lena or other training images. Note that to demonstrate the generalization capability of the ATM lter, the optimized discrimination function is used to restore an image outside the training set, where the Couple image, as shown in Figure 7, is used as a reference for training through the following experiments. The following experiments have been conducted to examine the effectiveness of our new lter: (1) the effects of the threshold T on the ATM lter, (2) a comparison with several other median-based lters in terms of noise suppression and detail preservation, and (3) a demonstration of the robustness of the ATM lter with respect to different impulse noise percentages. The inuence of the threshold T on the second-pass lter in our new lter is investigated in the rst experiment. The PSNR values obtained by adjusting the threshold T for different test images are graphically presented

Adaptive Median Filter Based on Support Vector Machines

347

Figure 7: Original training Couple image.

in Figure 8. The proper threshold values, which do not require a laborious process, are obtained experimentally for different test images. With the value 255 as the threshold T, the lter degenerates into the ATM(I) (the rst pass in ATM) lter. Note that the 0 threshold T represents a standard median lter. The median lter may result in degradation of the quality of the ne details, though it suppresses noises. Figure 8 shows that the PSNR is signicantly improved by using our ATM lter with appropriate thresholds. For the majority of images we tested, the threshold T employed by the ATM lter lies approximately in the range of [20, 46] at p D 20%. We observe that the threshold range is quite consistent and common for the xed-valued and random-valued impulse; thus, the performance is rather stable. The second experiment is to compare the ATM lter with the standard median (MED) lter, the CWM lter (Ko & Lee, 1991), switching scheme I (SWM-I) (Sun & Neuvo, 1994), switching scheme II (SWM II) (Sun & Neuvo, 1994), the tristate median (TSM) lter (Chen et al., 1999), and the fuzzy median (FM) lter (Arakawa, 1996) in terms of noise removal capability (measured in PSNR). Table 1 serves to compare the PSNR results of removing both the xed-valued and random-valued impulse noise at p D 20%. Apparently Table 1 reveals that the ATM lter achieves signicant improvement on the other lters to suppress both types of impulse noise. Note that

348

T. Lin and P. Yu

Figure 8: Effect of threshold T with respect to PSNR for ATM lter. Training processes are all conducted by the reference of the Couple image. (a) Fixedvalue impulse noise. (b) Random-valued impulse noise. (The Boat, Goldhill, Lake, Bridge, and Lena are the different test images.)

ATM(I) represents the rst-pass lter and ATM represents our nal proposed lter in Table 1. Figure 9 shows the comparative restoration results for the image Lena corrupted by random-valued impulse noise at 20% of MED, FM, ATM(I), and ATM lters, respectively. Apparently, the ATM lter produces a better subjective visual quality restored image by more noise

Adaptive Median Filter Based on Support Vector Machines

349

Table 1: Comparative Restoration Results in PSNR (dB) for 20% Impulse Noises. (a) Fixed-valued noise Filtersa MED CWM SWM-I SWM-II TSM FM ATM(I) ATM

Images Boat

Goldhill

Lake

Bridge

Lena

29.20 29.81 31.57 30.99 31.16 30.86 32.06 33.67

28.84 29.87 31.95 30.90 31.55 30.95 31.91 34.08

27.19 28.11 28.82 29.03 29.73 28.61 28.93 31.26

24.98 25.67 26.76 26.74 27.65 27.41 27.98 28.41

30.18 30.38 33.37 32.43 31.84 31.32 32.53 36.57

(b) Random-valued noise Filtersa MED CWM SWM-I SWM-II TSM FM ATM(I) ATM

Images Boat

Goldhill

Lake

Bridge

Lena

30.14 30.99 31.57 30.99 32.29 32.09 31.75 32.56

29.71 30.89 31.95 30.90 32.44 32.23 31.74 32.41

27.84 28.89 28.82 29.03 30.22 27.37 26.69 30.22

25.33 26.34 26.76 26.74 27.39 27.67 27.43 27.77

31.72 32.42 33.37 32.43 34.13 33.38 33.02 34.45

Notes: The Boat, Goldhill, Lake, Bridge, and Lena are the different test images. a Training image by the Couple image at p D 20% for FM and ATM lters.

suppression and detail preservation. In fact, only our ATM(I) lter is enough to get satisfactory restored images. Regardless of what is used in the training set, the third experiment is to demonstrate the robustness of the trained discrimination function at different percentages of impulse noise. In the experiment, a 20% impulsive reference image Couple is taken as the training image, independent of the actual corruption percentage. Figure 10 shows the comparative PSNR results of the restored image Boat when corrupted by the xed-valued and random-valued impulse noise of 10% to 30%. In Figure 10, the ATM lter has exhibited a satisfactory performance in robustness, even if the learning image is not available or there is no accurate information of impulse noise in training. Figure 11 shows the outstanding restoration results for the image Boat corrupted by random-valued impulse noise of 10% to 30%.

350

T. Lin and P. Yu

Figure 9: Subjective visual qualities of restored images of Lena. (a) Original image. (b) Corrupted image by 20% random-valued impulse noise and ltered by (c) MED lter, (d) FM lter, (e) ATM(I) lter, and (f) ATM lter.

6 Conclusion In this article, a novel adaptive two-pass median lter has been proposed for improving the median-based lters to preserve more image details while effectively suppressing impulse noise. The proposed lter is composed of

Adaptive Median Filter Based on Support Vector Machines

351

Figure 10: Comparative restoration results in PSNR (dB) for ltering the Boat image corrupted by impulse noise, with different percentages. Training processes for FM and ATM lters are all conducted by the reference of the Couple image at p D 20%. (a) Fixed-valued impulse noise. (b) Random-valued impulse noise.

352

T. Lin and P. Yu

Figure 11: Robust restored Boat image. (a) Original image. (b) Noisy image (20%). (c) Filtered image degraded by 10%. (d) Filtered image degraded by 20%. (e) Filtered image degraded by 25%. (f) Filtered image degraded by 30% of random-valued impulsive noise.

Adaptive Median Filter Based on Support Vector Machines

353

a noise decision maker and a two-pass median lter. A new approach is attached to our new lter to judge whether the input pixel is noise based on the SVM impulse detector. Owing to the excellent generalization capability of SVMs, the decision maker can successfully decide whether the pixel is noisy, so that the mean square error of the lter output can be minimized. The extensive experimental results presented here have demonstrated that the proposed ATM lter is superior to a number of well-accepted medianbased lters in the literature. The ATM lter is capable of showing desirable robustness in suppressing noise and providing satisfactory image quality. Acknowledgments We thank Chih-Jen Lin for providing the software LIBSVM, a library for support vector machines (version 2.32). References Abreu, E., & Mitra, S. K. (1995). A signal-dependent rank ordered mean (SDROM) lter: A new approach for removal of impulses from highly corrupted images. In Proceedings of IEEE ICASSP. Detroit, MI: IEEE, ICASSP. Arakawa, K. (1996). Median lters based on fuzzy rules and its application to image restoration. Fuzzy Sets and Systems, 77, 3–13. Astola, J., & Kuosmanen, P. (1997). Fundamentals of nonlinear digital ltering. Boca Raton, FL: CRC. Chang, C.-C., Hsu, C.-W., & Lin, C.-J. (2000). The analysis of decomposition methods for support vector machines. IEEE Trans. Neural Networks, 11, 1003– 1008. Chang, C.-C., & Lin, C.-J. (2002). Training nu-support vector regression: Theory and algorithms. Neural Computation, 14, 1959–1977. Chen, T., Ma, K. K., & Chen, L. H. (1999). Tri-state median lter for image denoising. IEEE Trans. Image Processing, 8, 1834–1838. Chen, T., & Wu, H. R. (2001). Application of partition-based median type lters for suppressing noise in images. IEEE Trans. Image Processing, 6(10), 829–836. Cristianini, N., & John, S.-T. (2000). An introduction to support vector machines and other kernel-based learning methods. Cambridge: Cambridge University Press. Haykin, S. (1999). Neural networks: A comprehensive foundation (2nd ed.). Upper Saddle River, NJ: Prentice Hall. Hsu, C.-W., & Lin, C.-J. (2002). A simple decomposition method for support vector machines. Machine Learning, 46, 291–314. Ko, S. J., & Lee, Y. H. (1991).Center weighted median lters and their applications to image enhancement. IEEE Trans. Circuits and Systems, 38, 984–993. Lee, K.-C., Hwang, J., & Sohn, K.-H. (1998).Detection-estimation based approach for impulsive noise removal. Electronics Letters, 5(34), 449–450. Lin, C.-J. (2001). On the convergence of the decomposition method for support vector machines. IEEE Trans. Neural Networks, 6, 1288–1298.

354

T. Lin and P. Yu

Lin, C.-J. (2002).A formal analysis of stopping criteria of decomposition methods for support vector machines. IEEE Trans. Neural Networks, 13, 1045–1052. Lin, T.-C., & Yu, P.-T. (2003). Centroid neural network adaptive resonance theory for vector quantization. Signal Processing, 83, 649–654. Lin, T.-C., & Yu, P.-T. (in press). Partition fuzzy median lter based on fuzzy rules for image restoration. Fuzzy Sets and Systems. Muller, ¨ K.-R., Mika, S., R¨atsch, G., Tsuda, K., & Sch¨olkopf, B. (2001). An introduction to kernel-based learning algorithms. IEEE Trans. Neural Networks, 12, 181–201. Sun, T., & Neuvo, Y. (1994). Detail-preserving median based lters in image processing. Pattern Recognition Letter, 15, 341–347. Tsai, H.-H., & Yu, P.-T. (1999). Adaptive fuzzy hybrid multichannal lters for removal of impulsive noise from color images. Signal Processing, 74, 127–151. Tsai, H.-H., & Yu, P.-T. (2000). Genetic-based fuzzy hybrid multichannel lter for color image restoration. Fuzzy Sets and Systems, 114, 203–224. Vapnik, V. (1995). The nature of statistical learning theory. New York: SpringerVerlag. Vapnik, V. (1998). Statistical learning theory. New York: Wiley. Wang, Z., & Zhang, D. (1999). Progressive switching median lter for the removal of impulse noise from highly corrupted images. IEEE Trans. Circuits and Systems-II: Analog and Digital Signal Processing, 1(46), 78–80. Yin, L., Yang, R., Gabbouj M., & Neuvo Y. (1996). Weighted median lters: A tutorial. IEEE Trans. Circuits Sys., 3(43), 157–192. Yu, P.-T., & Chen, R. C. (1995). An optimal design of fuzzy (m, n) rank order ltering with hard decision neural learning. In Proceedings of IEEE of ISCAS, International Symposium on Circuits and Systems (pp. 567–593). Seattle, WA: IEEE of ISCAS. Yu, P.-T., Lee, C.-S., & Kuo, Y.-H. (1995). Weighted fuzzy mean lters for heavytailed noised removal. In Proceedings of IEEE of ISUMA-NAFIPS, Third International Symposium on Uncertainty Modeling and Analysis (pp. 601–606). Received May 21, 2003; accepted July 10, 2003.

Communicated by David Saad

LETTER

Improving Generalization Performance of Natural Gradient Learning Using Optimized Regularization by NIC Hyeyoung Park [email protected] Brain Science Institute, RIKEN, Saitama, Japan

Noboru Murata [email protected] Waseda University, Tokyo, Japan

Shun-ichi Amari [email protected] Brain Science Institute, RIKEN, Saitama, Japan

Natural gradient learning is known to be efcient in escaping plateau, which is a main cause of the slow learning speed of neural networks. The adaptive natural gradient learning method for practical implementation also has been developed, and its advantage in real-world problems has been conrmed. In this letter, we deal with the generalization performances of the natural gradient method. Since natural gradient learning makes parameters t to training data quickly, the overtting phenomenon may easily occur, which results in poor generalization performance. To solve the problem, we introduce the regularization term in natural gradient learning and propose an efcient optimizing method for the scale of regularization by using a generalized Akaike information criterion (network information criterion). We discuss the properties of the optimized regularization strength by NIC through theoretical analysis as well as computer simulations. We conrm the computational efciency and generalization performance of the proposed method in real-world applications through computational experiments on benchmark problems. 1 Introduction The natural gradient method is a stochastic gradient method originating with information geometry. Amari (1998) proposed the concept of natural gradient learning and proved that it is Fisher efcient in the case that the negative log likelihood is used as a loss function. Its dynamical property has been also studied by statistical-mechanical analysis in the large system size limit, and it has been shown that the natural gradient can escape from plateaus more efciently than the standard gradient (Rattray, Saad, & Neural Computation 16, 355–382 (2004)

c 2003 Massachusetts Institute of Technology °

356

H. Park, N. Murata, and S. Amari

Amari, 1998). The convergence properties of on-line learning methods including natural gradient have also been analyzed using stochastic approximation theory (Bottou, 1998). In order to implement the concept of natural gradient for learning of feedforward neural networks, Amari, Park, and Fukumizu (2000) proposed an adaptive method of obtaining an estimate of the natural gradient. It is called the adaptive natural gradient method, and they extended it to various stochastic neural network models. A number of computational experiments on well-known benchmark problems have conrmed that adaptive natural gradient learning can be applied to various practical applications successfully, and the method can alleviate or even avoid plateaus. Besides the problem of learning speed, generalization performance is another important problem in neural networks. When a network structure, an error function, and training data are xed, all learning algorithms based on the gradient-descent method have the same equilibrium points in the error surface. Thus, in a theoretical sense, it is hard to say that the generalization performance of the natural gradient method is much different from that of the standard gradient method. In a practical sense, however, the results may be different. Let us assume the following typical practical situation. First, we do not know the optimal complexity of network models for a given problem, and thus we use a large network with sufcient complexity. Second, since we do not know the minimum error that can be achieved by the network, we stop learning when the decrement of training error has been very small for a while and no improvement by learning is expected. In this situation, the solution obtained by the natural gradient method can be frequently different from that by the standard method, because standard gradient learning is subject to being trapped in a long plateau and can easily be misinterpreted as the end of the process. In this case, it is obvious that the generalization performances of the methods are different. Therefore, more careful consideration of the generalization performance is necessary when natural gradient learning is applied to practical problems. Park (2001) has investigated this situation and has suggested solving it by using a regularization term. When applying the regularization method, it is very important to optimize the scale of regularization in order to have a good generalization performance. Various methods have been developed for obtaining the optimal scale of regularization (Sigurdsson, Larsen, & Hansen, 2000). A simple method is to use a validation set for optimizing the scale. Composing a validation set, however, requires using part of the learning data set. Thus, this method is not desirable when the learning data are insufcient. To solve this problem, the cross-validation method (Stone, 1974) has been widely used. To get an accurate estimate of the regularization parameter, however, the cross-validation method is computationally expensive. There also have been theoretical approaches to estimating the generalization error (Bishop, 1995; Moody, 1992; Rissanen, 1978; Vapnik,

Natural Gradient with Regularization Using NIC

357

1995) and applying it to optimization of the regularization strength. However, these methods have shortcomings in terms of efcient practical implementation, and some of them need additional assumptions. In addition, there have been few theoretical analyses on the properties of the obtained optimal values using those estimators. In this letter, we used the concept of network information criterion (NIC) in order to optimize the regularization strength. The NIC (Murata, Yoshizawa, & Amari, 1994), which is one of the estimators for generalization errors, has been developed for general error functions and network models. Since natural gradient learning (Amari, 1998) and its adaptive version for stochastic neural networks (Park, Amari, & Fukumizu, 2000) have been derived from the same theoretical background as NIC, the natural gradient and NIC can be combined and applied to various stochastic learning models. In addition, since the theoretical meaning of applying NIC to optimize the regularization strength has been discussed by Murata (2001), the proposed method has strong theoretical justication as well. In this letter, we conrm the theoretical results by computer simulations. Furthermore, since the NIC and natural gradient learning use the common Fisher information matrix, we can also achieve computational efciency by sharing the matrix information, as we discuss in section 3. 2 Theory of Learning with Regularization 2.1 Natural Gradient Learning. We begin with a brief introduction of the natural gradient method. Since natural gradient learning is a kind of stochastic gradient descent learning method, we consider a stochastic neural network model dened by y D Sf f .x; µ /g;

(2.1)

where x is an n-dimensional input subject to a probability distribution q.x/; µ D .µ 1 ; : : : ; µ m /¿ is an m-dimensional column vector that plays the role of a coordinate system in the space of the networks; ¿ denotes the transposition; and the deterministic function f .x; µ / species a network P structure. Although the most popular type of f is given by f .x; µ / D vi ’ .w i ¢ x/, which is called a three-layer perceptron, the discussions in this letter have no restriction on the shape of function f except some regularity conditions such as differentiability. The output y is emitted through a stochastic process denoted by Sf¢g. The stochastic process can be dened by a conditional probability density function p.y j xI µ / conditioned on input x and the weight parameter µ that can be regarded as the parameter of the density function. Although the most typical example of p.y j xI µ/ is the gaussian noise model of the form µ ¶ 1 1 exp ¡ fy ¡ f .x; µ/g2 ; (2.2) p.y j xI µ/ D p 2 2¼

358

H. Park, N. Murata, and S. Amari

natural gradient learning and NIC can be applied to various types of stochastic models (see Park et al., 2000, and Murata, et al., 1994, for details). Moreover, we proceed with the scalar output for simplicity, but its generalization to multivariate outputs can also be easily done. From this stochastic viewpoint, we can consider a space of probability density functions fp.x; yI µ / j µ 2 <m g and dene an appropriate pointwise loss function d.z ; µ / D d.x; y; µ/ for each piece of data z D .x; y/ and parameter µ on the space. The ultimate goal of learning is to obtain the optimal parameter dened by

µ ¤ D argmin EZ [d.Z; µ /];

(2.3)

µ

where E Z denotes the expectation with respect to the true distribution of random variable Z. Here, the optimality is discussed in terms of minimizing the expected loss under a given environment represented by a distribution of input and output, p.z I µ / D p.x; yI µ / D p.y j xI µ /q.x/. A natural way of estimating the optimal parameter is to adopt the empirical minimum loss estimator dened by 1 µO D argmin L.D; µ / D argmin n µ µ

n X

d.zp ; µ /;

(2.4)

pD1

when a sample set D D f.xp ; yp /gpD1;:::;n is observed. To obtain µO , gradient descent learning is widely used. The natural gradient learning method is based on the fact that the space of p.x; yI µ/ is a Riemannian space in which the metric tensor is given by the Fisher information matrix F.µ / dened by F.µ / D Ex [Ey j xI µ [r log p.y j x; µ /r log p.y j x; µ /¿ ]]:

(2.5)

Here, we use r to denote a differential operator with respect to the parameter µ , and we will also use @i in some cases to denote a partial differential operator with respect to the ith element of µ . The Ex [¢] and E y j xI µ [¢] denote the expectation with respect to q.x/ and p.y j xI µ/, respectively. Using the inverse of the Fisher information matrix of equation 2.5, we can Q and its learning algorithm for the stochastic obtain the natural gradient rL systems, Q µt / D µ t ¡ ´t .F.µt //¡1 rL.µt /; µ tC1 D µt ¡ ´t rL.

(2.6)

where rL.µ / is the derivative of a loss function L with respect to µ . Since it is difcult to obtain the Fisher information matrix F.µ/ and its inverse in practical implementation, we need an approximation method. The

Natural Gradient with Regularization Using NIC

359

adaptive natural gradient (Amari et al., 2000) was originally developed for estimating the inverse of Fisher information matrix adaptively in on-line learning. In this letter, we present a similar estimation method for batch learning. Since we have a set of training data f.xp ; yp /gpD1;:::;n in batch learnQ µ/ ing, we can use the empirical distribution so as to derive an estimator F. of Fisher information matrix, n 1X FQn .µ/ D hp .µ /hp .µ /¿ ; n pD1

(2.7)

where hp .µ /hp .µ /¿ D Ey j xI µ [r log p.y j xp I µ/r log p.y j xp I µ /¿ ]. The explicit form of hp .µ / can be analytically obtained by dening a specic form of p.y j xp I µ /. In the case of the gaussian noise model in equation 2.2, hp .µ/ is simply given by r f .xp I µ/ (see Park et al., 2000, for details). Then ¡1 Q its inverse, FQ¡1 n D .Fn .µ // , can be calculated successively by using the iterative equation, FQ¡1 n D

Á ! ¿ Q ¡1 FQ¡1 n ¡1 n¡1 hn hn Fn¡1 Q Fn¡1 C ; n¡1 n ¡ 1 C h¿n F¡1 n¡1 hn

(2.8)

where the initial matrix F0 is a positive denite symmetric matrix such as the identity matrix. This iterative technique can also be applied to calculating the NIC, as we show in section 3. We should emphasize here that the denition of the Fisher information matrix depends not on the error function directly but on the stochastic neural network models that we use. Therefore, the natural gradient can be applied to arbitrary error functions satisfying some regularity conditions, such as differentiability. Although a typical error function can be dened as L.µ / of equation 2.4, we can also use other types of error functions, as we show in section 3. 2.2 Regularization Method. Due to noise in the observed samples and the limited number of samples, the estimate µO often has some divergence from µ¤ depending on the sample set, which results in large generalization errors. In order to decrease the generalization error EZ [d.Z; µO /], the regularization method is widely used. In the regularization method, we adopt an error function including a regularization term, which can be dened by Lreg .D; ®; µ / D

n 1X ® d.z p ; µ/ C r.µ/; n pD1 n

(2.9)

including a regularization parameter ®. Note that in this denition, the contribution of the regularization is scaled as 1=n. By minimizing this error

360

H. Park, N. Murata, and S. Amari

function, we obtain an estimator,

µN D argmin µ

8 <1 X n :n

pD1

9 = ® d.zp ; µ / C r.µ / : ; n

(2.10)

In order to get good generalization performances using the regularization method, it is very important to optimize the regularization strength ® with respect to the expected generalization error E D [EZ [d.z; µN /]] D EµN [EZ [d.Z ; µN /]]; where both ED [¢] and EµN [¢] denote the expectation with respect to µN that is dependent on the sample set D. However, there have been few theoretical works on the optimization of ®. In this section, we review the relationship between the regularization strength and the generalization errors discussed in Amari and Murata (1997) and Murata (2001), and propose a modied NIC for optimizing the regularization strength ®. Before giving a formula for the optimal value of ®, let us dene two matrices that appear several times in the following discussions: £ ¤ G¤ D EZ rd.Z; µ ¤ /rd.Z; µ¤ / ; ij

£ ¤ Q¤ D EZ rrd.Z; µ¤ / :

(2.11)

ij

We also use the notation q¤ij , g¤ij , q¤ , and g¤ for the elements at the ith row and

the jth column of the matrix Q¤ , G¤ , Q¤¡1 , and G¤¡1 , respectively. The asymptotic relationships among µ ¤ , µO , and µN are given by the following lemmas. Lemma 1. Ensemble average and covariance matrix of the estimator µO derived by minimizing the empirical loss are given by ED [µO ] D µ ¤ C

³ ´ 1 1 bCo n n

(2.12)

³ ´ 1 1 VD [µO ] D Q¤¡1 G¤ Q¤¡1 C o ; n n

(2.13)

where the ith element of the vector b is given by i

b D

X jkl

(

Á ij q¤ qkl ¤

¤ sjkl

1 X k0 l0 ¤ ¤ ¡ q t 0g 0 2 k0 l0 ¤ jkk ll

!) ;

and t¤ijk D EZ [@i @j @k d.Z; µ ¤ /], s¤ijk D EZ [@i @j d.Z; µ¤ /@k d.Z; µ¤ /].

(2.14)

Natural Gradient with Regularization Using NIC

Lemma 2.

361

The estimator is modied by the regularization term as

® µN ¡ µO D ¡ Q¤¡1 rr.µN / C Op n

³

1 n2

´ :

(2.15)

Note that the difference between µN and µO given in this lemma is a bias caused by the regularization. The proofs of these two lemmas are given in appendixes A and B. Note the difference between o.¢/ in equations 2.12 and 2.13 and Op .¢/ in equation 2.15. Since µ ¤ and E D [µN ] do not depend on D, the higher-order term, o.1=n/, is a deterministic quantity. On the contrary, µO and µN depend on D, and thus the higher-order term Op .1=n/ is a stochastic quantity. Using the relationships among the estimators, we can obtain the expected generalization error at µN . Lemma 3.

The expected generalization error at µN is estimated by

1 trfQ¤¡1 G¤ g ED E Z [d.Z; µN /] D EZ [d.Z; µ¤ /] C 2n 1 C 2 f.b ¡ ®Q¤¡1 rr.µ ¤ //¿ Q¤ .b ¡ ®Q¤¡1 rr.µ ¤ // 2n ¡ 2® trfQ¤¡1 G¤ Q¤¡1 rrr.µ¤ /gg C (higher-order terms):

(2.16)

Here, the higher-order terms include nondominant terms with respect to n and ®. The proof of this lemma is also given in appendix C. From the results, we can nd the value of ®, which minimizes the term of order O.1=n2 / in equation 2.16, .b ¡ ®Q¤¡1 rr.µ¤ //¿ Q¤ .b ¡ ®Q¤¡1 rr.µ ¤ // ¡2® trfQ¤¡1 G¤ Q¤¡1 rrr.µ ¤ /g:

(2.17)

It would be interesting to have some intuitive meaning of these two terms in equation 2.17. The rst term is the sum of two biases: the original bias (b) from µO and the additional bias (¡®Q¤¡1 rr.µ ¤ /) from µN . The second term is related to the mutual interaction of the variance (Q¤¡1 G¤ Q¤¡1 ) of µO and the Hessian of regularization function (¡2®rrr.µ ¤ /), which is the ®dependent part of total variance. In this sense, expression 2.17 shows a kind of bias-variance dilemma. When ® is too small, both the bias and variance are large (see Figure 1a). As ® increases, the bias and variance decrease, and the sum of bias and variance also decreases. However, when ® is too large, the bias start to increase again (see Figure 1c). Therefore, we need to nd an optimal value of ® that minimizes the sum of the bias and variance, and the optimization of ® using the expected generalization error 2.16 can

362

H. Park, N. Murata, and S. Amari

(a) Small ¬

^ ED [µ(D)] Q¤¡ 1 rr

¡ ¬ ¹ ED [µ(D; ¬ )]

b

(b) Optimal ¬ ^ ED [µ(D)] b bias

bias

µ¤

¹ ED [µ(D; ¬ )]

variance

variance

(c) Large ¬

¡ ¬ Q¤¡ 1 rr

µ¤

(d) bias & variance with respect to ¬ ^ ED [µ(D)]

b

bias +variance

µ¤ ¡ ¬ Q¤¡ 1 rr bias

variance bias

variance

¹ ED [µ(D; ¬ )] small ¬

bias = b ¡

optimal ¬

large ¬

¬

¬ Q¤¡ 1rr(µ ¤ ), variance = const.¡ 2¬ trfQ¤¡ 1 G¤ Q¤¡ 1 rrr(µ ¤ )g

Figure 1: Bias-variance dilemma related to ®.

be regarded as a solution of this bias-variance dilemma in the higher order (see Figures 1b and 1d). The optimal ® is obtained as theorem 1. Theorem 1. The optimal regularization strength ® in equation 2.9, which asymptotically minimizes the expected generalization error, is given by ®opt D

b¿ rr.µ¤ / C trfQ¤¡1 G¤ Q¤¡1 rrr.µ ¤ /g : rr.µ ¤ /¿ Q¤¡1 rr.µ¤ /

(2.18)

In practice, the above optimal ®opt is not accessible because we know neither the optimal parameter µ¤ nor the true distribution p.z /. Obviously, it is possible to use the empirical distribution and µN instead of p.z / and µ ¤ so as to get an approximation of ®opt . However, it is still computationally expensive for learning systems with a large number of modiable parameters because the second and third differentials of the loss function need to be calculated. Using the concept of NIC, we propose a method for obtaining an estimate of the optimal scale ® in practice without directly calculating the third differentials of the loss function. The NIC, which is a generalized AIC (Akaike

Natural Gradient with Regularization Using NIC

363

information criterion; Akaike, 1974), is an estimator of the expected generalization error for arbitrary error functions. From the precise derivation of NIC (Murata et al., 1994), we can have an estimator of the generalization error for the error function Lgen , which is written as h i ® EZ d.Z; µN / C r.µN / n n X 1 1 ® » N trfQN ¡1 Gg d.z p ; µN / C r.µN / C D 2n n pD1 n C

1 trfQN ¤¡1 GN ¤ g; 2n

(2.19)

where

n ± ²± ²¿ 1X ® ® rd.zp ; µN / C rr.µN / rd.z p ; µN / C rr.µN / GN D n pD1 n n n ± ² X 1 ® rrd.z p ; µN / C rrr.µN / QN D n pD1 n h± ²± ²¿ i ® ® ¤ ¤ GN D EZ rd.Z; µ / C rr.µ ¤ / rd.Z; µ ¤ / C rr.µ ¤ / n n h± ²i ® ¤ ¤ ¤ N D rrd.Z; C rrr. Q EZ µ / µ / : n Roughly speaking, the third term in equation 2.19 is given from an asymptotic expansion around µN , and the fourth term is given from an asymptotic expansion around µ ¤ . The fourth term is not accessible because we do not know the true parameter µ ¤ . Therefore, in practice, the matrices GN ¤ and QN ¤ N respectively, leading to the original denition of are replaced with GN and Q, NIC given by Murata et al. (1994):

NIC.®I D/ D

n o 1X 1 n ® d.z p ; µN / C r.µN / C tr QN ¡1 GN : n pD1 n n

(2.20)

What we would like to get is an optimized ® with respect to the expected generalization error for the original loss d.Z; µN / without regularization penalty ®r.µN /=n. From the fact that ® ED E Z [d.Z; µN /] D NIC.®I D/ ¡ r.µN /; n

(2.21)

we need to neglect the regularization term ®r.µN /=n in equation 2.19. Moreover, noting that the term trfQN ¤¡1 GN ¤ g depends not on D but on ®, it is desirable to extract the ®-dependent part from the term. From this concern, we dene a modied NIC for regularization as NICreg .®I D/ D

n 1X 1 n Q ¡1 N o tr Q G ; d.zp ; µN / C 2n n pD1

(2.22)

364

H. Park, N. Murata, and S. Amari

where n ± ² 1X ® rrd.z p ; µN / C 2 rrr.µN / : QQ D n pD1 n

N and QQ are functions of ® and data set D. By expanding equaHere, µN , G, tion 2.22 around ®=n D 0 and minimizing it with respect to ® in leading order, we obtain the relationship of theorem 2. The assumption ®=n » D 0 is valid under the condition that n is sufciently larger than ®, which is usually satised in real applications. We should also remark that the matrix QQ N The factor 2 in the regularization term in QQ is different from the Hessian Q. is necessary to reect the inuence of the ®-dependent part in trfQN ¤¡1 GN ¤ g. This modied denition of NIC is justied in the following theorem and its proof. Theorem 2. By optimizing ® with respect to NICreg , equation 2.22, we can obtain the asymptotic equality, ®O opt D ®O opt .D/ D

¿ bO rr.µO / C trfQO ¡1 GO QO ¡1 rrr.µO /g ; rr.µO /¿ QO ¡1 rr.µO /

(2.23)

where n 1X rd.z p ; µO /rd.zp ; µO /¿ ; GO D n pD1

n 1X rrd.zp ; µO /; QO D n pD1

and bO is dened in the same manner with b by using µO instead of µ ¤ . The proof is given in appendix D. We should discuss the relationship between ®O opt and ®opt . The true optimum ®opt is obtained by minimizing the asymptotic expansion of the expected generalization error ED EZ [d.Z; µN /] around µ ¤ , which is given in lemma 3. This quantity is deterministically obtained if we know µ ¤ . In real situations, we do not know µ¤ , and thus we exploit the concept of NIC as an estimator of ED EZ [d.Z; µN /]. Since NIC uses the practically obtained estimator µO for a given data set, our estimator ®O opt is also a data-dependent quantity. The relationship between ®O opt and ®opt can thus be given from the p relationship between µO and µ ¤ . From the relationship µ¤ D µO C Op .1= n/ and equation 2.23, we can obtain the relationship ³ ®O opt D ®opt C Op

´ 1 p : n

(2.24)

Natural Gradient with Regularization Using NIC

365

From this, we can see that it is possible to substitute the minimization of the modied NIC for the minimization of the expected generalization errors, when the number of data is sufciently large. 3 Proposed Method From the result of theorem 2, we can say that the optimum ®opt , which minimizes the expected generalization error, is asymptotically equivalent to the optimum ®O opt , which minimizes the NICreg equation 2.22. By exploiting the NICreg for optimizing ® instead of using equation 2.23, for direct calculation of ®O opt , we can avoid heavy computations for obtaining bO , which requires the values of (number of parameters)3 entries, that is, sijk and tijk . Based on this fact, we propose a computationally efcient algorithm for obtaining µN with the minimum generalization error as follows. Corollary 3.

By minimizing the two loss functions simultaneously as 8 9 <X = n µN .®/ D argmin d.zp ; µ / C ®r.µ / : ; µ pD1 8 9 <X = n 1 N ®O opt D argmin d.zp ; µN .®// C trfQQ ¡1 .®/G.®/g ; : ; 2 ®

(3.1)

(3.2)

pD1

we can obtain an estimator µN with optimally scaled ®O opt that gives the minimum expected generalization error. In order to obtain µN .®/ for each ®, we employ natural gradient learning. As mentioned in the previous section, natural gradient learning can be easily extended to arbitrary error functions. In the regularization methods, we consider an error function with a regularization term, Lreg .D; ®; µ/, which is written by equation 2.9. Since the denition of the Fisher information matrix does not depend on the error function, the adaptive natural gradient learning algorithm with the regularization term can be written by Q µ t // r µtC1 D µ t ¡ ´t .F. ¡1

8 <X n :

pD1

9 = d.z p ; µ t / C ®r.µ t / ; ;

(3.3)

Q µ t //¡1 can be obwhere the Fisher information matrix and its inversion .F. tained by using the same method addressed in section 2.1. One can see that the shape of this updating rule for natural gradient learning with regularization is different from that for the second-order methods. In the case of the second-order methods, the Hessian matrix depends on

366

H. Park, N. Murata, and S. Amari

the cost function, and the explicit form of the matrix includes additional terms given from r.µ/ (see Bishop, 1995, for details). On the contrary, since the Fisher information matrix in the natural gradient depends on only the probability density function p.y j x; µ /, the forms of L.µ/ and r.µ / have no Q µ//¡1 . However, we should note that the equiinuence on the form of .F. libria in the two gradient methods are same, because both use the same rL. On the other hand, the problem of calculating the optimal ® is rewritten as the problem of searching the minimum of a one-parameter function, equation 3.2. Compared to the cross-validation method that is widely used, the proposed method can save computational cost. In the case of K-fold cross validation, it is required to train networks K times in order to obtain an estimate of the generalization error for each ®. In the case of the proposed NIC-based method, however, we can obtain an estimate of the generalization error by calculating the NIC dened in equation 2.22. Therefore, we can say that the proposed method can nd an optimal value of ® about K times faster than the K-fold cross-validation method, especially when the learning cost is high. In section 4, we compare the real processing times of the proposed method with other validation methods. In searching for the optimal ®, we can use the line search method, which is used in the cross-validation method as well. We also need to discuss the computational complexity of calculating the N NIC. From equation 3.2, we can see that it is required to calculate G.®/ and QQ ¡1 .®/ at each ®. However, noting that the two matrices are closely related to the Fisher information matrix F.µ / used in the natural gradient learning, we can obtain them with low computational cost. In this article, we show the adaptive method for calculating QQ ¡1 and GN for the gaussian noise model, equation 2.2, which is a typical model. The method can be extended to various stochastic models, as discussed in Park et al. (2000). By taking the negative log likelihood as the pointwise loss function and the regularization term as the weight decay term, the cost function is given by n X pD1

d.z p ; µ/ C ®r.µ/ D

n 1X ® fyp ¡ f .xp ; µ /g2 C kµ k2 : 2 pD1 2

(3.4)

Then GN can be estimated by GN D

n ± ²2 X Q µN / C ® µN µN ¿ : fyp ¡ f .xp ; µN /g2 F. n pD1

(3.5)

Q µN / can be obtained during the natural gradient learning process, The term F. and thus we do not need to recalculate it for obtaining NIC. For the inverse

Natural Gradient with Regularization Using NIC

367

of QQ ¡1 .D QQ ¡1 n /, we can apply the iterative method of equation 2.8, QQ ¡1 n

Á ! r fn r fn¿ QQ ¡1 QQ ¡1 n n¡1 n¡1 ¡1 D QQ n¡1 C ; n¡1 n ¡ 1 C r fn¿ QQ ¡1 n¡1 r fn

(3.6)

where r fn D r f .xn ; µN /, QQ ¡1 0 D I =2®, and I is the identity matrix. Here, we used the Gauss-Newton approximation for the Hessian, which is written as n 1X ® r fp .xp ; µN /r fp .xp ; µN /¿ C 2 I : QQ » D n pD1 n

(3.7)

From this, we can say that calculating NIC takes the same order of cost as that for a one-parameter update in natural gradient learning. 4 Computational Experiments We investigate the properties of the proposed method compared to other methods through computer simulations. For a given training data set D, we employ natural gradient learning to obtain an estimate of the weight parameter, 8 9 <X = µN .D; ®/ D argmin d.zp ; µ / C ®r.µ / : : 2D ; µ

zp

In order to optimize ®, we employ four criteria: the generalization error (Lgen ), the modied NIC (NICreg ), the validation error (Lval ), and the crossvalidation error (Lcv ). In practical implementation, the generalization error is approximated by a sample mean for a test data set Dtst, which can be written by Lgen .D; ®/ D EZ [d.Z; µN .D; ®/] » D

X 1 d.zp ; µN .D; ®//; jDtst j z 2D p

(4.1)

tst

where jDtstj is the number of examples in the set Dtst . The NICreg .D; ®/ can be calculated equation 2.22 for a given training set D. For validation error, we divided the training set D into two subsets: Dtrn for obtaining µN .Dtrn ; ®/ and Dval for obtaining the validation error; and the criterion is calculated by Lval.D; ®/ D

X 1 d.z p ; µN .Dtrn ; ®//: jDval j z 2D p

val

(4.2)

368

H. Park, N. Murata, and S. Amari

In the case of K-fold cross validation, we divided the data set D into K subsets: D1 ; D2 ; : : : ; DK . For the kth training, we leave a subset Dk for validation and use the rest Dk¡ DD1 [¢ ¢ ¢ Dk¡1 [DkC1 [¢ ¢ ¢ DK to obtain µN .Dk¡ ; ®/. Through K times of training, the cross-validation error is calculated by 8 9 = K < 1X 1 X Lcv .D; ®/ D d.zp ; µN .Dk¡ ; ®// : ; K kD1 : jDk j z 2D p

(4.3)

k

By minimizing these criteria with respect to ®, we can obtain the four gen NIC val cv optimal values: ®D , ®D , ®D , and ®D . These ®D are stochastic values depending on data set D and are distributed around the optimum ®opt that minimizes the expected generalization error, ED [Lgen .D; ®/]. The optimum can be estimated by gen

®opt D ®opt D argmin ED [Lgen .D; ®/]:

(4.4)

®

Therefore, the goodness of a criterion can be evaluated by the distributional properties of the corresponding ®D and Lgen .®D /. 4.1 Toy Problem. To investigate the distributional properties of ®D and Lgen .®D /, we rst conducted computer simulations using a simple toy problem. We used the gaussian noise model with simple weight decay regularization, as discussed in section 3. The training data are generated by a teacher network with a two input–three hidden–one output structure (see Figure 2). The output y includes the Gaussian noise with zero mean and 10¡2 variance. To get average results, we generated 100 different training sets with 250 examples in each set. We trained three student network models with

1 .5 1 0 .5 5

0 0 .5 1

0

1 .5 5 0 5

Figure 2: Input-output mapping of teacher network.

5

Natural Gradient with Regularization Using NIC

369

Table 1: Average Results on Toy Problem for Different Sizes of Network Models. Three Hiddens

Four Hiddens

Five Hiddens

ED [Lgen .0/]

0.00533

0.00560

0.00562

ED [Lgen .®D /]

0.00527

0.00534

0.00540

0.00531

0.00543

0.00551

error

val /] ED [Lgen .®D cv ED [Lgen .®D /]

0.00540 0.00533

0.00579 0.00540

0.00608 0.00557

Relative processing time

NIC Validation Cross valid

1.0 0.64 7.76

1.0 0.71 8.40

1.0 0.75 8.51

Average generalization

gen

NIC /] ED [Lgen .®D

different numbers of hidden units: three, four, and ve hidden units. For approximating Lgen , we used a test set with 105 examples generated from the teacher network. For obtaining Lval, we divided each training set into two subsets: Dtrn with 200 data and Dval with 50 data. For obtaining Lcv , we employed 10-fold cross validation. The average results are given in Table 1. Table 1 shows the average of gen Lgen .®D / for four different criteria and Lgen .0/ at ® D 0. The Lgen .®D / is ideal, giving the best generalization performance for a given data set, but it cannot be used in practical applications. The simple validation method is even worse than the naive learning without regularization. The proposed method can give a competent performance compared to the cross-validation method, which has a high computational cost (see the relative processing time in Table 1). We can also see the effect of regularization on the different sizes of network models. As the number of hidden units increases, the difference between Lgen with regularization and Lgen without regularization also increases. Nevertheless, for all methods, the smaller model gives a better generalization performance than the larger one. We conducted further investigation on detail properties of the criteria for the smallest (optimal) network model. Figure 3 shows the average curves of the criteria with respect to ®—over 100 learning tasks with different data sets. From the gure, we can see that the curve of ED [NICreg .®/] has a slightly steeper slope than that of ED [Lgen .®/]. This tendency also corresponds to the property of the AIC, that is, selecting a larger model in model selection tasks. Nevertheless, we can still see that the curves of E D [NICreg .®/] and ED [Lcv .®/] have similar slopes, and they are closer to that of ED [Lgen .®/], compared to that of ED [Lval.®/]. The optimum ®opt can be estimated by the minimum of the average curve gen of ED [Lgen .®/], (i.e., ®opt ), and it gives 0:02684. In addition to the average curves, we also plotted the median and quartiles of Lgen .®/ in the inset box of Figure 3. From the graph, we can see that the random uctuation of Lgen .®/ depending on data set D is quite large, which causes a large

370

H. Park, N. Murata, and S. Amari x 10

6.1

3

Median and quartiles of Lgen (a)

6 5.4

5.9 5.8

5.2

Criteria

5.7 5.6

0

0.05

0.1

0.15

0.2

5.5 5.4

ED[ Lgen (a) ] E [ NIC(a) ] D E [ Lval (a) ] D E [ Lcv (a) ]

5.3 5.2 5.1

¬aopt 0

0.02

0.04

D

0.06

0.08

0.1

a

0.12

0.14

0.16

0.18

0.2

Figure 3: Average curves of the four criteria—the generalization error, the modied NIC, the validation error, and the cross-validation error (large graph)—and median and quartiles of the generalization errors (small graph).

gen

uctuation of optimal ®D as well. When each data set D is not so large, the random uctuation of data in D causes the large variance of Lgen .®; D/ and gen ®D . Concerning this phenomenon, we can say that it is more important to gen NIC val investigate the distribution properties of the optimal values ®D , ®D , ®D , cv and ®D depending on each training set D. Figure 4 shows the distributions of the obtained optimal ®D . In the gure, NIC cv one can see that the ®D and ®D concentrate in the range of smaller values gen gen than ®opt .0:02684/, and they have a smaller mean and variance than ®D .

NIC The small value of mean ED [®D ] is supposed to correspond to the property of the AIC, which chooses a more complex model than an optimal one, because a small regularization strength implies a small penalty of complicating function shapes made by the learning network. Nevertheless, the small variNIC cv ances of ®D and ®D are desirable properties, giving robustness against the stochastic noise in given data sets. For the same reason, the large variance val is the main cause of the poor performance of the simple validation of ®D method. For some data sets, the generalization performance of the simple validation method is very poor and results in poor average performance. In addition to the properties of ®D itself, the data-dependent property of Lgen .®D / is also important. In Figure 5, we show the correlation between the

Natural Gradient with Regularization Using NIC

60 40

80

mean : 0.028 4 variance : 8.3´ 10 Frequency

Frequency

80

¬aopt

20

0

0.05

Frequency

80

0.1

a

gen D

0.15

¬aopt

20 0

0

40

0

0.2

¬aopt

0

0.05

80

mean : 0.046 variance : 3.5´ 10 3

60 40

mean : 0.011 5 variance : 4.9´ 10

60

20

Frequency

0

371

0.15

0.2

mean : 0.008 4 variance : 1.1´ 10

60 40

0.1

aNIC D

¬aopt

20

0.05

0.1

a

val D

0.15

0.2

0

0

0.05

0.1

acv D

0.15

0.2

Figure 4: Distributions of optimal values of ® depending on the training data set for the network model with three hidden units.

gen

ideal generalization performance (Lgen .®D /) and the generalization perNIC val cv formance of other methods (Lgen .®D /, Lgen .® D /, Lgen .® D /, and Lgen .0/). As shown in the gure, the generalization error obtained using NIC gives a gen slightly higher correlation with Lgen .® D / than those obtained using other criteria (see the correlation coefcients in the upper-right corner of each box). The high correlation also implies the stability of the proposed method. It is necessary to determine whether the high correlation is preserved gen in larger models. Figure 6 shows the correlation between Lgen .®D / and gen NIC cv Lgen .®D / (upper row) and the correlation between Lgen .®D / and Lgen .® D / (lower row), for the model with three, four, and ve hidden units (each column). The histograms to the left of the data plots show the distribution NIC cv of the generalization errors (Lgen .®D / in the upper row and Lgen .® D / in the NIC lower row), and the ones at the bottom show the distribution of Lgen .®D /. For both NIC and cross validation, we can see that the correlation decreased as the network size increased. At the same time, the mean and variance of the generalization errors also increased as the network size increased. This deterioration is expected to be improved by using a more sophisticated function for the regularization term. Nevertheless, we can still see good correlations, and the proposed method is as good as the cross-validation method.

372

H. Park, N. Murata, and S. Amari x 10

3

x 10

r : 0.941

(aval)

D

5

5.2

5

x 10

5.2 3

5.4 gen Lgen(aD )

5

5.6

5

5.2

3

x 10

x 10

r : 0.894

5.6

5.4 gen Lgen(aD )

3

5.6 x 10

3

r : 0.871

5.6 (0)

D

(acv)

5.4

L

gen

5.4

5.2

5.4

L

gen

5.4

L

gen

r : 0.486

5.6

NIC

Lgen(aD )

5.6

3

5.2

5

5.2

5

5.2

5.4 5.6 3 L (agen) x 10 gen D

5

5

5.2 L

5.4 5.6 3 (agen) x 10 gen D

Figure 5: Correlations among generalization errors at optimized values ® using different criteria for the network model with three hidden units. r denotes the correlation coefcient.

4.2 Real-World Problems. In order to check the practical applicability of the proposed method, we conducted experiments on three real-world problem: the sonar data classication problem (Gorman & Sejnowski, 1988), the pen-based digits recognition problem, and the computer system activity estimation problem. The sonar data and the pen-digit data are wellknown benchmark data for classication task and are available from the UCI machine learning repository (http://www.ics.uci.edu/»mlearn/MLSummary.html). The computer system activity estimation problem is also public benchmark data for regression tasks and is available from the DELVE database (http://www.cs.toronto. edu/»delve/). The task of the sonar data problem is to train a network to discriminate between sonar signals bounced off a metal cylinder and those bounced off a roughly cylindrical rock. The data are composed of two sets: a training set with 104 data and a test set with 104 data. We used a multilayer perceptron with 60 inputs–6 hiddens–1 output, and trained the network 10 times with

Natural Gradient with Regularization Using NIC 3

3

3 Hidden Units

x 10

6

Lgen(aNIC ) D

6

373 3

4 Hidden Units

x 10

6

5.5

5.5

r = 0.94

5.5

r = 0.79

5 0 50 3 x 10 6

r = 0.67

5 0 50 3 x 10 6

cv

Lgen(aD )

5 0 50 3 x 10 6

5 Hidden Units

x 10

5.5

5.5

5 0

r = 0.89

5 0

50 0

5

5.5

Lgen(agen ) D

6 3

x 10

5.5

r = 0.92

5 0

50 0

5

5.5

Lgen(agen ) D

6 3

x 10

r = 0.62 50 0

5

L

5.5

(agen)

gen

D

6 3

x 10

Figure 6: Correlations and distributions of generalization errors for different sizes of network model.

different initializations so as to get average results. The learning stopped when the classication rate for training data became 100% or the decreased error became smaller than 10¡4 . In order to nd the optimal ®, we started from ® D 0:5 and conducted a line search by decreasing the value using ®new D 0:7 £ ®old . We applied the four criteria: Lgen , NICreg , Lval , and Lcv . For obtaining Lgen , we used the test data set with 104 data. For obtaining Lval, we divided the training set into two subsets: one with 78 data for training and the other with 26 data for validation. For obtaining Lcv , we employed the eight-fold cross-validation method. The task of the pen-based digit recognition problem is to recognize digit numbers captured from a pressure-sensitive tablet. The raw data are preprocessed so as to be represented by 16-dimensional feature vectors. For the recognition task, we used a multilayer perceptron with 16 inputs–15 hiddens–10 outputs and trained the network 10 times with different initializations so as to get average results. The stopping condition and the searching strategy for ® are the same as those for the sonar data problem. We used 700 data for training and 1000 data for testing. For obtaining Lgen , we used the test data set. For obtaining Lval , we divided the training set into two subsets: one with 500 data for training and the other with 200 data for validation. For obtaining Lcv , we employed the ve-fold cross-validation method. For these two classication problems, the number of hidden units

374

H. Park, N. Murata, and S. Amari

was chosen so as to obtain 100% classication rates for training data sets when ® D 0. The task of the computer system activity estimation problem is to estimate the number of users in a multiuser computer system from 12 types of system information such as the number of system calls in a second. We used a multilayer perceptron with 12 inputs–15 hiddens–1 output and trained the network 10 times with different initializations to get average results. The learning stopped when the decreased training error became smaller than 10¡6 . The line search condition for optimizing ® is the same as that for the two previous problems. We used 512 data for training, and 4096 data for testing and obtaining Lgen . For obtaining Lval , we divided the training set: one with 384 data for training and the other with 128 data for validation. For obtaining Lcv , we employed the four-fold cross-validation method. The average results are given in Table 2. For the classication problems we also calculated the classication error rates (CE) to evaluate the generalization performance. The E[¢] in the table denotes the average over a number of trainings with different initialization. From the results, we conrmed that the proposed method can achieve good generalization performance in practical applications. Furthermore, we showed the relative processing time of the methods using NICreg , Lval, and Lcv in Table 2. From the result, we can say that the proposed method is competent for practical applications in the sense of computational efciency as well as generalization performance.

Table 2: Average Values of Test Error and/or the Classication Rate on Test Data for Different Criteria.

Average generalization

SONAR

PEN-DIGITS

COMP-ACT

E[Lgen .0/]

0.053

0.056

0.1702

E[Lgen .®D /]

0.041

0.042

0.0279

0.045

0.044

0.0293

gen

NIC E[Lgen .®D /]

error

val E[Lgen .®D /] cv E[Lgen .®D /]

0.051 0.046

0.046 0.044

0.0420 0.0408

E[CE.0/]

13.2%

6.64%

-

Average

E[CE.®D /]

gen

10.8%

4.98%

-

11.8%

5.35%

-

error

val /] E[CE.®D cv E[CE.®D /]

14.5% 11.9%

5.43% 5.30%

-

Relative processing time

NIC Validation Cross valid

1.0 0.65 6.44

1.0 0.97 3.42

1.0 0.88 3.04

classication

NIC /] E[CE.®D

Natural Gradient with Regularization Using NIC

375

5 Conclusion and Discussion In this article, we deal with the generalization performance of natural gradient learning. Due to the ability of avoiding plateaus and the fast convergence of the natural gradient method, the generalization problem of the natural gradient learning could be more serious than that of the standard gradient method. To solve the problem, we presented a natural gradient learning method with a regularization term and proposed an efcient and theoretically justied method of estimating the optimal scale of regularization using a modied NIC. By combining the natural gradient and the modied NIC, we can build up a learning strategy that gives good generalization performance and can be applied to various stochastic neural network models. Since it was proved that the optimization with respect to the NIC is asymptotically equivalent to the optimization with respect to the ensemble average of the generalization error, we can expect to maximize the generalization performance of the trained networks by the proposed method. Compared to the cross-validation method, the optimization process for the regularization strength is computationally efcient. We also investigated the properties of the optimal regularization strength obtained by the proposed method through computer simulations. Its distribution properties of concentrating on small values correspond to the properties of AIC, which is likely to choose a more complex model than the optimal one in model selection. Hagiwara (2002) noted that the tendency is due to the singular structure of the space of neural networks. As a future study, it would be important to investigate the relationship between singularities and the regularization strength more deeply. Nevertheless, the proposed method showed superiority in generalization performance compared to the simple validation method and the cross-validation method. Appendix A. Proof of Lemma 1 Hereafter, we use the tensor-like P n iexpressions and Einstein’s notation without notice, such as ai bi D i a bi : We also use the differential operator @f @i : f 7! @µ i for derivatives with respect to parameter µ D .µ 1 ; : : : ; µ m /¿ . Let us dene ! D .! 1 ; : : : ; ! m /¿ as

! µO ¡ µ¤ D p ; n

(A.1)

and expand the derivative of the empirical loss around the optimal paramPn eters µ ¤ . By using pD1 @i d.zp ; µO / D 0, we obtain n n 1 X 1 X p @i d.zp ; µO / D p @i d.zp ; µ¤ / n pD1 n pD1

376

H. Park, N. Murata, and S. Amari

0

1 n X 1 C@ @i @j d.z p ; µ¤ / ¡ q¤ij C q¤ij A ! j n pD1 0 1 n 1 1X ¤ C p @ @i @j @k d.z p ; µ / ¡ t¤ijk C t¤ijk A ! j ! k 2 n n pD1 C .higher-order term/ D 0:

(A.2)

By solving equation A.2 for ! , we obtain the following relation: 0

1 n X 1 ij ! i D ¡q¤ @ p @j d.zp ; µ¤ /A n pD1 0

1 n X 1 ij ¤A k ¡ q¤ @ @j @k d.zp ; µ ¤ / ¡ qjk ! n pD1 ¡

ij q¤

0 1 n 1 1X ¤ ¤ ¤ A k l p @ C tjkl @j @k @l d.zp ; µ / ¡ tjkl ! ! 2 n n pD1

C .higher-order term/:

(A.3)

Knowing that 2

3 n X 1 EZ1 ;:::;Zn 4 p @i d.Zp ; µ ¤ /5 D 0 n pD1 2

3 n n X X 1 1 E Z1 ;:::;Zn 4 p @i d.Zp ; µ ¤ / p @j d.Zp0 ; µ ¤ /5 D g¤ij n pD1 n p0 D1 2

n n 1 X 1 X EZ1 ;:::;Zn 4 p @i d.Zp ; µ ¤ / p @j d.Zp0 ; µ ¤ / n pD1 n p0 D1

3 ³ ´ n 1 X 1 ¤ 5 £p @k d.Zp00 ; µ / D O p ; n p00 D1 n

where E Z1 ;:::;Zn [¢] denotes an expectation with respect to independent and identically different random variables Z1 ; : : : ; Zn , and that 0

1 n X 1 ¤A @ @j @k d.z p ; µ ¤ / ¡ qjk ; n pD1

0

1 n X 1 ¤ A @ @j @k @l d.z p ; µ ¤ / ¡ tjkl n pD1

Natural Gradient with Regularization Using NIC

377

p are of the order Op .1= n/, and replacing !i by imposing equation A.3 repeatedly, the average of ! can be calculated as 2 3 n n X X 1 1 ij ¤ ¤ 4 EZ1 ;:::;Zn [! i ] D p q¤ qkl @j @k d.Zp ; µ / @l d.Zp0 ; µ /5 ¤ EZ1 ;:::;Zn n pD1 n pD1 2 3 n n X X 1 ij kl k0 l0 ¤ 1 ¡ p q¤ q¤ q¤ tjkk0 EZ1 ;:::;Zn 4 @l d.Zp ; µ¤ / @l0 d.Zp0 ; µ ¤ /5 n pD1 2 n pD1 ³ ´ 1 Co p n ³ ´ ³ ´ 1 1 ij kl k0 l0 ¤ ¤ 1 ij kl ¤ D p (A.4) q¤ q¤ sjkl ¡ q¤ q¤ q¤ tjkk0 gll0 C o p 2 n n ³ ´ 1 1 D p bi C o p (A.5) : n n Consequently, we obtain µ ¶ ³ ´ 1 1 ! ¤ O D bCo ED [ µ ¡ µ ] D ED p : n n n The covariance of the estimator is given by µ» ³ ´¼ » ³ ´¼ 1 1 1 1 ¤ ¤ O O O VD [µ ] D E D µ ¡ µ ¡ b C o µ¡µ ¡ bCo n n n n ³ ´ 1 1 D Q¤¡1 G¤ Q¤¡1 C o n n

(A.6)

¿¶

(A.7)

from ordinary asymptotic theory. Appendix B. Proof of Lemma 2 Let

³ D .³ 1 ; : : : ; ³ m / D µN ¡ µO

(B.1)

be the difference between two estimators. Taking account of n 1X ® @i d.z p ; µN / C @i r.µN / D 0 n pD1 n

(B.2)

n 1X @i d.z p ; µO / D 0; n pD1

(B.3)

and

378

H. Park, N. Murata, and S. Amari

we can expand the empirical loss with the regularization term as follows: n 1X ® @i d.zp ; µO C ³ / C @i r.µN / n pD1 n

D

n n 1X 1X ® @i d.zp ; µO / C @i @j d.z p ; µO /³ j C @i r.µN / n pD1 n pD1 n

C (higher-order term) n 1X ® D @i @j d.zp ; µO /³j C @i r.µN / C (higher-order term) n pD1 n D 0:

(B.4)

Here we dene the following values, gO ij D

n 1X @i d.zp ; µO /@j d.zp ; µO / n pD1

(B.5)

qO ij D

n 1X @i @j d.zp ; µO /; n pD1

(B.6)

which are similar to gij .µ ¤ /, qij .µ¤ / but are dened at parameter µO and averaged with respect to the empirical distribution based on the given examples. Then we obtain ³ ´ 1 ® ³ i D ¡ qOij @j r.µN / C O n n2 ³ ´ 1 ® D ¡ qOij @j rO C O (B.7) ; n n2 where rO D r.µO /. Appendix C. Proof of Lemma 3 Taking the regularization term and the bias into account, the estimator is decomposed as ³ ´ ! b ® ! µN D µ¤ C p C ¡ Q¤¡1 rr.µ ¤ / C Q¤¡1 rrr.µ ¤ / p n n n n C (higher-order term);

(C.1)

where ! D .!1 ; : : : ; ! m /¿ is a random variable that is subject to a normal distribution N.0; §/; § D Q¡1 .µ¤ /G.µ ¤ /Q¡1 .µ¤ /:

Natural Gradient with Regularization Using NIC

379

By letting ij

rPi D q¤ @j r.µ¤ /

¤ rRji D qik ¤ @j @k r.µ /;

the ensemble average of pointwise loss is given by ED E Z [d.Z; µN /] D EZ [d.Z; µ ¤ /] £ ¤ CEZ @i @j d.Z; µ ¤ / " ( ) ¢ 1 !k ± i ® i² 1 ¡ i i £ED p ±k ¡ rRk C b ¡ ®Pr 2 n n n ( )# ² !l ± j ® j ² 1 ± j j £ p ±l ¡ rRl C b ¡ ®Pr n n n C (higher-order term) £ ¤ D EZ d.Z; µ ¤ / 1 ± i ® ² ± j ® j ² 0 ll0 ¤ ¤ C ±k ¡ rRik ±l ¡ rRl qkk ¤ q¤ gk0 l0 q ij 2n n n ² ¢± 1 ¡ C 2 bi ¡ ®Pri b j ¡ ®Pr j q¤ij 2n C (higher-order term);

(C.2)

where ±ki is the Kronecker’s delta, which has a value of 1 when i D k and £ ¤ otherwise is 0. Here, the equality EZ @i d.Z; µ¤ / D 0 is used. By rewriting equation C.2 using the matrix form, we can easily obtain equation 2.16 in lemma 3. Appendix D. Proof of Theorem 2 The modied NICreg , equation 2.22, can be rewritten as n 1X 1 jk d.zp ; µN / C qQ gNjk ; 2n n pD1

(D.1)

N where qQ jk and gNjk is the element at ith row and jth column of QQ ¡1 and G, respectively. By using equation B.7, the rst term becomes n 1X d.zp ; µO C ³ / n pD1

³ ´ n 1X 1 ± ® ²2 1 d.zp ; µO / C qO ij qOik qO jl @k rO@l rO C O 2 n n pD1 n3 ³ ´ n 1X 1 ± ® ²2 ij 1 D d.zp ; µO / C qO @i rO@j rO C O : 2 n n pD1 n3 D

(D.2)

380

H. Park, N. Murata, and S. Amari

Noting that qQ ij D D

n 1X ® @i @j d.z p ; µO C ³ / C 2 @i @j r.µO C ³ / n pD1 n n n 1X 1X ® @i @j d.z p ; µO / C @i @j @k d.z p ; µO /³ k C 2 @i @j r.µO / n pD1 n pD1 n

® C 2 @i @j @k r.µO /³ k n C (higher-order term) ³ ´ 1 ® D qO ij ¡ .tOijk qOkl @l r.µO / ¡ 2@i @j r.µO // C O n n2

(D.3)

n ± ² 1X ® @i d.zp ; µO C ³ / C @i r.µO C ³ / n pD1 n ± ² ® £ @j d.zp ; µO C ³ / C @j r.µO C ³ / n n 1X D @i d.zp ; µO /@j d.zp ; µO / n pD1

gN ij D

C

n n 1X 1X @i @k d.zp ; µO /³ k @j d.z p ; µO / C @i d.zp ; µO /@j @k d.zp ; µO /³ k n pD1 n pD1

C

n n 1X 1X ® ® @i d.z p ; µO / @j r.µO / C @j d.z p ; µO / @i r.µO / n pD1 n n pD1 n

C (higher-order term) ³ ´ 1 ® D gO ij ¡ .Oskij C sOkji /Oqkl @l rO C O ; n n2

(D.4)

and that the equality @ qij D ¡qik qlj @qkl

(D.5)

holds for the derivative of the inverse matrix, the second term becomes 1 ij 1 ® ij 0 0 qO gO ij ¡ qO @i rO..Osjkl C sOjlk /Oqkl ¡ tOjkl qO kk qO ll gO k0 l0 / 2n 2n n ³ ´ 1 ® ik jl 1 ¡ qO qO gO kl @i @j rO C O nn n3 ³ ´ 1 ij 1 ® Oi 1 ik jl Oq gO ij ¡ O O O O O D C C .b @i r q q gkl @i @j r/ O : 2n nn n3

(D.6)

Natural Gradient with Regularization Using NIC

381

Neglecting higher-order terms and nding the value of ® that minimizes the quadratic form 1 ± ® ²2 ij 1 ® Oi qO @i rO@j rO ¡ .b @i rO C qOik q jl gkl @i @j rO/; 2 n nn

(D.7)

we obtain ®O opt D

bOi @i rO C qOik q jl gkl @i @j rO @i rO@j rOqO ij

¿ bO rr.µO / C trfQO ¡1 GO QO ¡1 rrr.µO /g D : rr.µO /¿ QO ¡1 rr.µO /

(D.8)

References Akaike, H. (1974). A new look at the statistical model identication. IEEE trans. on Automatic Control, 16, 716–724. Amari, S. (1998).Natural gradient works efciently. Neural Computation, 10, 251– 276. Amari, S., & Murata, N. (1997). Statistical analysis of regularization constant— From Bayes, MDL, and NIC points of view. Lecture Notes in Computer Science, 1240, 284–293. Amari, S., Park, H., & Fukumizu, K. (2000).Adaptive method of realizing natural gradient learning for multilayer perceptrons. Neural Computation, 12, 1399– 1409. Bishop, C. M. (1995). Neural networks for pattern recognition. New York: Oxford University Press. Bottou, L. (1998). Online algorithms and stochastic approximations. In D. Saad (Ed.), Online learning and neural networks. Cambridge: Cambridge University Press. Gorman, R. P., & Sejnowski, T. J. (1988). Analysis of hidden units in a layered network trained to classify sonar targets. Neural Networks, 1, 75–89. Hagiwara, K. (2002). On the problem in model selection of neural network regression in overrealizable scenario. Neural Computation, 14, 1979–2002. Moody, M. E. (1992). The effective number of parameters: An analysis of generalization and regularization in nonlinear learning systems. In J. E. Moody, S. J. Hanson, & R. P. Lippmann (Eds.), Advances in Neural Information Processing Systems, 4 (pp. 847–854). San Mateo, CA: Morgan Kaufmann. Murata, N. (2001). Bias of estimators and regularization terms. Dagstuhl seminar 01301. Available on-line: http://www.dagstuhl.de/Seminars/. Murata, N., Yoshizawa, S., & Amari, S. (1994). Network information criterion— Determining the number of hidden units for an articial neural network model. IEEE Trans. on Neural Networks, 5(6), 865–872. Park, H. (2001). Practical consideration on generalization property of natural gradient learning. Lecture Notes in Computer Science, 2084, 402–409.

382

H. Park, N. Murata, and S. Amari

Park, H., Amari, S., & Fukumizu, K. (2000). Adaptive natural gradient learning algorithms for various stochastic models. Neural Networks, 13, 755–764. Rattray, M., Saad, D., & Amari, S. (1998). Natural gradient descent for on-line learning. Physical Review Letters, 81, 5461–5464. Rissanen, J. (1978). Modelling by shortest data description. Automatica, 14, 465– 471. Sigurdsson, S., Larsen, J., & Hansen, L. K. (2000).On comparison of adaptive regularization methods. Proc. of Neural Network for Signal Processing (X, pp. 221– 230). Piscataway, NJ: IEEE Press. Stone, M. (1974). Cross-validatory choice and assessment of statistical predictions. Journal of Royal Statistical Society, 36, 111–133. Vapnik, V. (1995). The nature of statistical learning theory. Berlin: Springer-Verlag. Received December 6, 2002; accepted July 10, 2003.

Communicated by Max Welling

LETTER

One-Bit-Matching Conjecture for Independent Component Analysis Zhi-Yong Liu [email protected]

Kai-Chun Chiu [email protected]

Lei Xu [email protected] Department of Computer Science and Engineering, Chinese University of Hong Kong, Shatin, New Territories, Hong Kong

The one-bit-matching conjecture for independent component analysis (ICA) could be understood from different perspectives but is basically stated as “all the sources can be separated as long as there is a one-toone same-sign-correspondence between the kurtosis signs of all source probability density functions (pdf’s) and the kurtosis signs of all model pdf’s” (Xu, Cheung, & Amari, 1998a). This conjecture has been widely believed in the ICA community and implicitly supported by many ICA studies, such as the Extended Infomax (Lee, Girolami, & Sejnowski, 1999) and the soft switching algorithm (Welling & Weber, 2001). However, there is no mathematical proof to conrm the conjecture theoretically. In this article, only skewness and kurtosis are considered, and such a mathematical proof is given under the assumption that the skewness of the model densities vanishes. Moreover, empirical experiments are demonstrated on the robustness of the conjecture as the vanishing skewness assumption breaks. As a by-product, we also show that the kurtosis maximization criterion (Moreau & Macchi, 1996) is actually a special case of the minimum mutual information criterion for ICA. 1 Introduction Independent component analysis (ICA) aims at blindly separating the independent sources s from their linear mixture x D As via y D Wx; x 2 Rm ; y 2 Rn ; W 2 Rm£n :

(1.1)

The recovered y is required to be as component-wise independent as possible where independence is dened as q.y/ D

Qn

jD1 q.y

.j/

/:

Neural Computation 16, 383–399 (2004)

(1.2) c 2003 Massachusetts Institute of Technology °

384

Z. Liu, K. Chiu, and L. Xu

This effort is supported by Tong, Inouye, and Liu (1993). They showed that y recovers s up to constant scales and a permutation of components when the components of y become component-wise independent and at most one of them is gaussian. The problem is further formalized by Comon (1994) under the name ICA. Although ICA has been studied from different perspectives, such as the minimum mutual information (MMI) (Bell & Sejnowski, 1995; Amari, Cichocki, & Yang, 1996) and maximum likelihood (ML) (Cardoso, 1997), in the case that W is invertible, all such approaches are equivalent to minimizing the following cost function, D D ¡H.y/ ¡

n Z X iD1

pW .yi I W/ log pi .yi / dyi

(1.3)

where Z H.y/ D ¡

p.y/ log p.y/ dy

is the entropy of y, pi .yi / is the predetermined model probability density function (pdf), and pW .yi I W/ is the distribution on y D Wx. However, with each model pdf pi .yi / prexed, this approach works only for the cases that the components of y are either all subgaussians (Amari et al., 1996) or all supergaussians (Bell & Sejnowski, 1995). In the cases that sources of supergaussian and subgaussian coexist in a unknown manner, each model pdf pi .yi / is suggested to be a exibly adjustable density that is learned together with W, with the help of either a mixture of sigmoid functions that learns the cumulative distribution function (cdf) of each source (Xu, Yang, & Amari, 1996; Xu, Cheung, Yang, & Amari, 1997) or a mixture of parametric pdf’s (Xu, 1997; Xu, Cheung, & Amari, 1998b). A so-called learned parametric mixture–based ICA (LPMICA) algorithm is derived, with successful results on the sources that can be either subgaussian or supergaussian, as well as any combination of both types (Xu et al., 1997, 1998b). The mixture model was also adopted for the ICA algorithms by Pearlmutter and Parra (1996), although it did not explicitly target separating the mixed sub- and supergaussian sources. Later, Attias (1999) also studied the mixture model-based ICA, which is regarded as a noise-free degeneration of the independent factor analysis (IFA) model. Interestingly it has also been found that a rough estimate of each source pdf or cdf may be enough for source separation. For instance, a simple sigmoid function such as tanh.x/ seems to work well on the supergaussian sources (Bell & Sejnowski, 1995), and a mixture of only two or three gaussians may be enough already (Xu, Cheung, & Amari, 1998a; Xu et al., 1998b) for the mixed sub- and supergaussian sources. This leads to the so-called one-bit-matching conjecture (Xu et al., 1998a), which states that

One-Bit-Matching Conjecture for ICA

385

“all the sources can be separated as long as there is a one-to-one samesign-correspondence between the kurtosis signs of all source pdf’s and the kurtosis signs of all model pdf’s.” In recent years, this conjecture has also been implicitly supported by several other ICA studies (Girolami, 1998; Everson & Roberts, 1999; Lee, Girolami, & Sejnowski, 1999; Welling & Weber, 2001). Although the one-bit-conjecture was widely accepted in the ICA community, there is no theoretical proof for it. In literature, a mathematical proof (Cheung & Xu, 2000) was given for the case involving only two subgaussian sources, but the result cannot be extended to a model with more than two sources or with mixed sub- and supergaussian sources. Moreover, to guarantee a general adaptive ICA algorithm that is stable at the correct separation points, the constraints for the nonlinear function ’i .yi / D ¡ dyd i log pi .yi / were studied by Amari and Chen (1997), but it did not touch the circumstance under which the sources can be separated. When only skewness and kurtosis are under consideration, this letter provides a mathematical proof on the one-bit-matching conjecture under the assumption of zero skewness for the model pdf’s. The entire proof proceeds in three stages. First, the observed mixture and recovered source signals become prewhitened with zero mean and identity covariance matrix. Next, an equivalence is established between minimization of the cost function (1.3) and maximization of a weighted sum of matching scores between kurtoses of source pdf’s and model pdf’s. Finally, we show that maximizing the weighted sum will recover the sources up to a permutation and sign indeterminacy. Meantime, as a by-product, we also show that the kurtosis maximization (Moreau & Macchi, 1996)–based ICA can be taken as a special case of the ICA based on equation 1.3. The rest of the letter is organized in the following way. Section 2 is devoted to a detailed proof of the conjecture. Section 3 empirically demonstrates the robustness of the conjecture against the vanishing skewness assumption for the model pdf’s, and section 4 concludes the letter. 2 A Theorem on the One-Bit-Matching Conjecture In this section, we prove the theorem on the one-bit-matching conjecture according to the three stages described above. Lemma 1. Assume the independent sources s, observed samples x, and the output y are all prewhitened with zero mean and identity covariance matrix. We have

°im D

n X jD1

r3ij °js

(2.1)

386

Z. Liu, K. Chiu, and L. Xu

ºim D

n X jD1

r4ij ºjs ;

(2.2)

y

where °im , ºi , °js , and ºjs denote the skewness and kurtosis of yi , and the skewness and kurtosis of the source sj , respectively, and .rij /n£n is an orthonormal matrix. Proof. Based on the prewhitened assumption, we have y D Wx D WAs D Rs with E.yyT / D RE.ssT /RT ) RRT D I: That is, R D .rij /n£n is an orthonormal matrix. Provided that s D [s1 ; s2 ; : : : ; sn ]T are component-wise independent with E.si / D 0 and E.s2i / D 1, we can obtain 20 13 3 n n n X X y 6 7 X 3 3 (2.3) °i D E 4@ rij sj A 5 D rij E.sj / D r3ij °js jD1

jD1

jD1

20 14 3 n y 6 X 7 ºi D E.y4i / ¡ 3 D E 4@ rij sj A 5 ¡ 3 jD1

D D

n X jD1 n X jD1

r4ij E.sj4 / C 6 r4ij E.sj4 / ¡ 3

n¡1 X n X jD1 rDjC1 n X jD1

r4ij D

0 r2ij r2ir ¡ 3 @ n X jD1

n X jD1

12 r2ij A

r4ij .E.sj4 / ¡ 3/ D

n X jD1

r4ij ºjs ;

(2.4)

Pn 2 y y where jD1 rij D 1 due to the orthonormality of R, °i and ºi , respectively, denote the skewness and kurtosis of yi , which in practice are computed based on the samples; and °js and ºjs denote the skewness and kurtosis of the source sj , respectively. Meanwhile, the orthonormality of R further results in the entropy H.y/ D H.s/ in equation 1.3 being a constant. Thus, minimizing 1.3 is equivalent to maximizing DO D

n Z X iD1

pW .yi I W/ log pi .yi / dyi ;

(2.5)

where pW .yi I W/ is obtained via y D Rs. Based on the truncated GramCharlier expansion (Stuart & Ord, 1994) up to kurtosis, we can then get the

One-Bit-Matching Conjecture for ICA

387

following approximation for pW .yi I W/, Á

! y y °i ºi pW .yi I W/ ¼ g.yi / 1 C H3 .yi / C H4 .yi / ; 6 24

(2.6)

where g.yi / denotes the standard gaussian distribution density as g.yi / D 2

y

exp.¡ y2i /, °im and ºi are given by equations 2.3 and 2.4, respectively, and the Chebyshev-Hermite polynomials H3 .yi / and H4 .yi / are dened as follows: p1 2¼

H3 .yi / D y3i ¡ 3yi

(2.7)

H4 .yi / D y4i ¡ 6y2i C 3:

(2.8)

We choose the Gram-Charlier expansion because it clearly shows how the higher-order statistics affect the pdf and also the polynomials involved, that is, H3 .yi / and H4 .yi /, have an orthogonal property (Stuart & Ord, 1994). Theorem 1. When only the skewness and kurtosis are under consideration and when the skewness of the model pdf’s is zero, 1. Maximizing equation 2.5 is equivalent to maximizing n X n X iD1 jD1

r4ij ºjs km i

(2.9)

where km i is dened as Z kim D

g.yi /

³ ´ H4 .yi / ºm log 1 C i H4 .yi / dyi : 24 24

m 2. km i is a constant that possesses the same sign as ºi .

Proof. Taking a zero skewness, the prexed model pdf pi .yi / in equation 2.5 can be approximated by the following truncated Gram-Charlier expansion, ³ ´ ºm pi .yi / ¼ g.yi / 1 C i H4 .yi / 24 where ºim denotes the kurtosis of pi .yi /.

(2.10)

388

Z. Liu, K. Chiu, and L. Xu

Putting equations 2.10 and 2.6 into 2.5, maximizes the following cost function, Á

! y ºi J.R/ D g.yi / 1 C H3 .yi / C H4 .yi / 6 24 iD1 ³ ³ ´´ ºim £ log g.yi / 1 C H4 .yi / dyi 24 Á ! y ³ ´ y n Z X °j ºi ºm D g.yi / 1 C H3 .yi / C H4 .yi / log 1 C i H4 .yi / dyi 6 24 24 iD1 n Z X

y

°j

n .1 C log 2¼/ 2 ³ ´ y n Z X ºi ºim D g.yi / H4 .yi / log 1 C H4 .yi / dyi 24 24 iD1 ³ ´ y n Z X °i ºim C g.yi / H3 .yi / log 1 C H4 .yi / dyi 6 24 iD1 ³ ´ n Z X ¢ ºm n¡ C 1 C log 2¼ g.yi / log 1 C i H4 .yi / dyi ¡ 24 2 iD1 ³ ´ Z n X H4 .yi / ºm y D log 1 C i H4 .yi / dyi ºi g.yi / 24 24 iD1 ¡

CC¡

n .1 C log 2¼ /; 2

(2.11)

where, because the term n Z X

g.yi /

iD1

³ ´ y °i ºm H3 .yi / log 1 C i H4 .yi / dyi D 0; 6 24 y

and since the parameter under consideration is ºi D n Z X iD1

Pn

4 s jD1 rij ºj ,

the term

³

´ ºim g.yi / log 1 C H4 .yi / dyi 24

can be treated as a constant C with respect to R. Thus, the problem is further O simplied as maximizing the following cost function J.R/, O D J.R/

n X iD1

y ºi

Z

³ ´ H4 .yi / ºim log 1 C g.yi / H4 .yi / dyi 24 24

One-Bit-Matching Conjecture for ICA

D D

n X n X iD1 jD1 n X n X iD1 jD1

Z r4ij ºjs

g.yi /

389

³ ´ H4 .yi / ºm log 1 C i H4 .yi / dyi 24 24

s r4ij km i ºj ;

(2.12)

where Z km i ,

g.yi /

³ ´ H4 .yi / ºm log 1 C i .H4 .yi // dyi : 24 24

(2.13)

m

Note the three terms g.yi /; H424.yi / and log.1 C º24i H4 .yi // involved in the integration in equation 2.13. The rst standard gaussian g.yi / > 0. For the last two terms, their product has the same sign as that of ºim whenever H4 .yi / > 0 or < 0. Thus, the sign of the product of the three terms is always the same as ºim (except for the four isolated points of yi that cause the product m zero), and this then makes the constant km i have the same sign as ºi . As a by-product, theorem 1 also reveals that the conventional kurtosis maximization criterion can be taken as a special case of the MMI criterion given by equation 1.3. As shown in equation 2.12, by setting º1m D º2m D m m m ¢ ¢ ¢ D ºnm D º m , which implies that km 1 D k2 D ¢ ¢ ¢ D kn D k , we have O D J.R/

n X n X iD1 jD1

s m r4ij km i ºj D k

n X n X iD1 jD1

r4ij ºjs D km

n X

y

ºi :

(2.14)

iD1

This is exactly the kurtosis maximization criterion (Moreau & Macchi, 1996) by setting either km as 1 for supergaussian sources or ¡1 for subgaussian sources. Such a linkage was also observed by Cardoso (1999) under the following much stricter condition: ³ ´ ºm ºm log 1 C i H4 .yi / ¼ i H4 .yi /: 24 24 So far we have shown that the objective function 2.9 is equivalent to equation 2.5 when only the skewness and kurtosis are under consideration and the model skewness vanishes. Then we proceed to prove that maximizing equation 2.9 recovers the sources up to permutation and sign indeterminacy. The result is summarized by the following one-bit-matching theorem. Theorem 2. When only the skewness and kurtosis are under consideration and when the model skewness vanishes, all the sources can be separated as long as there is a one-to-one same-sign-correspondence between the kurtosis signs of all source pdf’s and the kurtosis signs of all model pdf’s. That is, maximization of equation 2.9 can be reachable only by an identity matrix up to permutation and sign indeterminacy.

390

Z. Liu, K. Chiu, and L. Xu

Proof. Let Rn£n D [r1 ; : : : ; rn ]T with ri D [ri1 ; : : : ; rin ] be an orthonormal matrix, the objective (2.9) can be rewritten as follow, O max J.R/ s.t. RT R D I n n X X O O i /; J.r O i / D km D J.R/ J.r r4i1 ºjs i iD1

(2.15)

jD1

Consider an ordering .i1 ; : : : ; in / that is any permutation of .1; : : : ; n/, and n , where consider the constraint c D fcij gjD1 cij D frij is orthogonal to Wij ; Wij D [ri1 ; : : : ; rij¡1 ]T ; krij k2 D 1g .j D 1; : : : ; n/ with ci1 degenerated to kri1 k2 D 1. There are totally n! such constraints that are jointly equivalent to the orO thonormality constraint RT R D I. Thus, the maximization of J.R/ under the O orthonormality constraint is equivalent to the maximization of J.R/ under these n! constraints all together. Since the subscript of km is the same as ri i in (2.15), the problem is equivalent to consider the n! orderings k 2 K of n n fkm i giD1 under the xed ordering of fci giD1 . Thus, we have, ( ) n n o X O O max f J.R/g D max max J.ri / (2.16) k2K s.t. RT RDI iD1 s.t. ci where, for each k 2 K , the maximization is implemented sequentially from P O i /gg. the rst term to the last term in f niD1 maxs.t. ci f J.r For every k, under the one-bit-matching condition, there P is at least one n 2 . Since ºjs .1 · j · n/ that possesses the same sign as km 1 jD1 r1j D 1, O 1 /g D km º s.1/ is reached at ri D [0; : : : ; ril.1/ ; : : : ; 0] with ril.1/ D §1, maxf J.r 1 l

m .1/ D minfº s gn where l.1/ D maxfºis gniD1 if km i iD1 if k1 < 0. 1 > 0; otherwise, l i i Pn 2 Then, under the constraints of jD1 r2j D 1 and r2 ? r1 that implies r2l.1/ D 0, O 2 /g D km º s.2/ , where l.2/ D maxfº s gn if km > 0; otherwise, l.2/ D maxf J.r 2

minfºis gniD1 i6Dl.1/

if

km 1

l

i6Dl.1/

i iD1

2

< 0. The one-bit-matching condition can guarantee that

O n /g D the above process can be sequentially proceeded until we get maxf J.r Pn Pn m s m s O O kn ºl.n/ . As a result, iD1 max f J.ri /g D J.5k / D iD1 ki ºl.i/ , where 5k is a s.t. ci permutation matrix. Moreover, considering all the orderings in K , we can nally reach a set f5k gk2K of at most n! permutation matrices. O O k /g, for which we show We further nd one 5 such that J.5/ D maxf J.5 k2K

m 5 D I that corresponds to the particular ordering of km 1 > ¢ ¢ ¢ > kn and º1s > ¢ ¢ ¢ > ºns . Such an ordering has the following nature, s m s m s s s m m s s km i ºi C kj ºq ¡ ki ºq ¡ kj ºi D .ki ¡ kj /.º i ¡ ºq / > 0 for j; q > i

(2.17)

4 Now consider any 5 D .¼ij /n£n 2 f5k gk2K ; 5 6D I. If ¼11 D 1, we directly go 4 to consider ¼22 ; otherwise, there must exist ¼1j D 1 and ¼i14 D 1 for i; j > 1.

One-Bit-Matching Conjecture for ICA

391

.1/ .1/ We modify 5 to 5.1/ by letting ¼11 D 1; ¼ij.1/ D 1; ¼i1.1/ D 0; ¼1j D 0 such .1/ O O that J.5/ < J.5 / due to (2.17). Continuing the same process, we get that O O .1/ / < ¢ ¢ ¢ < J.5 O .n/ / with 5.n/ D I. J.5/ < J.5 In a summary, for any orthonormal matrix R 6D I we can conclude that m O O In a special case as, for instance, km J.R/ < J.I/. i D kiC1 , the maximization O ˆ that is obtained of J.R/ can be also reached by the permutation matrix 5 by switching the ith and .i C 1/th row of I. Similarly, when there are more O two km i equal to each other, the maximization of J.R/ can be also reached by the permutation matrix obtained by switching the corresponding rows. In O other words, the maximization of J.R/ is also reachable by such permutation matrices in these special cases.

Remark 1. In the proof above it can be observed that the one-bit-matching condition takes a key role. Without it, we cannot always ensure that there O is a ºqsi that possesses the same sign as km i , and thus maximization of J.ri / cannot cause the ri take the form [0; : : : ; §1; : : : ; 0]. m m s s D ¡4; s D For example, as km º3 1 D 3; k2 D 2; k3 D ¡1 and º1 D 1; º2 0 4=9 1=9 the maximum of (2.9) is 53 obtained by setting .r4ij /n£n D @1=9 4=9 0 0 instead of a permutation matrix.

¡5, 1 0 0A 1

Remark 2. The proof above consists of a continuous optimization that leads to one permutation matrix and a combinatorial optimization that O reaches J.I/. Actually, any permutation matrix already corresponds to a separable solution for ICA. This also means that a local maximization of O that leads to a permutation can provide a successful solution for ICA J.R/ already. In fact, it explains the success of the gradient-based algorithms for a continuous optimization such as the natural gradient algorithm (Amari et al. 1996). 3 An Empirical Study on the Conjecture with Nonvanishing Model Skewness The one-bit-matching theorem is proved under the assumption that the model skewness vanishes. Although the assumption is quite weak in reality (for instance, any symmetric pdf satises the assumption of zero skewness), it is still interesting to study the case when the assumption of zero model skewness is not satised. We give some empirical evidence for it based on two experiments.

392

Z. Liu, K. Chiu, and L. Xu

3.1 On Demonstrating the Case with Small Model Skewness. This experiment aims to demonstrate the case as the model densities with small skewness are chosen. The experiment is based on a two-source model, with the 50,000 source samples si generated by the following pdf approximation, ³

´ °s ºs p.s/ D g.s/ 1 C H3 .s/ C H4 .s/ 6 24

(3.1)

where ° s and º s denote the skewness and kurtosis, respectively. The observed samples x are mixed from s by the rotation matrix ³ ´ cos.» / ¡ sin.» / AD sin.» / cos.» / with » D ¼=6 as follows, (3.2)

x D As: Based on the output y obtained by a rotation of x via ³ y D Wx D

cos.Á/ sin.Á/

¡ sin.Á/ cos.Á/

´ x;

(3.3)

objective 2.5 can be empirically obtained as follows, O O D Á D arg max D.Á/; D.Á/ Á

2 50000 X 1 X log pi .yi .t//; 50000 iD1 tD1

(3.4)

where the model pdf pi .yi .t// is constrained in the approximate form ³ ´ °m ºm p.y/ D g.y/ 1 C H3 .y/ C H4 .y/ ; 6 24

(3.5)

with ° y and º y denoting the skewness and kurtosis, respectively. Typically, in the experiment, we demonstrate the following three cases: Case 1: °1s D °2s D 0, and °1m D 0:2, °2m D ¡0:2 Case 2: °1s D 0:2, °2s D ¡0:2, and °1m D °2m D 0

Case 3: °1s D 0:2, °2s D ¡0:2, and °1m D 0:2, °2m D ¡0:2 Actually, the proved theorem includes case 2, which we use here for comparison. Meanwhile, in all cases we x º1s D º1m D 0:5 and º2s D º2m D 1. O The D.Á/ versus the Á 2 [0; 2¼] obtained for the three cases are shown in Figures 1, 2, and 3, respectively.

One-Bit-Matching Conjecture for ICA

393

2.8

gs = gs =0, gm=0.2, gm= 0.2 1

2

1

2

2.81

2.82

D

2.83

2.84

2.85

2.86

~11p/6

~5p/6 2.87

0

1

2

3

4

5

6

7

f

O vs. µ for case 1. Figure 1: D

From the experiment results, we see that the obtained Á’s corresponding to the maximized DO in all three cases are around 5¼=6 or 11¼=6, which then makes the rotation matrix R D WA approximately equal to ¡I2 or I2 , respectively. That is, all of the three settings recovered the original two sources up to sign indeterminacy, even when the model pdf’s are with a small nonzero skewness, as in cases 1 and 3. 3.2 On Demonstrating the Breakpoint of the Conjecture. We proceed to demonstrate the conjecture by gradually increasing the skewness of the model densities to get empirical evidence regarding to what extent the conjecture holds or breaks down in practice. In this experiment, the 50,000 source samples of s1 and s2 are generated by the following mixture of two gaussians, p.y j µ/ D 0:5G.y j µ; .2 ¡ µ/2 / C 0:5G.y j ¡µ; 1/;

0 · µ · 1; (3.6)

with µ1 D 0:1 and µ2 D 0:9, respectively, where G.y j m; ¾ 2 / denotes a gaussian pdf with mean m and variance ¾ 2 . By a standard technique (Stuart & Ord, 1994), the skewness ° and kurtosis º of the mixture of k gaussians

394

Z. Liu, K. Chiu, and L. Xu

O vs. µ for case 2. Figure 2: D

can be obtained as follows, ° D ¹30 C 3¹12 C 2¹310 ¡ 3¹10 ¹02 ¡ 3¹10¹20

º D ¹40 C 6¹22 C 3¹04 C 12¹ 210 ¹02 C 12¹210 ¹20

¡ 12¹10 ¹12 ¡ 4¹10¹ 30 ¡ 3¹202 ¡ 3¹220 ¡ 6¹02¹20 ¡ 6¹410 ;

where ¹pq ,

Pk

p q jD1 ®j mj ¾j .

(3.7)

In our case, they become, respectively,

° D 1:5µ 3 ¡ 6µ 2 C 4:5µ;

º D ¡1:25µ 4 ¡ 6µ 3 C 16:5µ 2 ¡ 18µ C 6:75;

(3.8) (3.9)

with the change of ° and k as µ varies as shown in Figure 4. As a result, the skewness ° and kurtosis º of the two sources are as follows, °1s D 0:39; º1s D 5:11; °2s D 0:28; º2s D ¡1:28: That is, s1 is supergaussian and s2 is subgaussian. The observations x and

One-Bit-Matching Conjecture for ICA

395

2.8

gs =0.2, gs = 0.2, gm= 0.2, gm= 0.2 1

2.81

2

1

2

2.82

D

2.83

2.84

2.85

2.86

~11p/6

~5p/6 2.87

0

1

2

3

4

5

6

7

f

O vs. µ for case 3. Figure 3: D

outputs y are obtained in the same way as in the previous experiment using equations 3.2 and 3.3, respectively, but now with » D ¼4 . Meanwhile, the Á is also obtained using equation 3.4, and the two model pdf’s p1 .y1 j µ1 / and p2 .y2 j µ2 / are still with the form given by equation 3.6 but with µ1 increased from 0 to 0:6 using µ1 D 0:6¿; 0 · ¿ · 1;

(3.10)

and µ2 decreased from 1 to 0:7 using µ2 D 1 ¡ 0:3¿; 0 · ¿ · 1:

(3.11)

In this way, p1 .y1 j µ1 / is constrained as a supergaussian while p2 .y2 j µ2 / remains always subgaussian, and meanwhile, the increase of ¿ will result in the decrease of the kurtosis (absolute value) coupled with the increase of the skewness for both p1 and p2 , as illustrated in Figure 6.

396

Z. Liu, K. Chiu, and L. Xu 7

6

5

high order statistics

4

3

kurtosis

2 skewness 1

0

1

2

0

0.1

0.2

0.3

0.4

0.5

q

0.6

0.7

0.8

0.9

1

Figure 4: Skewness and kurtosis vs. µ

The performance of the recovery is measured by the following error metric (Amari et al., 1996), 0 d d X X @ ED iD1

jD1

1 Á ! d d X X jpij j jpij j ¡ 1A C ¡1 ; maxk jpik j maxk jpkj j iD1 jD1

(3.12)

where P , WA. The obtained error E versus ¿ is shown in Figure 5. From Figure 5, we notice that the performance worsens as ¿ increases. Typically, if we set the threshold as E D 0:40, which corresponds to the matrix ³ ´ 0:9934 ¡0:1144 P D WA D ; 0:1144 0:9934 source separation by ICA via maximizing equation 1.3 would fail when ¿ > 0:89. This corresponds to 0:6 ¸ µ1 ¸ 0:53

and

0:7 · µ2 · 0:73

or 0:93 ¸ °1m ¸ 0:86; 0:43 · º1m · 0:85

One-Bit-Matching Conjecture for ICA

397

0.5 0.45 0.4 0.35

E

0.3 0.25 0.2 0.15 0.1 0.05 0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

t Figure 5: The obtained error E vs. ¿ .

and 0:73 ¸ °2m ¸ 0:67; ¡0:12 ¸ º2m ¸ ¡0:29; as illustrated by Figure 6. Apparently, small kurtosis coupled with large skewness poses a threat to the one-bit-matching conjecture. An intuitive explanation for the breakdown of the conjecture in the experiment might be as follows. As the model skewness becomes relatively big enough, it can no longer, even roughly, m guarantee that km i in equation 2.9 possesses the same sign as ºi , and thus make the one-bit-matching conjecture break down in practice. 4 Conclusions When only skewness and kurtosis are under consideration, we have theoretically proved the so-called one-bit-matching conjecture for ICA under the assumption of varnishing skewness for the model pdf’s. We also empirically demonstrated the robustness of conjecture against the vanishing skewness assumption for the model pdf’s and, as a by-product, showed that the kurtosis maximization criterion is actually a special case of the MMI criterion.

398

Z. Liu, K. Chiu, and L. Xu 7

breakdown point nm

6

1

high order statistics

5

4

3

gm

2

1

gm 2

1

super gaussian sub gaussian

0

nm 2

1

2

0

0.1

0.2

0.3

0.4

0.5

t

0.6

0.7

0.8

0.9

1

Figure 6: Illustration of the ICA performance as the model pdf varies.

Acknowledgments We thank the anonymous reviewers for their valuable suggestions, which improved the original manuscript. Z.-Y. Liu thanks Jinwen Ma for some helpful discussions. The work described in this article was fully supported by a grant from the Research Grant Council of the Hong Kong SAR (Project No: CUHK 4336/02E). References Amari, S. I., & Chen, T. P. (1997). Stability analysis of adaptive blind source separation. Neural Networks Letter, 10, 1345–1351. Amari, S. I., Cichocki, A., & Yang, H. (1996). A new learning algorithm for blind separation of sources. Advances in neural information processing, 8 (pp. 757– 763). Cambridge, MA: MIT Press. Attias, H. (1999). Independent factor analysis. Neural Computation, 11, 803–851. Bell, A., & Sejnowski, T. (1995). An information-maximization approach to blind separation and blind deconvolution. Neural Computation, 7, 1129–1159. Cardoso, J. F. (1997). Infomax and maximum likelihood for source separation. IEEE Letters on Signal Processing, 4, 112–114.

One-Bit-Matching Conjecture for ICA

399

Cardoso, J. F. (1999).High order constrasts for independent component analysis. Neural Computation, 11, 157–192. Cheung, C. C., & Xu, L. (2000). Some global and local convergence analysis on the information-theoretic independent component analysis approach. Neurocomputing, 30, 79–102. Comon, P. (1994). Independent component analysis—a new concept? Signal Processing, 36, 287–314. Everson, R., & Roberts, S. (1999). Independent component analysis: A exible nonlinearity and decorrelating manifold approach. Neural Computation, 11, 1957–1983. Girolami, M. (1998). An alternative perspective on adaptive independent component analysis algorithms. Neural Computation, 10, 2103–2114. Lee, T. W., Girolami, M., & Sejnowski, T. J. (1999). Independent component analysis using an extended infomax algorithm for mixed subgaussian and supergaussian sources. Neural Computation, 11, 417–441. Moreau, E., & Macchi, O. (1996). High order constrasts for self-adaptive source separation. International Journal of Adaptive Control and Signal Processing, 10, 19–46. Pearlmutter, B. A., & Parra, L. C. (1996). A context-sensitive genaralization of ICA. In Proc. of Int. Conf. on Neural Information Processing. Hong Kong: Springer-Verlag. Stuart, A., & Ord, J. (1994). Kendall’s advanced theory of statistics,Vol. 1: Distribution theory. London: Edward Arnold. Tong, L., Inouye, Y., & Liu, R. (1993). Waveform-preserving blind estimation of multiple independent sources. Signal Processing, 41, 2461–2470. Welling, M., & Weber, M. (2001). A constrained EM algorithm for independent component analysis. Neural Computation, 13, 677–689. Xu, L. (1997). Bayesian ying-yang learning based ICA models. Proc. 1997 IEEE Signal Processing Society Workshop on Neural Networks for Signal Processing VII (pp. 476–485). Florida. Xu, L., Cheung, C. C., & Amari, S. I. (1998a). Further results on nonlinearity and separtion capability of a liner mixture ICA method and learned lpm. In C. Fyfe (Ed.), Proceedings of the I&ANN’98 (pp. 39–45). Xu, L., Cheung, C. C., & Amari, S. I. (1998b). Learned parametric mixture based ICA algorithm. Neurocomputing, 22, 69–80. Xu, L., Cheung, C. C., Yang, H. H., & Amari, S. I. (1997). Independent component analysis by the information-theoretic approach with mixture of density. Proc. of 1997 IEEE Intl. Conf on Neural Networks (IEEE-INNS IJCNN97) (Vol. 3, pp. 1821–1826). Houston, TX. Xu, L., Yang, H. H., & Amari, S. I. (1996). Signal source separation by mixtures: accumulative distribution functions or mixture of bell-shape density distribution functions. Research proposal presented at FRONTIER FORUM. Japan: Institute of Physical and Chemical Research. Received January 2, 2003; accepted July 10, 2003.

LETTER

Communicated by Andrew Barto

Jacobian Conditioning Analysis for Model Validation Isabelle Rivals [email protected]

L´eon Personnaz

[email protected] Equipe de Statistique Appliqu´ee, Ecole Sup´erieure de Physique et de Chimie Industrielles, F75005 Paris, France

Our aim is to stress the importance of Jacobian matrix conditioning for model validation. We also comment on Monari and Dreyfus (2002), where, following Rivals and Personnaz (2000), it is proposed to discard neural candidates that are likely to overt and/or for which quantities of interest such as condence intervals cannot be computed accurately. In Rivals and Personnaz (2000), we argued that such models are to be discarded on the basis of the condition number of their Jacobian matrix. But Monari and Dreyfus (2002) suggest making the decision on the basis of the computed values of the leverages, the diagonal elements of the projection matrix on the range of the Jacobian, or “hat” matrix: they propose to discard a model if computed leverages are outside some theoretical bounds, pretending that it is the symptom of the Jacobian rank deciency. We question this proposition because, theoretically, the hat matrix is dened whatever the rank of the Jacobian and because, in practice, the computed leverages of very ill-conditioned networks may respect their theoretical bounds while condence intervals cannot be estimated accurately enough, two facts that have escaped Monari and Dreyfus’s attention. We also recall the most accurate way to estimate the leverages and the properties of these estimations. Finally, we make an additional comment concerning the performance estimation in Monari and Dreyfus (2002). 1 On the Jacobian Matrix of a Nonlinear Model We deal with the modeling of processes having a certain n-input vector x D [x1 ; x2 ; : : : ; xn ]T and a measured scalar output y. We assume that for any xed value xa of x, ya D ¹.xa / C wa

(1.1)

where ¹ is the regression function and wa is a random variable with zero expectation. A training set fxk ; yk gkD1 to N is available. The goal is to validate a set of candidate models approximating the regression as accurately as Neural Computation 16, 401–418 (2004)

c 2003 Massachusetts Institute of Technology °

402

I. Rivals and L. Personnaz

possible and well conditioned enough for a meaningful estimation of a condence interval. 1.1 Theoretical Results. We consider a family of parameterized functions F D f f .x; µ/; x 2 Rn ; µ 2 Rq g implemented by a neural network. A least-squares estimate µ LS of the parameters minimizes the quadratic cost function: J.µ/ D

N 1X .yk ¡ f .xk ; µ//2 : 2 kD1

(1.2)

The Jacobian matrix Z of the model with parameter µ LS plays an important role in the statistical properties of least-squares estimation. It is dened as the .N; q/ matrix with elements @ f .xk ; µ / zki D @µi

:

(1.3)

µ DµLS

If F contains the regression, Z is full rank, and the noise w is homoscedastic with variance , an estimate of the .1 ¡ ®/% condence interval for the regression for any input xa is asymptotically given by p f .xa; µ LS / § g.®/s .za /T .ZT Z/¡1 za ;

(1.4)

with za D @ f .xa ; µ/=@ µjµDµ LS , where the function g denotes the inverse of the P k 2 gaussian cumulative distribution, and where s2 D N kD1 .e / =.N ¡ q/, with k D k¡ k e y f .x ; µ LS / denoting the kth residual. In Rivals and Personnaz (2000), we further showed that whatever F , if Z is full rank, the kth leave-one-out error ek.k/ can be approximated with ek.k/ ¼

ek for k D 1 to N; 1 ¡ hkk

(1.5)

where the leverages fhkk g are the diagonal elements of the orthogonal projection matrix on the range of Z, the “hat” matrix H D ZZI (ZI denotes the pseudo-inverse of Z). The matrix Z being full rank, we have H D Z.ZT Z/¡1 ZT . Like any other orthogonal projection matrix, its diagonal elements verify 0 · hkk · 1

for k D 1 to N:

(1.6)

For a linear as well as for a nonlinear model, a leverage value measures the inuence of the corresponding example on the parameter estimate µ LS . In

Jacobian Conditioning Analysis for Model Validation

403

practice, a leverage value larger than 0.5 is considered to indicate a very inuent example. For a model linear in its parameters, relation (1.5) is an equality. The case hkk D 1 implies ek D 0, and ek.k/ is not determined. But

conversely, ek D 0 does not imply hkk D 1, and a small residual does not necessarily correspond to a leverage close to one. For a nonlinear model, equation 1.5 being only an approximation, hkk D 1 does not imply that the corresponding residual is zero or among the smallest residual values. Another property of the diagonal elements of a projection matrix is trace.H/ D

N X kD1

hkk D rank.Z/:

(1.7)

The results expressed in equations 1.4 and 1.5 assume that rank.Z/ D q, the number of parameters of the neural network. If rank.Z/ < q, this means that some parameters are useless, that is, that the model is unnecessarily complex given the training set. Thus, models such that rank.Z/ < q should be discarded. However, note that if rank.Z/ < q, the hat matrix, and hence the leverages, are still dened (see appendix A) and verify equations 1.6 and 1.7. Theoretically, if rank.Z/ < q, at least one of Z’s singular values is zero (see appendix A). 1.2 Numerical Considerations. In practice, it is difcult to make a statement about the rank deciency of Z. In fact, though there exist efcient algorithms for the estimation of the singular values (Golub & Reinsch, 1970; Anderson et al., 1999), these algorithms generally do not lead to zero singular values except in particular cases. Thus, the notion of the rank of Z is not usable in order to make the decision to discard a model. A decision should be made on the basis of whether the inverse of ZT Z needed for the estimation of a condence interval can be estimated accurately. The most accurate estimation is performed using Z’s singular value decomposition (SVD) Z D U6V T , rather than using the Cholesky or LU decompositions of ZT Z, or the QR decomposition of Z (Golub & Van Loan, O V, O and f¾O i g the computed versions of U, V, and 1983). We denote by U, of the singular values f¾i g obtained with a Golub-Reinsch like SVD algorithm (used by Linpack, Lapack, and Matlab, for example). According to appendix A, one obtains the computed version of .ZT Z/¡1 using: 8 <[.6\ T 6/¡1 ] D 1 for i D 1 to q ii .¾O i /2 T Z/¡1 D V T 6/¡1 V O .6\ O T with (1.8) .Z\ :[ \ 8 i 6D j: .6 T 6/¡1 ]ij D 0 The question is how to detect whether this computed version is accurate. The absolute error on the ith singular value is bounded by (Golub & Van Loan, 1983), j¾i ¡ ¾O i j · ¾1 "

for i D 1 to q;

(1.9)

404

I. Rivals and L. Personnaz

where " is the computer unit roundoff (" ¼ 10¡16 for usual computers). The relative error on the smallest singular value is hence bounded by ¾q ¡ ¾O q ¾1 · " D ·.Z/": ¾ ¾q q

(1.10)

It is directly related to the condition number ·.Z/ of Z, the ratio of its largest to its smallest singular value. When ·.Z/ reaches 1=" D 1016 , the relative error on the smallest singular value may be 100%. Since equation 1.8 involves the inverse of the squared Jacobian, and hence the inverse of the computed d > 108 should be dissquared singular values, neural candidates with ·.Z/ carded, as recommended in Rivals and Personnaz (2000). d ¿ 108 , one can compute .ZT Z/¡1 with good precision using equaIf ·.Z/ tion 1.8. Moreover, according to the error bound, equation 1.10, for a network with ·.Z/ · 108 , the relative precision on the smallest singular value ¾q is smaller than 10¡8 , and it is even smaller for the other singular values. The d · 108 . precision on ·.Z/ itself is hence excellent when ·.Z/ According to equation 1.9, it makes sense to estimate r D rank.Z/, if needed, as the number of computed singular values that are larger than the threshold ¾O1 ": rO D cardf¾Oi > ¾O1 "; i D 1 to qg:

(1.11)

However, the decision to discard a network should not be based on rO, because one may have rO D q while ·.Z/ > 108 , and hence while .ZT Z/¡1 is inaccurate. When r D q, one could think of computing the leverages using equation 1.8 since H D Z.ZT Z/¡1 ZT . However, as mentioned in Rivals and Personnaz (2000), it is better to use (see appendix A): hc kk.R&P/ D

rO X iD1

2 c .u ki /

for k D 1 to N;

(1.12)

where rO is computed with equation 1.11. As opposed to equation 1.8, expression 1.12 does not involve the inverse of the square of possibly inaccurate small singular values. The leverages computed with equation 1.12 are hence less sensitive to the ill conditioning of Z than .ZT Z/¡1 computed with equation 1.8. However, we must consider the error on q rst columns of U, which span the range of Z. Let Uq denote the matrix of the q rst columns of U. The angle between the range of Uq and that of its computed version UOq (see appendix A) is approximately bounded by (Anderson et al., 1999) angle.R.Uq /; R.UOq // ·

¾1 " : minj6Di j¾i ¡ ¾j j

(1.13)

Jacobian Conditioning Analysis for Model Validation

405

Thus, even if a model is very ill conditioned, as long as the singular values of its Jacobian are not too close to one another, the leverages can be computed accurately with equation 1.12. It can also be shown that UO is quasi orthogonal (Golub & Van Loan, 1983): UO D W C 1W with W T W D WW T D IN and k1Wk2 · ";

(1.14)

where IN denotes the identity matrix of order N. Thus, for leverage values obtained with equation 1.12, even if UO is not an accurate version of U, the relations 1.6 and 1.7 are satised to roughly unit roundoff.1 Note that if one is interested only in the leverages, they can be computed as accurately using the QR decomposition as with equation 1.12 (Dongarra, Moler, Bunch, & Stewart, 1979) (see appendix A). The advantage is that the QR decomposition demands far fewer computations than SVD. 2 Comment on the Proposition of Monari and Dreyfus (2002) Monari and Dreyfus do not consider the notion of condition number but focus on the values of the leverages. They choose to discard a model when the relations 1.6 and 1.7 are not satised for the computed values of its leverages. There are two problems with this proposition: 1. Property 1.7 is an equality, but Monari and Dreyfus do not specify a numerical threshold that could be used in order to make a decision. There is a similar problem with equation 1.6. 2. Instead of using equation 1.12, Monari and Dreyfus compute the leverages according to hc kk.M&D/ D

q X iD1

0

12 q X 1 @ zkj vbji A ¾Oi jD1

for k D 1 to N;

(2.1)

equation A.4 in Monari and Dreyfus (2002). This equation is derived from the expression ZV.6 T 6/¡1 V T ZT of the hat matrix if Z is full rank, 1

Proof that equation 1.6 is satised to unit roundoff: hb kk .R&P/ D

rO X iD1

.ubki /2 ·

N X iD1

.ubki /2 D 1

to unit roundoff:

Proof that equation 1.7 is satised to unit roundoff: N X kD1

O O T O T O hb kk .R&P/ D trace.Uq .Uq / / D trace..Uq / Uq / D rO

to unit roundoff:

406

I. Rivals and L. Personnaz

an expression that is, strangely enough, obtained by using the SVD of Z for .ZT Z/¡1 (expression A.5 of this article), but not for Z itself. Whereas the computed values of the leverages using equation 1.12 are accurate provided the singular values are not too close (property 1.13) and always satisfy equations 1.6 and 1.7 to unit roundoff (see property 1.14), there is no such result concerning the computed values obtained with equation 2.1. This has the following consequences: ² Assuming that a consistent threshold were specied for equation 1.6 and 1.7, because the leverages computed with equation 2.1 may be inaccurate, models well enough conditioned for the accurate estimation of condence intervals may wrongly be discarded. ² Conversely, in the case of models whose Jacobian Z is ill conditioned, the leverages computed with equation 2.1 (and a fortiori with equation 1.12) may satisfy equations 1.6 and 1.7, and hence the corresponding models may not be discarded, while condence intervals are meaningless. In the next section, these consequences are illustrated with numerical examples. 3 Numerical Examples 3.1 Accuracy of the Leverages Computed with Equations 1.12 and 2.1. We construct a .N; 2/ matrix Z such that its condition number can be adjusted by tuning a parameter ®, while the range of Z does not depend on the value of ®: 2 3 1 1 C ®c1 6¢ 7 ¢ 7; (3.1) Z D [1 1 C ®c] D 6 4¢ 5 ¢ N 1 1 C ®c where the fck g are realizations of gaussian random variables (see the Matlab program given in appendix B). For a given c, R.Z/ D span.1; c/ and hence the true leverages fhkk g have the same values 8 ® 6D 0. As an example, for a single realization of Z, with N D 4, and ® D 10¡12, we obtain the results reproduced in appendix B. In order to give statistically signicant results, we performed 10,000 realizations of the matrix Z. Table 1 displays the averaged results. When using equation 1.12, the computed values fhc kk.R&P/ g satisfy equation 1.6 and their sum satises equation 1.7 to roughly unit round-off for d up to 1016: as a matter of fact, this is ensured by property 1.14 of values of ·.Z/ O Moreover, according to property 1.13, since ¾b1 ¡ ¾b2 ¼ 2:8 8 ®, the estimate U. the values of the fhc kk.R&P/ g themselves are accurate.

Jacobian Conditioning Analysis for Model Validation

407

Table 1: Estimation of the Leverages with Formulas 1.12 (fhc kk .R&P/ g) and 2.1 g). (fhc kk .M&D/ ®

d h·.Z/i

10¡6 10¡8 10¡12 10¡15

3.2154 106 3.2154 108 3.2154 1012 Inf

hjOr ¡

PN kD1

3.7881 1.8516 1.8443 1.5901

hb kk .R&P/ ji 10¡16 10¡16 10¡16 10¡16

hjq ¡

PN kD1

hb kk .M&D/ ji

1.2029 10¡10 1.2316 10¡8 1.2213 10¡4 –

When using equation 2.1, the sum of the fhc kk.M&D/ g by far does not satisfy equation 1.7 to unit round-off even when Z is well conditioned. The c fhc kk .M&D/ g may also not satisfy equation 1.6, whereas the fhkk.R&P/ g always

do (run the program of appendix B with ® D 10¡15). This example illustrates the lower accuracy of equation 2.1 with respect to equation 1.12. If the values of leverages are central to a given procedure, they denitely should be computed according to equation 1.12. The results are similar for any N and any vector c, as well as for Jacobian matrices of actual problems.2 3.2 Neural Modeling. We consider a single input process simulated with yk D sinc.10.xk C 1// C wk for k D 1 to N;

(3.2)

where sinc denotes the cardinal sine function and N D 50. The noise values fwk g are drawn from a gaussian distribution with variance D 2 10¡3 , and the input values fxk g from an uniform distribution in [¡1I 1]. Neural models with one layer of nh “tanh” hidden neurons and a linear output neuron are considered, except the network without hidden neurons, which consists of a single linear neuron. They are trained 50 times using a quasi-Newton algorithm starting from different small random initial parameter values, in order to increase the chance to reach an absolute minimum; the parameters corresponding to the lowest value of the cost function (1.2) are kept. The corresponding mean square training error is denoted by MSTE. The approximate leave one out score computed with equations 1.5 and 1.12 is denoted by ALOOS. The ALOOS is to be compared to a better, unbiased estimate of the performance computed on an independent test set of 500 examples drawn from the same distribution as the training set (such a test set is usually not available in real life), the MSPE. 2 The economic QR decomposition of Z can also be used (see appendix A): the values computed with equation A.13 do not differ from those computed with equation 1.12 by more than roughly the computer unit roundoff (check with the Matlab program of appendix B).

408

I. Rivals and L. Personnaz

Table 2: Training of Neural Models with an Increasing Number nh of Hidden Neurons. nh 0 1 2 3 4

MSTE 3.699 9.506 3.181 2.153 1.888

ALOOS

10¡2

4.136 1.144 4.831 4.039 5.316

10¡3 10¡3 10¡3 10¡3

10¡2 10¡2 10¡3 10¡3 10¡7

MSPE 7.039 1.083 6.866 4.783 4.436

10¡2 10¡2 10¡3 10¡3 10¡3

d ·.Z/ 1.8 6.7 102 8.4 102 2.1 109 9.0 109

d > 108 (illNote: The bottom two rows correspond to ·.Z/ conditioned networks).

The simulations are programmed in the C language. The SVD decomposition is performed with the Numerical Recipes routine “svdcmp” (Press, Teukolsky, Vetterling, & Flannery, 2002), but in double precision. The leverages fhc kk.M&D/ g (see equation 2.1) and the fhkk.R&P/ g (see equation 1.12 are then computed as in the Matlab program given in appendix B. The results obtained are shown in Table 2. The outputs and the residuals of the networks with two and three hidden neurons are shown in Figures 1 and 2, respectively. Although the model with three hidden neurons has a slightly lower MSPE than the network with two, it is unreliable in the sense that one is unable to estimate correct condence T Z/¡1 with intervals for the regression with this network: computing .Zd T equation 1.8 and multiplying it by Z Z leads to a matrix that differs signicantly from the identity matrix (by more than two for some of its elements). Fortunately, following Rivals and Personnaz (2000), we can discard this network right away on the basis of its condition number, which is too large. d is ignored and that, following Monari and Dreyfus Suppose that ·.Z/ (2002), only the computed leverage values are considered. The results obtained with equations 1.12 and 2.1 are given in Table 3. Both relations 1.6 and 1.7 are satised for networks with three and even four hidden neurons, whether the leverages are computed with equation 1.12 .fhc kk.R&P/ g/ or equaTable 3: Observing the Computed Leverage Values. nh 0 1 2 3 4

max.hb kk .R&P/ / 8.1648771 8.5215146 8.6468122 8.8121097 9.9999983

10¡2 10¡1 10¡1 10¡1 10¡1

rO ¡

N X

hb kk .R&P/

max.hb kk .M&D/ /

kD1

2.22 10¡16 0.0 0.0 1:8 10¡15 0.0

8.1648771 8.5215146 8.6468122 8.8121097 9.9999981

10¡2 10¡1 10¡1 10¡1 10¡1

q¡

N X

hb kk .M&D/

kD1

0.0 1:1 10¡14 9.8 10¡15 6:2 10¡11 1:6 10¡8

Jacobian Conditioning Analysis for Model Validation

409

tion 2.1 .fhc kk.M&D/ g/. It proves that checking only equations 1.6 and 1.7 for the leverage values does not lead to discarding the unusable models with three and four hidden neurons. Let us take a closer look at the performance of the models with two and three hidden neurons, and at the interpretation of the leverage values. Figures 1 and 2 exhibit the training examples and the residuals corresponding to leverage values larger than 0.5. For both networks, the largest leverage value corresponds to an example that lies at the boundary of the input domain explored by the training set. This is a typical situation of an inuent example. For the network with two hidden neurons, a second leverage value is larger than 0.5: the fact that the corresponding example is located at an inexion point of the model output is the sign of its large inuence on the parameter estimate. Let us now examine the network with four hidden neurons. Four of its leverage values are larger than 0.5. These values are the following (from the

a ) 1 .5

1

0 .5

0

-0 .5 -1

- 0 .5

0

0 .5

1

0 .5

1

b ) 0 .2

0 .1

0

-0 .1

-0 .2 -1

- 0 .5

0

d D 8:4 102 /. (a) Regression Figure 1: Network with two hidden neurons .·.Z/ (dotted line), model output (continuous line), training set (circles). (b) Residuals. The training example and the residual corresponding to the largest leverage .8:65 10¡1 / are marked with a circle lled in black. A second leverage is larger than 0.5 (5.89 10¡1 ), and the corresponding training example and residual are marked with a circle lled in gray.

410

I. Rivals and L. Personnaz a ) 1 .5

1

0 .5

0

-0 .5 -1

-0 .5

0

0 .5

1

0 .5

1

b ) 0 .2

0 .1

0

-0 .1

-0 .2 -1

-0 .5

0

d D 2:1 109 /. (a) Regression Figure 2: Network with three hidden neurons .·.Z/ (dotted line), model output (continuous line), training set (circles). (b) Residuals. The training example and the residual corresponding to the largest leverage (8.81 10¡1 ) are marked with a circle lled in black. smallest to the largest corresponding abscissa; see Figure 3): 8 ¡1 > h[ 38 38.R&P/ D 7:3268195 10 > > <[ h39 39.R&P/ D 9:9999983 10¡1 ¡1 > > h[ 33 33.R&P/ D 9:9981558 10 > :d ¡1 h1 1.R&P/ D 8:4477304 10 The large leverage values correspond to the two extreme examples (38 and 1) and two overtted examples (39 and 33). In fact, large leverage values (i.e., values close to one but not necessarily larger than one) are the symptom of local overtting at the corresponding training examples or of extreme examples of the training set that are relatively isolated, and hence inuent. However, checking only equations 1.6 and 1.7 for the leverage values would not lead to the detection of the misbehavior of this network. Finally, this example shows that ill conditioning is not systematically related to leverage values close to one: the largest leverage value of the d D very ill-conditioned neural network with three hidden neurons .·.Z/ 9 ¡1 2:1 10 / equals 8:81 10 and is, hence, not much larger than the largest leverage value of the well-conditioned neural network with two hidden

Jacobian Conditioning Analysis for Model Validation

411

a ) 1 .5

1

0 .5

0

-0 .5 -1

- 0 .5

0

0 .5

1

0 .5

1

b ) 0 .2

0 .1

0

-0 .1

-0 .2 -1

- 0 .5

0

d D 9:0 109 /. (a) Regression Figure 3: Network with four hidden neurons .·.Z/ (dotted line), model output (continuous line), training set (circles). (b) Residuals. The training example and the residual corresponding to the largest leverage (9,9999983 10¡1 ) are marked with a circle lled in black. The training examples and residuals corresponding to three other leverages larger than 0.5 are marked with circles lled in gray. d D 8:4 102 /, which equals 8:65 10¡1 . Ill conditioning is not neurons .·.Z/ systematically related to local overtting but rather to a global parameter redundancy. 4 Conclusion Our conclusion is threefold: ² In order to validate only neural candidates whose approximate parameter covariance matrix and condence intervals can be reliably estimated, the rst condition should be that the condition number of their Jacobian does not exceed the square root of the inverse of the computer unit round-off (usually 108 ). ² If a procedure relies heavily on the computed values of the diagonal elements of the hat matrix, the leverages, the latter should be computed according to expression 1.12, as recommended in Rivals and Personnaz (2000), rather than according to expression 2.1 given in Monari and Dreyfus (2002). Only the computation according to equation 1.12

412

I. Rivals and L. Personnaz

ensures that the computed hat matrix is a projection matrix and that it is accurate. ² For candidates whose condition number is small enough and for which the leverages have been computed as accurately as possible according to equation 1.12, one may check additionally if none of the leverage values is close to one, as already proposed in Rivals and Personnaz (1998). Leverage values close to, but not necessarily larger than one are indeed the symptom of overtted examples or of isolated examples at the border of the input domain delimited by the training set. 5 Other Comment for Monari and Dreyfus (2002) This comment concerns the estimation of the performance of the selection method presented in Monari and Dreyfus (2002), for its comparison to selection methods proposed by other authors. As in Anders and Korn (1999), the process to be modeled is simulated, and its output is corrupted by a gaussian noise with known variance . In order to perform statistically signicant comparisons between selection methods, 1000 realizations of a training set of size N are generated. A separate test set of 500 examples is used for estimating the generalization mean square error (GMSE) of the selected models, and the following performance index is computed: ½D

GMSE ¡

(5.1)

;

equation 5.3 in Monari and Dreyfus (2002). In the case N D 100, two values of h½i, the average value of ½ on the 1000 training sets, are given: (1) a value of 126% corresponding to the above denition and (2) a value of 27% corresponding to a GMSE computed on part of the test set only. Strangely enough, 3% of the examples of the test set are considered as outliers and discarded from the test set. This value of 27% is compared to the values of h½i obtained by other selection procedures with the whole test set. This second value of 27% is meaningless. Putting aside the fact that considering examples of the performance estimation set as outliers is questionable, let us call GMSE¤ the GMSE obtained in (2). In the most favorable case for Monari and Dreyfus (2002)—the case where we assume that the examples Monari and Dreyfus discarded correspond to the largest values of the gaussian noise (and not to model errors)—this GMSE¤ should be compared not to but to the variance ¤ of a noise with a truncated gaussian distribution (without its two 1.5% tails). In the example, D 5 10¡3 , ¤ D 4:4 10¡3 . Thus, the ratio ½¤ D

GMSE¤ ¡ ¤

¤

>

GMSE¤ ¡

would be more representative of the actual model performance.

(5.2)

Jacobian Conditioning Analysis for Model Validation

413

To conclude, the second value of h½i obtained in Monari and Dreyfus (2002) by discarding some examples of the test set can by no means be compared to those obtained by other selection procedures correctly using the whole test set for the performance estimation. Appendix A This appendix summarizes the results used in this article; for details, see Golub and Van Loan (1983). A.1 Theorem for the Singular Value Decomposition (SVD). Consider a .N; q/ matrix Z with N ¸ q and rank.Z/ D r · q. There exist a .N; N/ orthogonal matrix U and a .q; q/ orthogonal matrix V such that: Z D U6V T ;

(A.1)

where 6 is a .N; q/ matrix such that [6]ij D 0 for i 6D j, and whose elements f[6]ii g, denoted by f¾i g, are termed the singular values of Z, with: ¾1 ¸ ¾2 ¸ ¢ ¢ ¢ ¸ ¾q ¸ 0:

(A.2)

If rank.Z/ D r < q, ¾1 ¸ ¢ ¢ ¢ ¸ ¾r ¸ ¾rC1 D ¢ ¢ ¢ D 0. The r rst columns of U form an orthonormal basis of the range of Z. A.2 Condition Number Using SVD. If rank.Z/ D q, and using the matrix two-norm3 , the condition number ·.Z/ of Z is expressed as ·.Z/ D kZk2 kZ¡1 k2 D

¾1 : ¾q

(A.3)

If ·.Z/ is large, the matrix Z is ill conditioned. We have the property ·.ZT Z/ D .· .Z//2 :

(A.4)

A.3 Inverse of ZT Z Using SVD. If rank.Z/ D q, the inverse of ZT Z exists and, using the SVD of Z, can be expressed as .ZT Z/¡1 D V.6 T 6/¡1 V T ; 3

The two-norm of a matrix A is dened as:

± kAk2 D sup x6D0

kAxk2 kxk2

² :

(A.5)

414

I. Rivals and L. Personnaz

where .6 T 6/¡1 is a .q; q/ diagonal matrix with [.6 T 6/¡1 ]ii D

1 ¾i2

for i D 1 to q:

(A.6)

A.4 Pseudo-Inverse of Z using SVD. Any .N; q/ matrix Z with rank r · q has a pseudo-inverse. It equals ZI D V6 I UT ;

(A.7)

where 6 I is a .q; N/ matrix whose only nonzero elements are the rst r diagonal elements: [6 I ]ii D

1 ¾i

for i D 1 to r:

(A.8)

A.5 Orthogonal Projection Matrix on the Range of Z Using SVD. The .N; N/ projection matrix H on the range of any .N; q/ matrix Z is given by H D ZZI :

(A.9)

Using the SVD of Z, we obtain H D U6V T V6 I UT D U66 I U T ;

(A.10)

where the matrix 66 I is hence a .N; N/ diagonal matrix whose r rst diagonal elements are equal to 1 and all the others to 0 (see Golub & Van Loan, 1983). Thus, the diagonal elements of H, the leverages, are given by hkk D

r X

.uki /2

iD1

for k D 1 to N:

(A.11)

A.6 Theorem for the QR Decomposition. Consider a .N; q/ matrix Z with N ¸ q and rank.Z/ D q. There exist a .N; N/ orthogonal matrix Q and an upper triangular .q; q/ matrix R such that ZDQ

µ ¶ R : 0

(A.12)

The q rst columns of Q form an orthonormal basis of the range of Z. A.7 Leverages Using QR. Using the QR decomposition of Z, we obtain: hkk D

q X iD1

.qki /2

for k D 1 to N:

(A.13)

Jacobian Conditioning Analysis for Model Validation

415

A.8 Angle Between Two Subspaces. Let S1 and S2 denote the ranges of two .N; q/ matrices Z1 and Z2 , and H1 and H2 the orthogonal projection matrices on S1 and S2 . The distance between the two subspaces S1 and S2 is dened as dist.S1 ; S2 / D kH1 ¡ H2 k2 :

(A.14)

The angle between S1 and S2 is dened as angle.S1 ; S2 / D arcsin.dist.S1 ; S2 //:

(A.15)

Appendix B Following is the text of a Matlab program that constructs an ill-conditioned matrix Z (for a small value of ®), and computes the leverage values using SVD, formulas 1.12 and 2.1, and also using the more economic QR decomposition, which is as accurate as equation 1.12: clc clear all format compact format short; % construction of the (N,q) matrix Z randn(© seed© ,12345); N=4; q=2; alpha = 1e-12 c = randn(N,1); Z = [ones(N,1) ones(N,1)+alpha*c]; condZ = cond(Z) % singular value decomposition of Z [U,S,V] = svd(Z); s = diag(S); diff_s = s(1)-s(2) % "True" leverages Z1 = [ones(N,1) c]; [U1,S1,V1] = svd(Z1); diagH_true = zeros(N,1); for k=1:N for i=1:q diagH_true(k) = diagH_true(k) + U1(k,i)^2;

416

I. Rivals and L. Personnaz

end end diagH_true = diagH_true % Rivals and Personnaz estimates (1.12) of the leverages tol = max(s)*eps; r = sum(s > tol); diagH_RP = zeros(N,1); for k=1:N for i=1:r diagH_RP(k) = diagH_RP(k) + U(k,i)^2; end end diagH_RP = diagH_RP r_sumd_RP = r-sum(diagH_RP) % Monari and Dreyfus estimates (2.1) of the leverages diagH_MD = zeros(N,1); for k=1:N for i=1:q toto = 0; for j=1:q toto = toto + Z(k,j)*V(j,i); end diagH_MD(k) = diagH_MD(k) + (toto/s(i))^2; end end diagH_MD = diagH_MD q_sumd_MD = q-sum(diagH_MD) % Economic estimates of the leverages using the QR decomposition [Q,R] = qr(Z); diagH_QR = zeros(N,1); for k=1:N for i=1:q diagH_QR(k) = diagH_QR(k) + Q(k,i)^2; end end diagH_QR = diagH_QR r_sumd_QR = r-sum(diagH_QR)

Jacobian Conditioning Analysis for Model Validation

417

Output of the program: alpha = 1.0000e-12 condZ = 2.8525e+12 diff_s = 2.8284 diagH_true = 0.2719 0.2580 0.7758 0.6943 diagH_RP = 0.2719 0.2580 0.7759 0.6943 r_sumd_RP = 0 diagH_MD = 0.2720 0.2579 0.7756 0.6944 q_sumd_MD = 1.8643e-05 diagH_QR = 0.2719 0.2580 0.7759 0.6943 r_sumd_QR = 0

References Anders, U., & Korn, O. (1999). Model selection in neural networks. Neural Networks, 12, 309–323. Anderson, E., Bai, Z., Bischof, C., Blackford, S., Demmel, J., Dongarra, J., Du Croz, J., Greenbaum, A., Hammarling, S., McKenney, A., & Sorensen, D. (1999). LAPACK User’s Guide (3rd ed.). Philadelphia: SIAM. Dongarra, J., Moler, C. B., Bunch, J. R., & Stewart, G. W. (1979). LINPACK User’s Guide. Philadelphia: SIAM. Golub, G. H., & Reinsch, C. (1970). Singular value decomposition and leastsquares solutions. Numerische Mathematik, 14, 403–420.

418

I. Rivals and L. Personnaz

Golub, G. H., & Van Loan, C. F. (1983). Matrix computations. Baltimore, MD: Johns Hopkins University Press. Monari, G., & Dreyfus, G. (2002). Local overtting control via leverages. Neural Computation, 14, 1481–1506. Press, W. H., Teukolsky, S. A., Vetterling, W. T., & Flannery, B. P. (2002).Numerical recipes in C. Cambridge: Cambridge University Press. Rivals I., & Personnaz, L. (1998). Construction of condence intervals in neural modeling using a linear Taylor expansion. In Proceedings of the International Workshop on Advanced Black-Box Techniques for Nonlinear Modeling. Leuwen. Rivals I., & Personnaz, L. (2000). Construction of condence intervals for neural networks based on least squares estimation. Neural Networks, 13, 463–484. Received May 31, 2002; accepted January 21, 2003.

LETTER

Communicated by Andrew Barto

Reply to the Comments on “Local Overtting Control via Leverages” in “Jacobian Conditioning Analysis for Model Validation” by I. Rivals and L. Personnaz Yacine Oussar [email protected]

Ga´etan Monari [email protected]

G´erard Dreyfus [email protected] ESPCI, Laboratoire d’Electronique, F-75005 Paris, France

“Jacobian Conditioning Analysis for Model Validation” by Rivals and Personnaz in this issue is a comment on Monari and Dreyfus (2002). In this reply, we disprove their claims. We point to awed reasoning in the theoretical comments and to errors and inconsistencies in the numerical examples. Our replies are substantiated by seven counterexamples, inspired by actual data, which show that the comments on the accuracy of the computation of the leverages are unsupported and that following the approach they advocate leads to discarding valid models or validating overtted models. 1 Introduction “Jacobian Conditioning Analysis for Model Validation” by Rivals and Personnaz (this issue) is a detailed comment on Monari and Dreyfus (2002), from the second sentence of their abstract to the last paragraph of their text. The authors claim that we “followed” a previous, controversial (Larsen & Hansen, 2001) article of theirs (Rivals & Personnaz, 2000). In this reply, we disprove all of their claims. We point to awed reasoning in the theoretical comments and to errors and inconsistencies in the numerical examples. Our replies to the comments are substantiated by seven counterexamples, which show that their comments on the accuracy of the computation of the leverages are unsupported and that following their approach leads to making wrong decisions: discarding valid models or validating overtted models. This letter begins by disproving the comments in “Jacobian Conditioning Analysis” on the accuracy of the computation of the leverages. We then show that the authors’ comments on model validation are erroneous, and we provide several counterexamples in which decisions made on the basis Neural Computation 16, 419–443 (2004)

c 2003 Massachusetts Institute of Technology °

420

Y. Oussar, G. Monari, and G. Dreyfus

of the condition-number selection criterion that the authors advocate are wrong. We conclude by summarizing the arguments that disprove all three conclusions of Rivals and Personnaz’s comments. 2 Reply to the Comments on the Accuracy of the Computation of the Leverages Section 1 of “Jacobian Conditioning Analysis” is entitled, “On the Jacobian Matrix of a Nonlinear Model.” The authors rst recall theoretical results, and they recall the validation method advocated in Rivals and Personnaz (2000). That part will be replied to in section 3 of this letter. Section 2 of “Jacobian Conditioning Analysis” is entitled “Comment on the Proposition of Monari and Dreyfus (2002).” The authors claim that relation A.4 of our article (relation 2.1 of their comment) is not accurate enough for the purpose of model validation and that a more classical computation method should be used instead. In order to substantiate their claims, they exhibit a small handcrafted numerical example (section 3.1 of their article): we show below that it is irrelevant to model selection and that it is inconsistent with the claims made by the authors in the previous section of their article. They show that the traditional method can compute the leverages with an accuracy of 10¡16; however, we show in the following that the authors fail to provide evidence that such accuracy is relevant in the context of nonlinear model selection and that they fail to provide evidence that our method does not meet the actual accuracy requirements in that context. The recommended approach to a problem in numerical analysis consists of asking two questions: (1) What numerical accuracy should be achieved in order to get insight into the problem at hand? (2) How can the above accuracy be achieved? Obviously, that approach should be used in discussing the accuracy of the computation of the leverages for model selection. In Monari and Dreyfus (2002), we discussed nonlinear model selection in the context of machine learning. The computation of the leverages of the observations is one of the ingredients of the original validation method that we describe. Therefore, question 1, which is not asked in “Jacobian Conditioning Analysis,” is: For the purpose of model selection, what is the desired accuracy for the computation of the leverages of models obtained by training from examples? The answer to that question is straightforward: the accuracy should be of the magnitude of the “noise” on the leverages. Since training is generally performed by iterative optimization of a cost function, it is stopped when the two models obtained in two successive iterations are considered “identical.” Therefore, the numerical noise on the leverages is the difference between the values of the leverages of two models that are considered identical. Two models are considered identical if some criterion is met, for example, if the variation of the cost function is smaller than a prescribed value or if the variation of the magnitudes of the parameter vector is smaller than a prescribed value, or similar criteria. Rivals and Personnaz

Reply to Rivals and Personnaz

421

do not address the question of estimating the numerical noise on the leverages; they claim that the accuracy should be on the order of 10¡16, without providing any evidence that such accuracy is relevant to the problem at hand. Therefore, their criticisms are unsupported in the context of machine learning. Conversely, we describe, in section 3, examples of nonlinear model selection. Neural networks were trained from examples by minimizing the least-squares cost function with the Levenberg-Marquardt algorithm. Training was terminated when the magnitude of the relative variation of the cost function between two iterations of the algorithm became smaller than 10¡10, which is an extremely conservative stopping criterion (a typical stopping criterion would be a relative difference of 10¡5 or even more). The root mean square of the variations of the leverages was computed as v u N u1 X 1h.k/ D t .hii .k C 1/ ¡ hii .k//2 ; N iD1 where hii .k/ is the leverage of observation i at iteration k of the training algorithm, and N is the number of examples. Iterations k and k C 1 were chosen such that the relative variation of the cost function was smaller that 10¡10. The values of 1h were consistently found to be substantially larger than 10¡10: clearly, there is no point in computing the leverages with an accuracy of 10¡16 , while the noise on the leverages is actually larger by more than six orders of magnitude, even in unusually conservative conditions. Furthermore, the discrepancy between the leverages computed by the traditional method (advocated in the comment) and by our method was smaller than 10¡12: it is smaller, by several orders of magnitude, than the noise on the leverages and hence is not signicant. Actually, in most real-life applications, training will be terminated much earlier; therefore, the root-mean-square difference between the leverages of two models that are considered equivalent will be larger by still many more orders of magnitude. In order to substantiate their claims, Rivals and Personnaz provide a handcrafted numerical example whose results are shown in Table 1 of their article. In the following, we show that those results are inconsistent with the authors’ claims in “Jacobian Conditioning Analysis” and irrelevant to model selection. The authors consider a Jacobian matrix with two columns: the elements of one of them are equal to one, and the second column vector is equal to the sum of the rst column vector and of a “small” vector, which is a normally distributed random vector multiplied by a scalar ranging from 10¡6 to 10¡15. Most results presented in that table are inconsistent and irrelevant, for the following reasons: 1. Rivals and Personnaz fail to state that most models used to derive the numerical results presented in their Table 1 are actually discarded by

422

Y. Oussar, G. Monari, and G. Dreyfus

the selection criterion that they advocate. Out of 10,000 random realizations of the Jacobian matrix, all models with a D 10¡12 and a D 10¡15 are discarded for having a condition number larger than 10C8 ; 9,988 models out of 10,000 with a D 10¡8 are discarded for the same reason; conversely, all models with a D 10¡6 are accepted. Therefore, the only results of Table 1 that are relevant to model selection, according to the selection criterion advocated by the authors, are the results pertaining to a D 10¡6 (rst row of the table). The other results are irrelevant, since there is no point in discussing the accuracy of the computation of leverages for models that are discarded by the selection criterion advocated by the authors. Moreover, the results reported in the rst row show that our method computes the leverages with an accuracy of 10¡10, which is on the order of the noise on the leverages, as shown above. 2. The structure of the Jacobian matrix indicates that the model has two parameters, µ0 and µ1 , and is of the form y.x/ D µ0 C f .x; µ1 /; with

@f D 1 C ".x; µ1 /; @µ1

where the values of ".x; µ1 / can be modeled as realizations of a random variable equal to a normal variable multiplied by a factor of 10¡6 to 10¡15. Therefore, the model is of the form y.x/ D µ0 C µ1 C ° .x; µ1 / with

@° D ".x; µ1 /: @µ1

No modeler would ever design a model with such a functional form. An obvious re-parameterization of the model consists of dening a new parameter µ2 D µ0 C µ1 , so that the rst column is still made of 1’s (derivative of y with respect to µ2 ) and the second column is made of the values of ".x; µ1 /, modeled as small, random gaussian variables; thus, the Jacobian matrix is much better conditioned. If the same experiment is performed as in Table 1, after reparameterization, (1) for a D 10¡6 , the difference between the number of parameters and the sum of the leverages computed by our method becomes 2. 10¡16, so that the traditional method is not more accurate than ours; and (2) for a D 10¡8 , about 25% of the models are accepted by the selection criterion advocated by Rivals and Personnaz, and, as above, the accuracy of the computation of the leverages by our method is on the order of 10¡16, that is, it is comparable to that of the traditional method. Therefore, the example displayed in Table 1 of “Jacobian Conditioning Analysis” is handcrafted to prove that the traditional method of computing the leverages is more accurate than ours in that specic case; however, that

Reply to Rivals and Personnaz

423

case is irrelevant to model selection because (1) there is no point in computing accurately the leverages of models that, according to the selection criterion advocated in the article, would actually be discarded and (2) no knowledgeable modeler would design a model having a Jacobian matrix of that form. To summarize, in “Jacobian Conditioning Analysis,” Rivals and Personnaz do not provide any evidence that an accuracy of 10¡16 for the computation of the leverages is desirable in the context of machine learning or that our method does not achieve the accuracy required in realistic conditions. Therefore, their claim that the traditional method is superior to ours in the machine learning context is unsupported. In the next sections, we provide additional counterexamples to support the above conclusions further. Furthermore, it should be noted that the issue of numerical accuracy was far from central in our paper, relation A.4 being presented in an appendix. Therefore, we did not nd it necessary to elaborate on that question, since it can be answered in a straightforward fashion, as shown above. 3 Reply to the Comments on Model Validation In the previous section, we showed that the second conclusion stated in section 4 of “Jacobian Conditioning Analysis” is unsupported. In this section, we disprove the other two conclusions of the comment, together with claims made in Rivals and Personnaz (2000). We show that contrary to the statement of the authors, we did not follow the approach advocated in that article since it is not correct from a numerical analysis point of view and leads to making wrong decisions for model validation. 3.1 Reply to Section 1.1 of “Jacobian Conditioning Analysis”. The use of the Jacobian matrix (relations 1.1 and 1.2 of the article) for the analysis of the identiability of nonlinear models is classical in statistics. It can be traced back to 1956. Relation 1.4 describes one of several standard condence interval estimates for nonlinear models. It can be found in textbooks on nonlinear regression (see, e.g., Seber & Wild, 1989). In view of the contents of section 3.2 of our reply, it is relevant to note that many other condence interval estimation methods for nonlinear models can be found in the literature (Tibshirani, 1996). Relation 1.5 in “Jacobian Conditioning Analysis” is an unfounded claim. The authors claim that in Rivals and Personnaz (2000), they established an upper bound of the leave-one-out error. Nothing of that kind can be found in that article, in which, following Hansen and Larsen (1996) and Sorensen, Norgard, Hansen, and Larsen (1996), they derived an approximation of the leave-one-out error under the assumption of the validity of a rst-order Taylor approximation, in parameter space, of the output of the model. We provided a more rigorous proof in Monari (1999) and Monari and Dreyfus (2000). Appendix 1 is standard textbook material.

424

Y. Oussar, G. Monari, and G. Dreyfus

3.2 Reply to Section 1.2 of “Jacobian Conditioning Analysis”. Section 1.2, entitled “Numerical Considerations,” is erroneous in two respects: 1. Model selection aims at nding the statistical model that generalizes best among different candidate models. Poor generalization may be due to either poor training, which is very easy to detect, or overtting. Overtting occurs when the model has too many adjustable parameters in view of the complexity of the problem. Therefore, model selection and model validation aim at detecting, and discarding, models that are likely to exhibit overtting. It has long been known in statistics that a preliminary screening can be performed by ascertaining that the Jacobian matrix of the model has full rank. Rivals and Personnaz dismiss that method and state that “the notion of the rank of Z is not usable in order to take the decision to discard a model.” They claim that the criterion should be “whether the inverse of ZT Z needed for the estimation of a condence interval can be estimated accurately.” That shift of focus to a completely different issue is a scientic reasoning aw: the fact that the condence intervals that the authors advocate cannot be computed “accurately” (see the next paragraph for a discussion of that issue) for a given model does not mean that the model should be discarded: it means that the condence interval estimation method should be discarded. Another condence interval estimate should be used instead—one that does not rely on matrix inversion (e.g., bootstrap methods). Thus, the comments on our approach to model validation are unfounded, and the discussion of model selection presented in Rivals and Personnaz (2000) is essentially irrelevant. That will be further substantiated by six counterexamples in section 3.4. 2. Rivals and Personnaz state that “a decision [of discarding a model] should be taken on the basis of whether the inverse of ZT Z needed for the estimation of a condence interval can be estimated accurately.” We have just shown that statement to be erroneous; nevertheless, let us discuss that statement from a numerical point of view. The authors claim that the condence intervals should be estimated accurately, but they do not explain what “accurately” means. More specically, they do not ask the relevant question: How accurately should the condence interval be estimated in order to get insight into the problem of model validation? A part of the answer is the following: Since the estimation of the condence interval is derived from a rst-order Taylor expansion of the output of the model, in parameter space, around a minimum of the cost function, there is no point in computing the condence interval; hence, the inverse of ZT Z, with a numerical accuracy that is better than the accuracy of the Taylor development. Despite the fact that many authors investigated that issue (see, e.g., Bates & Watts, 1998), Rivals and Personnaz claim that the condition number of the Jacobian matrix should be smaller than 108 , regardless of

Reply to Rivals and Personnaz

425

the problem and of the model.1 That cannot be true, since the inaccuracy due to the rst-order Taylor expansion may be much higher than the numerical accuracy required by that criterion and since the accuracy of the Taylor expansion is problem dependent. Therefore, that “universal” condition-number selection criterion can be expected to lead to discarding perfectly valid models, as shown below with simple counterexamples. To summarize our reply to the theoretical part of the comments concerning model validation: 1. From a basic point a view, there is a aw in the scientic reasoning on which those comments are based. Instead of discarding models that do not generalize well, the selection method that Rivals and Personnaz advocate discards models for which the authors’ favorite condence interval estimation method fails. Actually, the authors should blame their condence interval estimates for not being accurate instead of blaming the model for not lending itself to their condence interval estimation method. 2. From a numerical point of view, their selection criterion does not take into account the accuracy of the approximations on which the estimation of the condence intervals is based, so that the criterion may be much more stringent than actually required. The rst misconception may lead to accepting models that are invalid; the second may lead to rejecting models that are valid. Both situations will be exemplied below by seven counterexamples. 3.3 Reply to the Numerical Example. As a further criticism to our article, section 3.2 of “Jacobian Conditioning Analysis,” entitled “Neural Modeling,” considers the simplest example that was presented in our article: the neural modeling of a sin x=x function with added noise. Rivals and Personnaz consider, just as we did, models with one to four neurons. They discuss that problem in much more detail than we did, and they come to the conclusion that two-neuron architectures are appropriate, which is exactly the conclusion we stated in our article. Therefore, their example can hardly be considered as supporting their criticisms. Moreover, their numerical example contains errors and inconsistencies: 1. Table 2 of “Jacobian Conditioning Analysis” displays numerical results related to that example. In that table as well as in the rest of the article, the authors use notations that are different from the notations of the article they are commenting on, which makes things confusing 1 The same statement was made recently in Rivals and Personnaz (2003), reporting results obtained in 2000.

426

Y. Oussar, G. Monari, and G. Dreyfus

to nonspecialist readers. Moreover, they do not provide any denition of the quantities that they use. The quantity dubbed ALOOS seems to be the square of the estimated leave-one-out score denoted by Ep in our article. The quantity called MSTE is the square of the training root mean square error TMSE of our article. The quantity termed MSPE is the square of the validation root mean square error denoted by VMSE in our article. Those quantities are dened as v u N u1 X TMSE D t R2 ; N iD1 i MSTE D TMSE2 ; v u ´2 N ³ u1 X Ri Ep D t ; N iD1 1 ¡ hii

v u NV u 1 X ALOOS D Ep2 ; VMSE D t R2 ; NV iD1 i MSPE D VMSE2 ; where Ri is the modeling error on example i of the relevant set, hii is the leverage of example i, N is the number of examples in the training set, and NV is the number of examples in the validation set. The last line of the table reports an MSTE of 1.9 10¡3 and an ALOOS of 5.3 10¡7 . That cannot be true. Since 0 < hii < 1, each term of the sum in Ep is larger than the corresponding term in TMSE, so that one has Ep > TMSE or, equivalently, ALOOS > MSTE. Therefore, the result reported for four hidden neurons is wrong by at least four orders of magnitude. 2. Rivals and Personnaz agree with us that a two-hidden-neuron architecture is the most appropriate for the problem under investigation. Although our article does not discuss the models with three hidden neurons, they claim that our method would have accepted models with three hidden neurons, whereas they dismiss that architecture. That is worth investigating. For architectures with three hidden neurons, we performed 100 trainings with different initial values of the parameters, in the conditions that Rivals and Personnaz describe. That generates a relatively small number of signicantly different models. Taking advantage of the fact that the “true” regression function f is known, the mean square distance D between the model and the true

Reply to Rivals and Personnaz

427

Table 1: Mean Square Prediction Errors, Squared Distance, and Condition Number of the Jacobian Matrix for Models with Three Hidden Neurons. Minimum 1 MSPE 2:9 10¡3

D2 1:2 10¡3 16

Minimum 2 K 2 to 7 10C9

MSPE 3:8 10¡3

Minimum 3 MSPE 4:3 10¡3

D2 2:1

10¡3

10C16

D2 1:6 10¡3 67

K 4:0 10C6

Other minima K to 10C17

12

MSPE > 8 10¡3

D2 > 5 10¡3 5

K > 10C8

regression function was computed as v u ND u 1 X [ f .xk / ¡ g.xk /]2 ; DDt ND kD1 where ND D 5,000 (drawn from a uniform distribution), f .x/ D sinc [10.xC1/=¼], and g.x/ is the output of the model. D is the best estimate of the theoretical risk, that is, of the generalization ability of the model. Table 1 shows the MSPE, the distance D, the condition number K of the Jacobian matrix, and the number of occurrences of the cost function minimum. The best model (model with the smallest D, which also has the smallest MSPE) is discarded by the condition-number selection criterion advocated by Rivals and Personnaz. It is also worth noting that models that correspond to the same minimum of the cost function have widely varying condition numbers, even though the MSPEs agree to four decimal places. Other examples of similar situations, where the conditionnumber selection criterion rejects valid models, are exhibited below. 3. Rivals and Personnaz claim that they generated a training set by adding noise to the function sinc[10.x C 1/]. From a cursory look at Figures 1 and 2 of their article, it is clear that such is not the case. Actually, it seems that sinc[10.x C 1/=¼ ] was implemented. To summarize, in order to prove that their selection method is superior to ours, Rivals and Personnaz investigated one of the examples presented in our article. Their conclusion is exactly the same as ours, so that their example cannot be considered as evidence of the superiority of their approach. Moreover, we showed that their example contains a result that is erroneous by at least four orders of magnitude. In addition, their validation method leads to discarding perfectly valid models, for reasons that we explained in section 3.2 here.

428

Y. Oussar, G. Monari, and G. Dreyfus

Figure 1: Liquidus temperature vs. lithium oxide concentration.

3.4 Additional Counterexamples. The following counterexamples disprove the claim that the stated limit on the condition number of the Jacobian matrix is an appropriate screening criterion, which invalidates the rst conclusion in “Jacobian Conditioning Analysis” and a large part of the contents of Rivals and Personnaz (2000). The counterexamples show additionally that as we have already demonstrated, the accuracy of the computation of the leverages is not critical for model selection; that invalidates the second point of the conclusion of the comment, as well as another part of Rivals and Personnaz (2000). The problem that we address here is inspired by real data. The quantity to be modeled is a thermodynamic parameter (the liquidus temperature) of industrial glasses, as a function of the oxide contents of the latter. The estimation of the liquidus temperature is important for glass manufacturing processes (a detailed description of the application can be found in Dreyfus & Dreyfus, 2003). Figure 1 shows the simplest instance of real data on that problem. It is interesting because the singular points actually have a physical meaning, related to phase transitions. That application prompted us to investigate the modeling of data generated from function y D j sin xj C cos x;

(3.1)

which is plotted on Figure 2. We exhibit three pairs of counterexamples. In each pair, one counterexample shows the condition-number selection criterion discarding a valid model, and the other counterexample shows that criterion validating a

Reply to Rivals and Personnaz

429

Figure 2: Academic example inspired from Figure 1.

model that is overtted. In all of the following, the values of the leverages computed by our method and by the traditional method advocated in “Jacobian Conditioning Analysis” are in excellent agreement (e.g., they agree within nine decimal places for counterexample 1 and 11 decimal places for counterexample 2). Therefore, those examples do not provide any support to the claim that the traditional method is superior to ours in the machine learning context. In all the following, neural network training was performed by the Levenberg-Marquardt algorithm. For a given number of hidden neurons, 100 trainings were performed with different parameter initializations. Seven thousand equally spaced examples were generated as a validation set. Experiments were performed under Matlab on a standard PC. Distance D (dened in section 3.3) was also computed from 7000 equally spaced points. Following the notations of Monari and Dreyfus (2002), we denote by TMSE the root-mean-square error on the training set (as dened in section 3.3) and by VMSE the equivalent quantity computed on the validation set of 7000 examples (an excellent estimate of the generalization error). Before discussing the counterexamples, it may be useful to emphasize the main point of Monari and Dreyfus (2002): we claim that overtting can be efciently monitored by checking the distribution of the leverages (hence the title of our article “Local Overtting Control via Leverages”).

430

Y. Oussar, G. Monari, and G. Dreyfus

The leverages obey the following relations: 0 · hii · 1 8i; N X iD1

hii D q:

Since the sum of the leverages is equal to the number of parameters q, the leverage of example i can be interpreted as the fraction of the degrees of freedom of the model that is devoted to tting the model to observation i. Therefore, ideally, all examples should have essentially the same leverages, equal to q=N: if one or several points have leverages close to 1, the model has devoted a large fraction of its degrees of freedom to tting those points and, hence, may have tted the noise on those examples very accurately. In other words, the more peaked the distribution of the leverages around q=N, the less prone to overtting the model. Rivals and Personnaz were unaware of that point in their previous articles, as evidenced by the fact that they never used the word leverage prior to “Jacobian Conditioning Analysis.” In order to give a quantitative assessment of the “distance” of the model to a model where all leverages are equal to q=N, we dened (Monari & Dreyfus, 2002) the parameter s N 1 X N ¹D hii : N iD1 q the closer ¹ to 1, the more peaked the distribution of the leverages around its mean q=N; ¹ D 1 if all leverages are equal to q=N. Hence, the closer ¹ to 1, the less overtted the model. Alternatively, one may use the normalized standard deviation ¾n of the leverages, dened as v u N ± u X N q ²2 ¾n D t hii ¡ : q.N ¡ q/ iD1 N ¾n D 0 if all leverages are equal to q=N, and ¾n D 1 in the worst case of overtting, where q leverages are equal to 1 and .N ¡ q/ leverages are equal to zero. Hence, the smaller ¾n , the less prone to overtting the model. Both quantities are computed in the following counterexamples. 3.4.1 Counterexamples 1 and 2. In a rst set of experiments, 35 equally spaced points were generated from relation 3.1 to serve as a training set. Uniform noise was added, with standard deviation 0.1. Figure 3 shows the generating function, the training data, and the output of a model with ve hidden neurons, together with the values of the leverages. The relevant gures for that model are summarized in the rst row of Table 2.

Reply to Rivals and Personnaz

431

Figure 3: Counterexample 1. The condition-number selection criterion discards a valid model.

Table 2: Relevant Data for Counterexamples 1 and 2. TMSE VMSE Counterexample 1 Counterexample 2

0.078 0.066

0.117 0.123

¹

¾n

0.984 0.38 0.979 0.56

Distance D

K

Leverages > 0:95

0.062 0.072

1:5 109 2:9 106

3 11

Notes: TMSE, VMSE, ¹, ¾n , and D as dened in the text. K: Condition number of the Jacobian matrix. Last column: Number of leverages that are larger than 0.95.

432

Y. Oussar, G. Monari, and G. Dreyfus

The condition number of its Jacobian matrix exceeds the limit stated in “Jacobian Conditioning Analysis” by more than one order of magnitude, so that it should be discarded according to those comments. Actually it is a valid model: its generalization error is small since its VMSE, computed from 7000 examples, is slightly larger than the standard deviation of the noise, and the distance D is even smaller. ¹ is close to 1, and ¾n is far from 1. Finally, the leverage values computed by the traditional method and by ours (relation A.4 of Monari & Dreyfus, 2002) are in excellent agreement: the root mean square of the differences between the leverages computed by our method and by the traditional method is on the order of 3.7 10¡10 . The three points with high leverages are located at the boundaries of the input range, as expected. The W-shape of the leverage graph indicates that the points located in the vicinity of the main minimum are inuential, as expected. Thus, counterexample 1 is an example of the condition-number selection criterion stated in “Jacobian Conditioning Analysis,” discarding a valid model. Counterexample 2 (see Figure 4) is a model with seven hidden neurons, trained in the same conditions and with the same data as counterexample 1. The characteristics of the model are shown in the second row of Table 2. The root mean square of the differences between the leverages computed by our method and by the traditional method advocated by Rivals and Personnaz is on the order of 1:7 10¡12. The condition number of the Jacobian matrix of the model is way below the limit stated by Rivals and Personnaz. Hence, according to their comment, the model should be accepted. However, it is clear from Figure 4 that the model is strongly overtted. That is further substantiated by three facts: 1. The validation error VMSE is twice as large as the training error TMSE. 2. Eleven leverages (almost one out of three) are larger than 0.95, and 13 of them are larger than 0.90. The high leverages are located between x D 1 and x D 2, where overtting is clearly apparent on Figure 4, and also at the boundaries of the input range, as usual. 3. ¹ is substantially smaller than 1, or, equivalently, the standard deviation of the leverages ¾n is larger than that of counterexample 1. The selection method that Rivals and Personnaz advocate nevertheless fails to detect that gross overtting. As usual, computing the leverages by the traditional method and by relation A.4 of our article does not make any signicant difference. The difference between the two models is also clear from Figure 5, which shows the histograms of the leverages for the model that is discarded by the condition-number selection criterion (top gure) and for the model that is accepted by that criterion (bottom gure). As explained above, the distribution of the leverages should be as peaked as possible around q=N. Clearly, the leverage distribution for the model accepted by the suggested

Reply to Rivals and Personnaz

433

Figure 4: Counterexample 2. The condition-number selection criterion fails to detect overtting.

condition-number selection criterion is extremely far from complying with that condition, having a very large number of leverages that are close to 1. To summarize, counterexample 2 shows an example of the conditionnumber selection criterion failing to discard an overtted model. 3.4.2 Counterexamples 3 and 4. The second set of experiments was performed as follows. One hundred neural networks were rst trained with a large number of points (350) generated by the regression function 1.1, with-

434

Y. Oussar, G. Monari, and G. Dreyfus

Figure 5: Leverage distributions of counterexample 1 (top) and counterexample 2 (bottom).

out added noise. Then the values of the parameters of those models were used as initial values for a retraining with only 35 points with added noise. Since the latter training starts with initial parameters of excellent models, it may be expected that the resulting models are the best models that one can hope to obtain, given the limited number of points and the noise in the training set. Figure 6 shows a good model with ve hidden neurons, whose characteristics are reported in the rst row of Table 3. Despite its performance, the model is rejected by the condition-number selection criterion, since its condition number is larger, by one order of magnitude, than the rejection limit specied its authors. Figure 7 shows the behavior of a model that also has ve hidden neurons; its characteristics are summarized in the second row of Table 3. That model is much worse than counterexample 3: its VMSE is higher, while its TMSE is

Reply to Rivals and Personnaz

435

Figure 6: Counterexample 3. The condition-number selection criterion discards a valid model.

smaller (a clear sign of overtting), the distance between the model and the regression function is higher, and 20% of the leverages are larger than 0.95. ¹ is substantially smaller than for counterexample 3, and ¾n is almost twice as large as that of counterexample 3. Nevertheless, the condition number is one order of magnitude below the rejection limit, so that the condition-number

436

Y. Oussar, G. Monari, and G. Dreyfus

Table 3: Relevant Data for Counterexamples 3 and 4. TMSE VMSE Counterexample 3 Counterexample 4

0.078 0.068

0.12 0.13

¹

¾n

0.99 0.36 0.95 0.62

Distance D

K

Leverages > 0:95

0.063 0.079

109 1:7 107

2 7

Notes: TMSE, VMSE, ¹, ¾n , and D as dened in the text. K: Condition number of the Jacobian matrix. Last column: Number of leverages that are larger than 0.95.

Table 4: Relevant Data for Counterexamples 5 and 6. TMSE VMSE Counterexample 5 Counterexample 6

0.082 0.068

0.12 0.14

¹

¾n

0.97 0.47 0.95 0.62

Distance D 0.069 0.091

K 1014

4:2 1:3 105

Leverages > 0:95 1 6

Notes: TMSE, VMSE, ¹, ¾n , and D as dened in the text. K: Condition number of the Jacobian matrix. Last column: Number of leverages that are larger than 0.95.

selection criterion fails to detect that gross overtting. Furthermore, the leverages computed by the traditional method, and by our method, agree within 10¡11 . 3.4.3 Counterexamples 5 and 6. In a nal set of numerical experiments, the training set was constructed with a larger number of examples in the vicinity of the singular points, the total number of points being kept constant, equal to 35. Training was performed with random parameter initialization, as in counterexamples 1 and 2. Figure 8 shows the behavior of a model with ve hidden neurons, whose characteristics are summarized in Table 4. As expected, no overtting occurs in the vicinity of the minima of the function, but since the total number of points was kept constant, leverages become higher between the minima. Nevertheless, this is a very reasonable model given the training data. It is discarded by the condition-number selection criterion, since its condition number is larger than the rejection limit by six orders of magnitude. By contrast, Figure 9 shows a model with ve hidden neurons, whose characteristics are summarized in the second row of Table 4. This is again an overtted model, whose distance to the regression function, and VMSE, are much poorer than those of counterexample 5; nevertheless, it is accepted by the condition-number selection criterion. 3.5 Relevance of the Condition-Number Selection Criterion to Overtting. In the previous section, we showed several examples of the conditionnumber selection criterion accepting overtted models or discarding valid models. The above counterexamples are just a selection among many more

Reply to Rivals and Personnaz

437

Figure 7: Counterexample 4. The condition-number selection criterion fails to discard an overtted model.

similar counterexamples, so that it is natural to wonder how frequently such situations will occur. More specically, one can ask the following question: What is the probability that the parameters of a network with one input and ve hidden neurons can be estimated reliably from 35 equally spaced points (counterexamples 1, 2, 3, and 4)? In order to gain some insight into that question, the following numerical experiment was performed, 10,000 different

438

Y. Oussar, G. Monari, and G. Dreyfus

Figure 8: Counterexample 5. The condition-number selection criterion discards a valid model.

neural networks with one input and ve hidden neurons were generated, with random parameters, uniformly distributed with variance 10. For each model, the leverages of 35 equally spaced points between ¡1 and C1, and the normalized standard deviation ¾n of their distribution, were computed. The condition number K of the Jacobian matrix was also computed.

Reply to Rivals and Personnaz

439

Figure 9: Counterexample 6. The condition-number selection criterion fails to discard an overtted model.

Figure 10 shows ¾n as a function of K: each network is shown as a dot; dots lying on the x-axis are networks whose Jacobian matrix is rank decient. For models whose Jacobian matrix has full rank, no trend can be found in that graph: thus, the condition number has essentially nothing to do with the distribution of the leverages and, hence, is essentially irrelevant to overtting. Any model located to the right of the vertical line would be

440

Y. Oussar, G. Monari, and G. Dreyfus

Figure 10: Normalized standard deviation of the distribution of the leverages of 35 equally spaced points, and Jacobian matrix condition number, for 1000 neural networks with ve hidden neurons.

discarded by the condition-number selection criterion, despite the fact that some of them have excellent leverage distributions and, hence, are functions whose paramters can legitimately be estimated from data pertaining to 35 equally spaced points. Actually, the “best” network (the network with themost peaked leverage distribution, i.e., with the smallest ¾n ) is discarded, whereas several poor networks (with large values of ¾n ), including the network with the largest ¾n , are accepted. Similarly, no clear trend can be found in the graph of ¾n versus K when data are more abundant. 4 Conclusion “Jacobian Conditioning Analysis for Model Validation” is essentially a comment on our article (Monari & Dreyfus, 2002). In this reply, we disproved all comments made in “Jacobian Conditioning Analysis.” The rst conclusion of Rivals and Personnaz states that the condition number of the Jacobian matrix of the model should be used as a criterion for model validation: a model with a condition number larger than 10C8 should be discarded; they made the same statement in Rivals and Personnaz, 2000.

Reply to Rivals and Personnaz

441

We proved that statement wrong in two respects: 1. The models that are discarded by that criterion are models for which a particular type of condence intervals, obtained by a specic estimation method, cannot be computed accurately. This does not mean that the model should be discarded. It means that the condence interval estimation method should be discarded; that misconception may lead to discarding valid models. 2. Rivals and Personnaz’s comment on the accuracy required to compute the condence intervals is erroneous, because they overlook the fact that the condence intervals stem from a rst-order Taylor expansion of the model output. Therefore, there is no point in computing the condence intervals with an accuracy that is better than the accuracy of that rst-order approximation. Therefore, the accuracy requested for the computation of the condence interval is completely problem dependent: the “universal” criterion exhibited by the authors cannot be valid. In addition to disproving, on theoretical grounds, the comments made by Rivals and Personnaz, we presented six counterexamples: three of them are instances of the authors’ selection criterion discarding valid models; the other three are instances of the authors’ criterion accepting overtted models. In short, the condition-number selection criterion, advocated in the rst conclusion of Rivals and Personnaz, is at best useless and very frequently leads to making wrong decisions. The claim that we “followed” that approach is unsupported. In the second paragraph of their conclusion, Rivals and Personnaz state that the numerical method for computing the leverages, which was indicated in an appendix of our article (Monari & Dreyfus, 2002), is not accurate enough. We proved that their statement is unsupported. They claim that their method can reach an accuracy of 10¡16; however, they do not provide any example, in the context of machine learning, where such accuracy is required, that is, where the “noise” on the estimation of the leverages is on the order of 10¡16 . Conversely, we provided examples where the difference between the leverages of two models that are obtained at two successive iterations after convergence of the training algorithm (i.e., between two models that are considered “identical”) exceeds 10¡16 by several orders of magnitude. Therefore, the accuracy of the method advocated by Rivals and Personnaz is irrelevant in such situations, and they did not provide any example where it might be relevant. In addition to disproving their point theoretically, we have shown that even in the very example that Rivals and Personnaz proffered, the differences between the leverages computed by the traditional method and those computed by our method are negligibly small. In order to substantiate their claims, Rivals and Personnaz study the numerical example that we investigated in Monari and Dreyfus (2002). They

442

Y. Oussar, G. Monari, and G. Dreyfus

come exactly to the conclusion that we reached, so their example cannot be considered a convincing counterexample. However, they reach that conclusion by faulty reasoning and computing. We pointed to inconsistencies in their presentation, and we provided a proof that one of their numerical results is wrong by at least four orders of magnitude. The third paragraph of the conclusion of “Jacobian Conditioning Analysis” is: “For candidates whose condition number is small enough and for which the leverages have been computed as accurately as possible according to equation 1.12, one may check additionally if none of the leverage values is close to one, as already proposed in Rivals and Personnaz (1998).” That statement is not acceptable for three reasons: 1. We showed that Rivals and Personnaz provide no evidence that using equation 1.12 is necessary or that our method for the computation of the leverages is inappropriate for model validation. 2. We showed that models with a “small enough” condition number, that is, a condition number below the limit stated by Rivals and Personnaz, may have several leverages close to 1, and, hence, exhibit strong overtting (counterexamples 2, 4, and 6), and that, conversely, models with high condition numbers may have reasonable leverages and, hence, be acceptable (counterexamples 1, 3, and 5). 3. Rivals and Personnaz did not state in Rivals and Personnaz (1998) that leverages should be checked for values close to 1. They made a different suggestion: checking that the sum of the leverages is equal to the number of parameters, and all leverages are smaller than 1. Actually, they did not realize the signicance of leverages close to 1 before reading our article (Monari & Dreyfus, 2002), as evidenced by the fact that the very word leverage is used in neither Rivals and Personnaz (1998), nor in Rivals and Personnaz (2002). By contrast, the last sentence of the conclusion of “Jacobian Conditioning Analysis” (“Leverage values close to, but not necessarily larger than, one are indeed the symptom of overtted examples or of isolated examples at the border of the input domain delimited by the training set”) is unquestionable: it is actually the very central idea of our work (Monari & Dreyfus, 2002). To summarize, in this letter, we disprove all comments of Rivals and Personnaz in “Jacobian Conditioning Analysis” on our previous article (Monari & Dreyfus, 2002). 2 Additionally, our replies invalidate a substantial part of the contents of other articles by the same authors on the same subject.

2 Rivals and Personnaz add, after their conclusion, a section entitled, “Other Comment for Monari and Dreyfus (2002).” In view of the fact that all the other claims of Rivals and Personnaz were disproved, we will not discuss that ultimate comment.

Reply to Rivals and Personnaz

443

References Bates, D. M., & Watts, D. G. (1998).Nonlinear regressionanalysis and its applications. New York: Wiley. Dreyfus, C., & Dreyfus, G. (2003).A machine learning approach to the estimation of the liquidus temperature of glassforming oxide blends. Journal of NonCrystalline Solids, 318, 63–78. Hansen, L. K., & Larsen, J. (1996). Linear unlearning for cross-validation. Advances in Computational Mathematics, 5, 269–280. Larsen, J., & Hansen, L. K. (2001). Comments for: Rivals I., Personnaz L., Construction of condence intervals for neural networks based on least squares estimation. Neural Networks, 15, 141–142. Monari, G. (1999). S´electionde mod`eles non lin´eaires par leave-one-out: e´tude th´eorique et application des r´eseaux de neurones au proc´ed´e de soudage par points. The` se de l’Universit e´ Pierre et Marie Curie, Paris. Monari, G., & Dreyfus, G. (2000). Withdrawing an example from the training set: An analytic estimation of its effect on a nonlinear parameterized model. Neurocomputing, 35, 195–201. Monari, G., & Dreyfus, G. (2002). Local overtting control via leverages. Neural Computation, 14, 1481–1506. Rivals, I., & Personnaz, L. (1998). Construction of condence intervals in neural modeling using a linear Taylor expansion. In J. A. Suykens & J. Vandewalle (Eds.), Proceedings of the International Workshop on Advanced Black-Box Techniques for Nonlinear Modeling. Leuven, Belgium: Katholieke Universiteit Leuven. Rivals, I., & Personnaz, L. (2000). Construction of condence intervals for neural networks based on least squares estimation. Neural Networks, 13, 463–484. Rivals, I., & Personnaz, L. (2003). MLPs (mono-layer polynomials and multilayer perceptrons) for nonlinear modeling. Journal of Machine Learning Research, 3, 1383–1398. Seber, G. A. F., & Wild, C. J. (1989). Nonlinear regression. New York: Wiley. Sorensen, P. H., Norgard, M., Hansen, L. K., & Larsen, J. (1996). Cross-validation with LULOO. In S. I. Amari, L. Yu, L. W. Chan, I. King, & K. S. Leung (Eds.), Proceedings of the International Conference on Neural Information Processing— ICONIP’96 (pp. 1305–1310). New York: Springer-Verlag. Tibshirani, R. J. (1996). A comparison of some error estimates for neural models. Neural Computation, 8, 152–163. Received March 31, 2003; accepted July 10, 2003.

ARTICLE

Communicated by David Fitzpatrick

Geometrical Computations Explain Projection Patterns of Long-Range Horizontal Connections in Visual Cortex Ohad Ben-Shahar [email protected]

Steven Zucker [email protected] Department of Computer Science and the Interdepartmental Neuroscience Program, Yale University, New Haven, CT 06520, U.S.A.

Neurons in primary visual cortex respond selectively to oriented stimuli such as edges and lines. The long-range horizontal connections between them are thought to facilitate contour integration. While many physiological and psychophysical ndings suggest that collinear or association eld models of good continuation dictate particular projection patterns of horizontal connections to guide this integration process, signicant evidence of interactions inconsistent with these hypotheses is accumulating. We rst show that natural random variations around the collinear and association eld models cannot account for these inconsistencies, a fact that motivates the search for more principled explanations. We then develop a model of long-range projection elds that formalizes good continuation based on differential geometry. The analysis implicates curvature(s) in a fundamental way, and the resulting model explains both consistent data and apparent outliers. It quantitatively predicts the (typically ignored) spread in projection distribution, its nonmonotonic variance, and the differences found among individual neurons. Surprisingly, and for the rst time, this model also indicates that texture (and shading) continuation can serve as alternative and complementary functional explanations to contour integration. Because current anatomical data support both (curve and texture) integration models equally and because both are important computationally, new testable predictions are derived to allow their differentiation and identication.

1 Introduction The receptive elds (RFs) of neurons in visual cortex characterize their response to patterns of light in the visual eld. In primary visual cortex, this response is often selective for stimulus orientation in a small region (Hubel & Wiesel, 1977). The clustered long-range horizontal connections between such cells (Rockland & Lund, 1982) link those with nonoverlapping RFs and Neural Computation 16, 445–476 (2004)

c 2004 Massachusetts Institute of Technology °

446

O. Ben-Shahar and S. Zucker

are thought to facilitate contour integration (Field, Hayes, & Hess, 1993). However, there is no direct physiological evidence that these connections only support curve integration, while there also remains much ambiguity about the precise connections required to support the integration of curves. Our goal in this article is to address both of these concerns. 1.1 Biological Data and Integration Models. The argument that associates long-range horizontal connections with curve integration begins with the realization that the nite spatial extent of RFs and their broad orientation tuning lead to signicant uncertainties in the position and the local orientation measured from visual stimuli. This causes a further uncertainty in determining which of the many nearby RFs signal the next section of a curve (see Figure 1a). All of these uncertainties underlying curve integration can be reduced by interactions between neurons whose RFs are close in retinotopic coordinates. Starting with Mitchison and Crick (1982) and their hypothesis about interactions between iso-oriented RFs, physiological and anatomical ndings have been accumulating to suggest a roughly collinear interaction. The main evidence supporting this conclusion is based on the distribution of angular differences between preferred orientations of connected cells. These distributions are computed by taking the orientation difference between a target cell and every other cell it is connected to with a long-range horizontal connection. Indeed, as is exemplied in Figure 1b, these distributions have been shown to be unimodal on average, with maximal interaction between iso-oriented RFs (Ts’o, Gilbert, & Wiesel, 1986; Gilbert & Wiesel, 1989; Weliky, Kandler, Fitzpatrick, Katz, 1995; Schmidt, Goebel, Lowel, ¨ & Singer, 1997; Buza´ s, Eysel, Kisv´arday, 1998; Bosking, Zhang, Schoeld, & Fitzpatrick, 1997; Malach, Amir, Harel, & Grinvald, 1993; Sincich & Blasdel, 2001; Schmidt & Lowel, ¨ 2002). Furthermore, direct anatomical studies reveal long-range interactions between coaxial cells (Bosking et al., 1997; Schmidt et al., 1997) and indirect psychophysical experiments report a general association eld (Field et al., 1993; Kapadia, Ito, Gilbert, & Westheimer, 1995; Kapadia, Westheimer, & Gilbert, 2000) which emphasizes straight or slowly varying continuations while allowing some support for more rapidly varying continuations as well (see Figure 2a). With the accumulation of these data, however, are a growing number of observations that are difcult to reconcile with the intuition that neural spatial integration is based on collinearity or that it serves only curve integration. Facilitory interaction between cells of signicant orientation difference (Kapadia et al., 1995) short-range coaxial inhibition (Polat & Sagi, 1993), isoorientation side facilitation (Adini, Sagi, & Tsodyks, 1997), and strong correlations between iso-oriented, nonoverlapping, and parallel receptive elds (Ts’o et al., 1986) are functionally inconsistent. Evidence of cross-orientation (Matsubara, Cynader, Swindale, & Stryker, 1985; Kisv´arday, Toth, ´ Rausch, & Eysel, 1997) and nonaxial (Gilbert & Wiesel, 1989) connections, plus roughly

Projection Patterns of Long-Range Horizontal Connections in V Retinotopic envelope of all possible continuations

447

Where does the curve really continue?

Many possible curves due to RF tuning properties

RF preferred orientation, finite retinotopic extent, and broad orientation tuning Ideal hypothetical curve through RF

a Bo sking e t. al. 199 7

Bosking et. a l. 1 997 25 M e d ia n n u m b er of co n n ec tio ns (% )

M e d ia n n u m b er of co n n ec tio ns (% )

25 20

20

15

15

10

10

5 0 80 60 40 20 0 20 40 orientation difference

b

60

80

5 0 80 60 40 20 0 20 40 orientation difference

60

80

c

Figure 1: Visual integration and the distribution of long-range projections. (a) Broad tuning in orientation and position introduce uncertainty in curve integration even if a single curve model (thick red curve) is assumed through the RF. Determining which nearby RF the curve continues through can be facilitated by interaction between neurons with mutually aligned, retinotopically close RFs. (b) A fundamental measurable property of long-range connection is their distribution in the orientation domain, that is, the percentage of connections between interconnected neurons as a function of preferred orientation (angular) difference. This graph shows the median distribution of lateral connections (distance > 500¹m) of seven cell clusters in primary visual cortex of tree shrew (redrawn from Bosking et al., 1997, their Fig. 6c). Qualitatively similar (through coarser) measurements are available on primates as well (Malach et al., 1993). (c) Connectivity distribution of individual cell clusters reveals signicant variability and qualitiative differences between them. Shown here are distributions from two injection sites from Bosking et al. (1997).

448

O. Ben-Shahar and S. Zucker Connection distribution connectivity Connection distibution ofof connectivity models models

Roughly colinear continuations

Number connections Number of of connections (%) (%)

30

Association field

Collinearity connectivity General association field

25

20

15

10

5

0 80

60

40

20 0 20 40 orientation difference

60

80

Orientation difference

a

b

Figure 2: Collinear facilitation, association elds, and their predicted distribution of connections. (a) Informally, two visual integration or continuation models are typically considered in the physiological and psychophysical literatures. Collinearity, the predominant model, predicts only few possible curve continuations (top). On the other hand, many possible continuations reveal an association eld (bottom), similar to those observed psychophysically (Field et al., 1993). (b) The corresponding distribution derived from the collinearity and association eld models. Observe that collinearity predictis a very narrow distribution, which is clearly at odds with the signicant spread frequently measured anotomically or electrophysiologically (compare to Figure 1b). The association eld leads to a wider spread, but like collinearity, it predicts a xed distribution for all cells, a hypothesis refuted in recent studies (see the text). The collinearity distribution (solid) was calculated from the eld depicted in Figure 8a, while the association eld distribution (dashed) was calculated from the eld in Figure 8e. The dashed horizontal line depicts the uniform distribution.

isotropic retinotopic extent (Malach et al., 1993; Sincich & Blasdel, 2001), suggest anatomical inconsistencies. These inconsistencies prompt a closer examination of the interactions within visual cortex and their population statistics. As the evidence suggests, individual cells, or small collections of adjacent cells captured in tracer injections, may have qualitatively different connectivity distributions (Bosking et al., 1997): some are narrow and high while others are very wide, as is illustrated in Figure 1c. When averaged, the pooled distribution of longrange connections (e.g., those extending beyond 500 ¹m in Bosking et al., 1997) is (see Figure 3a): ² Unimodal.

² Peaks at zero orientation offset.

Projection Patterns of Long-Range Horizontal Connections in V

449

² Indicates a nonnegligible fraction of connections linking cells of signicantly different orientation preferences (Malach et al., 1993; Kisv´arday, Kim, Eysel, & Bonhoeffer, 1994; Kisv´arday et al., 1997; Bosking et al., 1997). ² Crosses the uniform distribution at approximately §40 degrees. 1

² Has a nonmonotonically changing variance as the orientation difference increases (Malach et al., 1993; Bosking et al., 1997). Neither collinearity nor association eld models predict all of these features. While both models imply unimodal pooled distributions over orientation differences (see Figure 2b), they also suggest a xed projection eld and thus neither predicts any variance for the pooled distribution, let alone a nonmonotonic one. Furthermore, collinearity is clearly at odds with the signicant spread in the distribution of connections over orientation differences, whether it is measured via extracellular injections (e.g., Bosking et al., 1997) or the more elaborate intracellular protocol (Buzas ´ et al., 1998). The data in Bosking et al. (1997) contain one injection site of possibly different connection distribution, which may substantially contribute to the nonmonotonic nature of the variance. Since the variance will become central to this article, we examined whether this statistical feature depends critically on this one, possibly outlier, measurement. We reanalyzed the data from Bosking et al. (1997) after removing the data from this injection site and calculating the statistical properties of the rest. We further examined the robustness of the nonmonotonicity by running two additional analyses: one in which we removed the sample points (one from each orientation bin) that contribute the most to the variance, and another in which we removed those sample points (again, one from each bin) that maximized local changes in the variance. In all these tests, including the last one, which attens the variance the most, the trimodal nonmonotonicity, and the two local minima at §30 degrees, were preserved. All these ndings suggest that the nonmonotonicity of the variance is a critical feature that deserves attention from both biologists and modelers. 1.2 Integration Models and Random Physiological Variations. It is tempting to explain the apparent anomalies and inconsistencies between the predicted and measured distributions of long-range horizontal connections as random physiological variations, for example, by asserting that anatomy only approximates the correct connections. We tested this explanation by applying different noise models to the collinearity and association eld connectivity distributions from Figure 2, and checked whether the resul1 This crossing point provides a reference for the bias of projection patterns toward particular orientations; considering the offsets where the connection distribution crosses the uniform line quanties this bias in a way independent of scale or quantization level.

450

O. Ben-Shahar and S. Zucker

tant pooled distributions possess the properties listed above. The results of the most natural noise model are illustrated in Figure 3b. Under this model, each long-range horizontal connection, ideally designated to connect cells of orientation difference 1µ, is shifted to connect cells of orientation difference 1µ C ²¾ , where ²¾ is a wrapped gaussian (i.e., normally distributed and wrapped on S 1 ) random variable with zero mean and variance ¾ (see the appendix for details). As the gure shows, it takes an overwhelming amount of noise (s.d.¸ 35 degrees) to transform the collinear distribution to one that resembles the measured data in terms of spread and peak height, but the nonmonotonic behavior of the variance is never reproduced. (For space considerations, we omit the results of other connection-based noise models, or the noisy distributions based on the association eld model, all of which were even less reminiscent of the measured physiological data.) A second possible source for the inconsistencies between the predicted and measured distributions may be the extracellular injection protocol commonly in use by physiologists to trace long-range horizontal connections (e.g., Gilbert & Wiesel, 1989; Malach et al., 1993; Kisv´arday et al., 1994, 1997; Bosking et al., 1997; Schmidt et al., 1997; Sincich & Blasdel, 2001). Due to the site-selection procedure used, cells stained by these injections are likely to have similar orientation preferences (e.g., Bosking et al., 1997, p. 2113, or Schmidt et al., 1997, p. 1084). However, their orientation tuning may nevertheless be different, sometimes signicantly (note such a cell in Bosking et al., 1997, Fig. 4B). Consequently, the distribution of presynaptic terminals (boutons) traced from the injection site may incorporate an articial, random spread relative to the single orientation typically assumed at the injection site. Preliminary evidence from a recently developed single-cell protocol (Buza´ s et al., 1998) suggests that leakage in the injection site cannot bridge the gap between the predicted collinear distribution and those measured anatomically. However, we also examined this possibility computationally by modeling the leakage in the injection site as a wrapped gaussian random variable of predened variance.2 The base distributions (collinear or association eld) of the computational cells selected by this process were then summed up and normalized, and the resultant (random) distribution was attributed to the original cell representing the injection site. Repeating this process many times yielded a collection of (different) distributions, for which we calculated an average and variance (see the appendix for details). The results are illustrated in Figure 3c. Similar to random variations at the level of individual connections, here too it takes an overwhelming amount of noise (s.d.¸ 35 degrees) to transform the colinear distribution to one that resembles the measured data in terms of spread and peak height, but the nonmonotonic behavior of the variance is never reproduced.

2 A wrapped gaussian model was particularly suitable here due to the injection site selection protocol typically used in the extracellular injection protocol; see the appendix.

Projection Patterns of Long-Range Horizontal Connections in V

451

a

b

c

Figure 3: Results of a statistical pertubation of collinear connectivity distribution. (a) Mean connection distribution computed from the data in Bosking et al. (1997), shown here for reference. Error bars are §1 standard deviation. Note the unimodal distribution that peaks at approximately 11%, the wide spread, crossing of the uniform distribution (dashed horizontal line) around §40 degrees, and the nonmonotonic variance. Can all these features be replicated by applying noise to the base distribution induced by the standard colinearity model? (b) Result of simulating physiological deviation at the individual connection level. The dashed line is the base collinear distribution. The gray region is the superposition of individual applications of the noise model to the base distribution. The solid graph is the expected distribution, and error bars are §1 standard deviation. Permitting large enough developmental variations (shown here is the result of wrapped gaussian independent and identically distributed noise of s. d. D 35± ) in the connections to model the rst-order statistics signicantly violates the underlying connectivity principle of good continuation but still cannot model the second-order statistics. (c) Results of simulating measurement errors due to leakage in the injection site. All parts are coded as in b. Again, permitting large enough injection spread to model the rst-order statistics (shown here is the result of gaussian noise of s.d. D 35± and assuming 20 cells per injection site; (Bosking et al., 1997) cannot model the second-order statistics

452

O. Ben-Shahar and S. Zucker

The thinking around long-range horizontal connections has been dominated by their rst-order statistics and its peak at zero orientation offset. However, the nonmonotonicity of the variance was rst reported almost a decade ago (Fig. 3d in Malach et al., 1993) and we have further conrmed it from the more detailed measurements in Bosking et al. (1997) as was illustrated in Figure 3a. Since neither collinearity nor association eld models can explain this aspect of the physiological data, even if much noise is allowed, it is necessary to consider whether this and the other subtle properties of the pooled data reect genuine functional properties of longrange horizontal connections. We therefore developed a geometric model of projection patterns and examined quantitatively both pooled connection statistics and connectivity patterns of individual cells generated by it. Since many ndings suggest that long-range horizontal connections are primarily excitatory, especially those extending beyond one hypercolumn (Ts’o et al., 1986; Gilbert & Wiesel, 1989; Kapadia et al., 1995; Kisv´arday et al., 1997; Buzas ´ et al., 1998; Sincich & Blasdel, 2001), our model concentrates on this class of connections. 2 From Differential Geometry to Integration Models Curve integration, the hypothesized functional role ascribed to long-range horizontal connections, is naturally based in differential geometry. The tangent, or the local linear approximation to a curve, abstracts orientation preference, and the collection of all possible tangents at each (retinotopic) position can be identied with the orientation hypercolumn (Hubel & Wiesel, 1977). Formally, since position takes values in the plane R2 (think of image coordinates x; y) and orientation in the circle S1 (think of an angle µ varying between 0 and 2¼ ), the primary visual cortex can be abstracted as the product space R2 £ S1 (see Figure 4). Points in this space represent both position and orientation to abstract visual edges of given orientation at a particular spatial (i.e., retinotopic) position. It is in this space that our modeling takes place. Since any single tangent is the limit of any smooth curve passing through a given (retinotopic) point in a given direction, the question of curve integration becomes one of determining how two tangents at nearby positions are related. (Collinearity, for example, asserts that the tangent orientation hardly changes for small displacements along the curve.) In general terms, the angular difference between RFs captures only one part of the relationship between nearby tangents; their relative spatial offset also must be considered. Thus, in the mathematical abstraction, relationships between tangents correspond to relationships between points in R2 £S1 . Physiologically, these relationships are carried by the long-range horizontal connections, with variation in retinotopic position corresponding to R2 , and variation along orientation hypercolumns corresponding to S1 (see Figure 5). Determining them amounts, in mathematical terms, to determining what is called a con-

Projection Patterns of Long-Range Horizontal Connections in V

a

b

c

d

453

Figure 4: Abstracting the primary visual cortex as R2 £ S 1 , or position £ orientation space. (a) The “ice cube” cartoon of visual cortex (Hubel & Wiesel, 1977) (cytochrome-oxidase blobs and distortions due to cortical magnication factor are not shown). A tangential penetration in the supercial layers reveals an orientiation hypercolumn of cells whose RFs have similar spatial (retinotopic) coordinates. With cells of similar orientation tuning grouped by color, the hypercolumn is cartooned as a horizontal cylinder. (b) With ocular dominace columns omitted, the supercial layers of the primary visual cortex can be viewed as a collection of (horizontally arranged) orientation hypercolumns. (c) Drawing the cylinders vertically emphasizes that RFs of cells within a column overlap in retinotopic coordinates .x; y/ and makes explicit this aspect of thier organization. (d) Since different hypercolumns correspond to different retinotopic positions, the set of all hypercolumns abstracts the visible subspace of R2 £ S 1 , with each column corresponding to a different vertical ber in that space. The µ axis in this space corresponds to a tangential penetration with V1 hypercolumns (colors within the column represent different orientation tunings), and the XY plane corresponds to retinotopic coordinates.

454

O. Ben-Shahar and S. Zucker

nection structure. As we discuss in the rest of this article, the relationship between these two types of connections, the mathematical and the physiological, is more than linguistic. 2.1 The Geometry of Orientation in the Retinal Plane. Orientation in the 2D (retinal) plane is best represented as a unit length tangent vector O q/ attached to point of interest q E.E E D .x; y/ 2 R2 . Having such a tangent vector attached to every point of an object of interest (e.g., a smooth curve or oriented texture) results in a unit length vector eld (O’Neill, 1966). AssumE from the ing good continuation (Wertheimer, 1955), a small translation V O point q E results in a small change (i.e., rotation) in the vector E.E q/. To apply

Projection Patterns of Long-Range Horizontal Connections in V

455

techniques from differential geometry, a suitable coordinate frame fEO T ; EO N g O q/—the E and the basis vector EO T is identied with E.E is placed at the point q E (see Figure 6). Note that EO T is drawn at an angle µ — tangent vector at q the local orientation measured relative to the horizontal axis in retinotopic coordinates—such that .E q; µ / 2 R2 £ S1 . Nearby tangents are displaced in both position and orientation according to the covariant derivatives of the underlying pattern. These covariant derivatives, rVE EO T and rVE EO N , are naturally represented as vectors in the basis fEO T ; EO N g itself: Á

rVE EO T rVE EO N

!

µ D

E w11.V/ E w21.V/

E w12.V/ E w22.V/

¶Á

EO T EO N

!

:

(2.1)

E known as 1-forms, are functions of the displacement The coefcients wij .V/, E and since the basis fEO T ; EO N g is orthonormal, they are skew direction vector V, E D ¡wji .V/. E Thus, w11 .V/ E D w22.V/ E D 0, and the system symmetric wij .V/ reduces to: Á ! µ ! ¶Á E rVE EO T 0 EO T w12.V/ D (2.2) : E ¡w12 .V/ 0 r E EO N EO N V

Figure 5: Facing page. Abstracting long-range horizontal connections as relationships between points in R2 £ S 1 . (a) Since visual integration must involve not only the relative orientation between RFs but their spatial offset as well, it is more fully abstracted by relationships between points in R2 £ S 1 . The exact nature of these relationships is determined by the underlying integration model. (b) Redrawing R2 £ S 1 bers as orientation hypercolumns in V1 reveals the connection between the integration model in R2 £ S 1 and the distribution of long-range horizontal connections between the hypercolumns. (c) Collapsing the R2 £ S 1 abstraction to a cortical orientation map (i. e., attening each orientation cylinder and redistributing its orentation-selective parts as orientation columns in the supercial cortical layers), the integration model implies a particular set of long-range horizontal connections between orientation domains (colors represent orientation tuning similar to panels a and b and Figure 4). Such links have been identied and measured through optical imaging and anatomical tracing (e.g., Malach et al., 1993; Bosking et al., 1997; Buz´as et al., 1998) and thus can be compared to the model’s predictions. (d) A real counterpart to the schematic in panel c. Reproduced from Bosking et al. (1997), this image shows an optical image of intrinsic signals combined with long-range horizontal connections traced through extracellular injection of biocytin. The white dots at the upper left corner represent the injection site, while the black dots represent labeled boutons. The white bar in the inset represents the orientation preference at the injection site.

456

O. Ben-Shahar and S. Zucker

This last system is known as Cartan’s connection equation (O’Neill, 1966), E is called the connection form. Since w12.V/ E is linear in V, E it can be and w12 .V/ O O represented in terms of fET ; EN g: E D w12.a EO T C b EO N / D a w12 .EO T / C b w12 .EO N / : w12 .V/

The relationship between nearby tangents is thus governed by two scalars at each point. We dene them as follows, ·T D w12.EO T / 1 ·N D w12.EO N /; 1

(2.3)

and interpret them as tangential (·T ) and normal (·N ) curvatures, since they represent a directional rate of change of orientation in the tangential and normal directions, respectively. While the connection equation describes the local behavior of orientation for the general 2D case, it is equally useful for the 1D case of curves. Now, only rEO T is relevant and equation 2.2 simplies to Á

rEO EO T T rEO EO N

!

"

D

T

0 ¡w12.EO T /

w12.EO T / 0

#Á

EO T EO N

!

:

(2.4)

In its more familiar form, where T,N, and · replace EO T ,EO N , and ·T , respectively, this is the classical Frenet equation (O’Neill, 1966) (primes denote derivatives by arc length): ³

T0 N0

´

µ D

0 ¡·

· 0

¶³

T N

´ :

(2.5)

2.2 Integration Models and Projection Patterns of Horizontal Connections. The geometrical analysis discussed above and illustrated in Figure 6 shows how the relationship between nearby tangents depends on the covariant derivative: for curves, the connection is dictated by one curvature; for texture ows, or oriented 2D patterns, two curvatures are required. By estimating these quantities at a given retinal point q E , it is possible to approximate the underlying geometrical object, and thus a coherent distribution of tangents, around q E . This, in turn, can be used to model the set of horizontal connections that are required to facilitate the response of a cell if its RF is embedded in a visual context that reects good continuation. Naturally, to describe such a local approximation and to use it for building projection patterns, the appropriate domain of integration must be determined. However, since RF measurements provide only the tangent, possibly curvature (Dobbins, Zucker, & Cynader, 1987; Versavel, Orban, & Lagae, 1990), but not whether the stimulus pattern is a curve (1D) or a texture (2D), it is necessary to consider continuations for both.

Projection Patterns of Long-Range Horizontal Connections in V

457

Figure 6: Visual integration under good continuation involves the question of how a measurement of orientation at one retinal position relates to another measurement of orientation at a nearby retinal position. Formally, this amounts E relates to specifying how a tangent (orientation measurement) at position q E This tangent displacement to another nearby tangent displaced by a vector V. amounts to rotation, and as shown above, this rotation can differ for different displacements. Formally, the rotation is specied locally by the covariant derivative 1VE , and the mathematical analysis is facilitated by dening an appropriate coordinate frame. Shown is the Frenet basis fEO T ; EO N g, where EO T corresponding to a unit vector in the orientation’s tangential direction and EO N corresponds to a unit vector in the normal direction. Associated with this frame is an angle µ dened relative to external xed coordinate frame (the black horizontal line). The covariant derivative species the frame’s initial rate of rotation for any direcE The four different cases in this gure illustrate how this rotation tion vector V. E both quantitiatively (i.e., different magnitudes of rotation) and depends on V qualitatively (i.e., clockwise, counterclockwise, or zero rotation). Since displacement is a 2D vector and 1VE is linear, two numbers are required to fully specify the covariant derivative. These two numbers describe the initial rate of rotation in two independent displacement directions. Using the Frenet basis once again, two natural directions emerge. A pure displacement in the tangential direction .EO T / species one rotation component, and a pure displacement in the normal direction .EO N / species the other component. We call them the tangential curvature .·T / and the normal curvature .·N /, respectively. If visual integration based on good continuation relates to 2D patterns of orientation, then both of these curvatures are required. For good continuation along individual curves, only the tangential curvature is required since displacement is possible only in the tangential direction (that is, along the curve only).

458

O. Ben-Shahar and S. Zucker

Since estimates of curvature at point q E hold in a neighborhood containing the tangent, the discrete continuation for a curve is commonly obtained by approximating it locally by its osculating circle (do Carmo, 1976) and quantizing. This relationship, which is based on the constancy of curvature E is known as co-circularity (Parent & Zucker, 1989; Zucker, Dobaround q, bins, & Iverson, 1989; Sigman, Cecchi, Gilbert, & Magnasco, 2001; Geisler, Perry, Super, & Gallogly, 2001), and in R2 £ S1 it takes the form of a helix (see Figures 7a and 7b). Different estimates of curvature give rise to different helices whose points dene both the spatial position and the local E (see orientation of nearby RFs that are compatible with the estimate at q Figure 7c). Together, these compatible cells induce a curvature-based eld of long-range horizontal connections (see Figures 7a through 7c and 8a through 8d). While different curvatures induce different projection elds, the “sum” over curvatures gives an association eld (see Figure 8e) reminiscent of recent psychophysical ndings (Field et al., 1993). Note, however, that as a psychophysical entity, the association eld is not necessarily a one-to-one reection of connectivity patterns in the visual cortex. In fact, representing a “cognitive union” across displays of different continuations, the association eld is unlikely to characterize any single cell. Similar considerations can be applied toward the local approximation of

Projection Patterns of Long-Range Horizontal Connections in V

459

texture ows, although now the construction of a rigorous local model is slightly more challenging. Unlike curves, this model must depend on the estimate of two curvatures at the point q E , KT D ·T .E q/ and KN D ·N .E q/, but more important, these estimates cannot be held constant in the neighborE , however small; they must covary for the pattern to be feasible hood of q (Ben-Shahar & Zucker, 2003b). Nevertheless, invariances between the curvatures do exist, and formal considerations of good continuation have been shown to yield a unique approximation that, in R2 £ S1 , takes the form of a right helicoid (see Figures 7c and 7d) and whose orientation function has

Figure 7: Facing page. Differential geometry, integration models, and horizontal connections between RFs. (a) Estimate of tangent (light blue vector) and curE permits modeling a curve with the osculating circle as a vature at a point q good-continuation approximation in its neighborhood. Given the approximation, compatible (green) and incompatible (pink) tangents at nearby locations can be explicitly derived. (b) with height representing orientation (see the scale along the µ -axis), the osculating circle lifts to a helix in R2 £ S 1 whose points dene both the spatial location and orientation of compatible nearby tangents. Color-coded as in a, the green point is compatible with the blue one, while the pink points are incompatible with it. (c) The consistent structure in a and b illustrated as RFs and their spatial arrangement. As an abstraction for visual integration, the ideal geometrical model—the osculating circle—induces a discrete set of RFs, which can facilitate the responce of the central cell. Shown here is an example for one particular curvature tuning at the central cell. (d) For textures, determination of good continuation requires two curvatures at a point. Based on these curvatures, a local model of good continuation can determine the position, orientation, and curvatures of (spatially) nearby coherent points. Given these two curvatures at a point, there exists a unique model of good continuation that guarantees identical covariation of the curvature functions. Given the approximation, compatible (green) and incompatible (pink) ow patches at nearby locations can be explicitly derived. (e) In R2 £ S 1 , our model for 2D orientation good continuation lifts to a right helicoid, whose points dene both the spatial location and orientation of compatible (green) nearby ow tangents. (f) As an abstraction for visual integration, the ideal geometric model—the right helicoid—induces a discrete set of RFs, which can facilitate the responce of the central cell. Shown here is an example for one particular curvature tuning at the central cell. Note that broad RF tuning means that both the helix and the helicoid must be dilated appropriately, thus resulting in compatible “volumes” in R2 £ S 1 and possibly multiple compatible orientations at give spatial positions. This dilation should be reected in the set of compatible RFs and the horizontal links to them, but to avoid clutter, we omit it from this gure. The effect of this dilation is illustrated in Figure 8 and consequently in all our calculations.

460

O. Ben-Shahar and S. Zucker

the following expression: ³ µ .x; y/ D tan¡1

´ KT x C KN y : 1 C KN x ¡ KT y

(2.6)

The unique property of this object is that it induces an identical covariation E. of the two curvature functions ·T and ·N in the neighborhood of the point q The osculating helicoid is the formal 2D analog of the osculating circle and, as with co-circularity for curves, the elds of connections between neurons that this model generates (see Figures 8f through 8j) depend intrinsically on curvature(s). Such connectivity structures can be used to compute coherent texture and shading ows in a neural, distributed fashion (Ben-Shahar & Zucker, 2003b). Two examples are shown in Figure 9. 3 Results The computational connection elds generated above contain all the geometrical information needed for predictions about long-range horizontal connections of individual cells (or, after some averaging, that of tracer injection sites) in visual cortex. Thus, we now turn to the central question: How well do these connectivity maps match the available data about projection elds in visual cortex? In particular, do they make better predictions than those arising from collinearity or association eld models? To answer these questions, we focused on anatomical studies that report population statistics (Malach et al., 1993; Bosking et al., 1997) and compared their data to predictions produced by performing “computational anatomy” on our model.3 We randomly sampled the population of model-generated elds analogous to the way anatomists sample cells, or injection sites, in neural tissue and computed both individual and population statistics of their connection distributions. To generate robust predictions, we repeated these sampling procedures many times and calculated the expected values and standard errors of the frequency distribution. 3.1 Computational Anatomy Predicts Biological Measurements. Figure 10 illustrates the main results computed from our models, and compares them to the corresponding anatomical data reported in the literature (Malach et al., 1993; Bosking et al., 1997). The agreement of the computational process to the biological data is striking qualitatively and quantitatively. As with the association eld, our model correctly predicts the spread of the pooled distribution with similar peak height (approximately 3 Anatomical studies such as Bosking et al. (1997) and Malach et al. (1993) were preferred to psychophysical or electrophysiological studies, which typically contribute no population statistics and are generally more difcult to interpret directly in terms of the structure of horizontal connections.

Projection Patterns of Long-Range Horizontal Connections in V

461

Figure 8: Illustration of connection elds for curves (top, based on co-circularity, Parent & Zucker, 1989) and textures (bottom, based on right helicoidal model, Ben-Shahar & Zucker, 2003b). Each position in these elds represents one orientation hypercolumn, while individual bars represent the orientation preference of singe neurons, all of which are connected to the central cell in each eld. Multiple bars at any given point represent multiple neurons in the same hypercolumn that are connected to the central cell, a result of the dilation of the compatible structure due to broad RF tuning (see the caption of Figure 7). All elds assume that orientation tuning is quantizied to 10 degrees and their radius of inuence is set to four to ve nonoverlapping hypercolumns to reect a 6 to 8 mm cortical range of horizontal connections (Gilbert & Wiesel, 1989) and hypercolumn diamater of 1.5 mm (to account for ocular dominance domains). (a–d) Examples of co-circularity projection elds (Parent & Zucker, 1989) for cells with orientation preference of 150 degrees (center bars) and different values of curvature tuning based on the implementation by Iverson (1994). (a) · D 0:0 (curvature in units of pixels¡1 ). (b) · D 0:08: (c) · D 0:16: (d) · D 0:24: (e) The union of all projection elds of all cells with same orientation preference (0 degrees in this case) but different curvature tuning. Note the similarity to the shcematic association eld in Figure 6b. (f–j) Examples of the texture ow projection elds (Ben-Shahar & Zucker, 2003b) for cells with horizontal orientation preference (center bars) and different curvature tuning. Note the intrinsic dependency on curvatures and the qualitatively different connectivity patterns that they induce. (f).·T ; ·N / D .0:0; 0:0/: (g) .·T ; ·N / D .0:2; 0:0/: (h) .·T ; ·N / D .0:0; 0:2/: (i) .·T ; ·N / D .0:1; 0:1/: (j) .·T ; ·N / D .0:2; 0:2/: Note that while the majority of connections link cells of roughly similar orientation, some connect cells of large orientation differences. The elds shown are just a few examples sampled from the models, both of which contain similar (rotated) connection elds for each of the possible orientation preferences in the central hypercolumn. The circles superimposed on d and i are used to characterize retinotopic distance zones for the predictions made in Figure 15.

462

O. Ben-Shahar and S. Zucker

Figure 9: Example of coherent texture (a–d) and shading (e–g) ow computation based on contextual facilitation with right helicoidal connectivity patterns (Ben-Shahar & Zucker, 2003b). (a) Natural image of a tree stump with perceptual texture ow. (b) A manually drawn ow structure as perceived by a typical observer. (c) Noisy orientation eld reminiscient of RF responses. The computed measurements are based on the direction of the image intensity gradient. (d) The outcome of applying a contextual and distributed computation (Ben-Shahar & Zucker, 2003b) which facilitates the response of individual cells based on their interaction with nearby cells through the connectivity structures in Figure 8. Compare to b and note how the measurements in the area of the knot, where no RF is embedded in a coherent context, were rejected altogether. (e) An image of a plane. (f) Measured shading ow eld (white) and edges (black). In biological terms, edges are measured by RFs of particular orientation preferences tuned to high spatial frequencies. The shading eld may be measured by cells tuned to low frequencies. (g) Applying the right helicoidal-based computation on the shading informaiton results in a coherent shading eld on the plan’s nose and a complete rejection of the incoherent shading information on the textured background. Such an outcome can be used to segment smoothly curved surfaces in the scene (Ben-Shahar & Zucker, 2003b), to resolve their shape (Lehky & Sejnowski, 1988), to identify shadows (Breton & Zucker, 1996), and to determine occlusion relationship underlying edge classication (Huggins et al., 2001).

11% for orientation resolution of 10 degrees) and a similar orientation offset at which it crosses the uniform distribution (approximately §40 degrees). Unlike collinearity and association eld models, however, ours predict qualitative differences between distributions of individual neurons, or injection sites, similar to ndings in the literature (see Figure 10c). Most important, our model predicts the consistently nonmonotonic standard deviation. At orientation resolution of 10 degrees, both the anatomical data and the com-

Projection Patterns of Long-Range Horizontal Connections in V

463

putational models exhibit variance local minima at approximately §30 degrees. This property holds for both a random sample of cells (see Figure 11) and the computational population as a whole (not shown for space consideration). 3.2 Curvature Quantization and Population Statistics. The geometrical model discussed in this article must be quantized in both orientation and curvature before projection patterns can be computed and computational predictions can be made. We xed the orientation quantization to the same level used in Bosking et al. (1997). Curvature quantization, however, is not addressed in the physiological literature, and thus it is necessary to examine its effect on the resultant connectivity distributions. We note that even with orientation represented to hyperacuity levels, there are sufcient numbers of cells to represent such quantization (Miller & Zucker, 1999). Broad orientation tuning implies discrete orientation quantization and suggests even more discrete curvature quantization. The results presented in Figures 10 and 11 are based on quantizing curvature into ve classes. 4 This is a likely upper bound, given the broad bandpass tuning of cortical neurons that have been observed (Dobbins et al., 1987; Versavel et al., 1990) and modeled (Dobbins, Zucker, & Cynader, 1989). However, to study the effect of curvature quantization, we repeated the entire set of computations with both a smaller (three) and a larger (seven) number of curvature classes. Three is clearly the lower limit, which may correspond to the tree shrew (Bosking et al., 1997) or other simple mammals, and seven is more than required computationally (Ben-Shahar, & Zucker, 2003b). We found that all of the properties predicted initially remain invariant under these changes. In particular, regardless of quantization level, the pooled distribution remains unimodal, it peaks at zero orientation difference with approximately 11%, it crosses the uniform distribution at §40 degrees, and it has nonmonotonic variance with local minima at §30 degrees (with somewhat increased variance around zero orientation for higher quantization levels). Qualitative differences between individual neurons are predicted regardless of the number of curvature classes. All these results are illustrated in Figure 12. 3.3 Relationship Between Cells’ Distribution and Connections’ Distribution. Since both anatomical and computational studies must sample the population of (biological or computational) cells to measure the distribution of their horizontal connections, an important consideration is whether the underlying distribution of cells (based on their curvature tuning) can affect the pooled distribution of connections. For example, if most cells in

4 In the context of curves, these ve classes may be labeled as straight, slowly curving to the left, slowly curving to the right, rapidly curving to the left, and rapidly curving to the right.

464

O. Ben-Shahar and S. Zucker

Figure 10: Comparison of anatomical data and model predictions for the distribution of long-range horizontal connections in the orientation domain. In all graphs, dashed horizontal lines represent the uniform distribution, and error bars represent §1 standard error. (a) Mean connection distribution of four injection sites from Malach et al. (1993) versus the computational prediction from our models (expected mean, N D 4, 100 repetitions). Note the dominant peak around zero orientation difference and the considerable width of the histogram. The asymmetry in the pooled distribution measured by Malach et al. (1993) likely derives from a bias at the injection site (see their Fig. 4D) rather than being intrinsic. (b) Median distribution of seven injection sites from Bosking et al. (1997) against the computational prediction from our models (expected median, N D 7, 100 repetitions). Note in particular the similarity in peaks’ height and in the orientation offset at which the graphs cross the uniform distribution, and the strongly nonmonotonic behavior of the variance. (c) Two individual injection sites with qualitatively different connection distributions reproduced from Bosking et al. (1997).The counterpart computational instances are sampled from our models. Solid graphs correspond to the elds in Figures 8b and 8i. Dashed graphs correspond to the elds in Figures 8c and 8j.

Projection Patterns of Long-Range Horizontal Connections in V

465

Figure 11: Although both the computational and the physiologically measured distributions of the mean are monotonically decreasing, their standard deviation is consistently nonmonotonic. (a) While Bosking et al. (1997)used the populaiton median, we further analyzed their published data (from seven injection sites) to nd its mean and standard deviation. It is evident that the standard deviation is nonmonotonic, with two local minima at §30 degrees (marked with arrows). (b) Expected standard deviation for the texture model. (c) Expected standard deviation for the curve model. Both graphs depict the expected standard deviation for seven randomly selected cells (N D 7, 100 repetitions). and both show a similar nonmonotonic behavior with pronounced standard deviation local minima at approximately §30 degrees. Not how the computational local minima coincide with the anatomical ones (arrows are copied from a and overlaid on the computational graphs). Compare also to the standard deviation on the median graphs in Figure 10b. Note that as with the distributions themselves, both computational models produce quantitatively similar standard deviation results.

466

O. Ben-Shahar and S. Zucker

Figure 12: Different quantization levels of curvature tuning have little effect on the expected median distribution and its standard deviation. (a) Anatomical data from Bosking et al. (1997) shown for comparison with the computational predictions. (b) Computational predictions with three curvature classes. (c) Computational predictions with ve curvature classes. (d) Computational predictions with seven curvature classes. In all cases, the left column depicts the expected median for seven cells (bars are 1 s.d.), the middle column depicts the expected standard deviation for seven cells, and the right column shows two qualitatively different distributions from two different cells. For space considerations, we show the results form the texture model only.

Projection Patterns of Long-Range Horizontal Connections in V

467

the population are tuned for zero or very small curvature, the pooled connection distribution may differ from that of a population dominated by high curvature cells. The results presented in Figures 10 and 11 are based on the assumption that cells of different curvature tuning (or, put differently, of different connectivity patterns) are distributed uniformly. Such an assumption follows naturally from the mathematical abstraction that allocates the same number of computational units to equal portions of R2 £S1 . However, if this assumption were wrong, would a bias in the distribution of cells affect signicantly the predictions made from our models? Unfortunately, few data about such distributions are available, partially because anatomists need not assume any particular cells’ distribution for their measurements of projection elds, and partially because curvature tuning is rarely considered. Some data available on the distribution of endstopped cells (Kato, Bishop, & Orban, 1978; Orban, 1984) in conjunction with the functional equivalence of end stopping with curvature selectivity (Orban, Versavel, & Lagae 1987; Dobbins et al., 1987, 1989; Versavel et al., 1990), suggest that cells are distributed bimodally in the curvature domain, with peaks at both zero and high curvature tuning. Alternatively, statistical studies of edge correlations in natural images (Dimitriv & Cowan, 1998; August & Zucker, 2000; Sigman et al., 2001; Kaschube, Wolf, Geisel, & L owel, ¨ 2001; Geisler et al., 2001; Pedersen & Lee, 2002) show that collinear cooccurrences are more frequent than others. Although these co-occurrence measurements neither depend on curvature nor do they necessarily indicate any particular distribution of cells at the computational level, implicitly they may suggest that cells are distributed unimodally in the curvature domain, with peak at zero curvature only. Since our model raises the possibility of a curvature bias effect, we thus redistributed the population of our computational cells by one or the other of these nonuniform (bimodal and unimodal) distributions, and then repeated the computational anatomy process described in section 3.1. All computations were done on the more general 2D (texture) model. The bimodal distribution was modeled as a radial two-gaussian mixture model q 2 and parame(GMM) parameterized by the total curvature · D ·T2 C ·N ters ¹0 D 0, ¾0 , ¹1 , and ¾1 . The unimodal distribution was modeled as a 2D gaussian of zero mean and variances ¾T and ¾N in the ·T and ·N dimensions, respectively. Figure 13 illustrates one example of the resultant statistical measures for the bimodal cell distribution. In this example, ¾0 D 0:04 and ¾1 D 0:05, where the slight difference accounts for corresponding differences in the two modes as reported in Kato et al. (1978) and Orban (1984). As is shown, this nonuniform distribution hardly changes the expected median, while further emphasizing the nonmonotonic nature of the variance (compared to the statistics obtained with the same number of curvature classes and

468

O. Ben-Shahar and S. Zucker

Figure 13: A statistical conrmation that the properties of our models persist even when the population of cells is distributed bimodally (Kato et al., 1978; Orban, 1984). Illustrated here is the result from a distribution modeled as a GMM with ¹0 D 0; ¾0 D 0:04; ¹1 D 0:2, and ¾1 D 0:05. Since the bimodal nature of the distribution is best represented with a higher number of curvature classes, presented here is the case of the texture model with curvatures quantizied to seven classes each. (a) The (radially) bimodal distribution of cells in the curvature domain normalized for number of cells. The x- and y-axes represent tangential and normal curvature tuning, respectively, and the z-axis represents the number of such curvature-tuned cells for any given orientation tuning. (b) Expected median of seven cells. Error bars are §1 standard deviation. (c) Expected standard error for seven cells. Compare both graphs to Figure 12d.

Projection Patterns of Long-Range Horizontal Connections in V

469

uniform cell distribution; see Figure 12d). Similar results were obtained with other values of ¾0 and ¾1 and for the curvature quantized to ve classes as well.5 Figure 14 illustrates another example, this time using the unimodal cell distribution mentioned above. In this example, ¾T D ¾N D 0:15 such that cells with zero curvature tuning are eight times more frequent than cells tuned to the maximum value of curvatures. As expected, this strongly nonuniform distribution slightly elevated the peak of the population statistics, but otherwise, all other features that were predicted from the uniform cell distribution, and in particular the nonmonotonic variance, were fully preserved. Similar results were obtained with other values of ¾T and ¾N , and for all quantization levels of the curvatures (as in section 3.2). In summary, we have shown that even if cells in primary visual cortex were distributed nonuniformly in their curvature tuning, the pooled distribution of long-range horizontal connections in the orientation domain would preserve its fundamental properties, and in particular its wide spread and nonmonotonic variance. Thus, our conclusions are not biased by an implicit assumption about curvature-dependent distribution of cells. 4 Discussion The ndings presented from our computational anatomy support the functional identication of the long-range horizontal connections with those obtained mathematically. However, the question of why the texture model is necessary becomes unavoidable, and we believe this issue is more than just formal mathematics. Certain physiological and psychophysical ndings, such as iso-orientation side facilitation (Adini et al., 1997), functional and anatomical connections between retinotopically parallel receptive elds (Ts’o et al., 1986; Gilbert & Wiesel, 1989), and roughly isotropic retinotopic extent of projection elds (Malach et al., 1993; Sincich & Blasdel, 2001) suggest the perceptual integration of texture ows rather than curves. Although this class of patterns may seem less important than curves as a factor in perceptual organization, their perceptual signicance has been established (Glass, 1969; Kanizsa, 1979; Todd & Reichel, 1990). Furthermore, recent computational vision research implicates them in the analysis of visual shading (Lehky & Sejnowski, 1988; Huggins, Chen, Belhumeur, & Zucker, 2001), as was demonstrated in Figure 9, and even color (Ben-Shahar & Zucker, 2003a). Whether projection patterns of cells in primary visual cortex come in different avors (i.e., curve versus texture or shading integration) is an open question. To answer it, one is likely to exploit the many physiologically measurable differences between these classes of projection patterns, as sug-

5 Quantization of curvature to three classes was irrelevant in this case because the bimodality of the distribution could not be expressed using too few (three) samples.

470

O. Ben-Shahar and S. Zucker

Figure 14: A statistical conrmation that the properties of our models persist even when the population of cells is distributed normally (i.e., unimodally). Such distribution is implicitly suggested by statistics of edges in natural images (Dimitriv & Cowan, 1998; August & Zucker, 2000; Sigman et al., 2001; Kaschube et al., 2001; Geisler et al., 2001; Pedersen & Lee, 2002) in which collinear edges are much more frequent. The case presented here (¾T D ¾N D 0:15) induces a distribution in which cells of zero curvature tuning are eight times more frequent than those of maximal curvature tuning. The graphs in this gure correspond to the texture model with curvatures quantizied to three classes each. Similar resutls were obtainded with other quantization levels as well. (a) The normal distribution of cells in the curvature domain normalized for number of cells. The x- and y-axes represent tangential and normal curvature tuning, respectively, and the z-axis represents the number of such curvature-tuned cells for any given orientation tuning. (b) Expected median of seven cells. Error bars are §1 standard deviation. (c) Expected standard error for seven cells. Compare both graphs to Figure 12b.

Projection Patterns of Long-Range Horizontal Connections in V

471

gested by Figure 8. Unfortunately, the statistical data obtained so far do not distinguish between the two (curve and texture) integration models; without a spatial dimension, the statistical differences between the two models in the orientation domain are too ne to measure relative to the accuracy of current laboratory techniques. Until full spatio-angular data are obtained, however, the inclusion of even weak spatial information is sufcient to generate further testable predictions. In particular, incorporating the retinotopic distance between linked cells into the statistics (or estimating it from their cortical distance) can produce predictions regarding the dependency of the distribution’s spread and shape on the integration distance, as illustrated in Figure 15. Some verication for these predictions can be seen in the measurements of Kisv´arday et al. 1997, top row in their Fig. 9 shows developing peaks resembling the ones in Figures 15b, and 15c). Similar annular analyses, which focus on sectors of the annuli in directions other than parallel to the RF’s preferred orientation, provide measurable differences between curve and texture projection elds. In summary, we have presented mathematical analysis and computational models that predict both the pooled distribution of long-range horizontal connections and the distributions of individual cells and injection sites. For the rst time, the modeling goes beyond the unimodal rst-order data and falsies a common conclusion from it. In particular, while coaligned facilitation entails the pooled rst-order data, the converse is not necessarily true: these data are also consistent with curvature-dependent connections. The second-order (variance) data, however, remain consistent with curvature-dependent connections but not with coaligned facilitation. The explanatory force of our model derives from sensory integration, and we observed in Section 1 that most researchers limit this to curve integration via collinearity. We conclude in an enlarged context: differential geometry provides a foundation for connections in visual cortex that predicts both dependency on curvature(s) and an expanded functional capability, including curve, texture, and shading integration. Since the same geometrical analysis applies to many other domains in which orientation and direction elds are fundamental features, coherent motion processing (Series, Georges, Lorenceau, & Fre´ gnac, 2002) and coherent color perception (Ben-Shahar & Zucker, 2003a) might also have been included. Since all follow from the geometry and all are important for vision, more targeted experiments are required to articulate their neural realization. Appendix: Noise Models Two basic noise models are used in this article to examine whether variations of the basic collinear distribution (see Figure 2) can produce a pooled statistics with similar properties to the biological one. This appendix describes both procedures in detail.

472

O. Ben-Shahar and S. Zucker

Figure 15: Model predictions of connection distributions by a retinotopic annulus. Left, middle, and right columns correspond to predictions based on small, medium, and large annuli, respectively (circles in Figures 8d and 8i). All annuli refer to distances beyond one orientation hypercolumn, thus, the small annulus should not be confused with distances less than the diameter of one hypercolumn. In a and b the same sampling procedure and the same sample set sizes described in Figure 10 were repeated. For lack of space, we omit the very similar graphs of the mean and median of the entire population, and present predictions from the texture model only. (a) Expected mean distribution and standard error (N D 4, 100 repetitions). Note the spread with increased retinotopic distance. (b) Expected median distribution and standard error (N D 7, 100 repetitions). Note the developing symmetric peaks that further depart from the iso-orientation domain as the spatial distance increases. The correspondece of these peaks to the minima of the standard error is remarkable, thus designating them as statistical anchors suitable for empirical verication. (c) Individually tuned cells show the qualitative difference between distributions of high and low curvature cells. In particular, note how the distributions of medium and high curvature cells (dashed graphs) are the ones that develop the peaks mentioned in b above.

Projection Patterns of Long-Range Horizontal Connections in V

473

To examine natural random variations at the level of individual connections, each long-range horizontal connection, ideally designated to connect cells of orientation difference 1µ , is shifted to connect cells of orientation difference 1µ C ²¾ , where ²¾ is a wrapped gaussian noise with zero mean and variance ¾ . To do this computation, the base distribution (collinear or association eld) from Figure 2, initially given as probabilities over 18 orientation bins of 10 degrees each, was normalized and quantized to a connection histogram in the range [0; N], where N represents the total number of connections a cell makes. To each such connection to orientation difference 1µ we then added a wrapped gaussian noise ²¾ of zero mean and variance ¾ , and the new (random) connection was accumulated at the bin 1µ C ²¾ of the resultant histogram. This process was repeated 200 times to produce 200 different perturbations, from which both expected distribution and variance were computed bin-wise. The parameter ¾ was set to the value that produced an expected distribution of peak height and spread similar to the biological one. Since different anatomical studies and protocols indicate a different number of total connections (e.g., hundreds in Schmidt et al. 1997, approximately 3500 in Buza´ s et al., 1998 and up to 20,000 for injection sites of approximately 20 cells in Bosking et al., 1997), we repeated this statistical test for normalizations in different ranges. As expected, changing N only scaled the variance uniformly across the expected distribution but did not affect its mean. Thus, for better clarity of its monotonicity, the result in Figure 3b reects a smaller number of total connections (N D 200), as in, for example, Schmidt et al. (1997). To examine random variations due to “leakage” of tracer from an injection site of preferred orientation µ0 to nearby orientation columns, we modeled such leakage by selecting i D 1; : : : ; M cells of preferred orientation µi D µ0 C 1µ¾ , where 1µ¾ is a wrapped gaussian random variable of zero mean and variance ¾ . A normalized- and quantized-based distribution (collinear or association eld) was then centered around each of µi , and all were summed up and normalized to yield a resultant (random) distribution of connections for the injection site at µ0 . As before, we repeated this generation process 200 times to produce 200 different perturbations, from which both expected distribution and variance were computed bin-wise. The parameter ¾ was again set to that value that produced an expected distribution of peak height and spread similar to the biological one. The number of cells in an injection was set to M D 20, approximately as reported in Bosking et al. (1997) Unlike random variations at the level of individual connections, the range parameter N had no effect on the variance of the expected distribution. Acknowledgments We are grateful to Allan Dobbins, David Fitzpatrick, Kathleen Rockland, Terry Sejnowski, and Michael Stryker for reviewing this manuscript and for

474

O. Ben-Shahar and S. Zucker

providing valuable comments; to Lee Iverson for the curve compatibility elds in Figure 8; and to an anonymous reviewer for pointing out a possible artifact in the data (in Section 1.1). This research was supported by AFOSR, ONR, and DARPA. References Adini, Y., Sagi, D., & Tsodyks, M. (1997). Excitatory-inhibitory network in the visual cortex: Psychophysical evidence. Proc. Natl. Acad. Sci. U.S.A., 94, 10426– 10431. August, J., & Zucker, S. (2000). The curve indicator random eld: Curve organization via edge correlation. In K. Boyer & S. Sarkar, (Eds.), Perceptual organization for articial vision systems. Norwell, MA: Kluwer. Ben-Shahar, O., & Zucker, S. (2003a). Hue elds and color curvatures: A perceptual organization approach to color image denoising. In Proc. Computer Vision and Pattern Recognition (pp. 713–720). Los Alamitos, CA: IEEE Computer Society. Ben-Shahar, O., & Zucker, S. (2003b). The perceptual organization of texture ows: A contextual inference approach. IEEE Trans. Pattern Anal. Machine Intell., 25(4), 401–417. Bosking, W., Zhang, Y., Schoeld, B., and Fitzpatrick, D. (1997). Orientation selectivity and the arrangement of horizontal connections in the tree shrew striate cortex. J. Neurosci., 17(6), 2112–2127. Breton, P. & Zucker, S. (1996). Shadows and shading ow elds. In Proc. Computer Vision and Pattern Recognition (pp. 782–789). Los Alamitos, CA: IEEE Computer Society. Buz´as, P., Eysel, U., & Kisva´ rday, Z. (1998). Functional topography of single cortical cells: An intracellular approach combined with optical imaging. Brain Res. Prot., 3, 199–208. Dimitriv, A. & Cowan, J. (1998). Spatial decorrelation in orientation-selective cortical cells. Neural Comput., 10, 1779–1795. do Carmo, M. (1976). Differential geometry of curves and surfaces. Engelwood Cliffs, NJ: Prentice-Hall. Dobbins, A., Zucker, S., & Cynader, M. (1987). Endstopped neurons in the visual cortex as a substrate for calculating curvature. Nature, 329(6138), 438–441. Dobbins, A., Zucker, S., & Cynader, M. (1989). Endstopping and curvature. Vision Res., 29(10), 1371–1387. Field, D., Hayes, A., & Hess, R. (1993). Contour integration by the human visual system: Evidence for a local “association eld.” Vision Res., 33(2), 173– 193. Geisler, W., Perry, J., Super, B., & Gallogly, D. (2001). Edge co-occurrence in natural images predicts contour grouping performance. Vision Res., 41(6), 711–724. Gilbert, C., & Wiesel, T. (1989). Columnar specicity of intrinsic horizontal and corticocortical connections in cat visual cortex. J. Neurosci., 9(7), 2432–2442. Glass, L. (1969). Moir´e effect from random dots. Nature, 223(5206), 578–580.

Projection Patterns of Long-Range Horizontal Connections in V

475

Hubel, D., & Wiesel, T. (1977). Functional architecture of macaque monkey visual cortex. Proc. R. Soc. London Ser. B, 198, 1–59. Huggins, P., Chen, H., Belhumeur, P., & Zucker, S. (2001). Finding folds: On the appearance and identication of occlusion. In Proc. Computer Vision and Pattern Recognition, (pp. 718–725). Los Alamitos, CA: IEEE Computer Society. Iverson, L.A. (1994). Toward discretegeometricmodels for early vision. Unplublished doctoral dissertations, McGill University. Kanizsa, G. (1979). Organization in vision: Essays on gestalt perception. New York: Praeger. Kapadia, M., Ito, M., Gilbert, C., & Westheimer, G. (1995). Improvement in visual sensitivity by changes in local context: Parallel studies in human observers and in V1 of alert monkeys. Neuron, 15, 843–856. Kapadia, M., Westheimer, G., & Gilbert, C. (2000). Spatial distribution of contextual interactions in primary visual cortex and in visual perception. J. Neurophysiol., 84, 2048–2062. Kaschube, M., Wolf, F., Geisel, T., & Lowel, ¨ S. (2001). The prevalence of colinear contours in the real world. Neurocomputing, 38-40, 1335–1339. Kato, H., Bishop, P., & Orban, G. (1978). Hypercomplex and the simple/complex cell classication in cat striate cortex. J. Neurophysiol., 41, 1071–1095. Kisva´ rday, Z., Kim, D., Eysel, U., & Bonhoeffer, T. (1994). Relationship between lateral inhibition connections and the topography of the orientation map in cat visual cortex. J. Neurosci., 6, 1619–1632. ´ Rausch, M., & Eysel, U. (1997). Orientation-specic relaKisva´ rday, Z., Toth, ´ E., tionship between populations of excitatory and inhibitory lateral connections in the visual cortex of the cat. Cereb. Cortex, 7, 605–618. Lehky, S. & Sejnowski, T. (1988). Network model of shape-from-shading: Neural function arises from both receptive and projective elds. Nature, 333, 452– 454. Malach, R., Amir, Y., Harel, M., & Grinvald, A. (1993). Relationship between intrinsic connections and functional architecture revealed by optical imaging and in vivo targeted biocytin injections in primate striate cortex. Proc. Natl. Acad. Sci. U.S.A., 90, 10469–10473. Matsubara, J., Cynader, M., Swindale, N., & Stryker, M. (1985). Intrinsic projections within visual cortex: Evidence for orientation specic local connections. Proc. Natl. Acad. Sci. U.S.A., 82, 935–939. Miller, D., & Zucker, S. (1999). Computing with self-excitatory cliques: A model and application to hyperacuity-scale computation in visual cortex. Neural Comput., 11, 21–66. Mitchison, G., & Crick, F. (1982). Long axons within the striate cortex: Their distribution, orientation, and patterns of connections. Proc. Natl. Acad. Sci. U.S.A., 79, 3661–3665. O’Neill, B. (1966). Elementary differential geometry. Orlando, FL: Academic Press. Orban, G. (1984). Neural operations in the visual cortex. Berlin: Springer-Verlag. Orban, G., Versavel, M., & Lagae, L. (1987). How do striate neurons represent curved stimuli. Abstracts of the Society for Neuroscience, 13, 404.10. Parent, P., & Zucker, S. (1989). Trace inference, curvature consistency, and curve detection. IEEE Trans. Pattern Anal. Machine Intell., 11(8), 823–839.

476

O. Ben-Shahar and S. Zucker

Pedersen, K., & Lee, A. (2002). Toward a full probability model of edges in natural images (APPTS Tech. Rep. 02-1.) Providence, RI: Division of Applied Mathematics, Brown University. Polat, U., & Sagi, D. (1993). Lateral interactions between spatial channels: Suppression and facilitation revealed by lateral masking exteriments. Vision Res., 33(7), 993–999. Rockland, K., & Lund, J. (1982). Widespread periodic intrinsic connections in the tree shrew visual cortex. Science, 215(19), 1532–1534. Schmidt, K., Goebel, R., Lowel, ¨ S., & Singer, W. (1997). The perceptual grouping criterion of colinearity is reected by anisotropies in the primary visual cortex. Eur. J. Neurosci., 9, 1083–1089. Schmidt, K., & Lowel, ¨ S. (2002). Long-range intrinsic connections in cat primary visual cortex. In B. Payne & A. Peters, (Eds.), The cat primary visual cortex (pp. 387–426). Orlando, FL: Academic Press. Series, P., Georges, S., Lorenceau, J., & Fre´ gnac, Y. (2002). Orientation dependent modulation of apparent speed: A model based on the dynamics of feedforwards and horizontal connectivity in V1 cortex. Vision Res., 42, 2781–2797. Sigman, M., Cecchi, G., Gilbert, C., & Magnasco, M. (2001). On a common circle: Natural scenes and gestalt rules. Proc. Natl. Acad. Sci. U.S.A., 98(4), 1935–1940. Sincich, L., & Blasdel, G. (2001). Oriented axon projections in primary visual cortex of the monkey. J. Neurosci., 21(12), 4416–4426. Todd, J., & Reichel, F. (1990). Visual perception of smoothly curved surfaces from double-projected contour patterns. J. Exp. Psych.: Human Perception and Performance, 16(3), 665–674. Ts’o, D., Gilbert, C., & Wiesel, T. (1986). Relationships between horizontal interactions and functional architecture in cat striate cortex as revealed by crosscorrelation analysis. J. Neurosci., 6(4), 1160–1170. Versavel, M., Orban, G., & Lagae, L. (1990). Responses of visual cortical neurons to curved stimuli and chevrons. Vision Res., 30(2), 235–248. Weliky, M., Kandler, K., Fitzpatrick, D., & Katz, L. (1995). Patterns of excitation and inhibition evoked by horizontal connections in visual cortex share a common relationship to oriented columns. Neuron, 15, 541–552. Wertheimer, M. (1955). Laws of organization in perceptual forms. In W. Ellis, (Ed.), A source book of gestalt psychology, (pp. 71–88). London: Routledge & Kegan Paul. Zucker, S., Dobbins, A., & Iverson, L. (1989). Two stages of curve detection suggest two styles of visual computation. Neural Comput., 1, 68–81. Received April 9, 2003; accepted September 5, 2003.

NOTE

Communicated by Bruce Knight

Mean Instantaneous Firing Frequency Is Always Higher Than the Firing Rate Petr Lansk ´ y´ [email protected] Institute of Physiology, Academy of Sciences of the Czech Republic, 142 20 Prague 4, Czech Republic

Roger Rodriguez [email protected] Centre de Physique Th´eorique, CNRS and Facult´e des Sciences de Luminy, Universit´e de la M´editerran´ee, F-13288 Marseille Cedex 09, France

Laura Sacerdote [email protected] Department of Mathematics, University of Torino, via Carlo Alberto 10, 10 123 Torino, Italy

Frequency coding is considered one of the most common coding strategies employed by neural systems. This fact leads, in experiments as well as in theoretical studies, to construction of so-called transfer functions, where the output ring frequency is plotted against the input intensity. The term ring frequency can be understood differently in different contexts. Basically, it means that the number of spikes over an interval of preselected length is counted and then divided by the length of the interval, but due to the obvious limitations, the length of observation cannot be arbitrarily long. Then ring frequency is dened as reciprocal to the mean interspike interval. In parallel, an instantaneous ring frequency can be dened as reciprocal to the length of current interspike interval, and by taking a mean of these, the denition can be extended to introduce the mean instantaneous ring frequency. All of these denitions of ring frequency are compared in an effort to contribute to a better understanding of the input-output properties of a neuron. 1 Introduction For a constant signal or under steady-state conditions, characterization of the input-output properties of neurons, as well as of the neuronal models, is commonly done by so-called frequency (input-output) transfer functions in which the output frequency of ring is plotted against the strength (again often frequency) of the input signal. By constructing the transfer functions, Neural Computation 16, 477–489 (2004)

c 2004 Massachusetts Institute of Technology °

478

P. L´ansky, ´ R. Rodriguez, and L. Sacerdote

it is implicitly presumed that the information in the investigated neuron is coded by the frequency of the action potentials, the classic coding schema in neural systems (Adrian, 1928). Characterization of the input signals by the frequency of evoked action potentials requires giving a proper denition of the ring frequency and extending it for the transient signals. Up to now, various denitions of the term spiking frequency have been adopted (Awiszus, 1988; Ermentrout, 1998; Gerstner & van Hemmen, 1992), and the concept of the rate coding is carefully treated by Gerstner and Kistler (2002). In the laboratory situation, the crucial question in identifying of the ring rate (frequency, intensity) is stationarity of the counting process of spikes; without this stationarity, speaking about the ring rate is meaningless. The most common and intuitive understanding of the ring frequency is based on counting events (spikes) appearing in an interval of prescribed duration and dividing this number by the length of this interval. An argument against using of this method is that the duration of the observation interval can be limited; either the required stationarity disappears outside the interval, or its length is out of the control of the researcher. The conditions are often very variable during experiments. For example, in studies of hippocampal place cells, the dwell time of a freely moving animal in a particular spot always changes (Fenton & Muller, 1998). Similarly, in experiments with neurons in sensory systems, the observation periods have to be reduced to the time when the stimuli act on the neuron (Rieke, Warland, de Ruyter van Steveninck, & Bialek, 1997). In other cases, an applied pharmaceutical treatment has a limited duration, and the neuron must be recorded within this time period. As we will note later, for a short observation period, the sensitive point is not only the limited length of the observation, but also the time when the counting of spikes starts (if identied with an action potential or not). Gerstner and Kistler (2002) discuss how to average over time (repetitions of experiments) or over different neurons to improve the estimates of the ring rate. A related method used for determining the ring frequency is based on calculating the inverse of the average interspike interval (ISI). We will see that this method and the previously mentioned one (based on counting process) are identical under specic conditions, and thus in theoretical studies, the reciprocal of the mean ISI is usually declared as the ring rate (Burkitt & Clark, 2000; Van Rullen & Thorpe, 2001). Of course, for small samples, both methods are inuenced by the selection of the beginning and the end of the observation period (see Figure 1), but the aim of this article is not to study the effect of a small sample size. In the extreme situation, with only one spike available during the observation period, calculating the mean ISI fails to provide any information. With two spikes, speaking about “mean” ISI is possible and probably less inuenced by other factors than the number of spikes in a vaguely determined period. So-called instantaneous ring frequency can be dened for a single ISI as the inverse of its length (see Pauluis & Baker, 2000, for historical notes on

N(t)

Instantaneous Firing Frequency Is Always Higher Than the Firing Rate

479

2

Spikes

0

0 .0 0

0 .0 4

0 .0 8

0 .1 2

0 .1 6

0 .2 0

T im e

Figure 1: A schematic example of experimental data. Observation starts at time zero and lasts for 0.2 s with three spikes or four spikes if there is spike at time origin. The ring frequencies, calculated in three different ways, are f1 D 1=t D 20 .25/s¡1 , f2 D#spikes/period D 15 .20/s¡1 , f3 D .1=t/ D 23:81 .32:54/ s¡1 , where the numbers in parentheses hold if the observation starts with a spike.

this method). This implicitly assumes that the current ISI is the “mean” over a short period of time. Then, in an analogous manner to the ring frequency over an interval of some xed length, the mean of these instantaneous ring frequencies can dene mean instantaneous ring rate. Similarly, Knight (1972) and Barbi, Carelli, Frediani, and Petracchi (1975) dene instantaneous rate as reciprocal to the period from the last spike, and thus both denitions coincide at the moment of spike generation. These two methods, the inverse of the mean ISI and the mean of the inverse ISI, are equally suitable for experimental and simulated (modeled) data. Formally, it means that if available, ISIs are denoted ft1 ; : : : ; tn g, which P are independent realizations P of a random variable T; then either 1=t D 1= n1 niD1 ti or .1=t/ D n1 niD1 t1 is i calculated. This corresponds to a situation in which it is assumed that ISIs are the realization of a random variable T and either 1=E.T/ or E.1=T/, where E is used for the mean throughout this note, is evaluated. The differences in denition of ring frequency are shown in a simple numerical example (see Figure 1). In this note, we compare the methods of ring rate quantication for common neuronal models and point out the possible differences and implications for inference on real data. We do not deal here with the time-variable and stochastic rates. 2 Basic Results The standard denition of the rate function of discharge (ring) is (Johnson, 1996)

f .t/ D lim

1t!0

E.N.t C 1t/ ¡ N.t// ; 1t

(2.1)

480

P. L´ansky, ´ R. Rodriguez, and L. Sacerdote

where N is the counting process of spikes. If this function is independent of t and thus constant, which is the presumption considered in this note, it is called the ring rate. A natural way to calculate the ring rate of a neuron is to divide the number of elicited spikes, N.t/, by the length of observation period, t. The mean of ISIs, E.T/, is connected to the mean of the counting process; E.N.t//; by the asymptotic formula 1 E.N.t// D lim ; E.T/ t!1 t

(2.2)

(see, e.g., Cox & Lewis, 1966; Rudd & Brown, 1997). Formula 2.2 holds true for nite t only under the condition that N.t/ is a stationary point process. Cox and Lewis (1966) show details on how E.N.t// is related to the inverse of the mean ISI. The necessary condition for the stationarity of the counting process is that it starts in an arbitrary time, and this implies that the sequence of ISIs cannot be stationary. However, for a renewal process, disregarding the time before the rst spike (i.e., starting the sequence of ISIs with the rst spike) solves the problem (nonrenewal models are outside the scope of this note). Nevertheless, as illustrated in Figure 1, equation 2.2 does not hold if the mean ISI is replaced by the sample average and the mean of the counting process by its sample value. The theoretical result given by equation 2.2 can be used if the observation period is sufciently long. This usually is not the case in estimating the ring rates, as the observation period contains rather few spikes. Gerstner and Kistler (2002) give 100 to 500 ms as the usual length of the observation period, and this obviously permits only a few spikes. To deal with this difculty, the averaging over time appearing in equation 2.2 is replaced by averaging over different neurons or over repetition of the experiment on the same neuron. In this case, the inference has to be based on realization of the random variable T. Hence, we focus on the estimate and the comparison of the ring frequency, 1=E.T/, and the mean instantaneous frequency, E.1=T/. In theoretical inference, we can use the fact that the function 1=t is convex, and by using Jensen’s inequality (e.g., Rao, 1965) we can prove that for any positive random variable T, E.1=T/ ¸ 1=E.T/;

(2.3)

which permits us to conclude that the mean instantaneous frequency is always higher than or at least equal to the ring frequency. This result will be illustrated and quantied for several common stochastic models of membrane depolarization and also for several generic statistical descriptors often used for the characterization of experimental data.

Instantaneous Firing Frequency Is Always Higher Than the Firing Rate

481

3 Generic Models of Spike Trains 3.1 Gamma Distribution of ISIs. The gamma distribution is often and successfully tted to experimentally observed histograms of ISIs and is often taken as a theoretical model for which further conclusions are derived (e.g., Baker & Gerstein, 2001). The probability density function of T is g.t/ D

¸v e¡¸t tv¡1 ; 0. v/

(3.1)

where 0 is the gamma function and v > 0 and ¸ > 0 are the parameters. (Below, probability density will be denoted by g.:/.) The statistical moments are well known for this distribution, and thus we can directly write 1=E.T/ D

¸ : v

(3.2)

For X D 1=T, we have g.x/ D

1 e¡¸=x .¸=x/v ; x0. v/

(3.3)

and the mean for distribution 3.3 can be calculated, which yields for v > 1, E.1=T/ D

¸ : v¡1

(3.4)

Compared with equation 3.2, we can see that the difference between the two rates is decreasing for increasing v. This is expected since for increasing v, the interval distribution becomes more sharply peaked (i.e., approaches the deterministic case), and in the deterministic case, the two measures become identical. 3.2 Poisson Process and Its Modication. The spikes are generated in accordance with a Poisson process if v D 1 in model 3.1. Then from equation 3.2, it follows that 1=E.T/ D ¸, and equation 3.4 implies that E.1=T/ D 1. This striking difference is caused by the existence of very short ISIs and has been noted in Johnson (1996). Let us thus assume that the model is the dead-time Poisson process (modeling a refractory period) in which intervals between events are exponentially distributed but cannot be shorter than a constant ±. Then ISI distribution is ¸ exp.¡¸.t ¡ ±// and 1=E.T/ D ¸=.¸± C 1/. The mean instantaneous frequency is Z E.1=T/ D

1 0

¸ exp.¡¸z/ dz; .z C ±/

(3.5)

482

P. L´ansky, ´ R. Rodriguez, and L. Sacerdote

a

b 0.5 Frequency

Frequency

40

25

0.4 0.3 0.2 0.1

10 0

100

200

300

0.0 0.0

400

0.4

1/d

0.8

c

1.6

2.0

d

0.3

1.0 Frequency

0.2 Frequency

1.2

Noise amplitude

0.1

0.8 0.6 0.4 0.2

0.0 0.0

0.4

0.8

1.2

1.6

Noise amplitude

2.0

0.0 0.00

0.05

0.10

0.15 0.20

0.25 0.30

Noise amplitude

Figure 2: Comparison of the input-output curves in dependency on the parameters of the models. (a) The ring rate (lower curve) and the mean instantaneous ring rate (upper curve) in dependency on the inverse value of the length of refractory period 1=± for Poissonian ring with rate ¸ D 20s¡1 . The frequencies and 1=± are given in s¡1 . (b–d) The ring rates (solid lines) and the mean instantaneous ring rates (dashed lines) are plotted as a function of the amplitude of the noise for different levels of the input. (b) LIF neuronal model: From top to bottom, ¹ D 1:5, 1, 0.5, and 0.1. The ring threshold S D 1, ¿ D 1. (c) Feller neuronal model. The same levels of the input ¹ and the same parameters as for the LIF, VI D ¡1, (d) The Morris-Lecar neuronal model. From top to bottom, Iext D 0.145, 0.125, 0.105, and 0.085.

which is nite. In Figure 2a, we show the ring rate and mean instantaneous ring rate in dependency on the inverse of the length of the refractory period ±. We can see that both denitions tend to give the same result when the refractory period is increased. This is expected since the process becomes almost deterministic when the refractory period dominates T. 4 Simple Stochastic Models 4.1 Perfect Integrate-and-Fire (Gerstein-Mandelbrot Model). This simple model is closer to the statistical descriptors like those in the previous section than to the models seeking to provide a realistic description of the neurons considered below. It means that even if the data t the model perfectly, it can be used for their characterization, but few biophysical con-

Instantaneous Firing Frequency Is Always Higher Than the Firing Rate

483

clusions can be deduced from this fact. The probability density function of T is known as the inverse gaussian distribution, g.t/ D

» ¼ S .S ¡ ¹t/2 exp ¡ p ; 2¾ 2 t ¾ 2¼ t3

(4.1)

where ¹, ¾ 2 , and S are constants characterizing the neuron and its input (Tuckwell, 1988). Its statistical moments are well known, and thus we can directly write 1=E.T/ D

¹ : S

(4.2)

For X D 1=T, we have

» ¼ S .Sx ¡ ¹/2 exp ¡ g.x/ D p ; 2x¾ 2 ¾ 2¼ x

(4.3)

and the mean for distribution 4.3 can be calculated, which yields E.1=T/ D

¹ ¾2 C 2: S S

(4.4)

Comparing equations 4.2 and 4.4, we can see that the difference between mean instantaneous frequency and the ring frequency increases with increasing ¾ , which characterizes the noise and decreases with decreasing S, which is the ring threshold of the model. 4.2 Leaky Integrate-and-Fire (Ornstein-Uhlenbeck Model). The Ornstein-Uhlenbeck stochastic process restricted by a threshold, called the leaky integrate-and-re (LIF) model, is the most common compromise between tractability and realism in neuronal modeling (Tuckwell, 1988; Gerstner & Kistler, 2002). In this model, the behavior of the depolarization V of the membrane is described by the stochastic differential equation, ³ ´ V dV D ¡ C ¹ dt C ¾ dW; V.0/ D 0; ¿

(4.5)

where ¿ > 0 is the time constant of the neuron, W is a standard Wiener process (gaussian noise with unity-sized delta function), ¹ and ¾ 2 are constants characterizing the input, and t D 0 is the time of the last spike generation. The possibility of identifying the time of the previous spike with time zero is due to the fact that the process of intervals generated by the rst passages of process 4.5 across the threshold S is a renewal one. The solution of the rst-passage-time problem is not a simple task for model 4.5, and numerical and simulation techniques have been widely used

484

P. L´ansky, ´ R. Rodriguez, and L. Sacerdote

(Ricciardi & Sacerdote, 1979; Ricciardi & Sato, 1990; Musila, Suta, & Lansky, 1996). The Laplace transform of the rst-passage-time probability density function is available (see Tuckwell, 1988, for historical references), and from it, its mean can be derived. The ring rate, f = 1=E.T/, can be approximated by the following linear function, FD

¢ 1 ¡ p ¾ ¼ ¿ C 2¿ ¹ ¡ S ; ¼¿S

(4.6)

(Lansky & Sacerdote, 2001). The ranges in which this approximation ispvalid are restricted to the p p values of parameters for which the quantities .¹ ¿ /=¾ and .¹ ¿ ¡ .S= ¿ //=¾ are small. This means that for sufciently large amplitudes of noise, the response function is linear. Linearization 4.6 is quite robust and valid in wide ranges of parameters. Q of the rst-passage-time By using the fact that the Laplace transform g.s/ probability density function is available, the mean of 1=T can be calculated by using the primitive function of this Laplace transform, µ Z s ¶ Q D ¡ (4.7) E.1=T/ g.w/dw : sD0

The input-output frequency curves using the ring rate and the mean instantaneous rate are compared in Figure 2b. To improve the possibility of comparison with other models, model 4.5 was transformed into dimensionless variables. It means that the time is in units of time constant ¿ , and voltage is in units of the ring threshold S. We can see that only for large noise amplitude, almost half of the threshold value, does the difference become substantial. 4.3 Diffusion Model with Inhibitory Reversal Potential (Feller Model). To include some other features of real neurons in the LIF model, the reversal potentials can be introduced into model 4.5. In one of the variants of this model, introduced by Lansky and Lanska (1987), the behavior of the depolarization V of the membrane is described by the stochastic differential equation, ³ ´ p V (4.8) dV D ¡ C ¹ dt C ¾ V ¡ VI dW; V.0/ D 0; ¿ where the parameters have the same interpretation as in equation 4.5 and VI < 0 is the inhibitory reversal potential (Lansky, Sacerdote & Tomassetti, 1995). As for the LIF model, the Laplace transform of the ISI can be written in a closed form, g.s/ D

1 ± p ¡VI I Á s¿; .¹ C 1/ 2¾

²; p S ¡VI 2¿ ¾

(4.9)

Instantaneous Firing Frequency Is Always Higher Than the Firing Rate

485

where Á.a; bI x/ is the Kummer function (Abramowitz & Stegun, 1965). Equation 4.9 was used for numerical evaluation of the mean instantaneous frequency via equation 4.7, and the mean ISI was calculated by using the Siegert formula (Siegert, 1951). To see the difference between the LIF model and model 4.8, we attempted to have the same set of parameters used in both models. The problem arises for xing the noise amplitude; as in model 4.8, it depends on the actual level of the membrane depolarization V. The amplitude of noise was made equal at the resting level, which is one of the methods applied in Lansky et al. (1995). The input-output frequency curves for ring rate and mean instantaneous rate are compared in Figure 2c. Again, as with the LIF model, the dimensionless variant of equation 4.8 was used. We can see that the difference between the two denitions of ring frequency is less remarkable here than it is for the LIF model. 5 Biophysical Model with Noise To illustrate the achieved results on a realistic model of a neuron, we investigated the Morris-Lecar model, which is a simplication of the original Hodgkin-Huxley system. Two kinds of channels, voltage-gated CaCC channels and voltage-gated delayed rectier KC channels, are present in this model of excitable cell membrane (Morris & Lecar, 1981; Rinzel & Ermentrout, 1989). When a noisy external input is applied, the system of equations for the normalized membrane depolarization V.t/ and for X.t/ representing the fraction of open KC channels can be written in the form dV D gCa m.VCa ¡ V/ C gK X.VK ¡ V/ C gL .VL ¡ V/ C Iext C ´.t/; (5.1) dt and dX D kX .V/.X.V/ ¡ X/; dt

(5.2)

where the time is also normalized. The calcium current plays the role of sodium current in the original Hodgkin-Huxley system. However, the calcium channels respond to voltage so rapidly that instantaneous activation is assumed for them with the associated ionic conductance gCa m.V/. Iext is an applied external normalized current, and ´.t/ is a white noise perturbation such that h´.s/´.t/i D ¾ ±.s ¡ t/, where ¾ is a constant. The functions m.V/, X.V/, and kX .V/ are of the form ³ ³ ´´ 1 V ¡ V1 1 C tanh m.V/ D ; 2 V2 ³ ³ ´´ 1 V ¡ V3 D 1 C tanh X.V/ ; 2 V4

486

P. L´ansky, ´ R. Rodriguez, and L. Sacerdote

³ kX .V/ D ’cosh

V ¡ V3 2V4

´´ :

(5.3)

In these equations, kX .V.t// is a relaxation constant for a given V.t/; further, gCa ; gK , and gL are constants representing normalized conductances, and VCa , VK , and VL are normalized resting potentials for the two different kinds of ions and for leakage current. Finally, V1 ,V2 ,V3 ,V4 , and ’ are constants. The values of all constants were taken from Rinzel and Ermentrout (1989). When constant and gaussian white noise inputs are applied, the random variable T is calculated. In Figure 2d, the inverses of mean E.T/ and mean E.1=T/ are shown in dependency on the amplitude of noise for different values of the constant input. The same effect as for the simple neuronal models is observed. 6 Discussion The calculation of number of spikes per long time window is rather unrealistic from the point of view of the neural system. Thus, it is assumed that the time averaging in real systems is replaced by population averaging, giving the same result (counting the spikes emitted by a neuron during the period of 1 minute is the same as counting the spikes of 600 neurons emitting spikes within 100 milliseconds). This possibility of replacing time averaging by population averaging would be especially important for evaluating ring rates in transient situations as in the evoked activity. The terminology when speaking about ring rates is not always clear. Sometimes constant ring rate is called mean rate, and at other occasions, the function f .t/ given by equation 2.1 and averaged over some interval of time is called the mean ring rate. The instantaneous rate is also understood in different ways. In some cases, it is the ring probability in an innitesimally short interval (Johnson, 1996; Fourcard & Brunel, 2002). In other cases, the reciprocal of the ISIs or their smoothed version is used; Pauluis and Baker (2000) and Johnson (1996) compare these two approaches. The difference is well illustrated in the extremal case (deterministic ring, which implies constant ISIs) in which ¾ in LIF given by equation 4.5 tends to zero. From one point of view, the ring rate is a sequence of delta functions of time with peaks located at the moments of spikes, and the instantaneous ring rate is either zero or 1/ISI. From another point of view, the one adopted here, the rate and instantaneous rate coincide and are equal to the reciprocal of the ISI. While the rst approach may be seen as more informative, we note that we presume a priori that the rate is constant over the entire period of observation. Van Rullen and Thorpe (2001) compared the counting method to calculate the ring rate with the ISI method and declared the latter as potentially more accurate than the former. However, under the conditions valid in their article, they authors preferred the latency to the rst spike as the neuronal

Instantaneous Firing Frequency Is Always Higher Than the Firing Rate

487

code. It is actually again closely related to the ISI distribution but not based on the counting process. 7 Conclusion We compared two methods for evaluating the ring rate in this note. Although one of them gives a systematically larger value, the difference between them is not so enormous, at least in the conditions investigated here. The only exception is the Poissonian ring, which only rarely can be considered as a realistic description of neuronal ring. For other models, only strong noise makes the methods substantially different. One advantage of the mean instantaneous ring rate is that statistical properties of the random variable 1=T can be derived, and thus condence intervals found, and testing procedures for comparison of the rates under different experimental situations can be applied. This is not so easy for the ring rate calculated as 1=E.T/. The only way to overcome this defect of the reciprocal of the mean ISI is to use known properties of spike counts (Treves, Panzeri, Rolls, Booth, & Wakeman, 1999; Settanni & Treves, 2000). Acknowledgments We thank an anonymous referee for many helpful comments. This work was supported by INDAM Research Project and by a grant from the Grant Agency of the Czech Republic 309/02/0168. References Abramowitz, M., & Stegun, I. (Eds.). (1965). Handbook of mathematical functions. New York: Dover. Adrian, E. D. (1928). The basis of sensation: The action of the sense organs. New York: Norton. Awiszus, F. (1988). Continuous function determined by spike trains of a neuron subject to stimulation. Biol. Cybern., 58, 321–327. Baker, S. N., & Gerstein, G. L. (2001). Determination of response latency and its application to normalization of cross-correlation measures. Neural Comput., 13, 1351–1378. Barbi, M., Carelli, V., Frediani, C., & Petracchi, D. (1975). The self-inhibited leaky integrator: Transfer functions and steady state relations. Biol. Cybern., 20, 51– 59. Burkitt, A. N., & Clark, G. M. (2000). Calculation of interspike intervals for integrate-and-re neurons with Poisson distribution of synaptic input. Neural Comput., 12, 1789–1820. Cox, D. R., & Lewis, P.A.W. (1966). Statistical analyses of series of events. London: Methuen.

488

P. L´ansky, ´ R. Rodriguez, and L. Sacerdote

Ermentrout, E. (1998). Linearization of F-I curves by adaptation. Neural Comput., 10, 1721–1729. Fenton, A. A., & Muller, R. U. (1998). Place cell discharge is extremely variable during individual passes of the rat through the ring eld. Proc. Natl. Acad. Sci., USA, 95, 3182–3187. Fourcard, N., & Brunel, N. (2002). Dynamics of ring probability of noisy integrate-and-re neurons. Neural Comput., 14, 2057–2110. Gerstner, W., & van Hemmen, J. L. (1992). Universality in neural networks: The importance of the “mean ring rate”. Biol. Cybern., 67,195–205. Gerstner, W., & Kistler, W. (2002). Spiking neuron models. Cambridge: Cambridge Univsity Press. Johnson, D. H. (1996). Point process models of single-neuron discharges. J. Comput. Neurosci., 3, 275–300. Knight, B. W. (1972). Dynamics of encoding in a population of neurons. J. Gen. Physiol., 59, 734–766. Lansky, P., & Lanska V. (1987). Diffusion approximation of the neuronal model with synaptic reversal potentials. Biol. Cybern., 56, 19–26. Lansky, P., & Sacerdote, L. (2001). The Ornstein-Uhlenbeck neuronal model with the signal-dependent noise. Physics Letters A, 285, 132–140. Lansky, P., Sacerdote, L., & Tomassetti, F. (1995). On the comparison of Feller and Ornstein-Uhlenbeck models for neural activity. Biol. Cybern., 76, 457– 465. Morris, C., & Lecar, H. (1981). Voltage oscillations in the barnacle giant muscle ber. Biophys. J., 35, 193–213. Musila, M., Suta, D., & Lansky, P. (1996). Computation of rst passage time moments for stochastic diffusion processes modelling the nerve membrane depolarization. Comp. Meth. Prog. Biomed., 49, 19–27. Pauluis Q., & Baker, S. N. (2000). An accurate measure of the instantaneous discharge probability, with application to joint-event analysis. Neural Comput., 12, 647–669. Rao, C. R. (1965). Linear statistical inference and its applications. New York: Wiley. Ricciardi, L. M., & Sacerdote, L. (1979). The Ornstein-Uhlenbeck process as a model of neuronal activity. Biol. Cybern., 35, 1–9. Ricciardi, L. M., & Sato, S. (1990). Diffusion processes and rst-passage-time problems. In L. M. Ricciardi (Ed.), Lectures in applied mathematics and informatics. Manchester: Manchester University Press. Rieke, F., Warland, D., de Ruyter van Steveninck, R. R., & Bialek, W. (1997). Spikes: Exploring the neural code. Cambridge, MA: MIT Press. Rinzel, J., & Ermentrout, G. B. (1989). Analysis of neural excitability and oscillations. In C. Koch & I. Segev (Eds.), Methods of neuronal modeling: From synapses to networks. Cambridge, MA: MIT Press. Rudd, M. E., & Brown, L. G. (1997). Noise adaptation in integrate-and-re neurons. Neural Comput., 9, 1047–1069. Settanni, G., & Treves, A. (2000). Analytical model for the effects of learning on spike count distributions. Neural Comput., 12, 1773–1788. Siegert, A.J.F. (1951). On the rst passage time probability problem. Phys. Rev., 81, 617–623.

Instantaneous Firing Frequency Is Always Higher Than the Firing Rate

489

Treves, A., Panzeri, S., Rolls, E. T., Booth, M., & Wakeman E. A. (1999). Firing rate distributions and efciency of information transmission of inferior temporal cortex neurons to natural visual stimuli. Neural Comput., 11, 601–632. Tuckwell, H. C. (1988). Introduction to theoretical neurobiology. Cambridge: Cambridge University Press. Van Rullen, R., & Thorpe, S. J. (2001). Rate coding versus temporal order coding: What the retinal ganglion cells tell the visual cortex. Neural Comput., 13, 1255– 1284. Received February 4, 2003; accepted July 10, 2003.

NOTE

Communicated by Rajesh P. Rao

Kalman Filter Control Embedded into the Reinforcement Learning Framework Istv´an Szita [email protected]

Andr´as L Íorincz [email protected] Department of Information Systems, E¨otv¨os Lor´and University, P´azm´any P´eter s´et´any 1=C, H-1117 Budapest, Hungary

There is a growing interest in using Kalman lter models in brain modeling. The question arises whether Kalman lter models can be used on-line not only for estimation but for control. The usual method of optimal control of Kalman lter makes use of off-line backward recursion, which is not satisfactory for this purpose. Here, it is shown that a slight modication of the linear-quadratic-gaussian Kalman lter model allows the on-line estimation of optimal control by using reinforcement learning and overcomes this difculty. Moreover, the emerging learning rule for value estimation exhibits a Hebbian form, which is weighted by the error of the value estimation.

1 Motivation Kalman lters and their various extensions are well studied and widely applied tools in both state estimation and control. Recently, there has been increasing interest in Kalman lters (KF) or Kalman lter–like structures as models for neurobiological substrates. It has been suggested that Kalman ltering may occur at sensory processing (Rao & Ballard, 1997, 1999) or at acquisition of behavioral responses (Kakade & Dayan, 2000), may be the underlying computation of the hippocampus (Bousquet, Balakrishnan, & Honavar, 1998), and may be the underlying principle in control architectures (Todorov & Jordan, 2002a, 2002b). Detailed architectural similarities between Kalman lter and the entorhinal-hippocampal loop, as well as between Kalman lters and the neocortical hierarchy, have been described recently (L Íorincz & Buzs a´ ki, 2000; LÍorincz, Szatm a´ ry, & Szirtes, 2002). The interplay between the dynamics of Kalman lter–like architectures and learning of parameters of neuronal networks is promising for explaining known and puzzling phenomena such as priming, repetition suppression, and categorization (LÍorincz, Szirtes, Tak´acs, Biederman, & Vogels, 2002; K´eri et al., 2002). Neural Computation 16, 491–499 (2004)

c 2004 Massachusetts Institute of Technology °

492

I. Szita and A. LÍorincz

It is well known that Kalman lters provide an on-line estimation of the state of the system. On the other hand, optimal control typically cannot be computed on-line, because it is given by a backward recursion, the Ricattiequations. (For on-line parameter estimations of Kalman lters but without control aspects, see Rao, 1999.) The aim of this article is to demonstrate that Kalman lters can be embedded into a goal-oriented framework. We show that slight modication of the linear-quadratic-gaussian (LQG) Kalman lter model is satisfactory to integrate the LQG model into the reinforcement learning (RL) framework. 2 The Kalman Filter and the LQG Model Consider a linear dynamical system with state xt 2 Rn , control ut 2 Rm , observation yt 2 Rk , and noises wt 2 Rn and et 2 Rk (which are assumed to be uncorrelated Gaussians with covariance matrix Äw and Äe , respectively) in discrete time t: (2.1)

xtC1 D Fxt C Gut C wt

(2.2)

yt D Hxt C et :

Assume that the initial state has mean xO 1 , covariance 61 and that executing control step ut in xt costs c.xt ; ut / :D xTt Qxt C uTt Rut :

(2.3)

Further, assume that after the Nth step, the controller halts and receives a nal cost of xTN QN xN . The task is to nd a control sequence with minimum total cost.1 This problem has the well-known solution (2.4)

xO tC1 D FOxt C Gut C Kt .yt ¡ H xO t / Kt D F6t HT .H6t H T C Äe /¡1 w

T

6tC1 D Ä C F6t F ¡ Kt H6t F

T

(2.5) (state estimation)

(2.6)

and u t D ¡Lt xO t T

(2.7) ¡1

Lt D .G StC1 G C R/

T

(2.8)

G StC1 F

St D Qt C FT StC1 F ¡ FT StC1 GLt :

(optimal control)

(2.9)

Unfortunately, the optimal control equations are not on-line, because they can be solved only by stepping backward from the nal (i.e., the Nth) step. 1

In a more general setting, c is a general quadratic function of xt , ut , and an optional where xtrack is a trajectory of a linear system to be tracked. t

xtrack , t

Kalman Filter Control

493

3 Integrating Kalman Filtering into the Reinforcement Learning Framework First, we slightly modify the problem: the run time of the controller will not be a xed number N. Instead, after each time step, the process will be stopped with some xed probability p (and then the controller incurs the nal cost cf .x f / :D xtf Q f x f ). This modication is commonly used in the RL literature; it makes the problem more amenable to mathematical treatment. 3.1 The Cost-to-Go Function. Let Vt¤ .x/ be the optimal cost-to-go function at time step t, Vt¤ .x/ :D

inf

ut ;utC1 ;:::

E[c.xt ; ut / C c.xtC1 ; utC1 / C ¢ ¢ ¢ C c f .x f / j xt D x]: (3.1)

Considering that the controller is stopped with probability p, equation 3.1 assumes the following form, ¤ Vt¤ .x/ D p ¢ cf .x/ C .1 ¡ p/ inf.c.x; u/ C E w [VtC1 .Fx C Gu C w/]/ (3.2) u

for any state x. It is easy to show that the optimal cost-to-go function is time independent and it is a quadratic function of x. That is, the optimal cost-to-go action-value function assumes the form V¤ .x/ D xT 5¤ x:

(3.3)

Our task is to estimate the optimal value functions (parameter matrix 5¤ ) on-line. This can be done by the method of temporal differences. We start with an arbitrary initial cost-to-go function V0 .x/ D xT 50 x. After this, (1) control actions are selected according to the current value function estimate, (2) the value function is updated according to the experience, and (3) these two steps are iterated. The tth estimate of V ¤ is Vt .x/ D xT 5t x. The greedy control action according to this is given by ut D arg min.E[c.xt ; u/ C Vt .Fxt C Gu C w/]/ u

D arg min.uT Ru C .FOxt C Gu/T 5t .FOxt C Gu// u

D ¡.R C GT 5t G/¡1 .GT 5t F/Oxt :

(3.4)

For simplicity, the cost-to-go function will be updated by using the onestep temporal differencing (TD) method, although it could be substituted with more sophisticated methods like multistep TD or eligibility traces. The TD error is » if t D tSTOP ; c .Ox/ ¡ Vt .Oxt¡1 / (3.5) ±t D f .c.Oxt¡1 ; ut¡1 / C Vt .Oxt // ¡ Vt .Oxt¡1 /; otherwise;

494

I. Szita and A. LÍorincz

and the update rule for the parameter matrix 5t is 5tC1 D 5t C ®t ¢ ±t ¢ r5t Vt .Oxt / D 5t C ®t ¢ ±t ¢ xO t xO Tt ;

(3.6)

where ®t is the learning rate. Note that value-estimation error weighted Hebbian learning rule has emerged. 3.2 Sarsa. The cost-to-go function is used to select control actions, so the estimation of the action-value function Q¤t .x; u/ is more appropriate here. The action-value function is dened as Q¤t .x; u/ :D

inf

utC1 ;utC2 ;:::

E[c.xt ; ut / C c.xtC1 ; utC1 / C ¢ ¢ ¢ C c f .x f / j xt D x; ut D u];

(3.7)

and analogously to Vt¤ , it can be shown that it is time independent and assumes the form ³ ´³ ´ ³ ´ ¡ ¢ 2¤ 2¤ ¡ ¢ x x 11 12 D x T uT 2 ¤ (3.8) Q¤ .x; u/ D xT u T : ¤ ¤ u u 221 222 If the tth estimate of Q¤ is Qt .x; u/ D [xT ; uT ]2t [xT ; uT ]T , then the greedy control action is given by u t D arg min E.Qt .xt ; u// D ¡2¡1 Ot; 22 221 x u

(3.9)

where subscript t of 2 has been omitted to improve readability. Note that the value function estimate (but not model) is needed to compute ut . The estimation error and the weight update are quite similar to the statevalue case: » if t D tSTOP ; c .Ox / ¡ Qt .Oxt¡1 ; u t¡1 / (3.10) ±t D f t .c.Oxt¡1 ; ut¡1 / C Qt .Oxt ; ut // ¡ Qt .Oxt¡1 ; ut¡1 /; otherwise; 2tC1 D 2t C ®t ¢ ±t ¢ r2t Qt .Oxt ; ut / ³ ´ ³ ´T xO xO t D 2t C ®t ¢ ±t ¢ t : ut ut

(3.11)

3.3 Convergence. There are numerous results showing that simple RL algorithms with linear function approximation can diverge (see, e.g., Baird, 1995). There are also positive results dealing with the constant policy case (i.e., with policy evaluation) and iterative policy improvements (Tsitsiklis & Van Roy, 1996). For our case, convergence proof can be provided (the complete proof can be found in Szita & LÍorincz, 2003).

Kalman Filter Control

495

1. One can show that an appropriate form of the separation principle holds for value estimation. That is, using the state estimates xO t for computing the control is equivalent to using the exact states. 2. One can apply Gordon’s (2001) results, which states that under appropriate conditions, TD and Sarsa control with linear function approximation cannot diverge. 3. In our problem, where the system is linear and the costs are quadratic, using Gordon’s (2001) technique, one can prove the stronger result: Theorem. If the system .F; H/ is observable, 50 ¸ 5¤ (or 20p¸ 2¤ in the case of Sarsa), there exists an L such that kF C GLk · 1= 1 ¡ p, there exists an MP such that kxP t k · M for all t, and the constants ®t satisfy 0 < ®t < 1=M4 , t ®t D 1, t ®t2 < 1. Then combined TD-KF methods converge to the optimal policy with probability 1 for both state-value and for action-value estimations. 4 Discussion 4.1 Possible Extensions. We have demonstrated that Kalman ltering can be integrated into the framework of reinforcement learning. There are numerous possibilities for extending this scheme on both sides; both KF and RL can be extended. We list some of these possibilities: ² Advanced RL algorithms. The one-step TD method can be replaced by more efcient algorithms like TD(¸) (eligibility traces), Sarsa(¸), and Q-learning. ² Parameter estimation. In its present form, the algorithm needs a model of the system (i.e., the parameters F, G, and H). However, for Q-learning, reinforcement learning may not require any model. In turn, the algorithm can be augmented by standard KF parameter estimation techniques like expectation maximization (see, e.g., Rao, 1999) and by information maximization methods (LÍorincz, Szatm a´ ry, & Szirtes, 2002). As a result, an on-line, model-free control algorithm for the linear quadratic regulation problem can be obtained. ² Extended Kalman ltering. Kalman ltering makes the assumption that both the system and the observations are linear. This restriction can be overcome by using extended Kalman lters (EKF), adding further potential to our approach. ² Kalman smoother. To obtain more accurate and less noisy state estimations, we could use Kalman smoothing instead of the ltering equation. One-step smoothing does not impose much additional computational difculty, because one-step lookahead is needed anyway for the temporal differencing method.

496

I. Szita and A. LÍorincz

² More general quadratic and nonquadratic cost functions. One is not restricted to the simplest quadratic loss functions. The KF equations are independent of the costs, and RL can handle arbitrary costs. Moreover, in many cases, costs can be rewritten into the form of a linear function approximation (see, e.g., Bradtke, 1993), and convergence is then warranted. 4.2 Related Works in Reinforcement Learning. Reinforcement learning has been applied in the linear quadratic regulation problem (Landelius & Knutsson, 1996; Bradtke, 1993; ten Hagen & Kr¨ose, 1998). The main difference is that these works assume fully observed systems and that either Q-learning or the recursive least-squares methods were used as the RL component. It is important that our approach can be interpreted as a partially observed Markov decision process (POMDP), with continuous state and action spaces. In general, learning in POMDPs is a very difcult task (see Murphy, 2000, for a review). We consider the LQG approach attractive, because it is a POMDP and yet can be handled efciently because of its particular properties: both the transitions and the observations are linear, and uncertainties assume gaussian form. 4.3 Neurobiological Connections. The modeling of neural conditioning by TD learning is not new: a series of papers have been published on this topic (e.g., Montague, Dayan, & Sejnowski, 1996; Schultz, Dayan, & Montague, 1997). In these works, it is typically assumed that states can be fully observed. Recently, Daw, Courville, and Touretzky (in press) proposed partially observable semi-Markov processes as an underlying model to deal with nonobservability. The motivation for our work comes from neurobiology. Some novel theoretical works in neurobiology (Rao & Ballard, 1997; L Íorincz & Buzs a´ ki, 2000; L Íorincz, Szatm a´ ry, & Szirtes, 2002; Todorov & Jordan, 2002a) claim that both sensory processing and control may be based on Kalman lter– like structures. Notably, some odd predictions (LÍorincz & Buzs a´ ki, 2000; LÍorincz, Szatm a´ ry, & Szirtes, 2002) have been reinforced recently: ² Properties of neurons corresponding to the internal representation of the Kalman lter (the Vth and the VIth layers of the entorhinal cortex, EC) versus properties of neurons corresponding the reconstruction error and the ltered input (EC layers II and III), respectively ² Adaptable long-delay properties of neurons with the putative role of eliminating temporal convolutions arising in Kalman lter–like structures with not-yet-tuned parameters Both have been conrmed by experiments reported in Egorov, Hamam, Frans e´ n, Hasselmo, and Alonso (2002) and in Henze, Wittner, and Buzs a´ ki (2002), respectively.

Kalman Filter Control

497

Clearly, the brain is a goal-oriented system, and Kalman lters need to be embedded into a goal-oriented framework. Moreover, this framework should exhibit Hebbian learning properties. Further, although batch learning does have some plausibility in neurobiological models—consider, for example, hippocampal replays of time sequences (Skaggs & McNaughton, 1996; Nadasdy, Hirase, Czurko, Csicsvari, & Buzs a´ ki, 1999)—the extent of such batch learning should be limited given that the environment is subject to changes. In turn, our work reinforces the efforts dealing with Kalman lter description of neocortical processing. From the point of view of parameter estimation, the Kalman lter seems to be an ideal neurobiological candidate (LÍorincz & Buzs a´ ki, 2000; L oÍ rincz, Szatm a´ ry, & Szirtes, 2002). On-line estimation of the Kalman gain, however, remains a problem. (But see Poczos ´ & LÍorincz, 2003.) It has been noted that smoothing should improve the efciency of the algorithm. Albeit the importance of smoothing and switching between smooth solutions is striking in the neurobiological context, it remains an open and intriguing issue if and how the neurobiological framework may allow the integration of smoothing. 5 Conclusions Recent research in theoretical neurobiology indicates that Kalman lters have intriguing potentials in brain modeling. However, the classical method for Kalman lter control requires off-line computations, which are unlikely to take place in the brain. Our aim here was to embed Kalman lters into the reinforcement learning framework. This was achieved by applying the method of temporal differences. Theoretical achievements of reinforcement learning were used to describe the asymptotic optimality of the algorithm. Although our algorithm is only asymptotically optimal and may be slower than the classical approaches like solving the Ricatti recursions (used commonly in engineering applications), it has several advantages: it works on-line, no model is needed for computing the control law, and the learning is Hebbian, which is weighted by the error of value estimation. These properties make it an attractive approach for brain modeling and neurobiological applications. Furthermore, the algorithm described here admits several straightforward generalizations, including the extended Kalman lters, parameter estimations, nonquadratic cost functions, and eligibility traces, which can extend its applicability and improve its performance. Acknowledgments We are grateful to the anonymous referees who called our attention to recent work on the topic. Their suggestions helped us to improve this note. This work was supported by the Hungarian National Science Foundation (Grant No. T-32487).

498

I. Szita and A. LÍorincz

References Baird, L. C. (1995). Residual algorithms: Reinforcement learning with function approximation. In International Conference on Machine Learning (pp. 30–37). San Francisco: Morgan Kaufmann. Bousquet, O., Balakrishnan, K., & Honavar, V. (1998). Is the hippocampus a Kalman lter? In Proceedings of the Pacic Symposium on Biocomputing (pp. 655– 666). Singapore: World Scientic. Bradtke, S. J. (1993). Reinforcement learning applied to linear quadratic regulation. In C. L. Giles, S. J. Hanson, & J. D. Cowan (Eds.), Advances in neural information processing systems, 5, (pp. 295–302). San Mateo, CA: Morgan Kaufmann. Daw, N., Courville, A., & Touretzky, D. S. (in press). Timing and partial observability in the dopamine system. In S. Becker, S. Thrun, & K. Obermayer (Eds.), Advances in neural information processing systems, 15. Cambridge, MA: MIT Press. Egorov, A., Hamam, B., Franse´ n, E., Hasselmo, M., & Alonso, A. (2002). Graded persistent activity in entorhinal cortex neurons. Nature, 420, 173–178. Gordon, G. J. (2001). Reinforcement learning with function approximation converges to a region. In T. K. Leen, T. G. Dietterich, & V. Tresp (Eds.), Advances in neural information processing systems, 13 (pp. 1040–1046). Cambridge, MA: MIT Press. Henze, D., Wittner, L., & Buzs´aki, G. (2002). Single granule cells reliably discharge targets in the hippocampal CA3 network in vivo. Nature Neuroscience, 5, 790–795. Kakade, S., & Dayan, P. (2000). Acquisition in autoshaping. In S. A. Solla, T. K. Leen, & K.-R. Muller ¨ (Eds.), Advances in neural information processing, 12 (pp. 24–30). Cambridge, MA: MIT Press. K´eri, S., Benedek, G., Janka, Z., Aszalos, ´ P., Szatm´ary, B., Szirtes, G., & LÍorincz, A. (2002). Categories, prototypes and memory systems in Alzheimer ’s disease. Trends in Cognitive Science, 6, 132–136. Landelius, T., & Knutsson, H. (1996). Greedy adaptive critics for LQR problems: Convergence proofs (Tech. Rep. No. LiTH-ISY-R-1896). Link oping, ¨ Sweden: Computer Vision Laboratory. LÍorincz, A., & Buzs´aki, G. (2000). The parahippocampal region: Implications for neurological and psychiatric diseases. In H. Scharfman, M. Witter, & R. Schwarz (Eds.), Annals of the New York Academy of Sciences (Vol. 911, pp. 83– 111). New York: New York Academy of Sciences. LÍorincz, A., Szatm´ary, B., & Szirtes, G. (2002). Mystery of structure and function of sensory processing areas of the neocortex: A resolution. J. Comp. Neurosci., 13, 187–205. LÍorincz, A., Szirtes, G., Tak´acs, B., Biederman, I., & Vogels, R. (2002). Relating priming and repetition suppression. Int. J. Neural Systems, 12, 187–202. Montague, P., Dayan, P., & Sejnowski, T. (1996). A framework for mesencephalic dopamine systems based on predictive Hebbian learning. Journal of Neuroscience, 16(5), 1936–1947.

Kalman Filter Control

499

Murphy, K. P. (2000). A survey of POMDP solution techniques. Available on-line at: http://www.ai.mit.edu/murphyk/Papers/pomdp.ps.gz. ˜ Nadasdy, Z., Hirase, H., Czurko, A., Csicsvari, J., & Buzs´aki, G. (1999). Replay and time compression of recurring spike sequences in the hippocampus. Journal of Neuroscience, 19, 9497–9507. Poczos, ´ B., & LÍorincz, A. (2003). Kalman-ltering using local interactions(Tech. Rep. No. NIPG-ELU-28-02-2003).Budapest: Faculty of Informatics, Eotv ¨ os ¨ Lor´and University. Available on-line at: http://arxiv.org/abs/cs.AI/0302039. Rao, R. (1999). An optimal estimation approach to visual perception and learning. Vision Research, 39, 1963–1989. Rao, R., & Ballard, D. (1997). Dynamic model of visual recognition predicts neural response properties in the visual cortex. Neural Computation, 9, 721– 763. Rao, R., & Ballard, D. (1999). Predictive coding in the visual cortex: A functional interpretation of some extra-classical receptive-eld effects. Nature Neuroscience, 2, 79–87. Schultz, W., Dayan, P., & Montague, P. (1997). A neural substrate of prediction and reward. Science, 275, 1593–1599. Skaggs, W., & McNaughton, B. (1996). Replay of neuronal ring sequences in rat hippocampus during sleep following spatial experience. Science, 271, 1870–1873. Szita, I., & LÍorincz, A. (2003). Reinforcement learning with linear function approximation and LQ control coverves. (Tech. Rep. No. NIPG-ELU-22-06-2003). Budapest: Faculty of Informatics, Eotv ¨ os ¨ Lor´and University. Available on-line at: http://arvis.org/abs/cs.LG/0306120. ten Hagen, S., & Krose, ¨ B. (1998). Linear quadratic regulation using reinforcement learning. In F. Verdenius & W. van den Broek (Eds.), Proceedings of the 8th Belgian-Dutch Conf. on Machine Learning (pp. 39–46). Wageningen: BENELEARN-98. Todorov, E., & Jordan, M. (2002a). Optimal feedback control as a theory of motor coordination. Nature Neuroscience, 5, 1226–1235. Todorov, E., & Jordan, M. (2002b). Supplementary notes for optimal feedback control as a theory of motor coordination. Available on-line at: http://www.nature. com/neuro/supplements/. Tsitsiklis, J. N., & Van Roy, B. (1996). An analysis of temporal-difference learning with function approximation (Tech. Rep. No. LIDS-P-2322). Cambridge, MA: MIT. Received December 16, 2002; accepted July 15, 2003.

LETTER

Communicated by Richard Zemel

Rapid Processing and Unsupervised Learning in a Model of the Cortical Macrocolumn Jorg ¨ L ucke ¨ [email protected]

Christoph von der Malsburg [email protected] ¨ Neuroinformatik, Ruhr-Universit¨at Bochum, Institut fur D-44780 Bochum, Germany

We study a model of the cortical macrocolumn consisting of a collection of inhibitorily coupled minicolumns. The proposed system overcomes several severe decits of systems based on single neurons as cerebral functional units, notably limited robustness to damage and unrealistically large computation time. Motivated by neuroanatomical and neurophysiological ndings, the utilized dynamics is based on a simple model of a spiking neuron with refractory period, xed random excitatory interconnection within minicolumns, and instantaneous inhibition within one macrocolumn. A stability analysis of the system’s dynamical equations shows that minicolumns can act as monolithic functional units for purposes of critical, fast decisions and learning. Oscillating inhibition (in the gamma frequency range) leads to a phase-coupled population rate code and high sensitivity to small imbalances in minicolumn inputs. Minicolumns are shown to be able to organize their collective inputs without supervision by Hebbian plasticity into selective receptive eld shapes, thereby becoming classiers for input patterns. Using the bars test, we critically compare our system’s performance with that of others and demonstrate its ability for distributed neural coding. 1 Introduction Simulations of articial neural networks (ANNs) are a standard way to study neural information processing. Although a large amount of data about biological neural networks is available, there remain uncertainties regarding the way in which neurons process incoming action potentials, the way the neurons are interconnected, and the way in which interconnections change dynamically over time. These uncertainties have generated a broad variety of different models of neural networks. They are based on different assumptions for connectivity (e.g., feedforward or symmetrically interconnected), neuron models (e.g., McCulloch-Pitts, integrate-and-re, Hodgkin-Huxley), and different modication rules for synaptic weight changes (e.g., Hebbian Neural Computation 16, 501–533 (2004)

c 2004 Massachusetts Institute of Technology °

502

J. Lucke ¨ and C. von der Malsburg

learning, backpropagation). ANNs like the Hopeld network (Hopeld, 1982; Hopeld & Tank, 1986) or perceptrons afford deep functional insight on the basis of mathematical analysis that allowed the networks to be successful in various technical applications and inuenced our views on learning and information processing in biological neural networks signicantly. By now, it has become obvious, however, that the classic ANNs fall short in modeling the generalization abilities or computation times of biological networks. Many important reactions in the brain take place in times so short that individual neurons have time to transmit only very few or just a single action potential (see Nowak & Bullier, 1997, and Thorpe, Fize, & Marlot, 1996, for reaction times in the visual system). If graded signals are to be processed, models based on a single neuron rate code fail to model the measured reaction times. Further, most ANNs do not reect biologically plausible connectivity because they were motivated by the view that biological information processing is continuously distributed over the cortical surface or that information is processed strictly feedforward through layers of equal neurons. In the last two or three decades, a large amount of anatomical and physiological data has been accumulated, suggesting that the cortex is hierarchically organized in cellular columns as principal building blocks (see Mountcastle, 1997, and Buxhoeveden & Casanova, 2002, for overviews and Jones, 2000, for a critical discussion). Columnar organization is advantageous (1) with respect to implementation of a neural population rate code able to overcome the computational speed limitations of single neuron rate codes and (2) with respect to connectivity and robustness. With evolutionary growth of the brain, individual building blocks had to connect to more and more other elements. Groups of neurons can support many more connections than individual neurons, and a network based on neural columns as principal units can be expected to be much more robust against the loss of connections or dropout of neurons. In the cerebral cortex of mammals, neural columns can be identied on different scales. The minicolumn is believed to be the smallest neural entity consisting of several tens up to a few hundred neurons, which are stacked orthogonally to the cortical surface (Peters & Yilmaze, 1993). The minicolumns themselves combine to what is called a macrocolumn or segregate (Favorov & Diamond, 1990; see, Mountcastle, 1997, for an overview). Like minicolumns, macrocolumns can be identied both anatomically and physiologically (Favorov & Diamond, 1990; Elston & Rosa, 2000; Lubke, Egger, Sakmann, & Feldmeyer, 2000) and are shown to process stimuli from the same source, such as an area of the visual eld or a patch of a body surface (Favorov & Whitsel, 1988; Favorov & Diamond, 1990). In the primary somatosensory cortex of the cat, macrocolumns were found to contain approximately 80 minicolumns. Although mini- and macrocolumns are best studied in primary sensory areas, they are found in higher cortical areas as well (Peters, Cifuentes, & Sethares, 1997; Constantinidis, Franowicz, & Goldman-Rakic, 2001) and are believed to represent the basic building

Processing and Learning in Macrocolumns

503

blocks of all areas of cortices of higher vertebrates (Mountcastle, 1997; Buxhoeveden & Casanova, 2002). The main part of a minicolumn is a collection of excitatory cells grouped around bundles of dendrites (Peters & Yilmaze, 1993) and axons (Peters & Sethares, 1996). Together with physiological ndings (Thomson & Deuchars, 1994), this suggests that the excitatory cells of a minicolumn are mutually strongly interconnected (Mountcastle, 1997; Buxhoeveden & Casanova, 2002). For inhibitory feedback, double-bouquet cells and basket (clutch) cells play a central role (DeFelipe, Hendry, & Jones, 1989; DeFelipe, Hendry, Hashikawa, Molinari, & Jones, 1990; Peters & Sethares, 1997; Budd & Kisvarday, 2001). Dendritic branch and axonal eld analysis suggests that the inhibitory cells are stimulated by the activities within the excitatory cells of their minicolumn and project back to a number of minicolumns in their neighborhood. In this article, we study a model of the macrocolumn motivated by the above ndings. We model the macrocolumn as a collection of inhibitorily coupled minicolumns that consist of a collectionof randomly interconnected excitatory neurons. The excitatory neurons are modeled explicitly. The neuron model is a very abstract one, but it takes into account the neurons’ spiking character. The assumptions made allow for a detailed mathematical analysis that captures the basic properties of the neuron dynamics. It turns out that the spiking character of the neurons, in combination with the columnar interconnection structure and fast inhibitory feedback, leads to a dynamics with ideal properties for computing input mediated by afferent bers to the macrocolumn. We will show that in the absence of input, the macrocolumn dynamics possesses stationary points (i.e., states of ongoing neural activity). The number of these depends exponentially on the number of minicolumns. The stability of the stationary points is controlled by a single parameter of inhibition and changes at a single critical value of this parameter. The system is operated by letting the inhibition parameter oscillate about its critical value. An isolated macrocolumn is hereby successively forced to spontaneously break the symmetry between alternate stationary points. If the dynamics is weakly coupled to input by afferent bers that are subject to Hebbian plasticity, a self-organization of minicolumnar receptive elds (RFs) is induced. The self-organization turns the macrocolumn into a decision unit with respect to the input. The result of a decision is identied with the active state of the macrocolumn at the maximum of the inhibitory oscillation (or º-cycle). Possible º-cycle periods may be very short, suggesting cortical oscillations in the ° -frequency range as their biological correlates and allowing very rapid functional decisions. The macrocolumn model will be shown to be able to classify input patterns and extract basic features from the input. The latter capability is demonstrated using the bars test (Foldi´ ¨ ak, 1990), and it can be shown that the presented model is competitive with all other systems able to pass the test. During the bars test, a system has to learn the basic components of input patterns made up of superpositions of these components. In the su-

504

J. Lucke ¨ and C. von der Malsburg

perpositions, the components do not add up linearly, which presents an additional difculty. Systems passing the test must be able to represent an input pattern by exploiting constituent combinatorics. The passing of the bars test can be seen as a prerequisite for systems that are intended to handle large and varying natural input because natural input is expected to be most efciently represented by combining elementary features. The presented network owes its computational abilities to the properties of the internal macrocolumnar neuron dynamics, which itself is emergent from the columnar interconnection structure, the spiking nature of neurons, and background oscillation. The network reects, on the one hand, major properties of biological neural information processing and is, on the other hand, competitive to a class of systems that focus on functional performance in tasks of basic feature extraction. In combining these two aspects, the system distinguishes itself from all other column-based systems (Fukai, 1994; Favorov & Kelly, 1996; Somers, Nelson, & Sur, 1995; Fransen & Lansner, 1998; Lao, Favorov, & Lu, 2001) and systems whose ability for extracting basic input constituents was demonstrated using the bars benchmark test (Foldi´ ¨ ak, 1990; Saund, 1995; Dayan & Zemel, 1995; Marshall, 1995; Hinton, Dayan, Frey, & Neal, 1995; Harpur & Prager, 1996; Frey, Dayan, & Hinton, 1997; Hinton & Ghahramani, 1997; Fyfe, 1997; Charles & Fyfe, 1998; Hochreiter & Schmidhuber, 1999; O’Reilly, 2001; Spratling & Johnson, 2002). In section 2, we dene the macrocolumn model and analyze its dynamical properties. In section 3, we introduce Hebbian plasticity of afferents to the macrocolumn and study the resulting input driven self-organization of the minicolumns’ RFs. In section 4, the computational abilities of the system are systematically studied for the problem of pattern classication and for the bars problem. In section 5, we discuss the system’s general properties in comparison with systems that have been applied to the bars problem and discuss our system’s relation to neuroscience. 2 Neural Dynamics of the Macrocolumn We rst dene and analyze the dynamics of a model of a single minicolumn and then proceed by studying the dynamical properties of the macrocolumn as a set of inhibitorily coupled minicolumns. 2.1 Model of the Minicolumn. We take a minicolumn to consist of excitatory neurons that are randomly interconnected, as motivated by the abovementioned ndings (see Mountcastle, 1997, and Buxhoeveden & Casanova, 2002, for review). The excitatory neurons are modeled as McCulloch-Pitts neurons with a refractory period of one time step. The dynamics of a minicolumn consisting of m neurons is described by the following set of difference

Processing and Learning in Macrocolumns

505

equations (i D 1; : : : ; m):

0 1 m X ni .t C 1/ D H @ Tij nj .t/ ¡ 2A ¢ H.1 ¡ ni .t//; | {z } jD1

» 0 H.x/ :D 1

ref raction

if x · 0 : if x > 0

(2.1)

For the interconnection Tij , we assume that each neuron receives s synapses from other neurons of the minicolumn. We further assume that the dendrites and axons interconnect randomly such that a natural choice for the probability of receiving a synapse from a given neuron is m1 (cf. Anninos, Beek, Csermely, Harth, & Pertile, 1970). The synaptic weights we take to be equal to the constant c. Note that for any choice of c, the threshold 2 can be chosen such that the resulting dynamics is the Without loss of generality, we Psame. m therefore choose c D 1s such that m1 D 1. To further analyze the T ij i;jD1 dynamics, we describe equations 2.1 in terms of the activation probability of a neuron, p.t/, at time t. The probability depends, rst, on the number of received inputs and, second, on the probability that the neuron was active in the preceding time step. Due to the interconnection .Tij /, the probability fbn .x/ of a neuron to receive x nonzero inputs from its presynaptic neurons is given by the binomial distribution ³ fbn .x/ D

s x

´ px .1 ¡ p/s¡x :

(2.2)

For s À 1, the distribution can be approximated by a gaussian probability density (theorem of Moivre-Laplace) of the form fg .x/ D p

1 2¼¾

1

e¡ 2 .

x¡a 2 ¾ /

; a D sp; ¾ D

p

sp.1 ¡ p/:

(2.3)

The probability pA .tC1/ that a neuron receives enough input to exceed threshold at .tC1/ is thus given by the integration of all fg .x/ at t with x > 2c D s2. The probability p.t C 1/ that the neuron is activated at time .t C 1/ further depends on the probability pB .t C 1/ that it is not refractory at .t C 1/. The probability pB .t C 1/ is directly given by the complement of the probability that the neuron was active the time step before, pB .t C 1/ D .1 ¡ p.t//. To further analyze the dynamics, so-called coherence effects, which are caused by repeating (cycling) neural activity states, are considered. Such effects are a direct consequence of the random but xed interconnection matrix (Tij ) and thus of interdependent neural activation and refraction probabilities. As we will discuss at the end of the section, the coherence effects can be suppressed—for example, by neural threshold noise. If the effects are sufciently suppressed, we can assume the probabilities p A .t C1/ and pB .tC1/ to

506

J. Lucke ¨ and C. von der Malsburg

be approximately independent, p.t C 1/ D pA .t C 1/ pB .t C 1/. The assumption permits a compact description of dynamics 2.1 in terms of the activation probability p.t/ (see the appendix for details): Á p.t C 1/ D 8s

p.t/ ¡ 2 p p.t/ .1 ¡ p.t//

! .1 ¡ p.t// ;

(2.4)

R ps x ¡ 1 y2 where 8s .x/ D p1 ¡1 e 2 dy is the gaussian error integral param2¼ eterized by s. The inhibitory feedback to the minicolumnar activity is modeled indirectly as a rise of the threshold 2. It is taken to be present already in the next time step and to be equally sensed by all neurons, which can be motivated by the axonal distribution of inhibitory double-bouquet neurons (DeFelipe et al., 1990; Peters & Sethares, 1997). The inhibitory neurons receive input from the excitatory neurons of the minicolumn. The P inhibitory feedback we choose to depend linearly on the overall activity, m1 m iD1 ni .t/, of the minicolumn, 2Dº

m 1 X ni .t/ C 2o D ºp.t/ C 2o ; m iD1

(2.5)

where º is the proportionality factor of inhibition and 2o the constant threshold of the neurons. The choice represents a natural approximation of the feedback and allows for a further analytical treatment. Inserting equation 2.5 into 2.4, we get: Á p.t C 1/ D 8s

.1 ¡ º/ p.t/ ¡ 2o p p.t/ .1 ¡ p.t//

! .1 ¡ p.t//:

(2.6)

The difference equation, 2.6, can be shown to possess a point of nonzero stable stationary activity for a wide range of parameters s, 2o , and º, given by: » ³ ´ ¼ ¡ ¡ Pºs;2o :D max p j p D 8s .1p º/ p 2o .1 ¡ p/ : p .1¡p/

(2.7)

Pºs;2o (or Pº for short) can be numerically determined, and its value is in good agreement with activity probabilities obtained by directly simulating equation 2.1 with inhibition 2.5. The dynamics 2.1 with 2.5 has to be simulated with additional neural threshold noise in order to suppress the coherence effects mentioned above. Fluctuations of the inhibitory feedback caused by a nite number of simulated neurons, m, also contribute to the noise but are on their own for a wide range of parameters not sufcient for an appropriate suppression of the effects. Note that the coherence effects are most

Processing and Learning in Macrocolumns

507

efciently suppressed if the interconnection matrix .Tij / is re-randomized at each time step. In this case, the assumption p.t C 1/ D p A .t C 1/ pB .t C 1/ and also the computation of p A .t C 1/ via the binomial distribution can be adopted from Anninos et al. (1970). There the matter is thoroughly discussed for a dynamics with another type of inhibition, but the essential arguments carry over to our dynamics. Neural threshold noise as an alternative to rerandomization was rst described in Lucke, ¨ von der Malsburg, and Wurtz ¨ (2002). In specic simulations with xed interconnections and threshold noise, we have validated that their behavior closely matches that of simulations with successively re-randomized interconnections, which shows that the analytical results are also applicable for the biologically realistic case of xed interconnections, as used in this work. 2.2 Model of the Macrocolumn. As motivated by the distribution of synapses of inhibitory neurons (DeFelipe et al., 1990, 1989; Peters & Sethares, 1997; Budd & Kisvarday, 2001), the macrocolumn is modeled as a collection of inhibitorily coupled minicolumns. With the same assumption as above, the dynamics is given by the following set of k m difference equations, 0 1 m X n®i .t C 1/ D H @ Tij® nj® .t/ ¡ I .t/ ¡ 2o A ¢ H .1 ¡ n®i .t//;

(2.8)

jD1

where ® D 1; : : : ; k counts the minicolumns, i D 1; : : : ; m the neurons of the minicolumn, and where I .t/ denotes a time-dependent inhibition. Note that equation system 2.8 assumes that there are no direct connections between excitatory neurons of the different minicolumns. For a xed ®, the interconnection .T ®ij / is of the same type as in the previous section. By the statistical considerations of section 2.1, we can replace the k m difference equations by a set of k difference equations in terms of activation probabilities p® of neurons of different minicolumns: Á ! p® .t/ ¡ I .t/ ¡ 2o p (2.9) p® .t C 1/ D 8s .1 ¡ p® .t//: p® .t/.1 ¡ p® .t// The inhibitory feedback I .t/ is modeled as the maximum of the overall P ¯ activities in the minicolumns, p¯ D m1 m iD1 ni .t/,

I .t/ D º max fp¯ .t/g; ¯D1;:::;k

(2.10)

where the maximum operation is assumed to be implemented by the system of inhibitory neurons of the macrocolumn. The maximum operation can be biologically implemented in various ways (Yu, Giese, & Poggio, 2002). Some possibilities are based on shunting inhibition (cf. also Reichardt, Poggio,

508

J. Lucke ¨ and C. von der Malsburg

& Hausen, 1983) whereas others use subtractive inhibition. On the functional side, inhibition proportional to the maximal minicolumnar activity 2.10 results in a qualitatively different and favorable behavior compared to a dynamics with inhibition proportional to the average activity, as was studied in Lucke ¨ et al. (2002). The dynamical difference and its functional implications will be discussed further later in this section. 2.3 Stability Analysis. The dynamical properties of the macrocolumn model can now be studied with a stability analysis of a system of k coupled nonlinear difference equations (® D 1; : : : ; k): Á ! p® .t/ ¡ º max¯ fp¯ .t/g ¡ 2o p p® .t C 1/ D 8s .1 ¡ p® .t// p® .t/.1 ¡ p® .t// (2.11)

E D: G® .p.t//:

First, note that the system possesses the following set of potentially stable stationary points,

Q D fEq j 8i D 1; : : : ; k .qi D 0 _ qi D Pº /g;

(2.12)

for example, for k D 3, .0; 0; 0/, .Pº ; 0; 0/, .0; Pº ; 0/, : : : , .Pº ; Pº ; 0/, : : : , .Pº ; Pº ; Pº /, where Pº is given in equation 2.7. The magnitude of Q is jQj D 2k . The vector with the smallest norm, .0; 0; 0/, will be called qEmin , and the vector with largest norm, .Pº ; Pº ; Pº /, will be called qEmax . The set of Q without the trivial stationary point qEmin will be denoted by QC D Q ¡fE qmin g. To analyze the stability of the stationary points in Q, we rst approximate E D º max¯ fp¯ g with a differentiable function: I .p/ Á E :D º IQ ½ .p/

! ½1

k X

.p¯ / ¯D1

½

E D I .p/: E ) lim IQ ½ .p/ ½!1

(2.13)

A stationary point pE is stable if and only if the magnitudes of all eigenvalues E p/ E (see equation 2.11) are smaller than one. Thanks to of the Jacobian of G. E the symmetries of equation system 2.11 and to the substitution by IQ ½ .p/, eigenvalues can be computed in general and are, in the limit ½ ! 1, given by1 ¸0 D 0;

³ ´ 1 1 ¡ 2Pº 1¡º C ¸1 D p 2o 80s .h.º// ¡ 8s .h.º//; Pº 2 Pº .1 ¡ Pº / 1 Due to the symmetries in the equations, we get, for q E 0 .Eq/, E 2 Q , a symmetric Jacobian G whose eigenvalues are simple functions of the Jacobian matrix entries. The explicit equations for the eigenvalues can then be obtained by long but straightforward calculations.

Processing and Learning in Macrocolumns

1

³

509

1 ¡ 2Pº

1 C .1 ¡ 2Pº / º C ¸2 D p 2 Pº .1 ¡ Pº / .1 ¡ º/ Pº ¡ 2o where h.º/ D p : Pº .1 ¡ Pº /

Pº

´ 2o 80s .h.º// ¡ 8s .h.º//

If for a given vector qE 2 Q, l.E q/ is the number of nonzero entries, then ¸0 is of multiplicity .k ¡ l.Eq//, ¸1 is of multiplicity 1, and ¸2 is of multiplicity .l.E q/ ¡ 1/. For xed parameters s and 2o , the magnitudes of all eigenvalues are smaller than one for º smaller than a critical value ºc . For º > ºc , ¸2 gets larger than one, which implies that all qE 2 Q with l.E q/ ¸ 2 become unstable. Hence, a set of .2k ¡ k ¡ 1/ stationary points of QC loses their stability at the same value ºc . In Figures 1 and 2, the properties of the dynamics are visualized using bifurcation diagrams. The critical value ºc can be computed numerically and is, for s D 15 and 2o D 1s , given by ºc ¼ 0:69. For k D 2, the set QC consists of three nontrivial stationary points, which are all stable

Figure 1: Bifurcation diagram of equation system 2.11 for parameters s D 15, 2o D 1s ; and k D 2. The points of QC D f.P º ; 0/; .0; P º /; .P º ; Pº /g are plotted together with the two unstable points of the dynamics. To obtain a twodimensional bifurcation diagram, the stationary points for given º are projected onto the one-dimensional space with normal vector p12 .1; 1/. The only stationary point not plotted is Eqmin because it projects onto the same point as Eqmax . The solid lines mark stable stationary points, and the dotted lines mark unstable points.

510

J. Lucke ¨ and C. von der Malsburg

Figure 2: Bifurcation diagram of equation system 2.11 for parameters s D 15, 2o D 1s , and k D 3. For each º, the stationary points of the three-dimensional phase space are projected to the plane given by the normal vector p13 .1; 1; 1/. The vectors p1 , p2 , and p3 are projections of the trihedron of the phase space onto this space. The only stationary point of the system that is not plotted is Eqmin because it projects onto the same point as Eqmax . The unstable stationary points are plotted as dotted lines; the stable stationary points, which are always elements of QC , are plotted as solid lines. For º < ºc , all elements of QC are stable, but for º > ºc , only k D 3 stable points remain. All other stable points lose their stability in subcritical bifurcations for the same value of º.

for º < ºc . Apart from the points in QC , there exist two unstable stationary points, which are numerically computed and are given by the dotted lines in Figure 1. For small values of º, the unstable points lie in the vicinity of the antisymmetric stable stationary points .Pº ; 0/ and .0; Pº /, which indicates that they attract only a small volume of neighboring phase space. For increasing º, the unstable points approach the symmetric stable stationary point qEmax D .Pº ; Pº /, which indicates that the phase space volume of points attracted by qEmax gets gradually smaller. In the point of structural instability, º D ºc , qEmax nally loses its stability when it meets the unstable stationary points in a subcritical pitchfork bifurcation. To visualize the crucial and analytically derived property that .2k ¡ k ¡ 1/ stationary points of QC lose their stability for the same value of º, the bifurcation diagram of a network for k D 3 is given in Figure 2. In the

Processing and Learning in Macrocolumns

511

diagram, all stationary points of QC are plotted, and we nd for º < ºc the set’s .2k ¡ 1/ stable stationary points. Apart from the points in QC , we get a number of unstable stationary points, which all lie, for small º, at the same distance from qEmax and in the vicinity of the other points of QC . As º increases, the unstable points are getting closer to the stable points with l.Eq/ ¸ 2 nonzero entries, and for º D ºc , these stable points of QC lose their stability when they are met by the unstable points in subcritical bifurcations. Note that an inhibition proportional to the mean minicolumnar activity instead of the maximum as in equation 2.10 results in a dynamics whose stationary points lose their stability for different values of º (compare also Luecke ¨ et al., 2002). This dynamic property is reected by eigenvalues of E with 0 < ½ < 1. the dynamics’s Jacobian, which depend on l.Eq/ for IQ ½ .p/ 2.4 Input. For the neuron dynamics 2.8, we have gained, by our stability analysis, far-reaching insight into the dynamical properties of the macrocolumnar model. The knowledge can now be exploited to investigate the dynamical behavior of the system if it is subject to perturbations in the form of externally induced input. As is customary in biology, we will denote the positive contribution of a presynaptic neuron to the input of the postsynaptic neuron excitatory postsynaptic potential (EPSP), and we will say that a neuron emits a spike at time t if it is active at time t. For the dynamics as investigated in the previous section, there are essentially three different modes of operation possible: ² For º > ºc , the macrocolumn can serve as a memory unit able to stabilize k different macroscopic states (i.e., stable stationary points of equation system 2.11). The switching between the states is possible by sending a sufciently large quantity of EPSPs to the respective minicolumn. ² For º < ºc , the macrocolumn is able to stabilize 2k ¡ 1 different macroscopic states. The transition between the states would be possible by inducing a sufciently large quantity of EPSPs to an appropriate subset of minicolumns. If º is chosen to be only slightly smaller than ºc and if one starts with the stable stationary point qEmax , already small differences in the input to the minicolumns are sufcient for the macrocolumn to change to a corresponding macroscopic state. ² If, for an initial state qEmax , º is continuously increased from º < ºc up to a value larger than ºc , the system will change to another stable stationary point for some value of º < ºc depending on the input. For larger values of º, the system can again change between stable points until º is nally larger than ºc and the dynamics is forced to one of the remaining stable stationary points where just one minicolumn is active. Consider, for instance, external EPSP input to a macrocolumn with k D 3 minicolumns. If the numbers of EPSPs induced per time step,

512

J. Lucke ¨ and C. von der Malsburg

M®EPSP , are different for the three minicolumns, for example, M1EPSP : M2EPSP : M 3EPSP D 0 : 1 : .1 C ²/; for 0 < ² ¿ 1; the dynamics will stabilize the initial state qEmax for small º. If º gets larger, the system will change to the stable point .0; Pº ; Pº / for some º1 < ºc because this point is less deected by the input than qEmax . The deection of .0; Pº ; Pº / caused by the input is sufciently large, however, if º is further increased. The system will therefore nally stabilize the point .0; 0; Pº /. The third possibility is the one with the most useful features. For given inputs, the dynamics rst successively switches off the minicolumns with smallest inputs. These macrostate transitions occur the earlier the larger the differences between the inputs are and can therefore encode neural population rate differences. If a new stable stationary point is reached, the process of switching off a minicolumn continues, each time without the perturbing inuence of the input of the already quiescent columns. For a dynamics whose stationary points lose their stability for different values of º (see Lucke ¨ et al., 2002), the number of active minicolumns is determined by º and not by the input. Note in this context that equation 2.10 is not the only type of inhibition that results in a single critical value of º. Other inhibition functions, for example, the average of active columns,

Iac .t/ :D º

1 X Q ; p¯ .t/ ; A D f¯ j p¯ > pg jAj ¯2A

(2.14)

with 0 < pQ ¿ 1, can also be shown to possess this property. In general, the contribution of quiescent minicolumns to the inhibition has to be negligible as a prerequisite for a dynamics with qualitative behavior comparable to the one with inhibition, equation 2.10. The simplicity of equation 2.10 and its good functional performance were the reasons to choose an inhibition proportional to the maximum in this work. The macrostate transitions that depend on input differences but are induced by an increased parameter º are all performed near symmetry breaking points. The transitions are theoretically innitely sensitive to input differences such that a macrocolumn can serve as an ideal decision unit (see also L ucke ¨ et al., 2002). For appropriate parameters s and 2o , the stabilization of stationary activity is performed in a few iteration steps such that the time to increase º from its minimal to its maximal value lies within a few tens of time steps, which makes decisions very fast in addition. If the inhibition parameter oscillates, the macrocolumns can repeatedly select the strongest input or inputs. In the next section, this mode of operation is exploited and further discussed for a macrocolumn with explicitly modeled afferent bers.

Processing and Learning in Macrocolumns

513

3 Afferent Fibers and Hebbian Plasticity We consider the situation that the excitatory neurons of the macrocolumn, n®i , receive input from an input layer of N external neurons njI , which are of the same type as the excitatory neurons of the minicolumns. In the following, we think of the neurons of the input layer as extracortical neurons in order to analyze the dynamical properties of the macrocolumn in a more convenient way. However, the input neurons can also be considered to be excitatory neurons of other macrocolumns, which would account for lateral excitation within the cortex. An afference from an input neuron, njI , to a neuron of a minicolumn, n®i , will be denoted by R®ij (see Figure 4A for a sketch of the system). Analogous to the internal connectivity, we demand that one neuron n®i receives a xed number of r synapses from neurons of the input layer and that the synaptic weight of a synapse is given by c D 1s . The receptive eld vector of a minicolumn, RE® 2 f0; c; 2c; : : :gN , is dened as the sum of the RF vectors RE®i D .R®i1 ; : : : ; R®iN / of all neurons of the minicolumn ®, P E® 2 RE® D m iD1 Ri . Instead of reanalyzing the dynamics statistically, it is (for r signicantly smaller than s) sufcient to treat the external input to the macrocolumn as a perturbation of the internal dynamics. The macrocolumn will be operated by repeatedly increasing the inhibition factor º from a minimal value ºmin to a maximal value ºmax . The system is hereby forced to select the column(s) with the strongest input at the end of each period or º-cycle (as we will call it from now on). In the beginning of a º-cycle, the system has to be in the state qEmax , which can be achieved under the inuence of noise by setting º to a sufciently small value before starting to increase º at ºmin (see Figure 3B). If for a macrocolumn of k D 3 minicolumns, the RFs, RE® , are already given, the system is able to distinguish even strongly overlapping input patterns. The system rst switches off the minicolumns with RFs very different from the presented stimulus and then decides between the remaining minicolumns with RFs similar to the stimulus (see Figure 3). In this way, the system can also gracefully handle simultaneously presented patterns. If a superposition of two patterns corresponding to the RFs of two minicolumns is presented, the dynamics switches off all irrelevant minicolumns except the two corresponding ones, whose activities are symmetrized. It then depends on the choice of ºmax whether this is the nal state or whether the symmetry is broken to favor one of the patterns. We now proceed by introducing Hebbian plasticity of the afferent bers to match neurophysiological experiments which show input-dependent changes of neuron RFs. As the RFs of neurons, RE®i , change, the RFs of the 2

In the neurons’ RF vectors, only afferent connections are considered.

514

J. Lucke ¨ and C. von der Malsburg

Figure 3: Operation and dynamic behavior of a system with parameters m D 100, s D 15, r D 7, 2o D 1s , with k D 3 minicolumns, and an input layer of E ® , of the minicolumns ® D 1; 2; 3 are given N D 16 £ 16 neurons. (A) The RFs, R as two-dimensional plots of the 162 vector entries. The entries are visualized as gray levels (black D 0). To make the RFs more conceivable, we have chosen them to be of the form of simple two-dimensional patterns. The input pattern is chosen to correspond to the RF of minicolumn ® D 3. During the operation of the system, all neurons of the input layer that correspond to white pixels spike with probability 13 ; neurons that correspond to black pixels are not spiking. (B) The periodical change of the parameter of inhibition, º, is visualized. After a short period with º D 0:1, which serves to reset the dynamics to qEmax , º is linearly increased from ºmin to ºmax D 1:12. Three º-cycles with period length Tº D 25 are displayed. (C) The dynamic behavior of the system is visualized, with the activities p® .t/ for the minicolumns ® D 1; 2; 3 plotted against time. In the beginning of a º-cycle, the dynamics tends to symmetrize the activities as predicted by the theoretical results and the bifurcation diagrams. The symmetry is rst broken when minicolumn ® D 1 is switched off because its RF receives the smallest number of EPSPs from the presented input. Afterward the stationary point .0; P º ; Pº / is stabilized; the remaining two minicolumn activities are symmetrized, until minicolumn ® D 2 becomes quiescent because it receives less input than minicolumn ® D 3. The qualitative behavior for each º-cycle is the same, but quantitative differences exist due to the threshold noise of the neurons and nitely many neurons per minicolumn.

Processing and Learning in Macrocolumns

515

Figure 4: (A) Sketch of a macrocolumn with k D 3 minicolumns connected to an input layer of N D 25 neurons. The m D 8 neurons per minicolumn are randomly interconnected; each minicolumnar neuron receives s D 3 synapses from within its minicolumn. The inhibition is symbolically sketched as one inhibitory neuron receiving input from all minicolumns and projecting back to all of them. Each minicolumnar neuron receives r D 2 synapses from neurons E 1 , of minicolumn ® D 1 is of the input layer. The randomly initialized RF, R 2 3 E E fully displayed, whereas RFs R and R are not. Lines within the input layer are displayed only for visualization purposes. There are no connections of neurons within the input layer. (B) Set of three different input patterns of 16£16 pixels. (C) Modications of RFs of a macrocolumn with k D 3 minicolumns and parameters m D 100, s D 15, r D 7, 2o D 1s , E D 0:03, » D 55, and N D 256. For 0 º-cycles, the random initialization of the RFs is displayed. After ve º-cycles (and ve E 1 is presentations of patterns randomly chosen from the set of input patterns), R 2 3 E E slowly specializing to pattern 2, and after 10 and 15 º-cycles, R and R specialize to the patterns 3 and 1, respectively. After 15 º-cycles, the RF specialization further increases until the maximal specialization is reached after about 100 º-cycles. From 100 º-cycles on, the degree of specialization remains unchanged.

516

J. Lucke ¨ and C. von der Malsburg

minicolumns, RE® , consequently change in time as well. As an update rule for the synaptic change, 1R®ij .t/ D R®ij .t/ ¡ R®ij .t ¡ 1/, we use elementary Hebbian plasticity, that is, an afferent connection R®ij is increased if the presynaptic neuron was active at the time-step directly preceding the ring of the postsynaptic neuron. The state of maximal macrocolumnar activity qEmax is the same for all inputs. Only after the selection process, that is, at lower levels of activity due to minicolumn inactivation, the activity state is able to distinguish between inputs. Therefore, synaptic plasticity has to be preP P ® dominant at low levels of macrocolumnar activity, B.t/ D k®D1 m iD1 ni .t/, in order to generate discriminating RFs. A simple and, as it turned out, functionally advantageous way to do this is enabling synaptic modication only if B.t/ is smaller than a threshold » . As activity oscillations are ubiquitous in the cortex, it seems plausible that synaptic plasticity is phase coupled (see Wespata, Tennigkeit, & Singer, 2003, for recent evidence of phase-coupled synaptic modication). For the dynamics of synaptic change, we further demand as a boundary condition that the number of synapses received by a minicolumnar neuron is limited to r in order to avoid unlimited synaptic growths. We get as dynamic equations for the synaptic weights .® D 1; : : : ; k ; i D 1; : : : ; m ; j D 1; : : : ; N/: 1R®ij .t/ D AEt n®i .t/ njI .t ¡ 1/ iff B.t/ < » ; 8i; ® :

N 1X R® .t/ D r : c jD1 ij

(3.1) (3.2)

As our synaptic weights are discrete values, AEt is not a real valued growth factor but a probability that the synaptic weight is increased by c D 1s .3 If R®ij is increased for a given .®; i/, the neuron n®i removes randomly one afferent from the input layer in order to fulll the boundary condition. We operate the system by periodically changing º as in Figure 3B. Throughout the duration of a º-cycle, we present a pattern randomly chosen from a set of input patterns. An input neuron that corresponds to a white pixel is spiking, and a neuron that corresponds to a black pixel is not. The RFs of the neurons are randomly initialized and are modied according to equations 3.1 and 3.2. If the set of training patterns is structured, in the sense that it contains a small number of patterns as in Figure 4B, we can observe a specialization of the RFs of the minicolumns to the different input patterns. In Figure 4C, the modication of the RFs, RE® , of a macrocolumn with minicolumns ® D 1; 2; 3 is displayed, and it can be seen that the system organizes its RFs such that the macrocolumn becomes a decision unit for the input patterns. 3 To be more precise, for each .i; j; ®/ AE 2 f0; cg is an independent Bernoulli sequence t with probability P.0/ D 1 ¡ E and P.c/ D E .

Processing and Learning in Macrocolumns

517

In the beginning, an input pattern affects all minicolumns equally such that the system selects a subset of minicolumns by symmetry breaking. As soon as, initiated by random selection, a RF specializes for one class of input patterns, the corresponding minicolumn is more likely to be activated by patterns of this class, which further increases the specialization of the RF. This is the positive feedback loop of the self-organizing process, which amplies small uctuations and nally leads to an ordered state of the RFs. In addition to self-organizing aspects, we have a competition due to the minicolumn selection process and competition between afferent bers induced by equation 3.2. In order to avoid mutual weakening of different patterns stored in the same minicolumnar RF, the system specializes its RFs to adequately different input patterns. 4 Experiments We have seen that the system is able to specialize its RFs to different input patterns. So far, we have presented three different patterns to a network of three minicolumns (see Figure 4). We now investigate two more general situations. In the rst experimental setting, we present to the network different patterns that can be grouped into different classes. In the second setting, the network’s task is to extract basic constituents of a class of patterns generated by combining different bars, a task known as the bars problem (Foldi´ ¨ ak, 1990). For both tasks, the same network is used with the same set of parameters. All experiments use an input layer of 16 £ 16 input neurons. If a binary (black and white) pattern of 16 £ 16 pixels is presented, the input neuron, njI , of a given pixel spikes with probability 13 if the corresponding pixel is white and is not spiking if the corresponding pixel is black. Gray levels can be coded by intermediate ring rates, but in the following, for simplicity, only binary input is considered. We use a network with m D 100 neurons per minicolumn. Each neuron receives s D 15 synapses from presynaptic neurons of the same minicolumn and r D 7 synapses from neurons of the input layer. The neurons’ constant threshold is set to 2o D 1s ¼ 0:067; it is chosen such that a single EPSP is not sufcient to activate a neuron. The constant threshold is subject to gaussian threshold noise with zero mean and a variance of .¾ 2 /2 D 0:01. The oscillation of the inhibition is essentially governed by the parameters ºmin D 0:5, ºmax D 1:12, and the length of a º-cycle is Tº D 25 time steps. To allow short º-cycle periods, we use a º-oscillation as given in Figure 3B, where the rst part (with º D 0:1 and additional noise) serves to reset the dynamics to the stationary point qEmax . Note, however, that self-organization of RFs is also possible with other (e.g., sinusoidal) types of º-oscillations. Hebbian plasticity (see equations 3.1 and 3.2) is determined by the synaptic modication rate E D 0:03 and the parameter » D 55, which determines

518

J. Lucke ¨ and C. von der Malsburg

the network activity for which synaptic modication is possible. The latter is chosen such that synaptic modication is enabled only close to the end of a º-cycle (note that value two to three times larger for » with simultaneously reduced E results in a system with comparable qualitative behavior). All parameters are independent of the number of minicolumns k, which we allow to change for different experiments. The parameters are partly chosen to reect anatomical data, as in the case of the number of neurons per minicolumns m D 100, and partly to optimize performance in the experiments. In the following, we will refer to these parameters as the standard set of parameters. 4.1 Pattern Classication. We have seen that the system is able to specialize its RFs to be sensitive to a number of input patterns. More realistic input would not consist of a repeated presentation of exactly the same patterns as in Figures 4B and 4C but rather of different patterns that can be grouped into different classes. In Figure 5, a pattern classication experiment for such a kind of input is illustrated. For input patterns Va 2 f0; 1g256 as displayed in Figure 5A, we can dene the distance measure, dS .V a ; V b / :D

jA4Bj ; jA [ Bj

(4.1)

where A D fijVia D 1g, B D fijVib D 1g, and where .A4B/ D .A [ B/ ¡ .A \ B/ is the symmetric difference of sets. Distance function, equation 4.1, can directly be derived by analyzing RF-mediated input to the minicolumns (further detail would go beyond the scope of this article) and can be used to group the input patterns into different classes of mutually similar patterns, as can be seen in Figure 5B. In Figure 5C, a typical modication of RFs of a macrocolumn with six minicolumns is displayed, and it can be observed that the system builds up representations of all classes identiable in Figure 5B. If fewer minicolumns than pattern classes are available, the system builds up larger classes of mutually similar patterns (see Figure 5D). If more minicolumns than major classes are available, the system further subdivides the pattern classes (see Figure 5E). The subdivision may in this respect slightly differ from simulation to simulation. For example, for k D 9 the “square class” is in many cases represented only by one and the “plus class” by three instead of two minicolumns, as in Figure 5E. In the experiment, it can further be observed that the nal representation depends on the substructure of the pattern classes rather than on their size; for example, the plus pattern appears more frequently than the St. Andrew’s cross pattern but the St. Andrew’s cross tends to be represented by more RFs (see Figures 5C and 5E). Furthermore, the classication is independent of the number of white pixels per pattern because the impact of the patterns on the minicolumns is normalized by boundary condition 3.2. The independence is affected only by patterns approximately lling out the

Processing and Learning in Macrocolumns

519

Figure 5: (A) The set of input patterns is displayed. During each º-cycle, one randomly chosen pattern of this set is presented. (B) Distance matrix generated using the distance measure dS . The line and column index enumerates the 42 input patterns in the same order as they appear in A. (C) The modication of the RFs of a macrocolumn with k D 6 minicolumns and the standard set of parameters is displayed. After 250 and 1000 º-cycles, four and six different pattern classes are represented, respectively. The RFs’ degree of specialization further E 1 and R E 2 further subdivide the increases thereafter, and it can be seen that RFs R 1 E pattern class formerly represented by R only. (D) Final RF specialization (after 250 º-cycles) if a macrocolumn with k D 3 minicolumns is used with the same input. (E) Final RF specialization (after 10000 º-cycles) if a macrocolumn with k D 9 is used.

whole input layer or by patterns having a number of white pixels close to zero. 4.2 Feature Extraction. So far, we have seen that if the training patterns contain v classes of patterns, the system is able to identify these classes if k ¸ v. There are situations, however, when the training patterns cannot

520

J. Lucke ¨ and C. von der Malsburg

easily be grouped into pattern classes. This is the case, for instance, if we present from a number of v patterns not only the patterns themselves but also all possible superpositions. If k < 2v , the system is not able to store all patterns in different RFs. We mentioned in section 3, however, that the internal dynamics of a macrocolumn is especially suitable for taking into account pattern superpositions such that we can nevertheless expect the system to generate appropriate RFs. A method to evaluate the ability of a network to handle input that can be represented only by a combination of different patterns is the bars test. It was rst introduced in Foldi´ ¨ ak (1990) and soon became a benchmark test for generalization and combinatorial abilities of learning systems. The training patterns of the bars test consist of horizontal and vertical bars. On a quadratic input layer, 2b (nonoverlapping) horizontal and 2b (nonoverlapping) vertical bars can be displayed (with b an even integer), each with probability 1b and all of equal size (see Figure 6A for some examples with b D 8). Note that overlapping horizontal and vertical bars do not add up linearly because two overlapping white pixels do not add but result in a white pixel as well. The bars test is passed if, after a training phase, the system has built up representations of all bars and is able to correctly classify input consisting of superpositions of the learned bars. We can operate our system without modication, with the same set of parameters as for the experiment in Figure 5, and it turns out that it passes the bars test without difculty. The only prerequisite for a correct representation is that the number of minicolumns k is greater than or equal to the number of different bars, k ¸ b. If k > b, the RFs of some minicolumns remain uncommitted or specialize for a bar already represented by another minicolumn. For a bars test with b D 8 different bars, Figure 6B shows the modication of the RFs of a macrocolumn with k D 10 minicolumns. Starting from random initialization, the RFs specialize to different single bars even though the input patterns consist mainly of bar superpositions. In Figure 6B, a representation of all bars is clearly visible after 1000 º-cycles, and the representation can be seen to stabilize further thereafter. During the learning phase, an RF sometimes specializes to a combination two or more bars, as can be seen by looking at RF RE8 in Figure 6 (after 500 º-cycles). Such an RF is not stable, however, because the parts of the RF that correspond to different bars compete via equation 3.2. The RF therefore rapidly specializes for one bar if another RF becomes sensitive for the other. An example is given by RFs RE7 and RE8 in Figure 6 from 500 to 1000 º-cycles. In the experiment of Figure 6, two RFs remain unspecialized. In other experiments or for a longer learning phase, the two supernumerary RFs often specialize to an already represented bar and increase redundancy in this way. The bars test was used in different versions with different numbers of bars and different systems. In Hinton et al. (1995), for instance, 8 bars were used, Hochreiter and Schmidhuber (1999) and others used 10 bars, Hinton

Processing and Learning in Macrocolumns

521

Figure 6: (A) A selection of 33 typical input patterns of the bars test of eight different bars. (B) Typical example of the self-organization of the RFs of a macrocolumn with 10 minicolumns and the standard set of parameters. During each º-cycle, a randomly generated input pattern of the upper type is presented. After about 250 º-cycles, the network has already found representations of seven bars. After 1000 º-cycles, representations of all bars are found and are further stabilized.

and Ghahramani (1997) 12 bars, and Foldi´ ¨ ak (1990) and others used 16 bars. To allow for comparison with these systems, we measured the performance of our system for bars tests of 8, 10, 12, 14, and 16 bars. For all tests, we used the same system and always with the standard set of parameters and an input layer of 16 £ 16 neurons. The different bars tests required different

522

J. Lucke ¨ and C. von der Malsburg

bar widths in order to cover the input layer appropriately. For the bars test with b D 8 bars, a bar width of four pixels was used (see Figure 6), for b D 10 three pixels, and for b D 12, 14, 16 bars were of a width of two pixels. Consequently, the input layer is not uniformly covered for 10, 12, and 14 bars. In Figures 7A, 7B, and 7C, the results of different test series are presented. In Figure 7A, the number of minicolumns is equal to the number of different bars; in Figure 7B, the number of minicolumns exceeds the number of bars by 2; and in Figure 7C, a surplus of 4 minicolumns is available. A measurement point in the diagrams corresponds to the number of º-cycles after which there is a 50% probability that all bars are represented; for example, in 200 runs with 8 bars and k D 10 minicolumns, there were 100 runs in which a representation was found after 1050 º-cycles (see the rst measurement point in Figure 7B). A bar is taken to be represented by a minicolumn if the minicolumn remains active in 9 of 10 º-cycles if the bar is presented. A macrocolumn is said to have found a representation of all bars if all bars are represented by at least one minicolumn and no minicolumn represents two different bars. In all runs, the system nally found a correct representation. Once a representation was found, it remained stable in the sense that the minicolumns remained specialized for the same bars. For a given experimental setting, there can be relatively large differences between individual runs, however. For 8 bars and k D 10, for example, a correct representation of the bars was not found after about 2100 º-cycles in 20% of the 200 runs (indicated by the upper bound of the error bar), whereas another 20% of the experiments found representations after 400 º-cycles (lower bound). The reason is that all bars but one nd presentations very early, whereas the remaining bar might take a long time to be represented— an effect that is also observable in Spratling and Johnson (2002) for the noisy bars test. In Figures 7B and 7C a large reduction of learning time can be observed if the number of minicolumns is larger than that of presented bars. A surplus of two minicolumns results in a reduction to less than half of the learning time for no surplus, and a surplus of four minicolumns results in a learning time of coarsely a fourth. For the results in Figures 7A through 7C, we used a newly generated bars image for every º-cycle. The same experiments can be carried out, however, by choosing randomly from a xed set of a number of u generated bars images. If u is several times larger than b, the results are qualitatively and quantitatively comparable. For a bars test with 8 bars and k D 10 minicolumns, for instance, u D 50 input patterns are fully sufcient to build up a correct representation of single bars. The results of Figure 7A through 7C further show that learning time in terms of º-cycles decreases if the number of bars does. This can be expected because the system has to learn a decreasing number of independent input constituents. On the other hand, there is an increasing overlap of bars, which makes it harder for the system to differentiate between two bars (compare

Processing and Learning in Macrocolumns

523

Figure 7: (A–C) Results of bars tests with b D 8; 10; 12; 14; 16 bars and a macrocolumn with standard set of parameters. In A, the number of minicolumns of the used macrocolumns is always equal to the number of different bars. In B, the number of minicolumns exceeds the number of bars by two, and in C, there is a surplus of four minicolumns. (D–F) Results of a bars tests with b D 8 bars and a macrocolumn with k D 10 minicolumns and standard parameters. In D, the input patterns are perturbed with bit ip noise of 0 to 12%. In E, the bar widths are varied, and in F, the generation of the input patterns is altered such that for each run, four randomly chosen bars appear with probability 18 .1 ¡ ° /, whereas the other four appear with probability 18 .1 C ° /. The measurement points of A, B, and C were obtained by taking 200 runs into account and the measurement points of D, E, and F with 100 runs each. As a result, the number of º-cycles is given after which a representation of all bars is found with a probability of 0:5. The lower and upper bounds of the error bars correspond to a probability of 0:2 and 0:8, respectively. For each run, a newly generated macrocolumn with newly initialized RFs was used.

524

J. Lucke ¨ and C. von der Malsburg

Dayan & Zemel, 1995, and Hochreiter & Schmidhuber, 1999). The positive effect of fewer constituents is predominant in our system. The negative effect of more overlap can be made responsible, however, for an increase of learning time if the bar widths are varied in an experiment discussed below. Once a system has learned a correct representation of the bars, it can be used to analyze bars images. To test the accuracy of the recognition, we trained a macrocolumn with images generated according to the bars test until it found a representation. After some additional learning to stabilize the representation further, it was tested with newly generated bars images of the same type as the training images. If an image is presented, the minicolumns corresponding to the bars appearing in the image remain active longer than minicolumns associated with bars not appearing in the test image. At the end of a º-cycle, a minicolumn is either active or not. A test image is considered to be correctly recognized if for all minicolumns that correspond to bars appearing in the image, the probability of remaining active is above average and if the probabilities of minicolumns corresponding to all other bars lie below average. Six macrocolumns of 16 minicolumns trained with bars images of 16 bars were each tested 100,000 times with images generated according to the bars test except that, for convenience, we required each image to contain at least one bar. The networks could classify the input correctly in all but two of the 600,000 cases. In the rst case, one of seven bars was not recognized and in the second one of eight bars. In the usual bars test, an individual bar is always displayed identically, the bars are of the same size, and all bars occur with exactly the same probability. Systems solving the bars test can therefore be suspected of using these articial assumptions. The system (Foldi´ ¨ ak, 1990) not only exploits the fact that the bars are occurring with the same probability but also needs to know the exact value of the bars’ probability of occurrence. How much a system relies on the assumptions of the bars test can be tested by relaxing them, and we present three test series showing the corresponding behavior of our system. For all three series, we use a bars test with b D 8 bars and a macrocolumn with k D 10 minicolumns and standard set of parameters. For the robustness against perturbed bar images, we presented input images with bit ip noise during the learning phase (see Figure 8). In Figure 7D the learning time is plotted for different degrees of noise. As can be observed, even low levels of noise have positive effects. However, with an increasing noise level, the nal degree of specialization of the minicolumns’ RFs is reduced. In Figure 8, the nal specialization degree corresponds to the displayed RFs after about 2000 or 5000 º-cycles. If compared to the nal degree of specialization in Figure 6B, it can be seen that in the noisy case, the RFs have more overlap. The overlap increases with increasing noise, which leads to an increasing instability of a representation of all bars until the system cannot nd a representation of the bars anymore. For the standard set of parameters and for a bars test with 8 bars, no representations can be found for noise levels above about 12%. By decreasing the learning rate E ,

Processing and Learning in Macrocolumns

525

Figure 8: (A) A selection of 33 typical input patterns of a bars test with b D 8 different bars and 8% bit ip noise. (B) Typical RF specialization corresponding to this input. RFs of a macrocolumn with 10 minicolumns and the standard set of parameters are displayed. After about 500 º-cycles, representations of all bars are recognizable, and after about 2000 º-cycles, the maximal degree of specialization is reached.

the robustness against noise can be increased such that representations can be found for noise levels above 15%. In the second test series, the bar size is varied. For b D 8, the bars are E D .w1 ; w2 ; w3 ; w4 / denotes the bar widths for usually w D 4 pixels wide. If w the four vertical as well as for the four horizontal bars, we can dene ±w D Pb=2 jw ¡4j as a measure for the bar width variation. In Figure 7E, the results i iD1 E D .4; 4; 4; 4/,.3; 4; 4; 5/; .3; 3; 5; 5/; .2; 3; 5; 6/; .1; 3; 5; 7/ for the test series w are given. The learning time increases with increasing ±w presumably be-

526

J. Lucke ¨ and C. von der Malsburg

E D .1; 3; 5; 7/, the cause the maximal bar overlap increases; for example for w horizontal 7-pixel-wide bar covers nearly half of the 1-pixel-wide vertical bar. The robustness of the system against relaxation of the assumption that all bars occur with equal probability is investigated in the third test series. We reduce the appearance probability of four randomly chosen bars to the value p D po .1 ¡ ° / and increase the appearance probability of the four other bars by the same value p D po .1 C ° /. Here, ° is a parameter in the interval [0; 1], and po D 1b D 18 is the usual appearance probability. In Figure 7F, the results for ° D 0:0; : : : ; 0:8 are given; for ° D 0:9, the corresponding measurements are 3750, 8050, and 32,850 º-cycles for probabilities to get a correct representation of 0:2, 0:5, and 0:8, respectively. The measurements show that the system reliably learns a correct representation even if half of the bars appear nearly 20 times more frequently than the others, and it needs a longer learning phase only if half of the bars occur more than four times more frequently. The bars appearance probability can also be varied globally. If all bars appear with the same probability po and if po is increased to values larger than 1b , the probability of nding a few or a single bar in an input image gets gradually smaller. For k D 10, b D 8, and the standard set of parameters, learning time increases if po gets larger. For po D 1:5 1b , the system found a stable representation in all of 100 runs and needed fewer than 1600 º-cycles to nd representations in 50% of them. In addition to a longer learning phase, the RFs representing the different bars become less disjunct until the nal representation gets unstable for values of po larger than about 1:8 1b . Representations for input with higher values of po can be found, however, if the synaptic modication rate E is reduced, which in general stabilizes the representation. For E D 0:005 instead of E D 0:03, the system always nds stable representations for po D 2:0 1b (after fewer than 18,600 º-cycles in 50% of 100 runs). However, even for very low values of E , there is a limit at po somewhat larger than 2:0 1b , from which point on no stable representation can be found anymore. We have seen that the same network solves problems such as pattern classication and basic feature extraction. As demonstrated in the bars test, the network can build up a representation of the input, which allows us to classify patterns by using distributed neural coding. The network found correct representations of all bars in all 5700 simulations that were carried out to acquire the data given in Figure 7. After the learning phase, the classication for the usual bars test shows a reliability of virtually 100%. All experimental data given in Figure 5 to 8 were obtained with the same parameters. Different sets of parameters lead to different results, and for an individual task, the parameters can be optimized to obtain shorter learning times or a higher robustness. We have chosen, however, to use the standard set of parameters for all experiments in order to demonstrate the universality and robustness of the system’s dynamics.

Processing and Learning in Macrocolumns

527

5 Discussion From an elementary neuron model and a random but column-based interconnection, we derived a neural dynamics with properties that make of the macrocolumn an ideal decision unit for input to its minicolumns. The dynamics is best exploited with an oscillating gain factor of the inhibition. If the afferent bers to the macrocolumn are subject to elementary Hebbian plasticity that is also phase-locked to the oscillation of inhibition, we get a system that self-organizes the RFs of its minicolumns. The system is able to classify input patterns into different groups or to extract basic constituents of the input patterns, as was demonstrated using the bars test. The way the system represents the input depends on only the nature of the input, as the same system with the same set of parameters was used for the pattern grouping task and for the bars test. 5.1 Computational Aspects. There are various systems capable of learning without supervision. Important (not necessarily disjoint) classes are different types of ANN, probabilistic models, and, in a more general sense, independent component analysis (ICA) and principal component analysis implementations. Among these systems, only a few are able to build up efcient combinatorial representations of the input. The bars test is a means of testing this ability, and it represents in this respect a hard problem because in general, its components, the bars, do not add up linearly. Linear methods like ICA therefore fail to pass the test (see, e.g., Hochreiter & Schmidhuber, 1999). The problem is (in its more or less difcult versions) solved by merely a small subset of systems (Foldi´ ¨ ak, 1990; Saund, 1995; Dayan & Zemel, 1995; Marshall, 1995; Hinton et al., 1995; Harpur & Prager, 1996; Frey et al., 1997; Hinton & Ghahramani, 1997; Fyfe, 1997; Charles & Fyfe, 1998; Hochreiter & Schmidhuber, 1999; O’Reilly, 2001; Spratling & Johnson, 2002). Some of them need additional knowledge about the input; Foldi´ ¨ ak (1990) and Marshall (1995), for example, require that all bars occur with equal probability. Other systems, such as those of Dayan and Zemel (1995) and Hinton and Ghahramani (1997), use hierarchical approaches. If these systems are applied to the bars test, a pattern is rst represented as containing horizontal or vertical patterns, and then exact instances of those patterns are represented at the next level. The system as presented in this article is not hierarchical. However, the dynamics can be extended to allow for hierarchical learning in the sense that the input patterns are rst subdivided into larger classes of patterns on the basis of the distance measure, equation 4.1. Such an extended system increases the parameter ºmax during learning. A system based on this mechanism is currently being studied in our lab. Note, however, that such a system is learning hierarchically but that it is not hierarchically representing a pattern as the systems of Dayan and Zemel (1995) and Hinton and Ghahramani (1997) do.

528

J. Lucke ¨ and C. von der Malsburg

To compare systems that solve the bars test, their behavior under the relaxation of the bars test assumptions is one important criterion; their reliability (some systems do not always nd correct representations) and the time they need to nd a representation are others. Comparison among the systems is difcult in many cases, however, because important data (e.g., concerning robustness or reliability) are often missing. Even if data are available (e.g., in terms of the number of presentations of input images required to build up a correct representation), comparison remains difcult because systems specialized to the bars test assumptions 4 can be expected to be much faster than systems that can also be applied to more general input.5 Our system was therefore tested against relaxations of the bars test assumptions and was shown to behave favorably (see Figures 7D–7F). In terms of pattern presentations, only the systems of Foldi´ ¨ ak (1990) and Spratling and Johnson (2002) are faster than the presented network. 6 In Foldi´ ¨ ak (1990) the probability of bars occurrence has to be known ahead of time, however, and Spratling and Johnson (2002) do not report on the robustness of their system when bars are of different sizes or appear with different probability. A further possibility for comparing systems is the complexity of computation. A typical system with N input units and k internal computational units with all-to-all connectivity needs O.Nk C k2 / elementary computations for one update in the learning phase. Spratling and Johnson (2002) report O.Nk2 / computations, whereas our system needs O.Nk C k/ because it is not using internal all-to-all connectivity.7 5.2 Neuroscientic Aspects. As discussed in section 1, we designed our model of the cortical macrocolumn in accordance with relevant neuroanatomical and neurophysiological facts. We show that on discrimination and learning tasks, the resulting system can overcome two serious problems raised by the concept of single neurons as the brain’s decision units, reaction time, and limited fault tolerance. The essential components of our model are column-based interconnections, discrete neural spike signals, oscillatory activity, and Hebbian plasticity. These neural characteristics, usually seen as independent of each other, are shown here to form a natural alliance, with important functional consequences. The model requires little genetic information, being based on sparse, asymmetric, and random interconnections within the minicolumn. Our model makes several simplifying assumptions,

4 The system of Foldi ¨ a´ k (1990), for example, required learning time of 1200 presentations for 16 bars. 5 Hochreiter and Schmidhuber (1999), for example, needed 5000 passes through a training set of 500 patterns for 10 bars. 6 The system of Spratling and Johnson (2002) needs 210 cycles to get a correct representation in the majority of runs for 16 bars and a specially chosen set of parameters. 7 If N grows, the number of available afferents per input unit can be kept constant by proportionally increasing r and reducing the synaptic weights of the afferents accordingly.

Processing and Learning in Macrocolumns

529

using an abstract neuron model, discrete time, and direct inhibition. Experimental predictions of the model should therefore be treated with caution. A fundamental property of our system is the ability to sustain neural activity without input. The property is based on a random interconnection matrix within a minicolumn. A relatively high number of EPSPs per time step results in a relatively high number of EPSPs in the next. The amount of EPSPs is controlled by inhibitory feedback and refractoriness of the neurons. As studies of continuous time systems suggest (e.g., Wilson and Cowan, 1973), this mechanism can be implemented in a continuous time version of the presented minicolumn model as well, such that with a continuous inhibition between the minicolumns similar to equation 2.10, the qualitative dynamical behavior of the discrete model can be expected to carry over to a continuous one, which is based, for example, on an integrate-and-re or Hodgkin-Huxley neuron model. It can even be expected that convergence to stable stationary points of the dynamics is faster than in the discrete time case, which would allow for a shorter º-cycle period and, consequently, an even faster reaction time. For this reason and because of the possibility of a better comparison with neurophysiology, continuous time systems are the subject of further studies. Our system realizes neural populations with well-dened global behavior while realistically using local update rules for individual neurons and synapses. The resulting population code is based on a collective ring rate, evaluated by the macrocolumnar dynamics as average over each minicolumn’s population at a particular phase relative to oscillating inhibition. We tentatively identify our inhibitory cycle with cortical oscillations in the gamma frequency range—about 30 to 60 Hz. Recent neurophysiological experiments (Perez-Orive et al., 2002; see Singer, 2003, for review) support this view of a phase-coupled population rate code. For evidence for phase dependence of Hebbian modication, see Wespatat et al. (2003), where membrane potential oscillations of 20 to 40 Hz were articially induced in pyramidal cells. A central issue for understanding the brain is the neural code. The currently dominant view is the single-neuron hypothesis (Barlow, 1972), according to which essential decisions of the brain can be linked directly to ring decisions of individual neurons. A fundamental difculty for this view are reaction times of the brain. These can be so short that single neurons can re only once. This makes it impossible to express graded signals (see, however, the time-of-arrival hypothesis of Thorpe, 1988, which also advocates a ring phase). On the other hand, a population code can be the basis for very fast information processing. In our model with a standard set of parameters, individual neurons typically re only 2 to 10 times before the macrocolumn makes a decision, and yet the decision is based in a precise graded fashion on the input (if Tº is reduced to Tº D 10, the system shows qualitatively the same behavior, but neurons spike only one to four times before the rst macrostate transition).

530

J. Lucke ¨ and C. von der Malsburg

The other fundamental weakness of the single-neuron hypothesis is lack of robustness against damage and accidents of wiring. The usual proposal to repair this weakness is a population code, and our model may be seen as an essential step at establishing one. The minicolumn has a collective receptive eld. This makes it fault-tolerant with respect to accidents in the afferent connections; the same can be said about intracortical connections. Moreover, self-tuning of the activity dynamics of minicolumns (e.g., of the parameters ºmin and ºmax ) can make them robust to lesion or imperfections in ontogenesis. In summary, our model, motivated by macrocolumn connectivity, has neurodynamic properties that solve important conceptual problems of neurophysiology. The spiking character of neurons, column-based interconnection structure, oscillatory inhibition, and Hebbian plasticity are shown to combine together to form an advanced information processing system that allows solving problems such as pattern classication and, specically, the bars benchmark problem, where it is highly competitive with other recent systems. Appendix: Derivation of the Neuron Dynamics in Terms of Neuron Activation Probability For the dynamics 2.1 with the above-described interconnection Tij , the probability, p.t C 1/, that a neuron is activated at time .t C 1/ can be approximated by the product of the probability, pA .t C 1/, that it receives enough input to exceed threshold and the probability, pB .t C 1/, that the neuron is not refractory. Using equation 2.3, we get in the limit s ! 1: p.t C 1/ D pA .t C 1/ pB .t C 1/ Z 1 D fg .x/ dx .1 ¡ p.t// s2 Z 1 1 1 x¡a 2 D p e¡ 2 . ¾ / dx .1 ¡ p.t// ; [a D sp.t/]; 2¼ ¾ s2 p ¾ D sp.t/.1 ¡ p.t// Z a¡s2 ¾ 1 1 2 D p e¡ 2 x dx .1 ¡ p.t// 2¼ ¡1 Á ! p.t/ ¡ 2 D 8s p .1 ¡ p.t// ; p.t/ .1 ¡ p.t// Z ps x 1 1 2 D p 8s .x/ e¡ 2 y dy: 2¼ ¡1 The approximation has proven to be applicable even for a relatively low neuron number m and for relatively small values of s.

Processing and Learning in Macrocolumns

531

Acknowledgments We thank Rolf P. Wurtz ¨ for many discussions and suggestions and Jan D. Bouecke for helping us in implementing parts of the system. Partial funding by the EU in the RTN MUHCI (HPRN-CT-2000-00111), the German BMBF in the project LOKI (01 IN 504 E9), and the Korber ¨ Prize awarded to C. v. d. M. in 2000 is gratefully acknowledged. References Anninos, P. A., Beek, B., Csermely, T. J., Harth, E. M., & Pertile, G. (1970). Dynamics of neural structures. J. Theo. Biol., 26, 121–148. Barlow, H. B. (1972). Single units and sensation: A neuron doctrine for perceptual psychology. Perception, 1, 371–394. Budd, J. M. L., & Kisvarday, Z. F. (2001). Local lateral connectivity of inhibitory clutch cells in layer 4 of cat visual cortex. Exp. Brain Res., 140(2), 245–250. Buxhoeveden, D. P., & Casanova, M. F. (2002). The minicolumn and evolution of the brain. Brain, Behavior and Evolution, 60, 125–151. Charles, D., & Fyfe, C. (1998). Modelling multiple-cause structure using rectication constraints. Network: Computation in Neural Systems, 9, 167–182. Constantinidis, C., Franowicz, M. N., & Goldman-Rakic, P. S. (2001). Coding specicity in cortical microcircuits: A multiple-electrode analysis of primate prefrontal cortex. J. Neuroscience, 21, 3646–3655. Dayan, P., & Zemel, R. S. (1995). Competition and multiple cause models. Neural Computation, 7, 565–579. DeFelipe, J., Hendry, S.H.C., Hashikawa, T., Molinari, M., & Jones, E. G. (1990). A microcolumnar structure of monkey cerebral cortex revealed by immunocytochemical studies of double bouquet cell axons. Neuroscience, 37, 655–673. DeFelipe, J., Hendry, S.H.C., & Jones, E. G. (1989). Synapses of double bouquet cells in monkey cerebral cortex. Brain Res., 503, 49–54. Elston, G. N., & Rosa, M.P.G. (2000). Pyramidal cells, patches, and cortical columns: A comparative study of infragranular neurons in TEO, TE, and the superior temporal polysensory areas of the macaque monkey. J. Neuroscience, 20, 1–5. Favorov, O. V., & Diamond, M. (1990). Demonstration of discrete place-dened columns, segregates, in cat SI. J. Comparative Neurology, 298, 97–112. Favorov, O. V., & Kelly, D. G. (1996). Stimulus-response diversity in local neuronal populations of the cerebral cortex. Neuroreport, 7(14), 2293–2301. Favorov, O. V., & Whitsel, B. L. (1988). Spatial organization of the peripheral input to area 1 cell columns I. The detection of “segregates”. Brain Res., 472, 25–42. Foldi´ ¨ ak, P. (1990). Forming sparse representations by local anti-Hebbian learning. Biological Cybernetics, 64, 165–170. Fransen, E., & Lansner, A. (1998). A model of cortical associative memory based on a horizontal network of connected columns. Network-Computation in Neural Systems, 9(2), 235–264.

532

J. Lucke ¨ and C. von der Malsburg

Frey, B. J., Dayan, P., & Hinton, G. E. (1997). A simple algorithm that discovers efcient perceptual codes. In M. Jenkin and L. R. Harris (Eds.), Computational and psychophysical mechanisms of visual coding. Cambridge: Cambridge University Press. Fukai, T. (1994). A model of cortical memory processing based on columnar organization. Biological Cybernetics, 70, 427–434. Fyfe, C. (1997). A neural network for PCA and beyond. Neural Processing Letters, 6, 33–41. Harpur, G. F., & Prager, R. W. (1996). Development of low entropy coding in a recurrent network. Network-Computation in Neural Systems, 7, 277–284. Hinton, G. E., Dayan, P., Frey, B. J., & Neal, R. M. (1995). The “wake-sleep” algorithm for unsupervised neural networks. Science, 268, 1158–1161. Hinton, G. E., & Ghahramani, Z. (1997). Generative models for discovering sparse distributed representations. Phil. Trans. Royal Soc. London, Series B, Biological Sciences, 352, 1177–1190. Hochreiter, S., & Schmidhuber, J. (1999). Feature extraction through LOCOCODE. Neural Computation, 11, 679–714. Hopeld, J. J. (1982). Neural networks and physical systems with collective computational abilities. Proceedings of the National Academy of Sciences, 79, 2554–2558. Hopeld, J. J., & Tank, D. W. (1986). Computing with neural circuits: A model. Science, 233, 625–633. Jones, E. G. (2000). Microcolumns in the cerebral cortex. PNAS, 97, 5019–5021. Lao, R., Favorov, O. V., & Lu, J. P. (2001). Nonlinear dynamical properties of a somatosensory cortical model. Information Science, 132, 53–66. Lubke, J., Egger, V., Sakmann, B., & Feldmeyer, D. (2000). Columnar organization of dendrites and axons of single and synaptical coupled excitatory spiny neurons in layer 4 of the rat barrel cortex. J. Neuroscience, 20, 5300– 5311. Lucke, ¨ J., von der Malsburg, C., & Wurtz, ¨ R. P. (2002). Macrocolumns as decision units. In Articial Neural Networks—ICANN 2002, LNCS 2415 (pp. 57–62). New York: Springer-Verlag. Marshall, J. A. (1995). Adaptive perceptual pattern recognition by selforganizing neural networks: Context, uncertainty, multiplicity, and scale. Neural Networks, 8, 335–362. Mountcastle, V. B. (1997). The columnar organization of the neocortex. Brain, 120, 701–722. Nowak, L. G., & Bullier, J. (1997). The timing of information transfer in the visual system. Cerebral Cortex, 12, 205–241 O’Reilly, R. C. (2001). Generaliztion in interactive networks: The benets of inhibitory competition and Hebbian learning. Neural Computation, 13, 1199– 1241. Perez-Orive, J., Mazor, O., Turner, G. C., Cassenaer, S., Wilson, R. I., & Laurent, G. (2002). Oscillations and sparsening of odor representations in the mushroom body. Science, 297(5580), 359–365. Peters, A., Cifuentes, J. M., & Sethares, C. (1997). The organization of pyramidal cells in area 18 of the rhesus monkey. Cerebral Cortex, 7, 405–421.

Processing and Learning in Macrocolumns

533

Peters, A., & Sethares, C. (1996). Myelinated axons and the pyramidal cell modules in monkey primary visual cortex. J. Comp. Neurology, 365, 232–255. Peters, A., & Sethares, C. (1997). The organization of double bouquet cells in monkey striate cortex. J. Neurocytology, 26, 779–797. Peters, A., & Yilmaze, E. (1993). Neuronal organization in area 17 of cat visual cortex. Cerebral Cortex, 3, 49–68. Reichardt, W., Poggio, T., & Hausen, K. (1983). Figure-ground discrimination by relative movement in the visual system of the y. Biol. Cybern., 46 (Suppl.), 1–30. Saund, E. (1995). A multiple cause mixture model for unsupervised learning. Neural Computation, 7, 51–71. Singer, W. (2003). Synchronization, binding and expectancy. In M. A. Arbib (Ed.), The handbook of brain theory and neural networks (pp 1136–1143). Cambridge, MA: MIT Press. Somers, D. C., Nelson, S. B., & Sur, M. (1995). An emergent model of orientation selectivity in cat visual cortical simple cells. J. Neuroscience, 15, 5448–5465. Spratling, M. W. & Johnson, M. H. (2002). Preintegration lateral inhibition enhances unsupervised learning. Neural Computation, 14, 2157–2179. Thomson, A. M., & Deuchars, J. (1994). Temporal and spatial properties of local circuits in neocortex. Trends in Neuroscience, 17, 119–126. Thorpe, S. (1988). Identication of rapidly presented images by the human visual system. Perception, 17,(A77), 415. Thorpe, S., Fize, F., & Marlot, C. (1996). Speed of processing in the human visual system. Nature, 381, 520–522. Wespatat, V., Tennigkeit, F., & Singer, W. (2003). Phase sensitivity of Hebbian modications in oscillating cells of rat visual cortex. Manuscript submitted for publication. Wilson, H. R., & Cowan, J. D. (1973). A mathematical theory of the functional dynamics of cortical and thalamic nervous tissue. Kybernetik, 13, 55–80. Yu, A. J., Giese, M. A., & Poggio, T. A. (2002). Biophysiologically plausible implementations of the maximum operation. Neural Computation, 14, 2857– 2881. Received March 25, 2003; accepted August 4, 2003.

LETTER

Communicated by Helge Ritter

Temporally Asymmetric Learning Supports Sequence Processing in Multi-Winner Self-Organizing Maps Reiner Schulz [email protected]

James A. Reggia [email protected] Departments of Computer Science and Neurology, and UMIACS, University of Maryland, College Park, MD 20742, U.S.A.

We examine the extent to which modied Kohonen self-organizing maps (SOMs) can learn unique representations of temporal sequences while still supporting map formation. Two biologically inspired extensions are made to traditional SOMs: selection of multiple simultaneous rather than single “winners” and the use of local intramap connections that are trained according to a temporally asymmetric Hebbian learning rule. The extended SOM is then trained with variable-length temporal sequences that are composed of phoneme feature vectors, with each sequence corresponding to the phonetic transcription of a noun. The model transforms each input sequence into a spatial representation (nal activation pattern on the map). Training improves this transformation by, for example, increasing the uniqueness of the spatial representations of distinct sequences, while still retaining map formation based on input patterns. The closeness of the spatial representations of two sequences is found to correlate signicantly with the sequences’ similarity. The extended model presented here raises the possibility that SOMs may ultimately prove useful as visualization tools for temporal sequences and as preprocessors for sequence pattern recognition systems. 1 Introduction A self-organizing map (SOM) is an articial neural network whose nodes are usually arranged in a two-dimensional grid. The SOM learns, in an unsupervised fashion, an organized transformation that converts a pattern from a typically high-dimensional input space to a pattern or map across the discrete surface of the grid. There currently exist two qualitatively different types of SOMs (see Table 1), both of which are actively in use. The rst type, iterative multi-winner SOMs, has been largely of interest in theoretical neuroscience (Cho & Reggia, 1994; Pearson, Finkel, & Edelman, 1987; Sutton, Reggia, Armentrout, & D’Autrechy, 1994; von der Malsburg, 1973). The activity of each map node is the result of a relatively neurobiologically plauNeural Computation 16, 535–561 (2004)

c 2004 Massachusetts Institute of Technology °

536

R. Schulz and J. Reggia

Table 1: Typical Features of the Two Types of Self-Organizing Maps. SOM Type Seminal work Primary applications Input-to-map connectivity Intramap connectivity Activation dynamics Learning rule Computational cost Memory capacity Examples

Iterative Multi-Winner

One-Step Single-Winner

von der Malsburg (1973) Neuroscience: modeling sensorimotor cortex Divergent, but localized

Kohonen (1982) Computer science: data visualization, speech processing Full

Lateral (excite immediate neighbors, inhibit more distant ones) Multiple winners: nonlinear differential equations Hebbian/competitive High High Cho & Reggia (1994); Pearson et al. (1987); Sutton et al. (1994); von der Malsburg (1973)

None (implicit neighborhoods) Single global winner: node most activated by the input Hebbian/competitive Low Low Callan et al. (1999); Kaski et al. (1998); Kohonen (1982); Kokkonen and Torkkola (1990); Principe et al. (1998)

sible, albeit computationally slow, iterative process during which laterally connected map nodes that are close to one another compete for activation. As a result, an activation pattern emerges in which multiple separated clusters of nodes (the “winners”) become active in response to an input pattern, so a distributed (or “coarse coding”) representation is being used. The second and more widely used type, one-step single-winner SOMs, is more oriented toward computational applications and less toward neuroscience (Callan, Kent, Roy, & Tasko, 1999; Kaski, Honkela, Lagus, & Kohonen, 1998; Kohonen, 1982; Kokkonen & Torkkola, 1990). These models are characterized by a signicantly more efcient but neurobiologically implausible mechanism for the computation of the SOM’s response to an input: the node that the input initially activates the most is declared the lone winner of a global competition, and it alone represents the input pattern, so a local representation is being used. The majority of past work on SOMs of both classes, as well as related non-SOM methods (Bishop, Svens´en, & Williams, 1998), has involved static, that is, time-invariant, input patterns where a map’s activation pattern in response to one input is not inuenced by previous inputs. The results of these studies do not carry over directly to temporal sequences of inputs, a signicant shortcoming given that sequential inputs are very common (e.g., language, motion in visual elds, motor feedback). In response to this problem, several extensions to the basic SOM method have been proposed during the past decade to support temporal sequence processing. These extended SOMs are very diverse, so we consider them rst in terms of the tasks they address and second in terms of the methodologies they adopt.

Multi-Winner Maps

537

The specic temporal processing tasks that have been addressed are prediction, recall, recognition and representation. Prediction is concerned with the accurate computation of the next element in a sequence from previously observed sequence elements. In Principe, Wang, & Motter (1998), for example, an SOM was successful at predicting articial chaotic time series as well as controlling a wind tunnel that required the prediction of wind speed changes. In Rao and Sejnowski (2000), an SOM-like network of two recurrently connected chains of neurons learned to predict the next in a series of left-to-right or right-to-left moving stimuli. The recall task takes prediction a step further, requiring that the SOM reproduce all elements of a sequence in the correct temporal order when given an initial cue, for example, the rst element of the sequence. This has been accomplished in Kopecz (1995) and Abbott and Blum (1996) with 2D fully laterally connected SOMs for one or two low-dimensional sequences. In Gerstner, Ritz, and van Hemmen (1993), a fully laterally connected network of 1000 nodes (not arranged according to any topology) was shown to be capable of storing and retrieving four sequences and its theoretical capacity estimated at 100 sequences. Recognition of temporal sequences has generally focused on identifying a given input sequence as a member of a class by mapping it onto a particular map location or locations that correspond to class prototypes learned from previously seen sequences. There have been many efforts to achieve this (Chappell & Taylor, 1993; Euliano & Principe, 1999; Kangas, 1990; Somervuo, 1999, 2003; Varsta, Millan, & Heikkonen, 1997; Wiemer, 2003). Finally, and most directly related to our work, is the problem of transforming temporal sequences into relatively unique spatial representations, that is, into relatively unique nal activation patterns on the map grid that represent the sequences and thus might be viewed as reminiscent of “cell assemblies” (Hebb, 1949). Such a time-to-space representation may be benecial in data visualization and as an initial input processing step in a larger neural system for sequence recognition (Chappell & Taylor, 1993). To our knowledge, the only other study to address this task was that of James and Miikkulainen (1995), but several of the sequence recognition models above are also necessarily concerned about how the prototypes are arranged on the map relative to one another. These past temporal sequence processing SOMs can also be viewed from the perspective of the diverse methodologies they have proposed. The simplest approach has been to leave the original one-shot single-winner SOM model untouched and to preprocess sequential inputs via an external shortterm memory. For example, in some studies, a xed number of successive input patterns were concatenated to form a single static pattern (Kangas, 1990). Others have suggested averaging the patterns in a sequence over time and feeding the average as a static pattern to the network (Carpinteiro, 1999). However, these approaches assume that the range of interpattern relations across time is quite limited. Many other forms of short term memory are reviewed elsewhere (Barreto, Araujo, & Kremer, 2003; Mozer, 1993). Another approach has employed leaky-integrator or other temporal neuron models

538

R. Schulz and J. Reggia

as the map nodes (Chappell & Taylor, 1993; Varsta et al., 1997), while yet another idea has been to capture temporal relations in the input via lateral connections between the map nodes, rendering the SOM a truly recurrent neural network (Abbott & Blum, 1996; Gerstner et al., 1993; Kopecz, 1995; Rao & Sejnowski, 2000). Finally, in Euliano and Principe (1999) and Weimer (2003), spreading wavefronts of activation (activity diffusion) were used to alter the typical activation dynamics of the one-shot single-winner SOM so that learning is characteristically affected by the temporal order of the inputs. At present, there is no general consensus as to how best to process sequences with SOMs, and this topic remains a very active focus of current neurocomputational research (Barreto et al., 2003). In this context, the goal of our work is to extend the one-step single-winner SOMs in biologically plausible ways to make them more effective in processing and representing large sets of variable-length sequences. Unlike most past related work described above, we focus solely on the task of developing a unique spatial representation for each of the input sequences, with the idea that this is also a precursor for effective pattern recognition. To achieve our goal, we extend traditional one-step single-winner SOMs (see Table 1) in two biologically motivated ways. First, our model supports map formation in the presence of multiple simultaneous winners rather than a single winner. Multiple islands of simultaneous activation (i.e., multiple “winners”) are often observed in neocortical regions (e.g., Donoghue & Sanes, 1992; Georgopoulos, Kettner, & Schwartz, 1988). Such a coarse coding in principle allows for the spatial encoding or representation of a much larger number of sequences. However, for computational efciency and unlike past multi-winner SOMs, we retain the one-step, noniterative selection of winners, making our model a one-step multi-winner SOM that can be distinguished from both classes of previously studied SOMs. Second, as a mechanism for supporting sequence processing, we introduce into SOMs for the rst time the use of temporally asymmetric Hebbian learning to train local, or range-limited, intramap connections. These local lateral connections are very different from those used in past multi-winner SOMs: they are not used to produce a “Mexican hat” pattern of lateral interactions, and they are adaptive. Further, their adaptation is temporally asymmetric in a fashion inspired by recent experimental evidence showing that changes in biological synaptic efcacy in cortex (Markram, Luebke, Frotscher, & Sakmann, 1997) and other structures of the brain (Bi & Poo, 1998, 2001; Zhang, Tao, Holt, Harris, & Poo, 1998) are sometimes due to temporally asymmetric Hebbian learning: a synapse is strengthened (long-term potentiation, LTP) if presynaptic action potentials precede excitatory postsynaptic potentials by typically 20 to 50 ms, and weakened (long-term depression, LTD) if the time course is reversed. While a few past modeling studies have used temporally asymmetric Hebbian learning to store and retrieve sequences (Abbott & Blum, 1996;

Multi-Winner Maps

539

Gerstner et al., 1993; Rao & Sejnowski, 2000), these past studies were not concerned with either map formation or the transformation of sequences into spatial representations as we consider here. Our model can be distinguished from that of James and Miikkulainen (1995), which successfully dealt with the representation task but did not use multi-winner SOMs, lateral connectivity, or temporally asymmetric Hebbian learning as we do here. Our approach is also very different from the pattern recognition system of Somervuo (1999) which, after initial training of a standard one-shot single-winner SOM on nonsequential inputs, uses an external construction algorithm to convert the SOM into a network with connections between arbitrarily distant nodes (i.e., its lateral connections are neither local nor learned with temporally asymmetric Hebbian learning). In summary, the fundamental hypothesis we examine in this article is that training a one-step multi-winner SOM whose short-range lateral connections undergo temporally asymmetric Hebbian learning transforms variable-length temporal sequences into reasonably unique spatial patterns of activity, even while map formation of the input space persists. 2 Methods While our model is very general, to assess its functionality, we use specic sequences of feature vectors. Each vector in a sequence is the feature-based encoding of an English phoneme. Each sequence corresponds to the phonetic transcription, based on the NetTalk corpus (Sejnowski & Rosenberg, 1987), of a 2- to 10-phoneme noun naming an object, taken from the widely used Snodgrass-Vanderwart word corpus (Snodgrass & Vanderwart, 1980). For example, /h ² l k a p t r/ is the phonetic sequence transcription of helicopter, and /p/, the fourth-from-last phoneme in the transcription, is equivalent to a distinct tuple of 34 binary feature values: [consonantal D 1, vocalic D 0, compact D 0, diffuse D 1, grave D 1, acute D 0, nasal D 0, oral D 1, tense D 1, lax D 0, : : : ]. See the appendix for a complete set of phoneme encodings and further details about the input data. In this context, the SOM’s task is the unsupervised acquisition of an internal representation for the spoken names of a set of objects, the representation for each name ideally being unique. Initially, before the rst vector of an input sequence is presented to the SOM, all map nodes are inactive. From this initial state, the activation pattern develops deterministically at discrete time steps (one time step per input phoneme vector, hence “one-step”) based on the current input vector and the activation pattern at the previous time step. This implies that, for example, in the case of bow (/b o/) and bowl (/b o l/), after the input of /o/, the respective activation patterns are identical. For bow, this is the nal activation pattern, and hence its spatial representation. According to our hypothesis, the last feature vector /l/ of bowl should trigger a change in the

540

R. Schulz and J. Reggia

Figure 1: Network structure. (A) Global architecture. (B) A map node with its weighted connections from the input layer and from the map nodes in its immediate 8-neighborhood. The different widths of the solid lines (arcs) indicate that in general, the efcacies of the connections differ.

map activation pattern of the trained one-step multi-winner SOM so that the spatial representation for bowl differs from that of bow. Figure 1a shows the basic architecture of our one-step multi-winner SOM. The recurrent network transforms an input sequence of patterns into a nal single static output pattern (the sequence’s spatial representation) where each component of the output corresponds to a node of the map. The map itself consists of a regular, rectangular grid of R rows by C columns of Q D R C nodes. We measure the distance on the map between two nodes i and i0 at positions .r; c/ and .r0 ; c0 / using the box-distance metric, d.i; i0 / D max.jr ¡ r0 j; jc ¡ c0 j/. As illustrated in Figure 1b, an arbitrary map node i receives a connection from each node of the input layer as well as from each map node within a circumscribed connection neighborhood, Nconn .i/ D fj j d.i; j/ · rconn g, centered at and including i. Every connection to i carries a real-valued synaptic weight indicating its efcacy. If the input layer consists of P nodes, then the connection to the ith map node from the jth input is weighted by wij , E i 2 RC P is the afferent weight vector of i. Analogously, the weights and w on the incoming lateral connections of map node i are stored in the lateral Q weight vector vEi 2 RC . In particular, vij corresponds to the weight on the lateral connection from j to i (d.i; j/ · rconn ), vii D ¯ 2 R is an immutable weight on the self-connection of i, and vij D 0 if i and j are not connected (d.i; j/ > rconn ). The level of activation of an arbitrary input or map node ranges between 0 (inactive) and 1 (fully active). The activation levels of all P input nodes compose the afferent vector xE 2 [0; 1]P , normalized to be of unit length. Similarly, the activation levels of all Q map nodes form a vector yE 2 [0; 1]Q , the activation pattern of the map. For any map node i, only those components of

Multi-Winner Maps

541

yE that correspond to activation levels of map nodes in i’s connection neighborhood Nconn .i/ are directly relevant since i receives lateral connections only from those nodes. The activation levels of all nodes are updated synchronously at discrete time steps, one step per input vector in a sequence. Thus, the afferent input vector as well as the map activation pattern are time variant. Given an input sequence X D xE.1/; : : : ; xE.k/, the map activation pattern, E evolves over a period of k time steps. The nal initialized as yE.0/ D 0, map activation pattern yE.k/ is then said to be the spatial representation of temporal sequence X. At the beginning of time step t ¸ 1, the net input h is computed independently for each map node i as E Ti xE.t/ C .1 ¡ ®/E hi .t/ D ® w vTi yE.t ¡ 1/;

(2.1)

where the xed parameter ® 2 [0; 1] determines the relative contributions of afferent versus lateral input vectors, and T indicates the transpose of E i and vEi . column vectors w E To compute yE.t/ from h.t/, we use a computationally efcient one-step mechanism that approximates the competitive activation dynamics (Mexican hat pattern) that has been implemented in some past iterative multiwinner SOMs via a computationally expensive numerical solution of differential equations (Reggia, D’Autrechy, Sutton, & Weinrich, 1992). However, unlike traditional one-step SOM models, multiple winners occur: every map node i that receives a net input greater than that of all other map nodes within i’s connection neighborhood is taken to be a winner. Since parameter rconn is usually chosen to be small relative to the size of the entire map, typically multiple winner nodes exist. Each winner is then made the center of a “peak” of activation. The distribution of activation within a single peak is such that winner node i at its center is maximally active (yi D 1), and the activation level of each nonwinner node j within i’s connection neighborhood decreases exponentially with increasing distance between j and i. The activation peak centered at i does not extend beyond the connection neighborhood of i. However, two or more peaks may partially overlap, in which case their contributions to the activation level of a map node in the region of overlap are added, but may not exceed 1. Specically, if the set V.t/ of winner nodes at time t is V.t/ D fi j 8j 6D i: j 2 Nconn .i/ ) hj .t/ < hi .t/g; then the activation of map node j is: Á ! X » ° d.i;j/ if j 2 N conn .i/ yj .t/ D min 1; ; 0 otherwise

(2.2)

(2.3)

i2V.t/

where ° 2 [0; 1] determines the shape of each peak of activation (smaller ° means faster drop-off).

542

R. Schulz and J. Reggia

To test our central hypothesis—that our model learns to spatially represent the sequences in the training set fairly uniquely—we use the 1norm to quantify the difference between two nal activation patterns yE PQ 0 and yE 0 : d.E y; yE 0 / D jjE y ¡ yE 0 jj1 D iD1 jyi ¡ yi j. Using distance measure d, we assess the overall performance of our model by measuring over all distinct sequences X (of length k) and X0 (of length k0 ) from the training set, the distance between the spatial representations of X and X0. We use three quantitative measures of how the model performs overall in separating different sequences into unique nal spatial representations. First, we count the number of pairs of distinct sequences in the training set for which the model ends up with the same nal map activation pattern: jZj D jffX; X0 g: X 6D X0 ; d.E yX .k/; yEX0 .k0 // D 0gj. The model uniquely represents all sequences if jZj D 0; otherwise, there are pairs of distinct sequences that the model “confuses.” The second measure is the minimum distance dmin between two spatial representations computed over all pairs of distinct sequences in the training set: dmin D minX6DX0 d.E yX .k/; yEX0 .k0//. Notice that dmin D 0 for as long as jZj > 0 and jZj D 0 as soon as dmin > 0, and that dmin and jZj are often complementary, not redundant. Training could, for example, signicantly increase dmin from a pretraining value already greater than zero, while jZj remains 0. Or training may decrease jZj to a smaller value still greater than zero, while dmin remains 0. Finally, we measure the average distance between two spatial representations computed over all pairs 1 P EX .k/; yEX0 .k0 //, of distinct sequences in the training set: dN D jSj fX;X0 g;X6DX0 d.y where jSj is the number of sequences in the training set. Before training, each weight is independently initialized with a random value from the interval [0; 1], each afferent weight vector is normalized to P unit length, and each lateral weight vector is normalized such that 8i : j6Di vij D 1. The one-step multiwinner SOM learns by adjusting its weights in response to the repeated input of all temporal sequences in the training set in random order. The number of training epochs is 1000. The input of a single arbitrary temporal sequence of length k causes the SOM to pass through k time steps. At the end of each time step t, after the construction of activation pattern yE.t/, the weights of the SOM are modied. E i of the ith map node, the learning rule For the afferent weight vector w is: E i .t/ D w E i .t ¡ 1/ C ¹yi .t/E w x.t/ E i .t/ D w E i .t/=jjw E i .t/jj2 : w

(2.4) (2.5)

Equation 2.4 implements typical temporally symmetric Hebbian (or competitive) learning where ¹ 2 .0; 1] is the afferent learning rate. RenormalE i to move across the surface of the unit ization in equation 2.5 restricts w hypersphere, generally in the direction of the current afferent input xE.t/. In contrast, the learning rule for the lateral weights is unusual in being a temporally asymmetric variant of Hebbian learning. As noted earlier, recent

Multi-Winner Maps

543

experimental studies have found this learning rule to sometimes govern changes in synaptic efcacy in cortex (Markram et al., 1997) and other parts of the brain (Bi & Poo, 1998, 2001; Zhang et al., 1998). Given two map nodes i and j where 0 < d.i; j/ · rconn , the efcacy of the connection vij to i from j at time t is increased proportional to yj .t ¡ 1/, the activity of j at the previous time step, times max.0; yi .t/ ¡ yi .t ¡ 1//, the increase in the activity of i relative to the previous time step:

vij .t/ D

8
vij .t ¡ 1/ X vij .t/ D vij .t/= v .t/ j6Di ij

for i 6D j

if j 6D i and j 2 Nconn .i/ otherwise

(2.6) (2.7)

where ´ 2 .0; 1] is the lateral learning rate. In general, equation 2.6 is intended to capture a notion of cause and effect across time: the connection to i from j is strengthened if j’s preceding activity contributes to an increase in the activation of i. This is consistent with the results of a previous analysis of temporally asymmetric Hebbian learning, which concludes that overall change in the synaptic efcacy is proportional to the rate of change in postsynaptic activity (Roberts, 1999). Note that the subsequent renormalization may result in a net decrease of a connection’s efcacy due to competition for growth with the other incoming lateral connections of i. This relates to previous observations that temporally asymmetric Hebbian learning is inherently competitive and self-stabilizing due to a balance of weight increases (LTP) and decreases (LTD) (Song, Miller, & Abbott, 2000; Royer & Pare, 2003). We use a simple method, that is, explicit renormalization, to implement such competition and stability. As is typical for the training of traditional one-step, single-winner SOMs, the values of certain parameters in our learning rule depend on how far training has progressed. For example, training of one-step single-winner SOMs is often divided into two phases: a rough ordering or self-organization phase corresponding to large values for ° and ¹ in our model, and a convergent phase corresponding to small values for ° and ¹. Analogously in our model, parameters ° , ¹, and ´ monotonically decrease in a nonlinear fashion from some initial value to a smaller nal value. For example, ° .t/ D °n C .°init ¡ °n /=. 1 C e.t¡°in /=° ¾ /, where t is the fraction of completed training epochs, °init (°n ) determines ° ’s initial (nal) value, °in is the point of inection, and °¾ determines the rate of decline. Similar functions are used for ¹ and ´. 3 Results We rst demonstrate that with a suitable set of parameters, training of our model improves the uniqueness of the transformation of sequences into spa-

544

R. Schulz and J. Reggia

Table 2: Best Parameter Set for the One-Shot Multiwinner SOM. Parameter Set rconn

®

¯ °init °n °in °¾ ¹init ¹n ¹in

4 .64 .05 .37

0

.2 .16

.44

0

¹¾ ´init ´n ´in ´¾

.4 .0001 .62

0

.8 .04

Performance: 30 by 20 Nodes, 60 Training Sequences dN

dmin

jZj

Pretraining

16.5 (1.003)

0 (0)

11.8 (4.27)

Posttraining

23.1 (0.998)

0.4 (0.88)

1.4 (1.27)

Performance: 40 by 30 Nodes, 175 Training Sequences dN

dmin

jZj

Pretraining

31.1 (1.10)

0 (0)

58.9 (9.55)

Posttraining

45.1 (1.18)

0 (0)

11.0 (6.58)

tial representations. Next, we measure the changes in model performance as the size of the training set is systematically varied and examine the formation of feature maps and patterns in the lateral connectivity of our model. Finally, we investigate how the difference between the spatial representations of any two distinct sequences in trained maps relates to the similarity or dissimilarity of these two sequences. 3.1 Learning Unique Representations. An initial exploration of the parameter space using particle swarm optimization methods (Kennedy, Eberhart, & Shi, 2001) readily established a value for each of the model parameters (see Table 2) such that training signicantly improves the performance of the model across all three performance (uniqueness) measures. The great extent of the parameter search space and the computational cost of training the map forced us to limit both the size of the model and the training set to restrain the time needed to train the model repeatedly. These initial experiments were done with a 30 £ 20 node network and 60 randomly chosen distinct sequences. We initially expected that strongly self-inhibitory map nodes (¯ strongly negative) and a very limited inuence of the afferent inputs on the activation dynamics (small ®n ) would be critical to the formation of unique spatial representations. However, both the particle swarm optimization algorithm and a systematic manual trial-and-error exploration of the permissible range of values determined that self-inhibition was not optimal and that a much higher value (0.64) for ®n , corresponding to a much stronger inuence of the afferent input, produces better performance. The performance results in the middle of Table 2 give the means and standard deviations (in parentheses) for the three performance measures, each

Multi-Winner Maps

545

Figure 2: Pretraining (light gray) and posttraining (dark gray) histograms of the distances between the spatial representations for every two distinct sequences from the training set containing 60 sequences. The histogram counts have been averaged over 20 independent experiments. The error bars represent one standard deviation.

averaged over 20 independent experiments, obtained with the listed parameter values and the 30 £ 20 node network. An independent experiment constitutes initializing the model using different random initial weights, measuring pretraining performance, training the model, and measuring posttraining performance. Figure 2 shows that training signicantly reduces the number of sequences that the model transforms into identical or very similar spatial representations. Training also increases the overall distance between the spatial representations of two distinct sequences in general, indicated in the gure by a posttraining right shift of the distance distribution’s peak in conjunction with the reduction of the distribution’s left tail. As an example of how our model learns to better separate two distinct but similar sequences, consider the two sequences /b l/ (ball) and /b o l/ (bowl). For the model, they are relatively hard to distinguish, since they are short and differ only in the intermediate similar phoneme. Nevertheless, on average, the distance d between the two sequences’ nal activation patterns increases from 3.0, prior to training, to 10.0, after training. This is illustrated in Figure 3 for a single representative experiment. The gure shows pretraining (top) and posttraining (bottom) plots of the development over time of the activation pattern that the winner nodes form on the map at each time step. Time proceeds along the horizontal axis, and the pattern of

546

R. Schulz and J. Reggia

Figure 3: Pretraining (A) and posttraining (B) traces of winner nodes for the sequences /b l/ (ball; ² and dotted lines) and /b o l/ (bowl; x and dashed lines). If a winner node is connected by lines to winner nodes at the previous time step, then the node would not have won without the input it received from the previous winners. The nal winner nodes or, equivalently, activation patterns (right-most grid) are almost identical prior to training (A), but substantially more distinct after training (B). After training, more winners depend on the input they receive from previous winners via lateral connections (many more lines in B), indicating that the lateral connectivity much more strongly inuences the activation dynamics after training, and is thus likely to be a major cause of the more unique nal activation patterns.

Multi-Winner Maps

547

winner nodes (² for ball, x for bowl) is shown after input of each phoneme on a grid that represents the map. The initial patterns of winner nodes are identical for both sequences after seeing just the rst identical phoneme, but the patterns diverge at subsequent time steps, leading to different nal patterns, shown on the right-most grid, even though the nal afferent inputs to the map are identical (the phoneme /l/). The lines (dotted for ball, dashed for bowl) in this gure that sometimes connect winner nodes of subsequent time steps can be viewed as causal relationships: a winner node is connected to all nodes in its connected neighborhood that won at the previous time step only if the node would not have won at the current time step without the lateral input from these previous winners. Several qualitative differences between the pretraining and posttraining activation dynamics of the model were observable in general. First, the total number of winner nodes on the map tended to increase after training, indicating a more efcient use of the map space that is available to provide an encoding (not observable in Figure 3). Second, the fraction of winner nodes that win because of the lateral input from previous winners increases: after training, the lateral connectivity of the model apparently inuences the activation dynamics much more strongly. Finally, and most important, the distance between the nal activation patterns of two distinct sequences usually tends to increase. This suggests that the temporally asymmetric training of the model’s lateral connections is the major cause of the overall increase in uniqueness of the sequences’ spatial representations. That this latter result is quite general, and much more dramatic for more different sequences than the two rather similar ones illustrated here, is seen in Figure 2. Despite a signicant increase in performance, the relatively small networks used meant that our model learned all unique representations in only 4 of the 20 independent experiments conducted with 60 sequences. In the other 16 experiments, only 7 pairs of sequences were transformed into the same spatial representation after training. These pairs ended in two successive consonants (horse/box, needle/eagle, iron/corn), shared a relatively long common sufx (sweater/helicopter, iron/corn), or both started with /k/ and ended with similar consonants (cup/couch, cup/coat, couch/coat). A systematic exploration of varying parameters one at a time showed that the model’s good performance was relatively insensitive to signicant parameter changes. However, for parameter ®, which controls the relative inuences of afferent and lateral inputs, we found only a narrow range of values 0:6 · ® · 0:7 for which jZj reaches nearly minimal values and dmin exceeds zero. Unlike the other parameters, the choice of ® appears critical to optimal performance. We conducted seven independent experiments with a larger 40 £ 30 version of our model that was trained using the same parameter values with the complete set of 175 distinct sequences; the results are shown at the bottom of Table 2. A comparison of the performance values for the two different network sizes in Table 2 shows that the twice-as-large model trained with

548

R. Schulz and J. Reggia

roughly three times the number of distinct sequences performed approximately the same. Training of the larger model increased dN by 50%, did not increase dmin , and decreased jZj by 84%, which compares to an increase of dN by 40%, a very small increase of dmin , and a decrease of jZj by 88% for the smaller model. The total number, over all experiments, of pairs of confused but distinct sequences increases by roughly a factor of six from 7 for the smaller model trained with 60 sequences to 43 for the larger model trained with 175 sequences. Note, however, that the total number of pairs of distinct sequences increases by roughly a factor of 8.6, from 1770 to 15,225. Some representative examples of sequence pairs confused by the larger network model in different runs are: bell/ball, tiger/spider, tomato/potato, y/buttery, bottle/beetle, box/ox, and sweater/tiger. 3.2 Effects of Memory Load on Performance. Memory load as used here refers to the number of sequences in the training set in relation to the number of map nodes. To assess memory load effects on the performance of our smaller 30 £ 20 model, using the parameter values from Table 2, we varied the number of sequences in the training set from 10 to 175. All training sets, except the set containing all 175 sequences, were generated by the successive subtraction (addition) of 10 randomly chosen distinct sequences from (to) the training set containing 60 sequences that we used in section 3.1. Figure 4 shows that after training, the three performance measures for our model react differently to variations in the size of the training set, measured as the number of pairs of distinct sequences that can be formed from sequences in the training set. The mean distance between the spatial representations of any two distinct sequences, dN (dmean ), remains roughly constant over the entire range of tested training set sizes. The average minimum representation distance, dmin , quickly drops to zero, so with this small network, there are often a few confused sequence pairs present. The data suggest an overall near-linear dependency of jZj on the number of pairs of distinct training sequences, or a roughly quadratic increase of jZj in terms of the number of training sequences. 3.3 Map Formation. Past afferent input vectors have substantial inuence, via recurrent lateral connections, on the activation dynamics of our model during the processing of a sequence. This effect is nonexistent in more typical one-step single-winner SOMs, which, when trained on a set of unsequenced input vectors, form maps where similar inputs (inputs that are close to one another in input space) are mapped onto or represented by nodes that are close to one another on the map, and vice versa. We assumed that the reverberation of past inputs in our model would disturb and prevent the formation of feature maps. This assumption turned out to be incorrect. As can be seen in Figure 5, which is a representative example, feature maps of single phonemes, that is, single-input feature vectors, did form. For

Multi-Winner Maps

549

Figure 4: The inuence of training set size on the three performance measures N dmin , and jZj. For each training set size, four independent experiments dmean D d, were conducted, except for the set of size 60 for which the number of independent experiments was 20. Shown are performance means, each with an error bar of one estimated standard deviation (its direction is chosen to minimize clutter and is irrelevant otherwise). While dmean D dN remains constant, dmin very quickly drops to zero. jZj increases proportionally to the number of pairs of distinct sequences.

example, the model clearly separated clusters of vowels from consonants. These are the two top-level categories any reasonable clustering algorithm would identify because they are most dissimilar based on the set of feature vectors. Unlike in traditional single-winner SOMs, the maps formed by our model exhibit substantial redundancy. For example, in Figure 5, 12 isolated clusters of vowels are visible where the clusters are similar to one another in terms of which particular vowels they contain. These redundant clusters arise due to the distributed representation used and are reminiscent of the multiple redundant clusters sometimes seen in biological sensorimotor cortex (see, e.g., Donoghue & Sanes, 1992; Georgopoulos et al., 1988). Internally, clusters are more homogeneous than at their periphery, that is, nodes in the center of a cluster are more similar to their immediate neighbors (lighter cells) than nodes on the periphery of a cluster (darker cells). The same applies to the areas of the map that have become sensitive to consonants. The result is a multitude of small, nonredundant feature maps with no discernible boundaries between them that form one globally redundant map. Surprisingly, qualitatively similar maps form for ®-values as

550

R. Schulz and J. Reggia

Figure 5: The bottom half of a trained map measuring 40 £ 30 nodes after training (175 distinct sequences; model parameters as in Table 2). Each cell is one node that is labeled by the phoneme (white characters for vowels, dark characters for consonants) whose feature vector is closest to the node’s afferent weight vector. Vowels have been separated on the map from consonants based on their phoneme feature vectors. Multiple vowel clusters or “islands” can be seen at different locations in a “sea” of consonants. A cell’s background brightness corresponds to the average dot product between the afferent weight vectors of the node it represents and that node’s immediate neighbors: the brighter the node, the more similar its input sensitivity is to that of its immediate neighbors. Lateral connections with a weight > 0:2 are shown as arrows pointing from the source toward the destination node. The length of an arrow is proportional to the square root of the connection weight. The distance of the destination node equals the number of concentric arcs at the arrow’s base. The arrow is black if it points from a vowel to a consonant or vice versa; it is white if it points from a vowel (consonant) to a vowel (consonant). The pattern of strong lateral connections suggests that they represent frequent phoneme transitions in the training sequences. In the training sequences, a vowel is almost always followed by a consonant, and on the map, most connections originating at vowels indeed terminate at consonants (black arrows).

Multi-Winner Maps

551

Table 3: Correlation of Lateral Weights and Phoneme Transition Frequency. Training Set and Model Size

Pretraining Posttraining

60 Sequences, 30 £ 20 Nodes

175 Sequences, 40 £ 30 Nodes

¡0:0664 (0.0165)

¡0:0536 (0.0141)

0.6939 (0.0277)

0.6263 (0.0311)

low as 0.2, that is, even if the afferent inputs have very little inuence on the activation dynamics of the model. Figure 5 also shows all lateral connections whose weights have increased signicantly during training.1 A visual inspection suggests that nodes sensitive to vowels tend to send strong connections to nodes sensitive to consonants. It is one of the properties of the training set that in all but three cases (out of a total of 222), a vowel in a sequence is followed by a consonant. This gives rise to the hypothesis that strong lateral connections coincide with the frequent transition from a particular input phoneme to a particular next input phoneme in the sequences of the training set. To test this, we measured the correlation between lateral connection weights and phoneme transition frequencies. We recorded, for each possible input phoneme transition xE.t/ to xE.t C 1/, the sum of the weights on all lateral connections from a node E Ti xE.t/ and xE.t C 1/ maximizes i to a node j (i 6D j) where xE.t/ maximizes w T E j xE.t C 1/. We then paired each greater-than-zero sum 2 with the absolute w frequency with which the respective phoneme transition xE.t/; xE.tC1/ occurs in the training sequences. These pairs are the data points from which the correlation coefcient is computed. This was done repeatedly and independently for both a small (30 £ 20 nodes trained with 60 distinct sequences; 20 independent experiments) and a large (40 £ 30 nodes trained with 175 distinct sequences; 7 independent experiments) version of our model, prior to and after training. Table 3 summarizes the results by providing the mean (and standard deviation) of each correlation coefcient, as estimated from the results of the respective number of independent experiments. Prior to training, the then random lateral connection weights are not correlated with phoneme transition frequencies. After training, the two quantities are very highly positively correlated, lending strong support to our hypothesis. 3.4 Representation Distance and Sequence Similarity. We now consider whether similar input sequences are transformed into similar spatial 1 The threshold is 0.2. Prior to training for a map of 40 £ 30 (30 £ 20) nodes, the mean lateral weight is 0.0008 (0.0017) with a standard deviation of 0.0040 (0.0058). 2 Lateral connections with a weight equal to zero are considered nonexistent. Hence, sums equal to zero are excluded from the analysis.

552

R. Schulz and J. Reggia

representations on the map. To measure the similarity of spatial representations (nal activation patterns), we use both the 1-norm distance d that we have used all along plus a winner separation distance. Recall that at the end of training when the parameter ° determining activation peak widths approaches zero (see equation 2.3 and °n in Table 2), only the winner nodes are signicantly (and fully) active. Let the (row, column) positions of the winner nodes in the spatial representation yE following the nal phoneme of one input sequence be .r1 ; c1 /; .r2 ; c2 /; : : : ; .rk ; ck /, and the positions in yE 0 for a different input sequence be .r01 ; c01 /; : : : ; .r0k0 ; c0k0 /, and without loss of generality take k ¸ k0 . We then dene the winner separation distance dsep between yE and yE0 to be the average distance on the map from a winner node in yE to P the closest winner node in yE0 : dsep D 1k kiD1 min1·j·k0 .jri ¡ rj0 j C jci ¡ cj0 j/. For comparison purposes, we also need a measure or measures of the similarity of any two input sequences of phonemes used for training. In general terms, the similarity (dissimilarity) of two sequences is typically measured in terms of the two sequences’ optimal alignment or “edit distance,” and so we adopt this method here. The algorithm for computing optimal alignment is described in detail in, for example, in Guseld (1997). In short, an alignment of two sequences is a recipe for translating one sequence into the other using essentially two operations: the insertion of a special blank element into a sequence and the substitution of an element in one sequence with an element at the same position in the other sequence. Each substitution operation in an alignment is associated with a cost or score that is a function of the two elements being substituted. The sum over all substitutions in an alignment is the score (cost) of the alignment. An optimal alignment maximizes (minimizes) the score (cost) of translating one sequence into the other. To avoid length-based biases, we normalize the score (cost) of each optimal alignment by its length. We adopt the convention that each inserted blank equals the blank’s immediate predecessor in the sequence. As all input elements are binary-valued feature vectors, we adopt a Hamming distance cost measure (very similar results were also found using Euclidean distances instead). As a score measure, we use the Tversky feature count (Tversky, 1977; Tversky & Gati, 1978), a well-established method in linguistics for measuring the similarity of phonemes. With this latter measure, if two phonemes are encoded by the feature vectors xE and xE 0 , then their similarity equals the difference between the number of features they share and the number of features they do not share: jfi: xi D x0i D 1gj ¡ jfi: xi D 1; x0i D 0gj ¡ jfi: xi D 0; x0i D 1gj. The correlation analysis was performed for each of the four possible combinations of a representation distance measure (d or dsep) compared to a sequence similarity (Tversky) or dissimilarity (Hamming) measure. Two different size versions of the model were used (30 £ 20 nodes and 60 distinct sequences versus 40 £ 30 nodes and 175 distinct sequences). The small (large) instance of the model was initialized 20 (7) times with different random initial weights and subsequently trained. In each of these

Multi-Winner Maps

553

Table 4: Correlation of Representation Distance and Sequence Similarity. Training Set and Model Size 60 Sequences, 30 £ 20 Nodes Pattern Distance Measure Pre- vs. Posttraining

1-norm

175 Sequences, 40 £ 30 Nodes

Winner separation

1-norm

Winner separation

Pre

Post

Pre

Post

Pre

Post

Pre

Post

.3284

.2893

.3041

.3779

.3874

.3594

.3863

.4141

¡.4054 ¡.3950 ¡.3641 ¡.3946

¡.4245

Feature vector distance or similarity measure Hamming distance Tversky feature count

¡.3438 ¡.2917 ¡.3214

Note: Italic entries indicate a statistically signicant (p · 0:01) decrease and bold entries an increase of the posttraining relative to the pretraining absolute level of correlation.

independent experiments, the four correlation coefcients were computed prior to and after training. The correlation coefcients, averaged over the respective number of independent experiments, are listed in Table 4. Overall, these results show that both before and after training, there is a substantial positive correlation between input sequence Hamming distances and their nal activation pattern distances, and a substantial negative correlation between input sequence similarities (Tversky’s measure) and their nal activation pattern distances. Most intriguing is that the magnitudes of the correlation measured in terms of winner node separation are always increased by training. 4 Discussion Most past work on SOMs has focused on processing nonsequential input patterns and has used Kohonen’s approach to map formation. The latter bases learning on a single global winner node for each input pattern, and uses a one-step “best match” winner selection process for computational efciency. While very successful for the nonsequential tasks for which it is intended (Kaski et al., 1998), various past approaches extending such SOMs have been and continue to be developed to process temporal sequences (see section 1).

554

R. Schulz and J. Reggia

In this article, we have examined the specic question of the extent to which Kohonen SOMs can be modied to learn a unique spatial representation or encoding of temporal sequences while still retaining their traditional map formation properties. To facilitate sequence processing, we extended the standard one-step single-winner SOM methodology in two ways, both inspired by biological phenomena in cerebral cortex. First, instead of global single-winner activation dynamics, we used multiple simultaneous winner nodes. Such a distributed or coarse representation is motivated by its potential to encode or represent a larger number of temporal sequences. Using multiple local activation peaks like this is also more consistent with activity patterns in the cerebral cortex, and for this reason has been adopted in several past SOMs directed at modeling neurobiological observations (Cho & Reggia, 1994; Pearson et al., 1987; Sutton et al., 1994; von der Malsburg, 1973). However, unlike these past studies with nonsequence processing tasks, we retained the one-step winner selection of Kohonen SOMs for computational efciency. The second enhancement we made was to add local intramap lateral connections that undergo temporally asymmetric Hebbian learning. The motivation for this type of connection was to enable the now recurrent map network to capture temporal transitions via lateral shifts in activation peak locations. As discussed above, this extension also derives from biological data that have demonstrated such temporally asymmetric learning experimentally (Bi & Poo, 1998, 2001; Markram et al., 1997; Zhang et al., 1998). Our learning rule (see equations 2.6 and 2.7) intuitively tries to capture and enhance the causal relationships between activation peaks at one time instant and subsequent nearby activation peaks at the next time instance. With these two extensions, the resulting one-step multiwinner SOM was found to be remarkably effective in developing unique spatial representatives (unique nal activation patterns across the map) for sizable sets of real-world temporal sequences. Even with the relatively small networks we used, maps could learn unique encodings for almost all of 60 sequences, or 175 sequences for the somewhat larger maps. While not perfect (typically a very few sequences remained confused after training), the learning process clearly and consistently increased the uniqueness of representations over time. As similar input sequences tended to produce similar nal activation patterns over the map, not surprisingly the confused input phoneme sequences often were similar, especially in their initial or nal subsequences (e.g., ball/bell and spider/sweater). A somewhat unexpected nding, at least to us, was that in spite of the sequential nature of the input, the multiple simultaneous winner nodes, and the lateral intramap connectivity that inuenced selection of winning nodes, well-organized maps of the individual phoneme input patterns still formed. These were reminiscent of maps seen with traditional one-shot (Kohonen) SOMs, with similar phonemes being generally adjacent to one another. For example, there was clear-cut separation of vowel and consonant phonemes

Multi-Winner Maps

555

from one another. Of course, since unlike with traditional SOMs, our model has multiple simultaneous winner nodes, multiple copies of such maps were present. This nding was very robust to variations in the weighting given to afferent versus lateral connections (parameter ®). We believe the ndings of this study imply that SOMs have a substantially greater role to play as useful tools for sequence processing than is generally recognized. Still, there is clearly room for future research to improve on the capabilities of SOMs in this regard. Perhaps most important, future theoretical and experimental studies are needed of ways to guarantee the uniqueness of the spatial representations that are learned for similar input sequences. While it might be true that simply using larger networks could resolve this issue, a more satisfying solution would use methods that increase the effectiveness of the time-to-space mapping without enlarging the maps. Some methods, which we did not examine here, that might be explored include the use of noise during training to encourage more separation of the nal activation patterns of very similar sequences, or increasing the time span of learning effects on lateral intramap connections from one time step to two or three (reaching back further in time has proven effective in improving supervised sequence learning in some past non-SOM systems). Appendix: Training Data The words (nouns) used in this work are derived from the SnodgrassVanderwart corpus (Snodgrass & Vanderwart, 1980) and their phonemes based on the NetTalk corpus (Berndt, D’Autrechy, & Reggia, 1994; Sejnowski & Rosenberg, 1987). The Snodgrass-Vanderwart corpus contains 260 names of physical objects (e.g., apple), from which we eliminated all multiword names (e.g., spool of thread), words for which, in experiments, subjects did not select the “correct” name for the corresponding picture at least 90% of the time (using % Corr(1) in (Snodgrass & Yuditsky, 1996), and nouns that are not part of the NetTalk corpus. This leaves 175 nouns that we use as training data. The phoneme sequences corresponding to the selected nouns are taken from the NetTalk corpus. Altogether, 27 consonants and 15 vowels and diphthongs occur in the NetTalk corpus, for a total of 42 phonemes. Three of the consonants, /ul/, /um/, and /un/, which rarely or never are part of a selected noun (14, 0, and 3 times), are not distinguished, but are considered to be equivalent to /l/, /m/, and /n/. Construction of distinctive feature vectors for each phoneme is challenging, as sometimes experts in phonology, phonetics and linguistics disagree on what an ideal set of distinctive features should be (see, e.g., Frisch, 1996). Our distinctive features were not based on any modeling considerations but on well-known previously published feature sets. They provide a unique representation for each distinguished phoneme that captures at least some of the regularities that make some phonemes similar to others. All 34 components of a feature vector (input pattern), prior to normalization, are binary

556

R. Schulz and J. Reggia

Table 5: Distinctive Features: Vowels. IPA

o

a

Keyboard Code

o

ah ay oo

e

uh- ee

ih

eh ae uh+ u

aw er

ai au

Consonantal Vocalic Compact Diffuse Grave Acute Nasal Oral Tense Lax Continuant Interrupted Strident Mellow +Voicing –Voicing +Duration –Duration +(Af)Frication –(Af)Frication Liquid Glide Retroex F2;VH F2;H F2;HM F2;LM F2;L F2;VL =F1;VH F1;H F1;HM F1;LM F1;L F1;VL

. + . . . . . . + . . . . . + . . . . . . . . . . . . . + . . + . .

. + . . . . . . + . . . . . + . . . . . . . . . . . + . . + . . . .

. + . . . . . . . + . . . . + . . . . . . . . . . + . . . + . . . .

. + . . . . . . . + . . . . + . . . . . . . . + . . . . . . . . + .

. + . . . . . . . + . . . . + . . . . . . . . . + . . . . . + . . .

. + . . . . . . . + . . . . + . . . . . . . . . . . . + . . + . . .

. + . . . . . . + . . . . . + . . . . . . . . . + . . . . . . + . .

. + . . . . . . + . . . . . + . . . . . . . . + . . . . . . . + . .

u

. + . . . . . . + . . . . . + . . . . . . . . . . . . . + . . . + .

i

. + . . . . . . + . . . . . + . . . . . . . . + . . . . . . . . . +

I

æ

. + . . . . . . . + . . . . + . . . . . . . . . + . . . . + . . . .

ai

. + . . . . . . . + . . . . + . . . . . . . . . . + . . + . . . . .

. + . . . . . . . + . . . . + . . . . . . . . . . . . + . . . . . +

. + . . . . . . + . . . . . + . . . . . . . + . . + . . . . . + . .

. + . . . . . . . + . . . . + . . . . . . . . . . . + . . . . + . .

valued: C for a present feature (numerical value 1.0) and ¡ for an absent feature (0.0) (see Tables 5, 6, and 7). The consonant features were taken mostly from the Jakobson, Fant and Halle (1951) feature system (Singh, 1976), augmented for completeness with additional phonemes (e.g., /r/) and features by Singh and colleagues (Singh & Black, 1966; Singh, 1976). The vowel features include some of the same features as consonants, plus features based on the F1 and F2 formants, each divided into six discrete frequency intervals (VH = very high, H = high, HM = high-medium, M = medium, LM = lowmedium, L = low), taken from Paget (1976). The diphthongs such as /ai/ and /au/ were taken as the average of their two nondiphthong components for simplicity.

Multi-Winner Maps

557

Table 6: Distinctive Features: Consonants, Part I. IPA

p

b

m

t

d

n

k

g

f

v

Keyboard Code

p

b

m

t

d

n

tch

dj

k

g

f

v

th–

th+

Consonantal Vocalic Compact Diffuse Grave Acute Nasal Oral Tense Lax Continuant Interrupted Strident Mellow +Voicing –Voicing +Duration –Duration +(Af)Frication –(Af)Frication Liquid Glide Retroex F2;VH F2;H F2;HM F2;LM F2;L F2;VL =F1;VH F1;H F1;HM F1;LM F1;L F1;VL

+ . . + + . . + + . . + . . . + . + . + . . . . . . . . . . . . . .

+ . . + + . . + . + . + . . + . . + . + . . . . . . . . . . . . . .

+ . . + + . + . . . . . . . + . . + . + . . . . . . . . . . . . . .

+ . . + . + . + + . . + . . . + . + . + . . . . . . . . . . . . . .

+ . . + . + . + . + . + . . + . . + . + . . . . . . . . . . . . . .

+ . . + . + + . . . . . . . + . . + . + . . . . . . . . . . . . . .

+ . + . . . . + + . . + + . . + . + + . . . . . . . . . . . . . . .

+ . + . . . . + . + . + + . + . . + + . . . . . . . . . . . . . . .

+ . + . . . . + + . . + . + . + . + . + . . . . . . . . . . . . . .

+ . + . . . . + . + . + . + + . . + . + . . . . . . . . . . . . . .

+ . . + + . . + + . + . . . . + . + + . . . . . . . . . . . . . . .

+ . . + + . . + . + + . . . + . . + + . . . . . . . . . . . . . . .

+ . . + . + . + + . + . . + . + . + + . . . . . . . . . . . . . . .

+ . . + . + . + . + + . . + + . . + + . . . . . . . . . . . . . . .

For normalization, each feature vector is projected onto the unit hypersphere in the next higher dimension. The additional component vr stores the minimal distance between the original feature vector and the surface of the smallest hypersphere enclosing all feature vectors: vr D r ¡ jjE vjj2 where r is the length of the largest feature vector. The thus extended feature vectors are then normalized to unit length to prevent input vectors with relatively large norms from having a greater inuence on the activation dynamics. The prior projection step preserves topological information such as the nearest-neighbor relation between the vectors.

558

R. Schulz and J. Reggia

Table 7: Distinctive Features: Consonants, Part II. IPA

s

z

w

r

l

j

h

Keyboard Code

s

z

sh

zh

w

r

l

y

h

ng

Consonantal Vocalic Compact Diffuse Grave Acute Nasal Oral Tense Lax Continuant Interrupted Strident Mellow +Voicing –Voicing +Duration –Duration +(Af)Frication –(Af)Frication Liquid Glide Retroex F2;VH F2;H F2;HM F2;LM F2;L F2;VL =F1;VH F1;H F1;HM F1;LM F1;L F1;VL

+ . . + . + . + + . + . + . . + + . + . . . . . . . . . . . . . . .

+ . . + . + . + . + + . + . + . + . + . . . . . . . . . . . . . . .

+ . + . . . . + + . + . . . . + + . + . . . . . . . . . . . . . . .

+ . + . . . . + . + + . . . + . + . + . . . . . . . . . . . . . . .

+ . . . . . . + . . + . . . + . . + . + . + . . . . . . . . . . . .

+ + . . . . . + . . . . . . + . . + . + + . + . . . . . . . . . . .

+ + . . . . . + . . . . . . + . . + . + + . . . . . . . . . . . . .

+ . . . . . . + . . . . . . + . . + . + . + . . . . . . . . . . . .

+ . . . . . . + + . . . . . . + . + + . . . . . . . . . . . . . . .

+ . + . . . + . . . . . . . + . . + . + . . . . . . . . . . . . . .

Acknowledgments This work was supported by NINDS Award NS35460 and DOD Award N000140210810. References Abbott, L., & Blum, K. (1996). Functional signicance of long-term potentiation for sequence learning and prediction. Cerebral Cortex, 6(3), 406–416. Barreto, G., Araujo, A., & Kremer, S. (2003). A taxonomy for spatiotemporal connectionist networks revisited: The unsupervised case. Neural Computation, 15(6), 1255–1320.

Multi-Winner Maps

559

Berndt, R., D’Autrechy, C., & Reggia, J. (1994). Functional pronunciation units in English words. J. Experimental Psychology: Learning, Memory and Cognition, 20, 977–991. Bi, G., & Poo, M. (1998). Synaptic modications in cultured hippocampal neurons: Dependence on spike timing, synaptic strength, and postsynaptic cell type. J. Neuroscience, 18(24), 10464–10472. Bi, G., & Poo, M. (2001). Synaptic modication by correlated activity: Hebb’s postulate revisited. Annual Review of Neuroscience, 24, 139–166. Bishop, C. M., Svense´ n, M., & Williams, C. K. I. (1998). GTM: The generative topographic mapping. Neural Computation, 10, 215–234. Callan, D. E., Kent, R. D., Roy, N., & Tasko, S. M. (1999). Self-organizing map for the classication of normal and disordered female voices. Journal of Speech, Language, and Hearing Research, 42, 355–366. Carpinteiro, O. (1999). A hierarchical self-organizing map model for sequence recognition. Neural Processing Letters, 9(3), 209–220. Chappell, G. J., & Taylor, J. G. (1993). The temporal Kohonen map. Neural Networks, 6, 441–445. Cho, S., & Reggia, J. (1994). Map formation in proprioceptive cortex. Int. J. Neural Systems, 5(2), 87–101. Donoghue, J., & Sanes, S. L. J. (1992). Organization of the forelimb area in squirrel monkey motor cortex. Experimental Brain Research, 89, 1–19. Euliano, N. R., & Principe, J. C. (1999). A spatio-temporal memory based on SOM’s with activity diffusion. In E. Oja & S. Kaski (Eds.), Kohonen maps (pp. 253–266). Amsterdam: Elsevier. Frisch, S. (1996). Similarity and frequency in phonology. Unpublished doctoral dissertation, Northwestern University. Georgopoulos, A., Kettner, R., & Schwartz, A. (1988). Primate motor cortex and free arm movements to visual targets in three-dimensional space. II. Coding of the directions of movement by a neural population. J. Neuroscience, 8, 2928– 2937. Gerstner, W., Ritz, R., & van Hemmen, J. (1993). Why spikes? Hebbian learning and retrieval of time-resolved excitation patterns. Biological Cybernetics, 69(5– 6), 503–515. Guseld, D. (1997). Algorithms on strings, trees, and sequences. Cambridge: Cambridge University Press. Hebb, D. (1949). The organization of behavior: A neuropsychological theory. New York: Wiley. Jakobson, R., Fant, G., & Halle, M. (1951). Preliminaries to speech analysis: The distinctive features and their correlates. Cambridge, MA: MIT Press. James, D. L., & Miikkulainen, R. (1995). SARDNET: A self-organizing feature map for sequences. In G. Tesauro, D. Touretzky, and T. Leen (Eds.), Advances in neural information processing systems, 7 (pp. 577–584). Cambridge: MIT Press. Kangas, J. (1990). Time-delayed self-organizing maps. In Proc. of IJCNN, International Joint Conference on Neural Networks, San Diego (Vol. 2, pp. 331–336). Los Alamitos, CA: IEEE Computing Society Press. Kaski, S., Honkela, T., Lagus, K., & Kohonen, T. (1998). WEBSOM—selforganizing maps of document collections. Neurocomputing, 21(1), 101–117.

560

R. Schulz and J. Reggia

Kennedy, J., Eberhart, R., & Shi, Y. (2001). Swarm intelligence. San Francisco: Morgan Kaufmann. Kohonen, T. (1982). Self-organizing formation of topologically correct feature maps. Biol. Cyb., 43(1), 59–69. Kokkonen, M., & Torkkola, K. (1990). Using self-organizing maps and multilayered feed-forward nets to obtain phonemic transcriptions of spoken utterances. Speech Communication, 9(5–6), 541–549. Kopecz, K. (1995). Unsupervised learning of sequences on maps with lateral connectivity. In F. Fogelman-Soulie & P. Gallinari (Eds.), Proceedings of the International Conference on Articial Neural Networks (Vol. 1, pp. 431–436). Nanterre, France: Editions EC2. Markram, H., Luebke, J., Frotscher, M., & Sakmann, B. (1997). Regulation of synaptic efcacy by coincidence of postsynaptic APS and EPSPS. Science, 275(5297), 213–215. Mozer, M. (1993). Neural network architectures for temporal sequence processing. In A. Weigend & N. Gershenfeld (Eds.), Time series prediction (pp. 243– 264). Reading, MA: Addison-Wesley. Paget, R. (1976). Vowel resonances. In D. Fry (Ed.), Acoustic phonetics (pp. 95– 103). Cambridge: Cambridge University Press. Pearson, J., Finkel, L., & Edelman, G. (1987). Plasticity in the organization of adult cerebral cortical maps: A computer simulation based on neuronal group selection. J. Neuroscience, 7, 4209–4223. Principe, J. C., Wang, L., & Motter, M. A. (1998). Local dynamic modeling with self-organizing maps and applications to nonlinear system identication and control. Proceedings of the IEEE, 86(11), 2240–2258. Rao, R., & Sejnowski, T. (2000). Predictive learning of temporal sequences in recurrent neocortical circuits. In S. Solla, T. Leen, & K. Muller ¨ (Eds.), Advances in neural information processingsystems, 12 (pp. 164–171). Cambridge, MA: MIT Press. Reggia, J., D’Autrechy, C., Sutton, G., & Weinrich, M. (1992). A competitive distribution theory of neo-cortical dynamics. Neural Computation, 4, 287–317. Roberts, P. (1999).Computational consequences of temporally asymmetric learning rules: I. Differential Hebbian learning. J. Computational Neuroscience, 7(3), 235–246. Royer, S., & Pare, D. (2003). Conservation of total synaptic weight through balanced synaptic depression and potentiation. Nature, 422, 518–522. Sejnowski, T., & Rosenberg, C. (1987). Parallel networks that learn to pronounce English text. Complex Systems, 1, 145–168. Singh, S. (1976). Distinctive features: Theory and validation. Baltimore, MD: University Park Press. Singh, S., & Black, J. (1966). A study of twenty-six intervocalic consonants as spoken and recognized by four language groups. J. Acoustic Society of America, 39(2), 372–387. Snodgrass, J., & Vanderwart, M. (1980). A standardized set of 260 pictures. J. Experimental Psychology: Human Learning and Memory, 6, 174–215. Snodgrass, J., & Yuditsky, T. (1996). Naming times for the Snodgrass and Vanderwart pictures. Behavior Research Methods, Instruments and Computers, 28(4), 516–536.

Multi-Winner Maps

561

Somervuo, P. (1999). Time topology for the self-organizing map. In IJCNN’99. International Joint Conference on Neural Networks. Proceedings (Vol. 3, pp. 1900– 1905). Piscataway, NJ: IEEE Service Center. Somervuo, P. (2003). Speech dimensionality analysis on hypercubical self-organizing maps. Neural Processing Letters, 17(2), 125–136. Song, S., Miller, K., & Abbott, L. (2000). Competitive Hebbian learning through spike-timing-dependent synaptic plasticity. Nature Neuroscience, 3(9), 919– 926. Sutton, G., Reggia, J., Armentrout, S., & D’Autrechy, C. (1994). Cortical map reorganization as a competitive process. Neural Computation, 6, 1–13. Tversky, A. (1977). Features of similarity. Psychological Review, 84, 327–352. Tversky, A., & Gati, I. (1978). Studies of similarity. In E. Rosch & B. Lloyd (Eds.), Judgement under uncertainty: Heuristics and biases. Mahwah, NJ: Erlbaum. Varsta, M., Millan, J., & Heikkonen, J. (1997). A recurrent self-organizing map for temporal sequence processing. In W. Gerstner, A. Germond, M. Hasler, & J. Nicoud (Eds.), Proc. Internat. Conf. Artif. Neural Networks (pp. 421–426). Berlin: Springer-Verlag. von der Malsburg, C. (1973). Self-organization of orientation sensitive cells in the striate cortex. Kybernetik, 14, 85–100. Wiemer, J. (2003). The time-organized map algorithm: Extending the selforganizing map to spatiotemporal signals. Neural Computation, 15(5), 1143– 1171. Zhang, L., Tao, H., Holt, C., Harris, W., & Poo, M. (1998). A critical window for cooperation and competition among developing retinotectal synapses. Nature, 395, 37–44. Received April 15, 2003; accepted September 8, 2003.

LETTER

Communicated by Misha Tsodyks

Neuronal Bases of Perceptual Learning Revealed by a Synaptic Balance Scheme Osamu Hoshino [email protected] Department of Human Welfare Engineering, Oita University, 700 Dannoharu, Oita 870-1192, Japan

Our ability to perceive external sensory stimuli improves as we experience the same stimulus repeatedly. This perceptual enhancement, called perceptual learning, has been demonstrated for various sensory systems, such as vision, audition, and somatosensation. I investigated the contribution of lateral excitatory and inhibitory synaptic balance to perceptual learning. I constructed a simple associative neural network model in which sensory features were expressed by the activities of specic cell assemblies. Each neuron is sensitive to a specic sensory feature, and the neurons belonging to the same cell assembly are sensitive to the same feature. During perceptual learning processes, the network was presented repeatedly with a stimulus that was composed of a sensory feature and noise, and the lateral excitatory and inhibitory synaptic connection strengths between neurons were modied according to a pulse-timing-based Hebbian rule. Perceptual learning enhanced the cognitive performance of the network, increasing the signal-to-noise ratio of neuronal activity. I suggest here that the alteration of the synaptic balance may be essential for perceptual learning, especially when the brain tries to adopt the most suitable strategy—signal enhancement, noise reduction, or both—for a given perceptual task. 1 Introduction Our ability to perceive external sensory stimuli improves as we experience the same stimulus repeatedly. This perceptual enhancement is called perceptual learning and has been demonstrated for various sensory systems, such as vision, audition, and somatosensation (for a review, see Goldstone, 1998). Depending on the type of cognitive processing, the brain may use a different strategy: signal enhancement (Poggio, Fahle, & Edelman, 1992; Weiss, Edelman, & Fahle, 1993), noise reduction (McLaren, Kaye, & Mackintosh, 1988; McLaren, 1994; Gold, Bennett, & Sekuler, 1999), or both (Dosher & Lu, 1998). It is often claimed that improvement in cognitive performance by perceptual learning may arise from the alteration of cortical circuitry, that is, Neural Computation 16, 563–594 (2004)

c 2004 Massachusetts Institute of Technology °

564

O. Hoshino

changes in synaptic connection strengths between relevant neurons (Gilbert, 1994, 1996). Karni and Sagi (1991) have proposed a Hebbian process for synaptic alteration, in which use-dependent synaptic enhancement is induced when concurrent pre- and postsynaptic activation occurs. Weiss and colleagues (1993) did a simulation study for perceptual learning in hyperacuity in an early visual system. They demonstrated that signal but not noise was amplied by perceptual learning, in which activity-dependent modulation of feedforward synaptic connection strengths played a crucial role. It has been proposed that lateral synaptic modulation within cortical areas may play a crucial role in perceptual learning. Karni and Sagi (1991) have suggested that lateral synaptic modulation based on the Hebbian rule is essential for perceptual learning in vision. Especially, lateral inhibitory synaptic modulation is effective for reducing background noise and therefore strengthens the “pop-out” of a target stimulus. Schwartz, Maquet, and Frith (2002) have suggested that perceptual learning in visual texture discrimination tasks may involve changes in connections within the primary visual cortex. Hoshino, Inoue, Kashimori, and Kambara (2001) have demonstrated by simulating a neural network model that excitatory and inhibitory synaptic modulation within the network enhances cognitive ability in sensory-feature identication tasks, reducing the time required to identify sensory stimuli. Crist, Li, and Gilbert (2001) have suggested that a change in excitatory-inhibitory synaptic balance may be essential for perceptual learning in the V1 area. Although there has been clear evidence that supports the relevance of cortical plasticity to perceptual learning, little is known about the functional signicance of the modulation of synaptic balance in perceptual learning. The purpose of the study examined here is to investigate how the lateral excitatory-inhibitory synaptic balance contributes to perceptual learning. I propose a synaptic balance scheme for perceptual learning, based on which I try to develop a unied understanding of the neuronal mechanism underlying perceptual learning. I investigate how the brain adopts the most suitable strategy—signal enhancement, noise reduction, or both—for a given perceptual task. I construct a simple associative neural network model in which sensory features are expressed by the activities of specic cell assemblies. When the network is stimulated with a sensory feature, the neurons of the cell assembly corresponding to the applied feature re action potentials, while the other neurons do not. During a perceptual learning process, the network is stimulated with a given feature repeatedly, and the lateral excitatory and inhibitory synaptic connection strengths between neurons are modied according to a pulse-timing-based Hebbian rule. After perceptual learning, neuronal responses to that feature are recorded. I evaluate the performance of the network by the signal-to-noise ratio of neuronal activity. The ratio is dened by (feature-induced neuronal

Neuronal Bases of Perceptual Learning

565

activity)/(noise-induced neuronal activity). This idea comes from the notion that there is a close relationship between an increase in the signal-to-noise ratio of neuronal activity and an improvement in cognitive performance (Gilbert, 1994). Other performance measures, such as reaction times to or the detectability of stimuli, which are often used for assessing the performance of associative neural networks (Hoshino et al., 2001; Hoshino, Zheng, & Kuroiwa, 2002), might also be possible. However, as will be shown in later sections, the signal-to-noise ratio of neuronal activity can clearly demonstrate the relevance of the synaptic balance to perceptual learning. I use a degraded feature stimulus as an input, which is the superposition of two ring patterns of the network, expressing an original feature stimulus and a noise stimulus. Relevant and some irrelevant neurons to the original feature tend to be activated when stimulated with that combinatory (feature plus noise) input. As the level of noise increases, the activity of the relevant and the irrelevant neurons decreases and increases, respectively. To investigate the relevance of the synaptic balance to perceptual learning, I vary the rate of excitatory and inhibitory synaptic modication and analyze neuronal behaviors before, during, and after perceptual learning. For the simulation, I assume perceptual learning processes that are restricted to early sensory systems such as V1 in vision, A1 in audition, or S1 in somatosensation (Gilbert, 1994; Buonomano & Merzenich, 1998). V1 neurons are orientation specic, A1 neurons are sound frequency specic, and S1 neurons tune to specic somatosensation for the body surface. That is, these neurons have simple and small receptive elds and therefore are single modal, responding to single sensory features but not to multiple sensory features. This notion might allow me to make a neural network model consisting of single-modal neurons. However, it is well known that perceptual learning is not limited to lower cortical areas but takes place in higher cortical areas as well. In general, since higher cortical areas integrate lower sensory information, the neurons of the higher areas have complex receptive elds and therefore tend to respond to multiple sensory features (Rolls & Baylis, 1994; Sakurai, 1996). I extend the simulation study to investigate how perceptual learning is processed if the network contains multimodal neurons. I simulate a simple neural network model that contains both types (single modal and bimodal) of neurons. The bimodal neurons are sensitive to two different features. 2 Neural Network Model Figure 1a shows the structure of the neural network model. The model consists of an input (IP) and an output (OP) network. The IP network receives an external input and sends action potentials to the OP network by divergent and convergent feedforward projections. The OP network consists of neural units, each consisting of a pair of a pyramidal neuron and an interneuron. In each unit, an excitatory (inhibitory) synaptic connection is made from the

566

O. Hoshino

Figure 1: The neural network model. (a) The model consists of an input (IP) and an output (OP) network. The IP network receives an external input and sends action potentials to the OP network via divergent and convergent feedforward projections. When the IP network is stimulated with feature FX (X = 1, 2, 3, 4, 5), the OP neurons corresponding to FX are activated. These groups of OP neurons have no overlapping region with each other. (b) Degraded input stimuli. The black circles and gray circles that denote the ring state of neurons represent signal and noise, respectively. The open circles denote the resting state of neurons.

pyramidal neuron (interneuron) to the interneuron (pyramidal neuron). The pyramidal neurons are mutually connected with excitatory and inhibitory synaptic connections. I assume here the excitatory and inhibitory synapses between pyramidal neurons. The excitatory synapses could be made by axo-dendritic connec-

Neuronal Bases of Perceptual Learning

567

tions between pyramidal neurons, and the inhibitory synapses could be expressed as the excitation of local interneurons (Martin, 1991; Amit, 1998), such as basket cells, Martinotti cells, or bitufted cells that are GABAergic inhibitory neurons in the cortex (Gupta, Wang, & Markram, 2000). The OP network is constructed based on the so-called associative neural network. Sensory features expressed by the activities of specic cell assemblies, or the specic ring patterns of the neural network, are memorized into the network dynamics through Hebbian learning. After learning, the neurons of each cell assembly become sensitive to one of the memorized features. These cell assemblies have no overlapping region with each other, thereby making feature-selective neurons. It is well known that the featureselective neurons generally prevail at early visual stages (Logothetis, 1998), such as motion-selective neuronal columns in MT area (Kandel, 1991). The network dynamics allows a certain cell assembly to be activated and therefore to be selected from among others when stimulated with a relevant sensory feature. For simplicity, the IP network contains only pyramidal neurons, and there is no connection between them. That is, the IP network works exclusively as an input network. I apply degraded features as input stimuli to the IP network. These degraded stimuli are expressed as the superposition of the original feature (e.g., F2) and a spatial noise, as shown in Figure 1b, where the black circles and gray circles denote the ring state of neurons, and the open circles denote the resting state of neurons. The black circles represent signal information about feature F2, while the gray circles, which are irrelevant to the feature F2, represent noises. As the level of deviation from the original feature representation (F20 or F2) increases, the F2-sensitive OP neurons receive lesser inputs of F2, while other OP neurons receive greater inputs of noise. Note that the numbers of the on-neurons (black and gray circles) of the degraded stimuli are the same. Dynamic evolutions of the membrane potentials of pyramidal neurons and interneurons are dened by ¿pIP ¿pOP

IP .t/ dup;i

dt

OP .t/ dup;i

dt

IP IP IP D ¡.up;i .t/ ¡ up;rest / C Ip;i .t/; OP OP D ¡.up;i .t/ ¡ up;rest /C

C wpr SOP r;i .t/ C ¿rOP

duOP r;i .t/ dt

NIP X jD1

NOP X jD1

(2.1)

ex ih OP .wpp;ij .t/ C wpp;ij .t//Sp;j .t/

IP LOP;IP Sp;j .t ¡ 1tOP;IP / ij

OP OP D ¡.uOP r;i .t/ ¡ ur;rest / C wrp Sp;i .t/;

Prob[SYx;i .t/ D 1] D fx [uYx;i .t/]; .x D p; rI Y D IP; OP/

(2.2)

(2.3) (2.4)

568

O. Hoshino

1 ; 1 C e¡´x .y¡µx /

fx [y] D

(2.5)

Y .t/ and uY .t/ are the membrane potentials of the ith pyramidal where up;i r;i neuron and the ith interneuron of the Y (Y = IP, OP) network at time t, reY spectively. up;rest and uYr;rest are the resting potentials of pyramidal neurons and interneurons, respectively. ¿pY and ¿rY are the decay times of membrane ex potentials. NY is the number of neuron units of the Y network. wpp;ij .t/ and

ih .t/ are, respectively, excitatory and inhibitory synaptic strengths from wpp;ij pyramidal neuron j to i. wpr (wrp ) is the strength of the synaptic connection from interneuron (pyramidal neuron) to pyramidal neuron (interneuron). Y .t/ and SY .t/ are the action potentials of the ith pyramidal neuron and Sp;i r;i

the ith interneuron, respectively. LOP;IP is the strength of the synaptic conij nection from IP pyramidal neuron j to OP pyramidal neuron i. Value 1 or 0 is properly chosen for LOP;IP to make divergent and convergent feedforward ij projections, whereby a specic group of feature-selective OP neurons (or cell assembly) tends to be activated when the IP network is stimulated with a given feature. 1tOP;IP denotes a delay time in signal transmission from the IP .t/ is an external input to IP pyramidal neuron i. ´ and IP to OP network. Ip;i x µx are the steepness and the threshold of sigmoid function fx , respectively, for x kind of neuron. Equations 2.4 and 2.5 dene the probability of neuronal ring, that is, the probability of SYx;i .t/ D 1 is given by fx , otherwise SYx;i .t/ D 0. When the ith neuron res, SYx;i .t/ takes on a value of 1 for 1 msec, which is followed by a value of 0 for another 1 msec. This is a simplied representation of an action potential and a refractory period and determines the maximum ring rate of 500 Hz. After ring, its membrane potential is reset to the resting potential Y (up;rest = -0.5, uYr;rest D ¡0:5). The values of the other network parameters used in this study are as follows: NIP = NOP = 81, ¿pIP = ¿pOP = 100 msec, ¿rOP = 100 msec, µp D 0:1, µr D 0:9, ´p D 10:2, ´r D 21:0, and 1tOP;IP = 10 msec. Synaptic connections within the OP network are modied according to a pulse-timing-based Hebbian rule, dened by ¿w

ex dwpp;ij .t/

dt

Z ex D ¡wpp;ij .t/ C

Z C

¿w

ih .t/ dwpp;ij

dt

1 0

¡1

Z

C

1 0

OP OP ®ex K.s/Sp;i .t/Sp;j .t C s/ ds

OP OP ®ex K.s/Sp;i .t ¡ s/Sp;j .t/ ds;

ih D ¡wpp;ij .t/ C

Z

0

0 ¡1

(2.6)

OP OP ®ih K.s/[Sp;i .t/ ¡ 1]Sp;j .t C s/ ds

OP OP ®ih K.s/[Sp;i .t ¡ s/ ¡ 1]Sp;j .t/ ds;

(2.7)

Neuronal Bases of Perceptual Learning

569

Figure 2: (a) A pairing function K.s/ for Hebbian learning. For details, see the text. (b) Raster plots of ring pulses of the OP cell assemblies, which have specic sensitivity to feature FX, during and after the feature-memorization process. Each “bar” indicates the period of stimulation with feature FX (X = 1, 2, 3, 4, 5) during which synaptic construction is made.

where ¿w is a time constant. ®ex (®ih ) is the rate of excitatory (inhibitory) synaptic modication. Kernel K.s/ (see Figure 2a) denes a pairing function for the spike-timing-based Hebbian learning rule. This pulse-timingbased Hebbian learning may accord with the recent experiments (Markram, Lubke, ¨ Frotscher, & Sakmann, 1997; Bi & Poo, 1998), demonstrating that synaptic plasticity depends on the temporal timing between presynaptic and postsynaptic spikes. The second and third terms on the right-hand side of equations 2.6 and 2.7 represent LTP (long-term potentiation) and LTD (long-term depression) for excitatory (see equation 2.6) and inhibitory (see equation 2.7) synaptic connections, respectively. According to K.s/, when ex neuron j res just before (after) neuron i res, excitatory connection wpp;ij .t/

570

O. Hoshino

is strengthened (weakened), and thus LTP (LTD) takes place for excitatory synapses. When neuron j res just before (after) neuron i ceases to re, ih .t/ is strengthened (weakened), and thus LTP inhibitory connection wpp;ij

ex ih (LTD) takes place for inhibitory synapses. Note that wpp;ij .t/ and wpp;ij .t/ are assigned to take a positive and a negative value, respectively. Parameter values for the synaptic dynamics are as follows: ¿w = 50 msec, ®ex D 20, and ®ih = 3 for the feature-memorization process. These values are exclusively used to reduce the time required for memorizing the features into the network as dynamic cell assemblies. ¿w = 500 msec, ®ex = 0–20.0 and ®ih = 0–1.0 are used for various perceptual learning processes. For simplicity, the synaptic connection strengths between pyramidal neurons and interneurons are chosen as constant values, wpr D ¡1:0 and wrp D 1:0. Other network parameter values used in the model are chosen to obtain suitable model performance so that I can clearly demonstrate the essential neuronal behaviors of the network. The simplications made here allow a more extensive exploration of the model by making it easier to identify critical parameters among a large number of possible candidates. Each stimulus is expressed by an on-off pixel pattern f»i .FXD /g as shown in Figure 1b, where D denotes the level of deviation from the original feature FX (or FX0 ). When stimulus FXD is presented to the IP network, the amount of stimulation received by neuron i of the IP network is dened by IP Ip;i .t/ D ²»i .FXD /; .X D 1 ¡ 5I D D 0; 10; 20; 30; 40; 50/;

(2.8)

where ² denotes the intensity of stimulus FXD . »i .FXD / takes on a value of 1 for on (the black and gray circles) or 0 for off (the open circles) pixel. ² = 100 for the feature memorization process, and ² = 0.2 for perceptual learning. The high intensity, ² = 100, is exclusively used to accelerate the feature memorization process, thereby reducing computational load for perceptual learning processes that follow. Figure 2b is the raster plot of ring pulses of the OP cell assemblies during and after the feature memorization process. Stimulation with feature FX, as indicated by a bar, activates the OP cell assembly corresponding to FX. During the stimulation periods, synaptic connections between the OP pyramidal neurons are modied according to the Hebbian rule dened by equations 2.6 and 2.7. After the memorization process, the OP neurons re action potentials in a random manner (see time = 6,000–10,000) at a ring rate of 10 to 30 Hz. Note that the neural network model has ongoing spontaneous activity even without external stimulation. The mutual excitatory connections within the cell assemblies might be responsible for the spontaneous neuronal ring. The randomness of ring might be due mainly to the self-inhibition of pyramidal neurons via inhibitory interneurons. In the V1 of alert macaque monkeys performing cognitive tasks, similar spontaneous neuronal ring (10»30 Hz) has been reported (Peres & Hochstein, 1994).

Neuronal Bases of Perceptual Learning

571

The ongoing spontaneous neuronal activity enables the OP network to respond quickly to an applied stimulus, activating the cell assembly corresponding to the stimulus. After the stimulus is switched off, the network returns to the ongoing spontaneous state. The dynamic property of the ongoing spontaneous state and its relevance to neuronal information processing have been investigated in detail (Hoshino, Kashimori, & Kambara, 1996, 1998; Hoshino, Usuba, Kashimori, & Kambara, 1997; Hoshino et al., 2001; Hoshino et al., 2002). ex ih I chose 0 for the initial synaptic values, wpp;ij .t/ = wpp;ij .t/ = 0, in equations 2.2, 2.6, and 2.7. The reason that I made no connection at t D 0 is to accelerate the learning process, by which the stable dynamic cell assemblies that represent the sensory features can be created as long-term memories. Nevertheless, the same result will be obtained if I choose certain values for the initial synaptic strengths (e.g., random values for sparse connections), though the time period that is required to create the stable dynamic cell assemblies would be prolonged due to the reconstruction of its synaptic structure. 3 Results 3.1 Neuronal Behavior. I stimulated the IP network repeatedly with a degraded stimulus (F220 : see Figure 1b) and recorded neuronal responses, where the rates of synaptic modication were set to ®ex = 10.0 and ®ih = 0.2. Figure 3a is the raster plots of action potentials of the OP pyramidal neurons that have specic sensitivity to feature FX. During perceptual learning (marked by L), synaptic connections between the OP pyramidal neurons were modied according to equations 2.6 and 2.7. After perceptual learning, the same stimulus (F220) was presented (marked by P), where no synaptic modulation was made. I counted the number of the neuronal spikes in a time window of 10 msec. Figures 3b and 3c are the total spike counts for the neurons that are sensitive to features F2 and F1, respectively. As the network goes through the ve trials of perceptual learning, the activity of F2-sensitive neurons increases, while those of the other neurons (e.g., F1sensitive neurons of Figure 3c) do not or decrease slightly. The increase in neuronal activity arises from the positive synaptic enhancement between F2-sensitive neurons, according to equation 2.6. Since the activity of the F2-sensitive neurons prevails during perceptual learning, the activities of the other neurons, that is, FX(X6D2)-sensitive neurons, tend to be suppressed by inhibitory connections from the F2-sensitive neurons to the other neurons. Note that the inhibitory connections have been established across the cell assemblies through the feature memorization process (see Figure 2b). I assessed the cognitive performance of the OP network in terms of the number of trials on perceptual learning. The simulation is the same as that

572

O. Hoshino

Figure 3: (a) Raster plots of action potentials of the OP cell assemblies that have specic sensitivity to feature FX (X = 1, 2, 3, 4, 5). In perceptual learning, the IP network was stimulated repeatedly with the degraded stimulus F220 (L), during which the synaptic connections between the OP pyramidal neurons were modied according to equations 2.6 and 2.7. After perceptual learning, the same stimulus was presented (P), where no synaptic modulation was made. (b, c) Total spike counts within a bin size of 10 msec for the OP neurons that are sensitive to feature (b) F2 or (c) F1. (d) Cognitive performance of the OP network as a function of the number of trials (up to 20 times) on perceptual learning. The sensitivity of an F2-sensitive neuron (open circles) and an F1-sensitive neuron (lled circles) to the stimulation (P) was measured.

Neuronal Bases of Perceptual Learning

573

in Figure 3a except that the stimulus (F220 ) was presented up to 20 times (not shown). The ring rate was measured for an F2- and an F1-sensitive neuron. Note that the F1-sensitive neuron is irrelevant to the applied feature (F2). As shown in Figure 3d, the sensitivity of the F2-sensitive neuron gradually increases as the trial proceeds (open circles) and saturates at a certain value »170 Hz. This result implies that the cognitive performance saturates at some level (Adini, Sagi, & Tsodyks, 2002). The key parameters that determine the maximal ring rate (»170 Hz) are OP ) of pyramidal neurons the time constant (¿pOP ) and resting potential (up;rest (see equation 2.2). The minimal ring rate (»3 Hz) of the F1-sensitive neuron (lled circles) is due to strong suppression by the active F2-sensitive neurons via inhibitory synapses that have also been enhanced by perceptual learning (see equation 2.7). When synaptic modulation took place even during interstimulus intervals, the sensitivity of the network to the stimulus (F220) deteriorated (not shown). The decrease in synaptic strength due to natural decay, which is determined by the synaptic time constant ¿w in equations 2.6, is responsible for that sensitivity deterioration, weakening the synaptic connections among F2-sensitive neurons. A longer synaptic time constant should be taken for preventing such deterioration. Note that the synaptic connections between F2-sensitive neurons are not likely to be enhanced due to lesser coincident action potentials during the interstimulus intervals. 3.2 Synaptic Balance and Perceptual Performance. To investigate the relevance of the synaptic balance between positive and negative connections to perceptual learning, I varied the ratio ®ex =® ih , which are the rates of positive and negative synaptic modication during perceptual learning (see equations 2.6 and 2.7). Figure 4 shows how neurons respond to stimulus F2D after perceptual learning in three typical cases: ®ex =® ih = 10/0.2 (see Figure 4a), 10/0.5 (see Figure 4b), and 20/0.5 (see Figure 4c). As shown at the left of Figure 4a, when the network is stimulated repeatedly, the activity of the F2-sensitive neurons increases (solid lines), while the other neurons (e.g., F1-sensitive neurons indicated by the dotted lines) do not for D < 30% or increase for D > 30%. This result may imply that signal but not noise changes at low noise levels and that both signal and noise enhancement occur at higher noise levels. I evaluated the performance of the model by the signal-to-noise ratio of neuronal activity that is the ratio of the ring rate of F2-sensitive neurons to that of F1-sensitive neurons. Note that both (feature-induced and noise-induced) neuronal activities include the ongoing spontaneous neuronal activity. As shown at the right of Figure 4a, the network performance can be improved if the noise level is low, D < 30%. The poor performance for higher noise stimuli (D > 30%) is presumably due to the enhancement of the noise. As shown at the left of Figure 4b, when the rate of inhibitory synaptic modulation increases, the noise but not the signal tends to change. That

574

O. Hoshino

Figure 4: Dependence of neuronal responses and cognitive performance on the synaptic balance. The IP network is stimulated repeatedly with F220 , where (a) ®ex =® ih = 10/0.2, (b) ®ex =® ih = 10/0.5, and (c) ®ex =® ih = 20/0.5. Left: Firing rates of an F2-sensitive neuron (solid lines) and an F1-sensitive neuron (dotted lines). Right: Signal-to-noise ratio of neuronal activity. The circles, triangles, squares, diamonds, and asterisks indicate that the numbers of trials on perceptual learning are 1, 2, 3, 4, and 5, respectively.

Neuronal Bases of Perceptual Learning

575

is, only the noise decreases through perceptual learning. The noise reduction is not sufcient (dotted lines) when stimulated at higher noise levels (D > 40%), and therefore the signal-to-noise ratio deteriorates as shown at the right of Figure 4b. Figure 4c shows the case where both signal enhancement and noise reduction take place. One of the notable ndings is that the synaptic balance set by ®ex =® ih = 10/0.5 (Figure 4b) provides good performance for the lower noise environments (D < 30%), while the balance set by ®ex =® ih = 20/0.5 (Figure 4c) can still enhance the performance for the higher noise environments (D > 30%). These results may imply that the synaptic balance between excitatory and inhibitory connections contributes to neuronal responses to both components (signal and noise) and therefore to network performance. To investigate the precise relationship between the synaptic balance and neuronal behaviors, I carefully varied the ratio (®ex =® ih ) and recorded neuronal responses. Figure 5 shows the dependence of neuronal behaviors on synaptic balance (®ex =® ih ). Figure 5a and 5b are, respectively, the isocontours of sensitivity changes (% on ring rate) of an F2- and an F1-sensitive neuron after perceptual learning (see time = 22,000–23,000 in Figure 3a). The subdomains of Figure 5c express the spaces that guarantee the ve distinct neuronal behaviors: signal enhancement without noise change (I), noise reduction without signal change (II), signal enhancement and noise reduction (III), signal and noise enhancement (IV), and signal and noise reduction (V). Although the subdomains I and II should be on the 0% contour lines (lled arrows), the two dotted regions roughly indicate the parameter spaces within which the two distinct neuronal behaviors (signal enhancement without noise change and noise reduction without signal change) can be easily available because of the sparseness of the isocontour lines. As has been revealed by Figures 4b and 4c, the synaptic balances expressed by regions II and III are quite effective for the improvement in network performance, increasing the signal-to-noise ratio. An interesting nding might be that the noise reduction alone can improve the cognitive performance more than signal enhancement alone. 3.3 Perceptual Learning in Higher Sensory Systems. In the above simulation, I assumed an early sensory system, where neurons are feature specic. That is, the neurons that respond to a certain feature stimulus do not have sensitivity to another feature. However, perceptual learning is not limited to early sensory systems; it takes place in higher sensory systems as well, where neurons might have complex receptive elds and therefore tend to respond to multiple sensory features. I extend the simulation study to investigate how perceptual learning is processed if the network contains bimodal neurons. I modied the neural network model, as shown in Figure 6a, where the F2- and F4-sensitive IP neurons project to (dotted lines) the same group of OP neurons (the dotted circle). When the IP network is stimulated with feature

576

O. Hoshino

Figure 5: Dependence of neuronal behaviors on synaptic balance ®ex /®ih . (a) Isocontours of sensitivity changes (% on ring rate) of an F2-sensitive neuron after perceptual learning (see time = 22,000–23,000 in Figure 3a). (b) Isocontours for an F1-sensitive neuron. (c) Overall neuronal behavior. I: signal enhancement without noise change; II: noise reduction without signal change; III: signal enhancement and noise reduction; IV: signal and noise enhancement; and V: signal and noise reduction. The signal enhancement without noise change and noise reduction without signal change are easily available within dotted regions because of the sparseness of the isocontour lines.

Neuronal Bases of Perceptual Learning

577

Figure 6: A neural network model that contains two types (single modal and bimodal) of neurons. The F2- and F4-sensitive IP neurons project to (dotted lines) the bimodal OP neurons (dotted circle). The bimodal F2-4 OP neurons are sensitive to both (F2 and F4) features.

F2, the OP neurons indicated by F2 and F2-4 are activated simultaneously. Note that “F2-4” denotes specic OP neurons that are indicated by an arrow with “F2-4” in Figure 6 and means that these neurons are sensitive to both F2 and F4. Similarly, when the IP network is stimulated with feature F4, the OP neurons indicated by F4 and F2-4 are activated simultaneously. This means that the F2 and F4 OP neurons are single modal, while the F2-4 OP neurons are bimodal. The ve sensory features are sequentially memorized (time = 1000–5500 in Figure 7a) through Hebbian learning. Since the F2-4 bimodal neurons are sensitive to both features, they are activated by F2 (time = 2000–2500) and F4 (time = 4000–4500). Note that the F2 neurons are also activated by F4 stimulation (see time = 4000–4500). This arises from an indirect association process. That is, the F4 stimulation (time = 4000–4500) directly activates F4 and F2-4 OP neurons, which then activate the F2 OP neurons by associating F2 with F2-4. Such an associative property between F2 and F2-4 has been established during time = 2000–2500. When the network is stimulated with F2 (time = 8000–9000), the F2sensitive neurons and the bimodal F2-4 neurons are strongly activated, while the F4 neurons are weakly activated. Similarly, F4 stimulation strongly activates the F4 and F2-4 neurons (time = 12,000–13,000), while the F2 neurons are weakly activated. Figures 7b, 7c, and 7d, respectively, show how the

578

O. Hoshino

Figure 7: (a) Raster plots of ring pulses of the OP cell assemblies during the feature memorization period (time = 1000–5500) and the stimulation periods with F2 (time = 8000–9000) and F4 (time = 12,000–13,000). Each bar indicates the period of stimulation with feature FX. Total spike counts within a bin size of 10 msec for the OP neurons that are sensitive to feature (b) F2, (c) F4, or (d) F2-4.

Neuronal Bases of Perceptual Learning

579

F2, F4, and F2-4 neurons respond to the F2 (time = 8000–9000) or F4 (time = 12,000–13,000) stimulation, where the numbers of the neuronal spikes that occur in a time window of 10 msec were counted. The stronger response of the F2 (F4) neurons during F2 (F4) stimulation arises from the direct activation of these neurons by the F2 (F4) stimulus, while the weaker response of the F4 (F2) neurons during F2 (F4) stimulation arises from the indirect activation of these neurons. The indirect activation is mediated by the F2-4 bimodal neurons. Note that the weak activation of the F4 neurons (time = 8000–9000) and the F2 neurons (time = 12,000–13,000), which seems, respectively, irrelevant to the applied F2 and F4 features, might be considered a kind of noise. I show in Figure 8 how perceptual learning can reduce that noise signal. Before perceptual learning, the F2 stimulus activates the relevant F2 and F2-4 neurons, and also the irrelevant F4 neurons (time = 8000–9000). After perceptual learning, during which the feature F2 is presented three times (time = 12,000–12,500, 14,000–14,500, 16,000–16,500), the responsiveness of the F4 neurons decreases, while that of the F2 neurons remains vigorous (time = 20,000–21,000). This result indicates that the noise activity of F4 neurons has been suppressed, and therefore the cortical representation has been reorganized. Figure 9a shows that cortical reorganization. After perceptual learning, the F4-sensitive neurons do not respond to feature F2, thereby reducing the number of neurons that are sensitive to feature F2. Such a reduction in the number of feature-sensitive neurons has in fact been reported in experiments of perceptual learning. I discuss this issue in section 4.1. It is interesting to ask how the F4 neurons have been removed from the cortical representation. Before perceptual learning, the F2, F4, and F2-4 neurons simultaneously respond to the F2 stimulation (see time = 8000– 9000 of Figure 8). It might be inferred from the simultaneous activation of the three cell assemblies (F2, F4, F2-4) that these cell assemblies are likely to merge into one large dynamic cell assembly through perceptual learning. That is, Hebbian learning, which takes place in the perceptual learning process, strengthens the synaptic connections between these neurons. This might result in forming one large dynamic cell assembly that consists of the F2, F4, and F2-4 neurons. However, the result was quite different: the F2 and F2-4 neurons but not the F4 neurons were included in the merged dynamic cell assembly, as revealed in Figure 8a (see time = 20,000–21,000) and as shown at the right of Figure 9a. To answer the question of why the F4 neurons were excluded from the merged dynamic cell assembly, I calculated cross-correlation functions of action potentials during the F2 stimulation period before perceptual learning (time = 8000–9000 of Figure 8). The F2 and F2-4 neuronal spikes coincide (the lled arrow of Figure 9b) with a small time lag (»3 msec), while the F4 and F2-4 neuronal spikes do not (Figure 9c). The disynaptic delay (the open arrow of Figure 9c) is too long to enhance their positive synaptic connec-

580

O. Hoshino

Figure 8: (a) Raster plots of ring pulses of the OP cell assemblies before, during, and after perceptual learning. Each bar indicates the period of stimulation with feature F2. L denotes perceptual learning. Before perceptual learning, the F2, F4, and F2-4 neurons are activated by F2 stimulation (time = 8000–9000). After perceptual learning, the F2 and F2-4 neurons but not the F4 neurons are activated by F2 stimulation (time = 20,000–21,000). Total spike counts within a bin size of 10 msec for the OP neurons that are sensitive to feature (b) F2, (c) F4, or (d) F2-4. ®ex =® ih = 3.0/1.0.

Neuronal Bases of Perceptual Learning

581

Figure 9: (a) Reorganization of cortical representation through perceptual learning. Stimulation with F2 before (left) and after (right) perceptual learning. The activity levels of the OP neurons are indicated by a white-to-black scale. (b) A cross-correlation function of action potentials between an F2- and an F2-4sensitive neuron during the F2 stimulation period before perceptual learning (time = 8000–9000 of Figure 8). The F2 and F2-4 neuronal spikes coincide (the lled arrow) with a small time lag (»3 msec). (c) A cross-correlation function for an F4- and an F2-4 sensitive neuron. The positive peak (the open arrow) reects a disynaptic delay between the two neurons. The lled arrows indicate anticorrelated rings.

582

O. Hoshino

tions according to kernel K.s/ (see Figure 2a). It seems that the time that is required to associate F4 with F2-4 might be the main cause for that delay and thus the less coincidence, resulting in the prevention of excitatory synaptic enhancement between the F2-4 and F4 neurons. In addition to the prevention of excitatory synaptic enhancement, the anticorrelations within several milliseconds (see the lled arrows of Figure 9c) allow these inhibitory synaptic connections to be enhanced according to equation 2.7. Such a combinatorial effect (i.e., the prevention of excitatory synaptic enhancement and the enhancement of inhibitory synaptic connections) may be essential for the suppression of the nontrained feature (F4). 3.4 Relevance to Experimental Observations. When the inhibitory and excitatory synapses were modulated separately during perceptual learning, that is, inhibitory synaptic modulation followed by an excitatory one and vice versa, noise reduction and signal enhancement were processed separately (not shown). I obtained the same cognitive performance in both cases as that in Figure 8. Furthermore, I could not nd any difference in cognitive performance even if the LTD part was removed from the kernel K.s/ (not shown). These results imply that the balance between excitatory and inhibitory synapses, but not LTD, is crucial for perceptual learning. The lack of LTD effect is presumably due to a small contribution of the LTD part compared to that of LTP of kernel K.s/. In fact, as will be presented in the next paragraphs, LTD greatly affects the network performance if the LTD part of the kernel is allowed to spread wider than LTP. I made simulations for the bimodal case (see Figure 8a), in which the time window of the LTD part spreads wider than LTP. As the LTD part for excitatory synapses spreads (see the solid ! dashed ! dotted line of Figure 10a), neuronal activity (time = 20,000–21,000) for the nontrained feature (F4) tends to be suppressed (see Figures 10b ! 10c ! 10d, where ®ex =® ih = 3.0/0.5). The suppression is presumably due to synaptic depression in excitatory connections from the F2-4 to F4 neuron, whose action potentials are anticorrelated (see Figure 9c). When the LTD part for inhibitory synapses spreads (see the solid ! dashed ! dotted line of Figure 11a), the neuronal activity (time = 20,000– 21,000) for the nontrained feature (F4) is not suppressed but rather enhanced (see Figures 11b ! 11c ! 11d). The enhancement is presumably due to synaptic depression in inhibitory connections from the F2-4 to F4 neuron that re with anticorrelation (see Figure 9c). Concerning the roles of LTD, Feldman (2000) has experimentally demonstrated that long LTD windows can effectively depress postsynaptic action potentials that are uncorrelated with presynaptic action potentials. Song and Abbott (2001) have demonstrated by simulations that only if LTD dominates LTP will a faithful representation develop which remains stable. I also suggest here that the irrelevant neuronal activity (or noise) corresponding

Neuronal Bases of Perceptual Learning

583

Figure 10: Effects of LTD spread for excitatory synaptic modulation on bimodal perceptual learning. (a) LTD spread (solid ! dashed ! dotted line) and corresponding ((b) ! (c) ! (d)) raster plots of action potentials of the F2- and F4-sensitive neurons. Each solid bar indicates the period of stimulation with feature F2. L denotes perceptual learning. ®ex =® ih = 3.0/0.5.

584

O. Hoshino

Figure 11: Effects of LTD spread for inhibitory synaptic modulation on bimodal perceptual learning. (a) LTD spread (solid ! dashed ! dotted line) and corresponding ((b) ! (c) ! (d)) raster plots of action potentials of the F2- and F4-sensitive neurons. Each solid bar indicates the period of stimulation with feature F2. L denotes perceptual learning. ®ex =® ih = 3.0/0.5.

Neuronal Bases of Perceptual Learning

585

to the nontrained feature might effectively be suppressed if LTD dominates LTP for excitatory synapses. The membrane time constants, ¿pOP and ¿rOP in equations 2.2 and 2.3 are critical parameters for the simulation and typically range between 10 and 100 msec (Dayan & Abbott, 2001). When these membrane time constants were reduced from 100 to 10 msec in the simulation of Figure 10b, the activity of neurons responsible for the nontrained feature (F4) did not decrease (not shown), and therefore the suppression mechanism did not work anymore. However, as shown in Figure 12a, if the inhibitory modulation rate (®ih in equation 2.7) is increased from 0.5 to 1.0, the suppression of the nontrained feature has occurred regardless of such a small membrane time constant. Figure 12b is a cross-correlation function of action potentials between an F4 (single-modal) and an F2-4 (bimodal) neuron during the F2 stimulation period before perceptual learning (time = 8000–9000 of Figure 12a). The positive peak (the open arrow) indicates that the F4 and F2-4 spikes coincide (within »3msec), which implies that the F4 neuronal activity is not suppressed but rather enhanced (see equation 2.6). However, the anticorrelation (the lled arrow) allows the inhibitory synaptic connections to be enhanced as well (see equation 2.7). This implies that the suppression mechanism could work if inhibitory modulation dominates excitatory modulation. I suggest that neuronal suppression for the nontrained feature might be possible for reasonable time constant values (5–10 msec) provided that the synaptic balance, ®ex =® ih , is properly chosen. Concerning response times to sensory stimulation, it is well known that human subjects can respond to old (previously learned) words and nonwords more quickly (by tends of milliseconds) than new (unlearned) words and nonwords (Haist, Musen, & Squire, 1991; Musen & Squire, 1993). Cave and Squire (1992) have demonstrated that human subjects can name old pictures more quickly (by tends of milliseconds) than new pictures. Figure 13, reconstructed from Figure 8, shows that the response time of the F2-sensitive neurons has been reduced in the shrunken cortical representation (see the right of Figure 9a), that is, after perceptual learning. In the expanded cortical representation (see the left of Figure 9a), before perceptual learning, the response latency of the neurons is »60 msec from the onset of simulation (top), while that for the shrunken cortical representation is »25 msec (bottom). The reduction in response time (1t = »35 msec) in the shrunken cortical representation is presumably due to the stabilization of the dynamic cell assembly corresponding to feature F2, which is mediated by the enhancement of excitatory synaptic connections between the F2-sensitive neurons according to equation 2.6. This result may provide some insight into underlying neuronal mechanisms for the reduction of response times to sensory stimulation through perceptual learning.

586

O. Hoshino

Figure 12: Effects of membrane time constants on bimodal perceptual learning. (a) Raster plots of action potentials of the OP cell assemblies, where ¿pOP and ¿rOP were reduced from 100 to 10 msec, and the inhibitory modulation rate (®ih in equation 2.7) was increased from 0.5 to 1.0. (b) A cross-correlation function of action potentials between an F4 (single-modal) and an F2-4 (bimodal) neuron during the F2 stimulation period before perceptual learning (time = 8000–9000). The positive peak (the open arrow) indicates that the F4 and F2-4 spikes coincide (within »3 msec). The lled arrow indicates anticorrelation. ®ex =® ih = 3.0/1.0.

4 Discussion In this section, I discuss the functional signicance of the synaptic balance in perceptual learning and then related simulation studies that tried to clarify neuronal mechanisms of perceptual learning based on synaptic plasticity. Finally, I discuss the assumptions that were made in this study.

Neuronal Bases of Perceptual Learning

587

Figure 13: Raster plots of action potentials of the F2-sensitive neurons during stimulation (F2) periods before (top) and after (bottom) perceptual learning (see Figure 8). The horizontal bars indicate stimulation periods. The response times of the F2-sensitive neurons are reduced by »35 msec (1t).

4.1 Signicance of Synaptic Balance. As noted in section 1, there have been three theories of perceptual learning: signal enhancement, noise reduction, or both. I have focused my study on the relevance of lateral excitatoryinhibitory synaptic balance to these theories. I have shown that the positive synaptic connections contribute to signal enhancement (see Figure 4a), while the negative one contributes to noise reduction (see Figure 4b). I have shown in Figure 4c that the signal enhancement and noise reduction take place if the synaptic balance falls within a certain range (III of Figure 5c), where the cognitive performance of the neural network has been greatly improved through perceptual learning even at higher noise levels (deviation > 30% of Figure 4c). I suggest that the alteration of the synaptic balance may be essential when the brain tries to adopt the most suitable strategy—signal enhancement, noise reduction, or both—for a given perceptual task. Concerning the relevance of neuronal behavior to perceptual learning, two contradictory experimental results have been reported: an increase (Skrandies, Lang, & Jedynak, 1996; Schwartz et al., 2002; Karni & Bertini,

588

O. Hoshino

1997) or a decrease (Wiggs & Martin, 1998; Schiltz et al., 1999) in neuronal activity through perceptual learning. I consider that such a difference in neuronal activity may arise from variety in synaptic modulation, where the brain may select the appropriate synaptic balance from among the possible candidates (see Figure 5). Researchers (Jenkins, Merzenich, Ochs, Allard, & Guic-Robles, 1990; Recanzone, Schreiner, & Merzenich, 1993) have suggested that repeated sensory stimulation increases the number of cortical neurons relevant to the stimulus, while other researchers (Wiggs & Martin, 1998; Ghose, Yang, & Maunsell, 2002) have suggested a decrease in that number. The increase in the number of neurons, which means an expansion of cortical representation of sensory information, might be established by the enhancement of excitatory synaptic connections, which adds new neuronal members into the original cortical representation. Then how does the decrease in the number of neurons or the shrinkage of cortical representation occur? I consider that the cortical shrinkage may arise from the enhancement of lateral inhibitory connections, whereby highly redundant sensory signals might be transformed into a more efcient code. For example, Figure 9a shows that the irrelevant information about F4, an associative signal with the applied stimulus F2, has been completely removed after perceptual learning. For the removal of the F4 neurons from the cortical representation, the enhancement of the lateral inhibitory connections was crucial. I evaluated the performance of the network in terms of the signal-to-noise ratio of neuronal activity, which was dened by the ratio of feature-induced and noise-induced neuronal activity. To assess early perceptual learning, many cognitive performance measures have been proposed, such as detectabilities of motion directions (Ball & Sekuler, 1982), global orientations (Ahissar & Hochstein, 1993), wave gratings (Fiorentini & Berardi, 1980), object features (Ahissar & Hochstein, 1993), hyperacuity (Poggio et al., 1992), and texture segregation (Karni & Sagi, 1991). In all these cognitive processes, the performance of subjects has been greatly improved by perceptual learning. I believe that the signal-to-noise ratio of neuronal activity may be one of the fundamental neural codes for cognitive processing, which is presumably transferred to later cortical areas, such as prefrontal cortex or motor cortex, thereby making nal decisions for given cognitive tasks. 4.2 Related Simulation Studies. Several neural network models have been proposed for perceptual learning. Weiss et al. (1993) made a model for Vernier hyperacuity. The model was a two-layered feedforward neural network. In the input network, the neurons that process signal information were affected by the activity of nearby neurons. The neuron of the output layer made a decision for orientation detection tasks. The researchers modied the feedforward synaptic connections and demonstrated that signal enhancement but not noise reduction occurred. I suggest that signal en-

Neuronal Bases of Perceptual Learning

589

hancement, noise reduction, or both is available even within single cortical networks when the brain alters the synaptic balance between positive and negative lateral connections. Peres and Hochstein (1994) made a recurrent neural network model for orientation detection tasks. The network consisted of hypercolumns, each of which contained orientation-sensitive neurons. The researchers demonstrated that the network performance was greatly improved if the synaptic ratio (intracolumnar excitatory connections)/(intercolumnar inhibitory connections) reached a certain value. The “synaptic balance scheme” that I propose here may support their result and may provide a unied understanding of the neuronal mechanism underlying perceptual learning. Herzog and Fahle (1998) proposed a three-layered neural network model for Vernier discrimination tasks. The researchers demonstrated that external feedback from another cortical region controlled a search process for visual stimuli and the speed of perceptual learning. In this study, I did not incorporate such a feedback contribution. I modeled an early sensory system, where bottom-up but not top-down signals seems to contribute to perceptual learning (Watanabe, Nanez, Koyama, Mukai, Liederman, & Sakaki, 2002). Adini et al. (2002) proposed a neural network model and demonstrated that perceptual learning enhanced the discrimination of visual contrast. The model consisted of excitatory (E) and inhibitory (I) neuronal subpopulations that were interconnected through both excitatory and inhibitory synaptic connections. The synaptic strengths were modulated according to the spike timing of E and I. The cognitive performance of the E subpopulation was improved if the inhibitory inuence of I on E was reduced. The neural network model has an interconnection between a pyramidal neuron and an interneuron, which is similar to that between E and I subpopulations. However, its functional role is quite different in that the cognitive performance of the present model has been improved by augmenting excitatory inuence among pyramidal neurons within cell assemblies, while that of this model has been improved by reducing the inhibitory inuence of I on E. This result may provide another plausible neuronal mechanism for perceptual learning. It might also be possible that the two distinct (inhibitory and excitatory) inuences cooperate together to improve overall cognitive performance. The model examined here has mutual inhibitory connections across cell assemblies. I have demonstrated that noise components can be reduced when the inhibitory connections are augmented, thereby improving signalto-noise ratio. The notable nding might be that the alteration of the balance between excitatory and inhibitory synaptic connections is essential for improvement in overall cognitive performance. Another possible way of enhancing the signal-to-noise ratio is to increase a neuronal threshold for action potentials. This would also lead to the compactication of neuronal representation when presented with repetitive

590

O. Hoshino

stimuli, because the “Eisberg” effect cuts off the noise and leaves only the salient activities (signal) alive. This might be the classic explanation of how the compactication and the enhancement of the signal-to-noise ratio work. 4.3 Assumptions. I assumed here the self-inhibition of pyramidal neurons via their accompanying interneuron. This self-inhibition is essential for the performance of the model presented here. If the pyramidal neurons do not inhibit themselves, the ongoing state of the network becomes stationary, and the current state of the network is determined by its previous state. Due to the robustness of the conventional associative neural network model without self-inhibition, the same ring pattern corresponding to F2 tended to appear even when presented with the different stimuli F2D (D D 0, 10, 20, 30, 40 or 50), where the noise contribution determined by D (see Figure 1b) was no longer reected in neuronal activity (not shown). The self-inhibition was essential for mediating the ongoing spontaneous neuronal activity and thus for assessing the noise contribution to perceptual learning. These interneurons function to suppress the activities of pyramidal neurons. This neuronal suppression allows the network state to escape from the basin of a point attractor in which the current network state is being trapped, and thus enables the network to easily respond to an applied stimulus. (For details, see Hoshino et al., 1996, 1997, 1998, 2001, 2002.) Such a self-inhibition by neighboring inhibitory interneurons has in fact been found in the visual cortex (Martin, 2002) To reduce total simulation time, I used “one-shot” learning where the ve sensory features were presented only once in their memorization process (see Figure 2b). The intense stimulation, ² = 100 in equation 2.8, was necessary for creating the stable dynamic cell assemblies (or the point attractors) in the network dynamics in such a short time interval. In real situations, we are generally exposed repeatedly to sensory features or events over a longer period, and we store them as long-term memories. If I chose a weak intensity and let the model learn these features, similar dynamic cell assemblies would be created as well, though its simulation time should be prolonged. The anticorrelation learning rule that I have proposed here (see equation 2.7) was quite effective for both the construction and modulation of inhibitory synapses. It allowed the inhibitory connections to the nonstimulated (inactive) neurons to grow (see Figure 2b), thereby successfully constructing the inhibitory synapses. Synaptic modulation according to the anticorrelation learning rule enabled the network to suppress a nontrained feature if it showed up (see Figure 8). Moreover, although noise is a rare event by denition, the anticorrelation learning rule enabled the network to reduce the noise effectively and therefore to enhance the signal-to-noise ratio, which seems difcult by another method of synaptic modulation, an invertedcorrelation learning rule, as will be explained in the next paragraph. The inverted-correlation learning rule obtained by the kernel ¡K.s/ (see Figure 2a), whose plausibility might be supported by an experimental study

Neuronal Bases of Perceptual Learning

591

(Holmgren & Zilberter, 2001), can modulate the inhibitory synapses as well. Buchs and Senn (2002) successfully used ¡K.s/ kernel for learning direction selectivity where one needs to modulate existing inhibitory synapses but not build them up. The inverted-correlation rule will only modify inhibitory connections between active neurons; it could not construct them because that rule is based on the correlation between the active neurons. Furthermore, the inverted-correlation learning rule for inhibitory synapses tends to correlate synaptic activities and also to correlate signal and noise, and is therefore less effective in increasing the signal-to-noise ratio due to the rareness of the noise event. I suggest that the anticorrelation learning rule could be an effective way of constructing and modulating the inhibitory synaptic connections. Acknowledgments I thank the anonymous referees for giving me valuable comments and suggestions on the earlier draft. References Adini, Y., Sagi, D., & Tsodyks, M. (2002). Context-enabled learning in the human visual system. Nature, 415, 790–793. Ahissar, M., & Hochstein, S. (1993). Attentional control of early perceptual learning. Proc. Natl. Acad. Sci. USA, 90, 5718–5722. Amit, D. J. (1998). Simulation in neurobiology: Theory or experiment? Trends Neurosci., 21, 231–237. Ball, K., & Sekuler, R. (1982). A specic and enduring improvement in visual motion discrimination. Science, 218, 697–698. Bi, G. Q., & Poo, M. M. (1998). Synaptic modication in cultured hippocampal neurons: Dependence on spike timing, synaptic strength, and postsynaptic cell type. J. Neurosci., 18, 10464–10472. Buchs, N. J., & Senn, W. (2002). Spike-based synaptic plasticity and the emergence of direction selective simple cells: Simulation results. J. Comput. Neurosci., 13, 167–186. Buonomano, D. V., & Merzenich, M. M. (1998). Cortical plasticity from synapses to maps. Annu. Rev. Neurosci., 21, 149–186. Cave, C. B., & Squire, L. R. (1992). Intact and long-lasting repetition priming in amnesia. J. Experimental Psychology, 18, 509–520. Crist, R. E., Li, W., & Gilbert, C. D. (2001). Learning to see: Experience and attention in primary visual cortex. Nature Neurosci., 4, 519–525. Dayan, P., & Abbott, L. F. (2001). Model neurons I: Neuroelectronics. In P. Dayan & L. F. Abbott (Eds.), Theoretical neuroscience (pp. 153–194). Cambridge, MA: MIT Press. Dosher, B. A., & Lu, Z. L. (1998). Perceptual learning reects external noise ltering and internal noise reduction through channel reweighting. Proc. Natl. Acad. Sci. USA, 95, 13988–13993.

592

O. Hoshino

Feldman, D. E. (2000). Timing-based LTP and LTD at vertical inputs to layer /III pyramidal cells in rat barrel cortex. Neuron, 27, 45–56. Fiorentini, A, & Berardi, N. (1980). Perceptual learning specic for orientation and spatial frequency. Nature, 287, 43–44. Ghose, G. M., Yang, T., & Maunsell, J. H. (2002). Physiological correlates of perceptual learning in monkey V1 and V2. J. Neurophysiol., 87, 1867–1888. Gilbert, C. D. (1994). Neural dynamics and perceptual learning. Curr. Biol., 4, 627–629. Gilbert, C. D. (1996). Plasticity in visual perception and physiology. Curr. Opin. Neurobiol., 6, 269–274. Gold, J., Bennett, P. J., & Sekuler, A. B. (1999). Signal but not noise changes with perceptual learning. Nature, 402, 176–178. Goldstone, R. L. (1998). Perceptual learning. Annu. Rev. Psychol., 49, 585–612. Gupta, A., Wang, Y., & Markram, H. (2000). Organizing principles for a diversity of GABAergic interneurons and synapses in the neocortex. Science, 287, 273– 278. Haist, F, Musen, G., & Squire, L. R. (1991).Intact priming of words and nonwords in amnesia. Psychobiology, 19, 275–285. Herzog, M., & Fahle, M. (1998). Modeling perceptual learning: Difculties and how they can be overcome. Biol. Cybern., 78, 107–117. Holmgren, C. D., & Zilberter, Y. (2001). Coincident spiking activity induces longterm changes in inhibition of neocortical pyramidal cells. J. Neurosci., 21, 8270–8277. Hoshino, O., Inoue, S., Kashimori, Y., & Kambara, T. (2001). A hierarchical dynamical map as a basic frame for cortical mapping and its application to priming. Neural Comput., 13, 1781–1810. Hoshino, O., Kashimori, Y., & Kambara, T. (1996). Self-organized phase transitions in neural networks as a neural mechanism of information processing. Proc. Natl. Acad. Sci. USA, 93, 3303–3307. Hoshino, O., Kashimori, Y., & Kambara, T. (1998). An olfactory recognition model based on spatio-temporal encoding of odor quality in olfactory bulb. Biol. Cybern., 79, 109–120. Hoshino, O., Usuba, N., Kashimori, Y., & Kambara T. (1997). Role of itinerancy among attractors as dynamical map in distributed coding scheme. Neural Networks, 10, 1375–1390. Hoshino, O., Zheng, M.H., & Kuroiwa, K. (2002). Roles of dynamic linkage of stable attractors across cortical networks in recalling long-term memory. Biol. Cybern., 88, 163–176. Jenkins, W. M., Merzenich, M. M., Ochs, M. T., Allard, T., & Guic-Robles, E. (1990). Functional reorganization of primary somatosensory cortex in adult owl monkeys after behaviorally controlled tactile stimulation. J. Neurophysiol., 63, 82–104. Kandel, E. R. (1991). Perception of motion, depth, and form. In E. R. Kandel, J. H. Schwartz & T. M. Jessell (Eds.), Principles of neural sciences (pp. 440–466). East Norwalk, CT: Appeleton & Lange. Karni, A., & Bertini, G. (1997). Learning perceptual skills: Behavioral probes into adult cortical plasticity. Curr. Opin. Neurobiol., 7, 530–535.

Neuronal Bases of Perceptual Learning

593

Karni, A., & Sagi, D. (1991). Where practice makes perfect in texture discrimination: Evidence for primary visual cortex plasticity. Proc. Natl. Acad. Sci. USA, 88, 4966–4970. Logothetis, N. (1998). Object vision and visual awareness. Curr. Opin. Neurobiol., 8, 536–544. Markram, H., Lubke, ¨ J., Frotscher, M., & Sakmann, B. (1997). Regulation of synaptic efcacy by coincidence of postsynaptic APs and EPSPs. Science, 275, 213–216. Martin, J. H. (1991). The collective electrical behavior of cortical neurons: The electroencephalogram and the mechanisms of epilepsy. In E. R. Kandel, J. H. Schwartz, & T. M. Jessell (Eds.), Principles of neural sciences (pp. 777–791). East Norwalk, CT: Appeleton & Lange. Martin, K. A. C. (2002). Microcircuits in visual cortex. Curr. Opin. Neurobiol., 12, 418–425. McLaren, I. (1994). Representation development in associative systems. In J. Hogan & J. Bolhuis (Eds.), Causal mechanisms of behavioral development (pp. 377–402). Cambridge: Cambridge University Press. McLaren, I., Kaye, H., & Mackintosh, N. (1988). An associative theory of the representation of stimuli: Applications to perceptual learning and latent inhibition. In R. Morris (Ed.), Parallel distributed processing: Implications for psychology and neurobiology (pp. 102–130). Oxford: Clarendon. Musen, G., & Squire, R. (1993). On the implicit learning of novel associations by amnesic patients and normal subjects. Neuropsychol., 7, 19–135. Peres, R., & Hochstein, S. (1994). Modeling perceptual learning with multiple interacting elements: A neural network model describing early visual perceptual learning. J. Comput. Neuronsci., 1, 323–338. Poggio, T., Fahle, M., & Edelman, S. (1992). Fast perceptual learning in visual hyperacuity. Science, 256, 1018–1021. Recanzone, G. H., Schreiner, C. E., & Merzenich, M. M. (1993). Plasticity in the frequency representation of primary auditory cortex following discrimination training in adult owl monkeys. J. Neurosci., 13, 87–103. Rolls, E. T., & Baylis, L. L. (1994). Gustatory, olfactory, and visual convergence within the primate orbitofrontal cortex. J. Neurosci., 14, 5437–5452. Sakurai, Y. (1996). Hippocampal and neocortical cell assemblies encode memory processes for different types of stimuli in the rat. J. Neurosci., 158, 181– 184. Schiltz, C., Bodart, M., Dubois, S., Dejardin, S., Michel, C., Roucoux, M., Crommelinck, M., & Orban, G. A. (1999). Neuronal mechanisms of perceptual learning: Changes in human brain activity with training in orientation discrimination. NeuroImage, 9, 46–62. Schwartz, S., Maquet, P., & Frith, C. (2002). Neural correlates of perceptual learning: A functional MRI study of visual texture discrimination. Proc. Natl. Acad. Sci. USA, 99, 17137–17142. Skrandies, W., Lang, G., & Jedynak, A. (1996). Sensory thresholds and neurophysiological correlates of human perceptual learning. Spat. Vis., 9, 475–489. Song, S., & Abbott, L. F. (2001). Cortical development and remapping through spike timing-dependent plasticity. Neuron, 32, 339–350.

594

O. Hoshino

Watanabe, T., Nanez, J. E., Koyama, S., Mukai, I., Liederman, J., & Sasaki, Y. (2002). Greater plasticity in lower-level than higher-level visual motion processing in a passive perceptual learning task. Nature Neurosci., 5, 1003–1009. Weiss, Y., Edelman, S., & Fahle, M. (1993). Models of perceptual learning in Vernier hyperacuity. Neural Compt., 5, 695–718. Wiggs, C. L., & Martin, A. (1998). Properties and mechanisms of perceptual priming. Current Opin. Neurobiol., 8, 227–233. Received February 7, 2003; accepted September 5, 2003.

LETTER

Communicated by Wulfram Gerstner

How the Shape of Pre- and Postsynaptic Signals Can Inuence STDP: A Biophysical Model Ausra Saudargiene [email protected] Department of Psychology, University of Stirling, Stirling FK9 4LA, Scotland, and Department of Informatics, Vytautas Magnus University, Kaunas, Lithuania

Bernd Porr [email protected]

Florentin Worg ¨ otter ¨ [email protected] Department of Psychology, University of Stirling, Stirling FK9 4LA, Scotland

Spike-timing-dependent plasticity (STDP) is described by long-term potentiation (LTP), when a presynaptic event precedes a postsynaptic event, and by long-term depression (LTD), when the temporal order is reversed. In this article, we present a biophysical model of STDP based on a differential Hebbian learning rule (ISO learning). This rule correlates presynaptically the NMDA channel conductance with the derivative of the membrane potential at the synapse as the postsynaptic signal. The model is able to reproduce the generic STDP weight change characteristic. We nd that (1) The actual shape of the weight change curve strongly depends on the NMDA channel characteristics and on the shape of the membrane potential at the synapse. (2) The typical antisymmetrical STDP curve (LTD and LTP) can become similar to a standard Hebbian characteristic (LTP only) without having to change the learning rule. This occurs if the membrane depolarization has a shallow onset and is long lasting. (3) It is known that the membrane potential varies along the dendrite as a result of the active or passive backpropagation of somatic spikes or because of local dendritic processes. As a consequence, our model predicts that learning properties will be different at different locations on the dendritic tree. In conclusion, such site-specic synaptic plasticity would provide a neuron with powerful learning capabilities. 1 Introduction Hebbian (correlation-based) learning requires that pre- and postsynaptic spikes arrive within a certain small time window, which leads to an increase of the synaptic weight (Hebb, 1949). Originally it had been supposed that the temporal order of both signals is irrelevant (Bliss & Lomo, 1970, 1973; Neural Computation 16, 595–625 (2004)

c 2004 Massachusetts Institute of Technology °

596

A. Saudargiene, B. Porr, and F. Worg ¨ otter ¨

Bliss & Gardner-Edwin, 1973). However, rather early rst indications arose that temporal order is indeed important (Levy & Steward, 1983; Gustafsson, Wigstrom, Abraham, & Huang, 1987; Debanne, Gahwiler, & Thompson, 1994). Later, this was termed spike-timing-dependent plasticity (STDP), which refers to the observation that many synapses will decrease in strength when the postsynaptic signal precedes the presynaptic signal (dened here as T < 0), while they will grow if the temporal order is reversed (thus, T > 0) (Markram, Lubke, ¨ Frotscher, & Sakmann, 1997; Magee & Johnston, 1997; Bi & Poo, 2001). T denotes the temporal interval between post- and presynaptic signals (T :D tpost ¡ tpre). This leads to the characteristic antisymmetrical weight change curve measured by several groups. Antisymmetrical learning curves were rst observed in the entirely different context of classical conditioning models requiring much slower timescales. In their seminal study, Sutton and Barto (1981) introduced a learning rule based on the temporal difference between subsequent output signals, and they observed that this rule leads to inhibitory conditioning when the temporal order of conditioned and unconditioned stimulus is reversed. In their hands, this was an unwanted effect, because inhibitory conditioning is only very rarely observed in experiments (Prokasy, Hall, & Fawcett, 1962; Mackintosh, 1974, 1983; Gormezano, Kehoe, & Marshall, 1983). It gave, however, a hint that this class of “differential” algorithms would in general produce antisymmetrical learning curves. The TD learning rule (Sutton, 1988) also belongs to this class of algorithms. Accordingly, Rao and Sejnowski (2001) successfully used the TD algorithm to implement STDP. In the original TD learning rule, one specic signal is used as a dedicated reward, which is treated differently from the other inputs. Thus, Rao and Sejnowski (2001) had to change the TD rule to some degree in order to better adapt it to STDP (see also section 4). The differential Hebbian ISO learning rule, recently introduced by us (Porr & Worg¨ ¨ otter, 2003a) in the context of machine control (Porr & Worg¨ ¨ otter, 2003b), on the other hand, treats all input lines equivalently. This prompted us to query if the ISO rule could also be applied to spiking neurons in a biophysically more realistic model of STDP. The mechanisms that underlie STDP are associated with the biophysics of long-term potentiation (LTP) and long-term depression (LTD; Martinez & Derrick, 1996; Malenka & Nicoll, 1999; Bennett, 2000). This involves complex calcium dynamics and the concerted action of several enzymes such as ®calcium-calmodulin-dependent protein kinase II (CaMKII; e.g., see Teyler & DiScenna, 1987). Several kinetic models have been made (Senn, Markram, & Tsodyks, 2000; Castellani, Quinlan, Cooper, & Shouval, 2001; Shouval, Bear, & Cooper, 2002) in order to arrive at a better understanding of some of these aspects, and some models reach a relatively high level of biophysical and biochemical complexity. As a consequence, however, they contain many degrees of freedom. The question of STDP will be addressed here in the context of a singlecompartment neuron model applying the ISO learning rule. We motivate

How the Shape of Pre- and Postsynaptic Signals Can Inuence STDP

597

the use of this rule for modeling STDP by the fact that it is a differential Hebbian learning rule that correlates input and output signals and produces antisymmetrical weight change curves (Porr & Worg¨ ¨ otter, 2003a). On the right timescale, these properties are similar to STDP such that the ISO learning rule should in principle be applicable in this context too. Currently it is generally assumed that backpropagating spikes provide the necessary postsynaptic signal that represents the temporal reference for the considered synapse. This view has recently been questioned (see Goldberg, Holthoff, & Yuste, 2002, for a review), and a stronger emphasis has been laid on local dendritic processes. Therefore, we will specically investigate how different possible sources of postsynaptic depolarization, modeled by different shapes of the potential change, will inuence learning. The central nding of this modeling exercise is that the ISO learning rule leads in a robust and generic way to STDP, while the shape of the input signals distinctively inuences the shape of the weight change curve. We believe that this study may help to further our understanding of more complex (compartmentalized or kinetic) models because the question of how a certain STDP curve arises is reduced to the question of how the cellular parameters lead to the underlying input signal shapes. 2 Methods 2.1 Components of the Membrane Model. The model represents a small, nonspiking, dendritic compartment with synaptic connections that can take the shape of an AMPA or an NMDA characteristic. Thus, conductances g A of AMPA and gN of NMDA channels were modeled by state-variable equations: gi :D gA .t/ D gN A gO A .t/ D gN A t e¡t=tpeak gi :D gN .t/ D gN N gO N .t/ D gN N

¡t=¿ 1

(2.1) ¡t=¿ 2

¡e e : 1 C ´[Mg2C ] e¡° V

(2.2)

This slightly more complex notation is used because we will need the normalized conductance time functions gO A;N .t/ on their own when introducing the learning rule. All equations used in this study are numerically evaluated in 0.1 ms time steps. V is the membrane potential. Peak conductances are given by the gN values: gN A D 5:436 nS/ms, gN N D 4 nS. The other parameters were tpeak D 0:5 ms, ¿1 D 40 ms, ¿2 D 0:33 ms, ´ D 0:33/mM, [Mg2C ] D 1 mM, ° D 0:06/mV (Koch, 1999). Reversal potentials used were EA D EN D 0 mV.

598

A. Saudargiene, B. Porr, and F. Worg ¨ otter ¨

The conventional membrane equation (equation 2.3) was used to determine the momentarily existing membrane potential: C

dV.t/ X Vrest ¡ V.t/ D .½i C 1½i /gi .t/.Ei ¡ V.t// C ; dt R i

(2.3)

with R D 100 MÄ, C D 50 pF, and Vrest D ¡70 mV (Koch, 1999). Here we have introduced synaptic weights ½i and their weight changes 1½i . This is done purely for convenience, because weights and peak conductances could also be combined multiplicatively. However, as we will see below, it makes sense to keep them separate, because the peak conductances gN can then serve as reference values for the growth (or shrinkage) of the synaptic weights. These equations were modeled using C++ in the Z-domain (Kohn ¨ & Worg¨ ¨ otter, 1998) in order to speed up simulations. A single synapse was assumed as the so-called plastic synapse (PS) on which the inuence of the ISO learning rule was tested. This synapse can consist of varying NMDA and AMPA components, and we will call it ½1 . Note that only the NMDA component drives the learning (see below), the AMPA component will only (mildly) inuence the membrane potential, thereby possibly exerting a second-order inuence on the learning. It will, however, turn out in the course of this study that the secondary AMPA inuence is so small that it can be neglected in most cases. The plastic synapse receives the presynaptic spikes modeled as ±-function input to equations 2.1 and 2.2. The inuence of the NMDA component of the plastic synapse on the membrane potential is dependent on the membrane’s depolarization level. We assume in this model that this is determined by the postsynaptic activity, and we tested how different postsynaptic events inuence the weight change curve. To this end, three cases will be discussed: a postsynaptic inuence that takes the shape of (1) an AMPA response, (2) an NMDA response, and (3) a backpropagating spike (BP spike; see Figure 1A). We will call these inuences the postsynaptic depolarization source (DS). Since we will treat these cases one by one, we can associate them with the same weight (i.e., amplitude factor) ½0 . When using a BP spike, we have generally set ½0 D 1. Technically this was achieved by triggering the depolarization source with another ±-pulse that was shifted by a temporal interval T in relation to the presynaptic event. Physiologically this is meant to be linked to the postsynaptic spike. Strictly, this association, however, is valid only for the BP spike, which is causally related to the postsynaptic spike. The other (AMPA- or NMDA-) depolarization source events need not arise from such a causal relation but can be associated with other independently converging inuences from other synapses. The possibility for neuronal synchronization (Singer & Gray, 1995) with different lead or lag supports the possibility that clusters of other synapses could lead to the required depolarization. At this point, we note that it is not possible to rigorously dene T in all

How the Shape of Pre- and Postsynaptic Signals Can Inuence STDP

599

Figure 1: Schematic diagram of the model. (A) Components of the membrane model. The inset shows how to match the NMDA conductance function gO N (see equation 2.2) with a resonator impulse response h1 . (B) Components of ISO learning. (C) Typical weight change curve obtained with ISO learning. Parameters to obtain this curve were Q D 0:51, f D 0:1 Hz.

instances. Experimentally, T is associated with the difference between preand postsynaptic spike. At the site of the plastic synapse, one would, however, expect a 1 to 2 ms larger T due to the delay in backpropagating the spike. If other clusters of synapses are driving the (heterosynaptic) plasticity, T would have to be dened as the interval between the cluster activity and that at the plastic synapse. Figure 1A shows the different modeled depolarization sources in the context of a single-compartment model. At the summation point, the membrane potential is determined by the three depicted DS inuences (see equations 2.1–2.3) as well as by the inuence that comes from the plastic synapse. For practical purposes, we dene T as the difference between the events as it occurs at the site of the plastic synapse, thus neglecting possible delays that are currently experimentally unresolvable. One can assume that all active processes involved in generating BP spikes will cease at (or close to) the synaptic density. Thus, locally, only the electrotonic membrane properties will prevail. They are at the synaptic density determining the membrane potential, which in turn inuences the state of the Mg2C block at the NMDA channels and thus, the Ca2C inux, which enters the CaMKII second messenger chain (Teyler & DiScenna, 1987). Therefore,

600

A. Saudargiene, B. Porr, and F. Worg ¨ otter ¨

we have decided to model the BP spike also through a conductance change gBP at the summation point (see equation 2.4), which mimics the physiologically measured shapes of BP spikes without having to implement active processes (active channels). Note that the actual equations used do not have any physiological meaning; they are used only to design realistic backpropagating spike shapes: ³ gBP .t/ D gN BP

´ 1 0:5 ¡ ¡ 0:5 : 1 C e¡t=¿ rise 1 C e¡.t¡¿BP /=¿ fall

(2.4)

With the help of this equation, rising (¿rise ) and falling anks (¿ f all ), as well as the total width (¿BP ) of our backpropagating spikes, can be adjusted independently, while their amplitude is controlled by gN BP . This allowed us to design different shapes of backpropagating spikes in a very specic way. BP spikes modeled according to equation 2.4 always have a typical shape. In order to cover transitory cases of a BP spike with a shape that is intermediate to the different shapes obtainable by equation 2.4, we used equation 2.5 (Krukowski & Miller, 2001): gBP .t/ D gN BP . f e¡t=¿ a C .1 ¡ f / e¡t=¿ b ¡ e¡t=¿ c /:

(2.5)

Actual parameters for equations 2.4 and 2.5 shall be given in the gure legends. To calculate the membrane potential, we assumed in all cases EBP D 0 mV. A paired pulse protocol was used to stimulate the inputs. The pulse interval between both inputs (dened as T) was varied between ¡50 and +50 ms. The interval between pulse pairs was T D 250 ms in order to prevent second-order interactions between pulse pairs (steady-state condition). 2.2 Components of ISO Learning. Figure 1B shows the circuit diagram of rate-based ISO-learning for only two (±-pulse) inputs x0 ; x1 (for a more general description, see Porr & Worg¨ ¨ otter, 2003a). The inputs are rst bandpass ltered by means of heavily damped resonators h dened by h.t/ D

1 at e sin.bt/; b

(2.6)

p with a :D ¡¼ f=Q and b :D .2¼ f /2 ¡ a2 , where f is the center frequency of the resonator and Q ¸ 0:5 the damping factor. Generally we used Q D 0:51 (Porr & Worg¨ ¨ otter, 2003a), and this strong damping leads essentially to a low-pass behavior. Elsewhere, we have discussed that this would sufce for learning, while using the equations for bandpass lters renders several advantageous mathematical properties (Porr & Worg¨ ¨ otter, 2003a). However, in the context of this study, the “bandpass” lters used are really rather lowpass lters. Such a bandpass (low-pass) ltering takes place in a generic way

How the Shape of Pre- and Postsynaptic Signals Can Inuence STDP

601

at almost all membrane processes, and this allows us to easily associate the abstract operations h to more realistic cellular operations below. The transformed inputs u0;1 converge onto the learning unit with weights ½0;1 , and its output is given by v.t/ D ½0 .t/u0 .t/ C ½1 .t/u1 .t/ where u0;1 .t/ D x0;1 .t/ ¤ h.t/:

(2.7)

The ¤ denotes a convolution. In this study, we keep the weight ½0 xed. The other weight ½1 changes by the ISO learning rule that uses the temporal derivative of the output: d ½1 D ¹u1 .t/v0 .t/ dt

¹ ¿ 1:

(2.8)

In the original article, we had shown that ISO learning produces a linear weight change that can be calculated for all t ¸ 0 by solving ½1 ! ½1 C 1½1 Z 1 1½1 .T/ D ¹ u1 .T C ¿ /v0 .¿ /d¿: 0

(2.9) (2.10)

This integral can be solved analytically and leads to the ISO learning weight change curve shown in Figure 1C for two identical bandpass lters h. Note that this curve becomes skewed if two different bandpass lters are used, which indicates that the shapes of the input functions u are critical in determining the shape of the weight change curve. 2.3 Associating the Membrane Model to ISO Learning. We need to associate the parameters of the ISO learning rule with those in the membrane model for the plastic synapse ½1 : ² x1 : We assume that the presynaptic spike train at the plastic synapse represents the signal x1 of ISO learning. ² h1 : The bandpass lter operation h1 is represented by the conductance functions g of the plastic synapse, and we dene h1 .t/ :D gO N .t/:

(2.11)

² u1 : Since we are only dealing with spike trains modeled as ±-functions, we get u1 .t/ D h1 .t/ D gO N .t/. Corresponding curves are shown in the inset of Figure 1A. This shows that the conductance function essentially captures the characteristic of low-pass ltering the spike at the input. The match between the curves, however, is not exact, immediately indicating that the results of the membrane model will not be identical to those of ISO learning.

602

A. Saudargiene, B. Porr, and F. Worg ¨ otter ¨

² v: The membrane potential V is associated straightforwardly to the output function v from ISO learning. Note, as opposed to the original linear ISO learning rule, that we observe that the biophysically adapted version introduced here is no longer linear. The results presented later, however, will show that the adapted rule will still lead to generic antisymmetrical weight change curves. The nonlinearities introduced by the reversal potentials, as well as by the voltage dependence of the NMDA channel, inuence the results only qualitatively. ² x0 : The signal x0 from ISO learning is associated with three possible signals: with a spike arriving at input 1 (AMPA) or 2 (NMDA) or with a BP spike (see equation 2.3) in Figure 1A. ² u0 : It is not necessary to associate u0 with any component of the membrane model because it does not inuence the plastic synapse ½1 directly. Instead, this happens only via the derivative of the membrane potential.1 Essentially u0 represents the conductance change for any of the three introduced depolarization sources (see equations 2.1–2.3 and Figure 1A). ² ½0;1 : Synaptic weights had been directly introduced into the membrane equation (see equation 2.3). ½1 is the initial value of the synaptic weight of the plastic synapse. ½0 is used as the amplitude factor for a possible second synapse or the BP spike. Thus, ½0 denes the strength of the depolarization source. ² Learning rule: As a consequence of these settings, the learning rule of ISO learning is rephrased in the context of this model to d ½1 D ¹u1 .t/v0 .t/ D ¹ gO N .t/ V 0 .t/: dt

(2.12)

In section 4, we will address the question of the physiological relevance of the different parts of this learning rule and show how to associate pre- and postsynaptic events with the different terms. Here we only note that this rule in its derivative form can be associated to calcium ow through NMDA channels or in its integrated form to the calcium concentration (see section 4). Note that this rule is (as usual) treated in an adiabatic condition assuming that multiple spike pairs (with interspike interval T), which occur with a temporal distance of T between them, do not inuence each other. Thus, for the actual weight change 1½ obtained with one In a symmetrical learning situation (with ½0 changing also), one could associate u0 with the corresponding conductances gO in a similar way as for u1 . 1

How the Shape of Pre- and Postsynaptic Signals Can Inuence STDP

603

spike pair at the inputs, we use equation 2.10 and calculate Z 1½1 D

t 0

d½1 dt; dt

(2.13)

where T ¿ t ¿ T . We call 1½ an integrated weight change. The learning rate ¹ takes the unit of Volt¡1 , because this way 1½ is rendered unit free. To regard ¹ as a voltage-dependent entity may make sense given the observation that predepolarization enhances the induction of LTP and vice versa (Sourdet & Debanne, 1999). The physiological meaning behind the concept of a synaptic weight is still under debate. Several pre- and postsynaptic mechanisms contribute to the weight (Malenka & Nicoll, 1999), which in this study are subsumed under a single number ½. However, using these settings, N ½ can be treated as a multiplicative factor of the peak conductance g, which, multiplied together, can be interpreted as the strength of a given connection in the context of this model. At this point, it is important to note that the nal value of ¹ is rather arbitrary, because it is just a multiplicative factor that changes the slope of the (linear, see below) learning curve. In physiology, there exist (intracellular) amplication mechanisms that could in principle be associated with such a ¹-factor. This does not make sense in the context of this model because such mechanisms are not implemented. Thus, we will set the value of ¹ D 1 and provide an analysis about the range of ¹ within which the model operates linearly (see Figure 7). 3 Results In this section, we present results obtained when using a pure NMDAsynapse as the plastic synapse. At the end of this section, we discuss the physiologically more realistic case of a mixed AMPA/NMDA synapse, showing how to infer the corresponding results from what we have presented before. We use the three different sources for a postsynaptic depolarization introduced in section 2: a BP spike (3 in Figure 1A), which is currently believed to be the most likely source of depolarization, but also a pure AMPA inuence (see 1 in Figure 1A) or a pure NMDA inuence (see 2 Figure 1A). The goal of this section is to distinguish unrealistic from more realistic cases and to arrive at some conclusions concerning the robustness of the obtained results. In all the cases, we set the relative initial strength of the plastic synapse to ½1 D 0:5, which means that the synaptic weight of this connection was initially at 0:5 gN N . The weight ½0 , usually set to 1, is kept constant. 3.1 Individual Weight Change Examples. Figure 2A shows the conductance gN , Figure 2B the membrane potential and Figure 2C its derivative,

604

A. Saudargiene, B. Porr, and F. Worg ¨ otter ¨

Figure 2: Detailed curves for a single pulse pair experiment with T D 10 ms and a BP spike as the depolarization source, ½0 D 1. (A) Conductance change gN arrising from the presynaptic input and as a consequence of the BP spike leading to positive feedback at the e¡° V -term in equation 2.2. (B) Membrane potential change. (C) Derivative of the membrane potential. (D) Resulting integrated weight change. The weight stabilizes as soon as all curves have returned to their equilibrium.

and Figure 2D the development of the plastic synapse ½1 for a single input pulse pair with the rst spike arriving at t D 10 ms at the plastic synapse (presynaptic spike) and the BP spike arriving at t D 20 ms. Thus, we have a positive value for T D C10 ms. The initial membrane potential was at resting level (¡70 mV) for this simulation. The plastic synapse was assumed to be a pure NMDA synapse. The small increase in the NMDA conductance gN (see Figure 2A) starting at t D 10 ms is caused by a spike at the plastic synapse. The following large peak is due to the rising membrane potential as soon as the BP spike arrives at 20 ms. The membrane potential (see Figure 2B) increases slightly at t D 10 ms because of the activated NMDA channel and is dominated by a BP spike later. The upper part of the membrane potential curve (see Figure 2B) is not shown in order to make the small NMDA channel response more visible. An integrated weight change 1½1 (see Figure 2D) occurs throughout the duration of the membrane potential excursion. It follows the rule given in

How the Shape of Pre- and Postsynaptic Signals Can Inuence STDP

605

Figure 3: Detailed curves for a single pulse pair experiment with T D ¡10 ms. Panels are the same of those in Figure 2, only in this case, a negative integrated weight change is obtained (D).

equation 2.12 integrated according to equation 2.13, and the weight nally stabilizes slightly above zero as soon as the membrane potential returns to resting. The obtained change is small because only a single pulse pair was used. Essentially, the opposite situation is observed when inverting the pulse sequence to T D ¡10 ms (see Figure 3). The negative excursion of the derivative of the membrane potential (C) is now scaled with the full gN inuence (A), leading to a strong drop of the integral and nally to a reduced weight ½1 at steady state (D). Figure 4A shows complete weight change curves obtained at different resting potentials Vrest D ¡40 to ¡70 mV. We observe that the shape of the curves remains essentially the same while the magnitude of the weight change grows slightly when the membrane potential is depolarized. This is in accordance with observations that the amplitude of LTP can be augmented by predepolarizing the cell under study as a consequence of the voltage dependence of the NMDA channel (Sourdet & Debanne, 1999). Part B of Figure 4 has been obtained with a second synapse as the depolarization source and shall be discussed later.

606

A. Saudargiene, B. Porr, and F. Worg ¨ otter ¨

Figure 4: Weight change curves obtained at different resting potentials. (A) Using a BP spike as depolarization source. The BP spike is modeled with equation 2.4 and parameters ¿rise D 1 ms, ¿ f all D 10 ms, ¿BP D 25 ms, gN BP D 59:8 nS, ½0 D 1. (B) Using a second NMDA synapse as the depolarization source, ½0 D 10.

3.2 Inuence of the Shape of the BP Spike. Figure 5 shows 17 weight change curves and the BP spikes with which they were obtained. Note that some of these spike shapes do not necessarily reect realistic BP spike shapes. Instead, this modeling exercise is meant to cover a rather complete range of composable shapes such that the characteristics of a weight change curve resulting from any other BP spike shape can be inferred from this diagram. In general, we observe that the negative part of the weight change curve dominates in most cases across all panels, which is in accordance with physiology (Debanne, Gahwiler, & Thompson, 1998; Feldman, 2000). Figure 5: Facing page. Weight change curves obtained with different BP spikes as the depolarization source. Top panels show the weight change curves and bottom panels the BP spikes with which they were obtained. BP spikes were modeled using equation 2.4, adjusted to the same amplitude. A and B also contain one example obtained with a BP spike with intermediate shape modeled with equation 2.5. This spike starts with the shape of the rst BP spike in B and ends with the shape of the last spike. In all cases we used ½0 D 1. Individual parameters for the different BP spikes were: A, B: ¿rise D 1 ms: (¿ f all D 1 ms, ¿BP D 9 ms, gN BP D 56 nS); (¿ f all D 5 ms, ¿BP D 15 ms, gN BP D 61:4 nS); (¿ f all D 10 ms, ¿BP D 25 ms, gN BP D 59:8 nS); (¿ f all D 20 ms, ¿BP D 30 ms, gN BP D 67:5 nS). C, D, ¿rise D 5 ms: (¿f all D 1 ms, ¿BP D 20 ms, gN BP D 56 nS); (¿f all D 5 ms, ¿BP D 25 ms, gN BP D 64 nS); (¿f all D 10 ms, ¿BP D 35 ms, gN BP D 63 nS); (¿f all D 20 ms, ¿BP D 45 ms, gN BP D 67:5 nS). E, F, ¿rise D 10 ms: (¿f all D 1 ms, ¿BP D 35 ms, gN BP D 56 nS); (¿ f all D 5 ms, ¿BP D 40 ms, gN BP D 62 nS); (¿f all D 10 ms, ¿BP D 50 ms, gN BP D 63:5 nS); (¿ f all D 20 ms, ¿BP D 60 ms, gN BP D 69 nS). G, H, ¿rise D 20 ms, (¿ f all D 1 ms, ¿BP D 60 ms, gN BP D 57:5 nS); (¿ f all D 5 ms, ¿BP D 70 ms, gN BP D 60 nS); (¿ f all D 10 ms, ¿BP D 80 ms, gN BP D 62 nS); (¿ f all D 20 ms, ¿BP D 90 ms, gN BP D 68 nS). Parameters for the BP spike with intermediate shape in A and B are ¿a D 7 ms, ¿b D 30 ms, ¿c D 4 ms, f D 0:9, gN BP D 10:9 nS.

How the Shape of Pre- and Postsynaptic Signals Can Inuence STDP

607

608

A. Saudargiene, B. Porr, and F. Worg ¨ otter ¨

By comparing the curves within each panel, it can be seen that increasing fall times (¿f all ) of the BP spike mainly lead to an increase of the positive peak of the weight change curve, while the negative peak becomes smaller but more spread out toward negative values of T. By comparing curves across panels, one can assess the inuence of increasing rise times (¿rise). Here we observe that the typical STDP shape of the curves in Figure 5A (zero crossing at about T D 0 ms) becomes similar to standard Hebbian learning for values of T > ¡20 ms for a rather shallow rise time ¿rise D 20 ms of the BP spikes in Figure 5G. Only for T < ¡20 ms are negative weight changes again obtained. This effect becomes more pronounced when increasing the rise times to values ¿rise > 20 ms. Such shallow rise times may indeed occur at distal dendrites where, discounting possible active processes, the membrane capacitance has smeared out a BP spike substantially (Magee & Johnston, 1997; Larkum, Zhu, & Sakmann, 2001). When cells are driven, for example, by a stimulus, pre- and postsynaptic spikes will follow each other, often in intervals of less than 20 ms (Froemke & Dan, 2002). Thus, the values of a weight change curve for T beyond §20 ms are probably many times not of relevance for a cell’s synaptic plasticity (Froemke & Dan, 2002). Therefore, the shape-dependent leftward shift of the weight change curve leading to LTP within rather larger temporal intervals could be of some theoretical interest, because it shows that we do not have to alter the learning rule in order to get either differential Hebbian learning or a characteristic that is similar to standard Hebbian learning at realistic interspike intervals. A changing input characteristic will do the trick already. Note that in general, the plastic NMDA synapse will contribute almost nothing to the membrane potential change as compared to the strong inuence of the BP spike. Thus, in the specic case of Figures 5G and 5H, the derivative of the membrane potential will remain positive for rather long durations as soon as the rise time is large. This leads to positive weight changes also for large negative T values. The one example of a BP spike with intermediate shape (see Figures 5A and 5B) shows, and quite expectedly so, that gradual spike shape transitions will also lead to gradual transitions of the shape of the weight change curves. This supports the notion that other shapes of weight change curves can be basically inferred from these examples. 3.3 Inuence of Different NMDA Characteristics. It is known that during development, the relative frequency of different NMDA receptor types (NMDARA versus NMDARB ) changes. This inuences the electrophysiological properties of the NMDA channel. Figures 6A and 6B show three different NMDA characteristics, the steepest reecting an adult stage. The other two stages are observed during development at postnatal days 26 to 29 (¿decay D 380 ms) and 37 to 38 (¿decay D 189 ms) in ferret at a C40 mV voltage

How the Shape of Pre- and Postsynaptic Signals Can Inuence STDP

609

Figure 6: Learning curves obtained with different NMDA characteristics. (A) EPSC, (B) conductance of the NMDA synapse, both at C40 mV voltage clamp. (C) Weight change curves. Parameters for equation 2.2 were as follows. Normal (adult) NMDA: gN N D 4 nS, ¿1 D 40 ms, ¿2 D 0:33 ms which gives an EPSC with ¿decay D 41:7 ms. Young NMDA, before eye opening (26–29 days): gN N D 4:02 nS, ¿1 D 363:1 ms, ¿2 D 0:033 ms which gives EPSC with ¿decay D 380 ms. Older NMDA, after eye opening (37–38 days): gN N D 4:2 nS, ¿1 D 173:3 ms, ¿2 D 0:033 ms which gives EPSC with ¿decay D 189 ms. The BP spike was modeled according to equation 2.4 with parameters ¿rise D 1 ms, ¿f all D 10 ms, ¿BP D 2:5 ms, gN BP D 59:8 nS, ½0 D 1.

clamp preparation, where there is no more Mg2C blockage of the NMDA channel. The single decay values for ¿decay were taken from Roberts and Ramoa (1999), but we still modeled the NMDA characteristic using equation 2.2 by tting our two ¿ -values to yield the curves reported by Roberts and Ramoa. To obtain the weight change curves, we used a BP spike with a short rise time (1 ms) and a medium fall time (10 ms; compare Figure 5B). Interestingly we observe that both “young” NMDA synapses yield rather asymmetrical weight change curves with a strongly dominated LTD part. To our knowledge, so far very little is known about the actual physiological learning characteristics of early synapses. There are, however, indications that synaptic elimination dominates the early developmental stages (analyzed in a theoretical study by Chechik, Meilijson, & Ruppin, 1998). The

610

A. Saudargiene, B. Porr, and F. Worg ¨ otter ¨

theoretical results obtained with our learning rule would possibly point toward this direction. 3.4 Multiple Spike Pairs. Figures 7A and 7B show how weights develop when using a sequence of 30 pulse pairs with interpair intervals of T D 100 ms, which still guarantees the adiabatic condition for interspike intervals below T § 50 ms (see Figure 7C). Different learning rates ¹ ¸ 100 and varying amplitudes of the BP spike were applied to obtain Figures 7A and 7B. These high learning rates were used in order to be able to use only a few pulse pairs for measuring the whole curve. We nd two different types of behavior. Curves 1 to 6 show a gradual increase with almost unchanging slope (until saturation, curves 5 and 6); curves 7 to 9 show weak growth until a certain point, from which they grow much faster. Curves with gradual, unchanging growth (1–6) are obtained as soon as the amplitude of the BP spike is large. To obtain them, we have kept the same BP spike amplitude and increased the learning rate. This way, weight growth can be adjusted to different values to match it to the physiologically obtained percentage weight changes if desired (Bi & Poo, 2001). Curves 7 to 9 were obtained with (very) small amplitudes of the BP spike. We observe that the left part of the curves still shows a very shallow increase, followed by a kink (or bend), which continues into a second steeper part of the curve followed by saturation in curves 8 and 9. These differences can be explained by looking at how the membrane potential develops over time for the two different cases as shown in Figure 7B. Each depolarization represents the response of the model to a pulse pair (T D 25 ms). Most of the time, two peaks are seen in the potential; the Figure 7: Facing page. (A) Progress of the weight growth for a plastic NMDA synapse with initial value ½1 D 0:5 and BP spikes of different amplitudes. We used T D 5 ms as the temporal interval between pulses. Different slopes of the curves were obtained with different learning rates ¹, which were for curves 1–6 increasing as ¹ D 100; 200; 300; 500; 1000, 3400, curve 7, ¹ D 1000, curves 8 and 9, ¹ D 3400. The BP spike was modeled according to equation 2.4 for all curves with parameters, ½0 D 1, ¿rise D 1 ms, ¿ f all D 10 ms, ¿BP D 25 ms; for curves 1–6, gN BP D 59:8 nS; for curves 7 and 9, gN BP D 4 nS; for curve 8, gN BP D 0:4 nS. The resulting BP spike amplitude Vmax was for curves 1–6 Vmax D ¡20 mV, curves 7 and 9 Vmax D ¡61 mV, curve 8, Vmax D ¡69 mV. Interpair interval T D 100 ms. (B) Membrane potential traces comparing the cases with and without steep increase. Here we used T D 25 ms to make the individual contributions of both pulses more visible. The top panel corresponds to curve 9, the bottom panel to curve 3. Interpair interval T D 100 ms. (C) Weight change curves obtained with multiple spike pairs at different interpair intervals T D 100; 50; 25 ms. Parameters were: ¹ D 1, ½0 D 1, ¿rise D 1 ms, ¿f all D 10 ms, ¿BP D 25 ms, gN BP D 59:8 nS.

How the Shape of Pre- and Postsynaptic Signals Can Inuence STDP

611

rst comes from the plastic synapse and the second from the BP spike. The bottom trace represents a case where the amplitude of the BP spike was relatively large (as for curves 1–6). As a consequence, the general shape of the potential is very strongly dominated by the BP spike; the plastic synapse does not contribute much to it. In the top trace, we have used a very small BP spike amplitude (as for curves 7–9). This time, the potential is dominated

612

A. Saudargiene, B. Porr, and F. Worg ¨ otter ¨

by the growing plastic synapse. Given that this is a pure NMDA synapse, we obtain positive feedback through the e¡° V term in equation 2.2. Hence, we get a very steep increase as soon as this synapse gets “overly” strong. As a result, we get a slope in curves 8 and 9 that is the same as that in curve 6, which was obtained by a large BP spike and the same high learning rate (¹ D 3400). For the same reason, the slopes of curves 5 and 7 at the right side of the diagram are also similar. Note that if we use the small learning rate of ¹ D 1 that we normally applied, we still expect the same basic behavior but only after many more input pairs and with a much shallower nonlinear increase as soon as the positive feedback sets in. It is not conceivable that the situation shown in curves 7 to 9 directly corresponds to physiology, because one rarely nds pure NMDA synapses that at the same time would have to grow very strongly before being able to elicit this effect. However, a cluster of mixed AMPA and NMDA plastic synapses at the peripheral dendrite (or at a spine) where the BP spike might be small could indeed lead to local nonlinear potential changes resulting in such effects. Figure 7C shows weight change curves obtained with different interpair intervals T , as indicated in the gure. This shows that the adiabatic condition is violated as soon as interspike intervals T approach the interpair interval. As a consequence, secondary LTP (or LTD) regions appear as expected from the results of Froemke and Dan (2002). Bi (2002) discusses various cases of spike pair combinations that could lead to such results; explicit experimental proof of this model prediction, however, is at the moment still lacking.

3.5 Weight Change Curves Obtained with a Second Synapse as Depolarization Source. Its is known that the postsynaptic depolarization signal is needed in order to remove the Mg2C block at the NMDA channels, without which no Ca2C could enter the cell. A BP spike provides a very strong source of depolarization. More locally, however, other sources of postsynaptic depolarization also can exist, especially when considering clusters of synapses. Here “any other” synapse could lead to a local depolarization affecting the plastic NMDA synapse under consideration. This would lead to heterosynaptic plasticity, and Goldberg, Staff, and Spurston (2002) discuss the biophysical implications of this alternative. Accordingly, Figure 8 shows two groups of curves obtained with a plastic NMDA synapse and different synaptic depolarization sources. Depolarization comes from a cluster of AMPA synapses for A curves and a cluster of NMDA synapses for B curves. Both types of curves, A and B, are antisymmetrical, but the generic shape differs. Curves A have a skewed asymmetry and a slight positive offset, while curves B possess almost equal LTD and LTP parts and are shifted above zero for weak depolarization.

How the Shape of Pre- and Postsynaptic Signals Can Inuence STDP

613

Figure 8: Weight change curves obtained with differently strong depolarization source amplitudes ½0 . In general, an increase in depolarization source amplitude leads to an increase of the amplitude of the weight change curves. Insets show the shape of the two cut-off curves pointed to with the arrows at a reduced magnication. (A) Using an NMDA synapse as the depolarization source with ½0 D 0:25I 0:5I 1I 2I 5I 10. (B) Using an AMPA synapse with ½0 D 0:25I 0:5I 1I 2I 5I 10.

Nothing is known about STDP at synaptic clusters, and our results show that plasticity may have a different form when depolarization is caused by synchronized synapses and not a BP spike. The different curves in Figures 8A and 8B are obtained setting ½0 between 0.25 and 10 and thus assuming different relative strengths of the depolarization source.2 In all cases, we observed that the strength of the depolarization source acts as an amplication factor, leaving the shape of the curves essentially unchanged. As noted above, pure amplication is meaningless in the context of these simulations because this can be achieved by a changed learning rate ¹ or a larger number of paired pulses as well. Interestingly, we found that a changing depolarization source strength not only affects amplication but also induces a shift of the curves with respect to zero. The smallest curves, which were obtained with a weak depolarization source, remain above zero all the time. Thus, in spite of their realistic looking shape, these curves do not represent STDP. For larger values of DS (½0 ¸ 0:5), a zero crossing is observed, and only for the largest curve (DS: ½0 D 10), the negative part covers more area than the positive part, which seems to be the generic case for most STDP curves. Furthermore, weight changes are about tenfold smaller as compared to the cases above, where we had used a BP spike as the depolarization source 2 In principle, it would be possible to keep the relative strength of DS equal to 0.5 and vary the strength of PS. This, however, is essentially symmetrical to the experiments shown, because only the quotient between the strength of PS and DS determines the shape of the resulting STDP curves. This holds apart from minor effects due to the recurrent inuence of the NMDA channel.

614

A. Saudargiene, B. Porr, and F. Worg ¨ otter ¨

(except for the cases ½0 D 10). This is due to the much stronger change in membrane potential arising from a BP spike. As a consequence, a tonic depolarization of the membrane potential will lead to a stronger amplication of the weight change curves when using a second synapse as depolarization source (see Figure 4B) as compared to the situation where we had used a BP spike. 3.6 Mixed NMDA/AMPA Plastic Synapses. So far we have looked only at pure NMDA synapses as the plastic synapse. This is not in accordance with physiology because most synapses, which contain NMDA components, are mixed NMDA and AMPA synapses (as depicted in Figure 1A), moreover normally with a dominating AMPA part (Malenka & Nicoll, 1999). In addition, one nds that during learning, the AMPA component of such synapses undergoes much stronger changes than the NMDA component (Luscher ¨ & Frerking, 2001; Song & Huganir, 2002). This happens in spite of the fact that it is the NMDA component that is the driving force of the learning. This last observation, however, justies formulating the learning rule used here by only the normalized NMDA conductance function gO N in equations 2.11 and 2.12. Thus, the AMPA component cannot directly enter learning at equation 2.12; it will, however, inuence the membrane potential V and thereby exert two effects: (1) it inuences the e¡° V term in the NMDA conductance function (see equation 2.2) and (2) it inuences directly the derivative of the membrane potential. Both effects could change learning. However, at this stage of the analysis, we have arrived at the conclusion that the BP spike is in most cases the strongest and most inuential depolarization source, which is in accordance with others (reviewed in Linden, 1999, but see Goldberg et al., 2002). Above, we have argued that the depolarization that comes from a BP spike is normally much stronger than that which occurs from the plastic synapse itself that the inuence of the plastic synapse on the membrane potential can be neglected. This still holds for mixed AMPA and NMDA synapses. A realistic single excitatory postsynaptic spike potential (EPSP) is normally rather small. Thus, as long as the BP spike is still strong, a single EPSP will not substantially inuence learning. As soon as the BP spike drops in amplitude, one would have to consider the effect of a mixed plastic synapse. In this case, the learning curve will gradually approach the shape shown in Figure 8B, where we have a pure AMPA component assumed as the depolarization source and we will get nonlinear learning behavior such as that observed in Figure 7, curves 7 to 9. The wide range of possible effects that might arise from this, however, exceeds the scope of this article. 4 Discussion This study consists essentially of two parts. The rst part is the association between the ISO learning model and its biophysical counterpart, and the

How the Shape of Pre- and Postsynaptic Signals Can Inuence STDP

615

second part concerns the actual ndings obtained from the new model. We discuss both parts consecutively. At the end of this section, we compare our model to others found in the literature. 4.1 Discussing the Model Assumptions. ISO learning is a typical articial neural network algorithm, and it is therefore rather unrelated to biophysics. As a consequence, any kind of biophysical reevaluation cannot immediately go all the way down toward individual channel and calcium dynamics. Instead, in this study we have attempted to go one step into this direction by adapting the ISO learning algorithm to a traditional state-variable description of a neuronal (compartmental) model. One central assumption of ISO learning is the ltering of its inputs. In a neuron, low-pass ltering takes place in a very natural way as the consequence of the NMDA channel properties, as well as from the low-pass characteristics of natural membranes. Obviously these low-pass lters are not identical to the technical bandpass lters used in ISO learning, but deviations are small enough in order to yield the same basic STDP-like learning behavior. ISO learning was designed to correlate two inputs with each other in time (e.g., in a temporal sequence learning paradigm). STDP, however, takes place in relation to the temporal structure between a neuron’s input (presynaptic) and its output (postsynaptic). In spite of this apparently different algorithmic structure, adaptation of both models is still straightforward when realizing that normally a postsynaptic spike has been elicited from some presynaptic inuence. This justies our approach of either assuming a second (cluster of) synchronized synapse(s), a dendritic spike, or a BP spike as a possible depolarization source (Linden, 1999; Goldberg et al., 2002). The learning rule consists of two components. The second term is given by the derivative of the membrane potential. In most cases, the membrane potential is strongly dominated by the shape of the BP spike at the moment of pairing, while the contribution of the plastic synapse (or other synapses) can be neglected. This makes V 0 a postsynaptic quantity. Given that V0 D CI , O dQ . This we note that the learning rule can be rewritten also as d½ D ¹ dt C g dt

shows that charge transfer dQ across the (postsynaptic) membrane is a major dt driving force of learning. We can assume that a part of dQ is contributed by calcium ow (Malenka dt & Nicoll, 1999). Then after integration, the nal weight change 1½ is determined by part of Q, the total amount of calcium ions that crossed the membrane. This interpretation is valid as long as the calcium contributes an approximately xed part to the total current. The model does not take into account more complex calcium dynamics, buffering, enzymatic reactions, or others that take place during physiological weight changes. This was clearly not intended at this level of model complexity. As the rst term of the learning rule, we have used the normalized NMDA conductance function gO N , which represents the bandpass ltered input u1

616

A. Saudargiene, B. Porr, and F. Worg ¨ otter ¨

of ISO learning’s response to a ±-pulse input. We would argue that gO N essentially subsumes the time course of all processes that occur for an NMDA receptor outside or directly at the membrane—thus, all presynaptic events, for example, glutamate release, binding to the receptors, unbinding, and elimination from the synaptic cleft. The efciency with which this happens is encoded in the scaling factor gN N . O and a postsyThus, our learning rule uses a product of a presynaptic ( g) naptic (V 0 ) inuence. From this, it is now clear that the association of V 0 with calcium ow must be restricted to that proportion of the calcium that travels through the NMDA channels. Voltage-gated calcium channels, calciuminduced calcium release, or other calcium buffer release mechanisms are not being considered in this model, which could potentially inuence synaptic changes. There is, however, wide-ranging support (especially for spines) that synaptic plasticity is indeed strongly dominated by calcium transfer through NMDA channels (Schiller, Schiller, & Clapham, 1998; Yuste, Majewska, Cash, & Denk, 1999; Malenka & Nicoll, 1999) and that the other calcium release mechanisms may play only a minor role (but see, e.g., Huemmeke, Eysel, & Mittmann, 2002). Note that in this study, we do not implement any mechanisms of shortterm plasticity (Markram & Tsodyks, 1996; Fortune & Rose, 2001). In principle, this could be done using a fast model for short term plasticity (Giugliano, Bove, & Grattarola, 1999) as a front end that continuously adjusts the base value of ½ as soon as an input spike train arrives. Here we are also faced with another problem. Any input spike train, which res the cell, will lead to to complex “pre-post-pre-etc.” combinations (Froemke & Dan, 2002), discussed also in Bi (2002). Our model can cope with these effects too, and we receive additional transitions from LTP to LTD or vice versa depending on the pre-post sequence, as shown in Figure 7C. 4.2 Discussion of the Findings. We believe that three of our ndings could be of longer-lasting relevance for the understanding of synaptic learning, provided they withstand physiological scrutinizing: (1) the shape of the weight change curves heavily relies on the shape of the input functions (see Figures 5 and 8). (2) Differential Hebbian learning (STDP) can become similar to standard Hebbian learning (LTP) if the postsynaptic depolarization (i.e., the BP spike) rises shallow (see Figures 5A and 5B versus Figures 5F and 5G), (3) and weight growth can strongly change its characteristic in the course of learning if the membrane potential is locally dominated by the potential change arising from the plastic synapse itself (see Figure 7). 4.2.1 Finding 1. Physiological studies suggest that weight change curves can indeed have a widely varying shape (reviewed in Abbott & Nelson, 2000, and Roberts & Bell, 2002). In our study, both the NMDA characteristic and the characteristic of the depolarization source inuence the shape of the weight change curve. The NMDA characteristic changes during

How the Shape of Pre- and Postsynaptic Signals Can Inuence STDP

617

development, and in the ferret “young” NMDA channels are even slower than those found in adult animals (Roberts & Ramoa, 1999). We nd that for “young” synapses, LTD strongly dominates (see Figure 6). Functionally this would make sense in helping to stabilize (Song, Miller, & Abbott, 2000; Rubin, Lee, & Sompolinsky, 2001; Kempter, Gerstner, & van Hemmen, 2001) an immature network, where false “inverse-causal” correlations could still be frequent. Also, the shape of the membrane potential locally at the synapse is a source for differences in the shapes of the weight change curves. The physiological properties and morphology of dendritic trees will lead to a locally different active back-propagation or passive attenuation of the BP spike (Magee & Johnston, 1997; Larkum et al., 2001). It has been recently shown that synaptic plasticity in distal dendrites may be triggered by local NaC and/or Ca2C - mediated dendritic spikes (Golding et al., 2002), which are usually slower than the BP spikes (Stuart, Spruston, Sakmann, & H¨ausser, 1997; Schiller, Schiller, Stuart, & Sakmann, 1997; H¨ausser, Spruston, & Stuart, 2000; Larkum et al., 2001). As a consequence of these different shapes of the depolarizing potentials, the resulting weight change curves would differ as well. Many theoretical studies on STDP assume some kind of “generic” weight change curve that is applied regardless of the morphology of the neuron (Gerstner, Kempter, van Hemmen, & Wagner, 1996; Song et al., 2000; Kempter, Leibold, Wagner, & van Hemmen, 2001; Rubin et al., 2001). Others assume a gradually changing shape following, for example, the length of a dendrite without making specic assumptions about the parameters that lead to the different weight change curves (Panchev, Wermter, & Chen, 2002; Sterratt & van Ooyen, 2002). In our study, we argue that the shape of the inputs determines the shape of the weight change curves. It is interesting to consider that this way, local dendritic and spine properties would lead to different learning characteristics. For structures that are strongly electrically decoupled, the temporal structure and possible synchronization of the different inputs would be more important than the causality of preand postsynaptic signals. Note, however that the electrical (de-)coupling of spines is still a matter of debate (Koch, 1999; Kovalchuk, Eilers, Lisman, & Konnerth, 2000; Sabatini, Maravall, & Svoboda, 2001). 4.2.2 Finding 2. Several theoretical articles have shown that differential Hebb rules will lead to STDP-like behavior while plain (undifferentiated) Hebb rules will lead to temporally undirected LTP (Roberts, 1999; Xie & Seung, 2000; Roberts & Bell, 2002). Here we nd that our differential Hebb rule can lead to plain LTP within rather wide correlation windows T as soon as the rising ank of our BP spike is shallow. In this case, the product of gO and V 0 remains positive for rather large negative temporal shifts of the postsynaptic signal. This nding indicates that at this model level, it is not necessary to assume fundamentally different mechanisms for LTP or STDP.

618

A. Saudargiene, B. Porr, and F. Worg ¨ otter ¨

Yet from physiology, it is known that there are different intracellular processes involved for generating LTD or LTP. Furthermore, LTD is supposed to arise from low calcium inux, while LTP arises as soon as the calcium current is high (Nishiyama, Hong, Mikoshiba, Poo, & Kato, 2000). In an extended version, our model could be made compatible with these two aspects because the duration of the depolarization (hence V 0 ) and its temporal location with respect to the NMDA signal will determine how much calcium can ow, and this is different for the different values of T. One could implement two processes that are differently susceptible to these different calcium levels and thereby extend the model accordingly. The implications of being able to change an STDP to an LTP characteristic (or back) depending on the shape of the membrane potential are interesting from a theoretical point of view. Hebbian learning is usually associated with the extraction and condensation of relevant signals (Infomax principle, principal component analysis, Oja, 1982; Linsker, 1988), while STDP addresses the causality of synaptic events. It seems that the same substrate would support both principles and, in a similar way as discussed above, it could be the location at the dendrite that determines what kind of behavior is found. Possibly distal dendrites, where the potential changes can be shallow (Magee & Johnston, 1997; Larkum et al., 2001), discounting possible active processes, would experience LTP, proximal dendrites STDP. Also this can be a matter of further theoretical and experimental investigations. 4.2.3 Finding 3. We had found that weight growth can be linear or can contain a nonlinear transition when the positive feedback of the NMDA characteristic is dominant. This could again only happen at electrically more strongly decoupled parts of the membrane such as spines. There, small currents elicited by an active synapse will lead to large potential changes, which are required for this positive feedback effect. Such local potential changes cannot be measured anymore behind the spine neck because of its high resistance (Sabatini et al., 2001). In this case, the potential (change) that is the driving force of the calcium current would be strongly inuenced by the local structure of the synaptic density, and correlation-based learning will take place between local inputs independent of the cell’s soma (i.e., regardless of the ring of the cell). 4.3 Comparison to Other Models. A wide variety of models for STDP have been designed that can roughly be subdivided into two groups with different biophysical complexity. Some of them are spike based and others rate based. Group 1 could be called abstract models. They assume a certain shape for a weight change curve as the learning rule (Gerstner et al., 1996; Song et al., 2000; Rubin et al., 2001; Kempter, Leibold et al., 2001; Gerstner & Kistler, 2002) that remains unchanged across the local properties of the cell. Thus, these studies cannot discuss cellular properties but focus on network effects

How the Shape of Pre- and Postsynaptic Signals Can Inuence STDP

619

instead. One interesting nding obtained with these models was that STDP leads to self-stabilization of weights and rates in the network as soon as the LTD side dominates over the LTP side in the weight change curve (Song et al., 2000; Kempter, Gerstner, et al., 2001). In addition, it was found that such networks can store patterns (Abbott & Blum, 1996; Seung, 1998; Abbott & Song, 1999; Rao & Sejnowski, 2000; Fusi, 2002). More recently, these models have also been successfully applied to generate (i.e., to develop) some physiological properties such as map structures (Song & Abbott, 2001; Kempter, Leibold et al., 2001; Leibold & van Hemmen, 2002), direction selectivity (Buchs & Senn, 2002; Senn & Buchs, 2003) or temporal receptive elds (Leibold & van Hemmen, 2001). The biophysical realism of the used learning rules (really: weight change curves), however, must remain limited and cannot capture the wide variety of curves experimentally measured. Group 2 could be called state variable models, to which we count our approach. Such models can adopt a rather descriptive approach (Abarbanel, Huerta, & Rabinovich, 2002), where appropriate functions are being t to the measured weight change curves. Others are closer to the kinetic models in trying to t phenomenological kinetic equations (Senn et al., 2000; Castellani et al., 2001; Karmarkar & Buonomano, 2002; Karmarkar, Najarian, & Buonomano, 2002; Shouval et al., 2002). Our approach tries to associate the used functions more closely to biophysics than that of Abarbanel et al. (2002), but, unlike some of the other models, we have not tried to t any kinetic equations because model complexity substantially increases when doing so. As a consequence, our model is most closely related to the study of Rao and Sejnowski (2001). These authors used a variant of the TD learning rule to implement STDP. Dayan (2002) claries this issue and discusses that the rule of Rao and Sejnowski (2001) is rather a temporal difference rule between output activity values and not between prediction values as in the traditional TD rule. As a consequence, their rule is strongly related to our approach, and they also observe that the shape of the BP spike will inuence the weight change curve. We have, however, replaced the 10 ms discretization used by Rao and Sejnowski (2001) for calculating the temporal difference by a real derivative (using 0.1 ms steps), and in our model the presynaptic activity is modeled as a conductance. This recently allowed us to solve the integral equation for the weight change (see equation 2.10) analytically for a slightly simplied set of conductance functions (Porr, Saudargiene, & Worg¨ ¨ otter, 2004). Some of the other models implement a rather high degree of biophysical detail, including calcium, transmitter and enzyme kinetics (Senn et al., 2000; Castellani et al., 2001). The power of such models lies in the chance to understand and predict intra- or subcellular mechanism—for example, the aspect of AMPA receptor phosphorylation (Castellani et al., 2001), which is known to centrally inuence synaptic strength (Malenka & Nicoll, 1999; Luscher ¨ & Frerking, 2001; Song & Huganir, 2002). The approaches of Shouval et al. (2002) as well as of Karmarkar and

620

A. Saudargiene, B. Porr, and F. Worg ¨ otter ¨

co-workers (Karmarkar & Buonomano, 2002; Karmarkar et al., 2002) are a bit less detailed. Both models investigate the effects of different calcium concentration levels by assuming certain (e.g., exponential) functional characteristics to govern its changes. This allows them to address the question of how different calcium levels will lead to LTD or LTP (Nishiyama et al., 2000), and one of the models (Karmarkar & Buonomano, 2002) proposes to employ two different coincidence detector mechanisms to this end. An interesting aspect of our study and that of Rao and Sejnowski (2001) is that these models require only a single coincidence detector, because essentially the gradient of Ca2C drives the learning when using a differential Hebbian learning rule and not the absolute Ca2C -concentration (see Bi, 2002, for a detailed discussion of the gradient versus concentration model). Both model types (Shouval et al., 2002; Karmarkar & Buonomano, 2002; Karmarkar et al., 2002) were designed to produce a zero crossing (transition between LTD and LTP) at T D 0, which is not always the case in the measured weight change curves, which show transitions between more LTP- and more LTD-dominated shapes depending on the cell type and the stimulation protocol (Roberts & Bell, 2002). The differential Hebb rule we employed leads to the observed results as the consequence of the fact that the derivative of any generic (unimodal) postsynaptic membrane signal (like a BP spike) will lead to a bimodal curve. The relative temporal location of the presynaptic depolarization signal with respect to the positive (or negative) hump of this bimodal curve will then determine if the convolution product is positive (weight growth) or negative (weight shrinkage). The model of Shouval et al. (2002) implicitly also assumes such a differential Hebbian characteristic by the bimodal shape of their Ä-function, which they used to capture the calcium inuence. This group also discussed, among other aspects, the role of the shape of the BP spike, and they concluded that a slow afterdepolarization potential (more commonly known as repolarization) must exist in order to generate STDP. This assumption is essentially similar to that of a slow fall time of the BP spike in our study. Thus, also in their study, the shape of the BP spike will inuence the shape of the weight change curve. In general, they nd that the LTP part of the curve is stronger than the LTD part. This observation would prevent self-stabilization of the activity in network models (Song et al., 2000; Kempter, Gerstner et al., 2001a), which require a larger LTD part for achieving this effect. Interestingly, however Shouval et al. (2002) nd a second LTD part for larger positive values of T, which could perhaps be used to counteract such an activity amplication. In the hippocampus there is conicting evidence if such a second LTD part exists for large T (Pike, Meredith, Olding, & Paulsen, 1999; Nishiyama et al., 2000). Acknowledgments We acknowledge the support from SHEFC INCITE and EPSRC, GR/R74574/01. We are grateful to B. Graham, L. Smith, and D. Sterratt for their helpful comments on this manuscript.

How the Shape of Pre- and Postsynaptic Signals Can Inuence STDP

621

References Abarbanel, H. D. I., Huerta, R., & Rabinovich, M. I. (2002). Dynamical model of long-term synaptic plasticity. Proc. Natl. Acad. Sci. (USA), 99(15), 10132– 10137. Abbott, L. F., & Blum, K. I. (1996). Functional signicance of long-term potentiation for sequence learning and prediction. Cereb. Cortex, 6, 406–416. Abbott, L. F., & Nelson, S. B. (2000). Synaptic plasticity, Taming the beast. Nature Neurosci., 3, 1178–1183. Abbott, L. S., & Song, S. (1999). Temporal asymmetric Hebbian learning, spike timing and neuronal response variability. In M. S. Kearns, S. Solla, & D. A. Cohn (Eds.), Advances in neural information processing systems, 11 (pp. 69–75). Cambridge, MA: MIT Press. Benett, M. R. (2000). The concept of long term potentiation of transmission at synapses. Prog. Neurobiol., 60, 109–137. Bi, G. Q. (2002). Spatiotemporal specicity of synaptic plasticity: Cellular rules and mechanisms. Biol. Cybern., 87, 319–332. Bi, G.-Q., & Poo, M. (2001). Synaptic modication by correlated activity, Hebb’s postulate revisited. Annu. Rev. Neurosci., 24, 139–166. Bliss, T. V., & Gardner-Edwin, A. R. (1973). Long-lasting potentiation of synaptic transmission in the dentate area of the unanaesthetized rabbit following stimulation of the perforant path. J. Physiol. (Lond.), 232, 357–374. Bliss, T. V., & Lomo, T. (1970). Plasticity in a monosynaptic cortical pathway. J. Physiol. (Lond.), 207, 61P. Bliss, T. V., & Lomo, T. (1973). Long-lasting potentiation of synaptic transmission in the dentate area of the anaesthetized rabbit following stimulation of the perforant path. J. Physiol. (Lond.), 232, 331–356. Buchs, N. J., & Senn, W. (2002). Spike-based synaptic plasticity and the emergence of direction selective simple cells, Simultation results. J. Comput. Neurosci., 13, 167–186. Castellani, G. C., Quinlan, E. M., Cooper, L. N., & Shouval, H. Z. (2001). A biophysical model of bidirectional synaptic plasticity: Dependence on AMPA and NMDA receptors. Proc. Natl. Acad. Sci. (USA), 98(22), 12772–12777. Chechik, G., Meilijson, I., & Ruppin, E. (1998). Synaptic pruning in development: A computational account. Neural Comp., 10(7), 1759–1777. Dayan, P. (2002). Matters temporal. Trends Cogn. Sci., 6(3), 105–106. Debanne, D., Gahwiler, B. T., & Thompson, S. H. (1994). Asynchronous pre- and postsynaptic activity induces associative long-term depression in area CAI of the rat hippocampus in vitro. Proc. Natl. Acad. Sci. (USA), 91, 1148–1152. Debanne, D., Gahwiler, B., & Thompson, S. (1998). Long-term synaptic plasticity between pairs of individual CA3 pyramidal cells in rat hippocampal slice cultures. J. Physiol. (Lond.), 507, 237–247. Feldman, D. (2000). Timing-based LTP and LTD at vertical inputs to layer II/III pyramidal cells in rat barrel cortex. Neuron, 27, 45–56. Fortune, E. S., & Rose, G. J. (2001). Short-term synaptic plasticity as a temporal lter. Trends Neurosci., 24, 381–385. Froemke, R. C., & Dan, Y. (2002). Spike-timing-dependent synaptic modication induced by natural spike trains. Nature, 416, 433–438.

622

A. Saudargiene, B. Porr, and F. Worg ¨ otter ¨

Fusi, S. (2002). Hebbian spike-driven synaptic plasticity for learning patterns of mean ring rates. Biol. Cybern., 87, 459–470. Gerstner, W., Kempter, R., van Hemmen, J. L., & Wagner, H. (1996). A neuronal learning rule for sub-millisecond temporal coding. Nature, 383, 76–78. Gerstner, W., & Kistler, W. M. (2002). Mathematical formulations of Hebbian learning. Biol. Cybern., 87, 404–415. Giugliano, M., Bove, M., & Grattarola, M. (1999). Fast calculation of short-term depressing synaptic conductances. Neural Comp., 11, 1413–1426. Goldberg, J., Holthoff, K., & Yuste, R. (2002). A problem with Hebb and local spikes. Trends Neurosci., 25(9), 433–435. Golding, N. L., Staff, P. N., & Spurston, N. (2002). Dendritic spikes as a mechanism for cooperative long-term potentiation. Nature, 418, 326–331. Gormezano, I., Kehoe, E. J., & Marshall, B. S. (1983). Twenty years of classical conditioning research with the rabbit. In J. M. Sprague & A. N. Epstein (Eds.), Progress of psychobiology and physiological psychology (pp. 198–274). Orlando, FL: Academic Press. Gustafsson, B., Wigstrom, H., Abraham, W. C., & Huang, Y.-Y. (1987). Longterm potentiation in the hippocampus using depolarizing current pulses as the conditioning stimulus to single volley synaptic potentials. J. Neurosci., 7, 774–780. H¨ausser, M., Spruston, N., & Stuart, G. J. (2000). Diversity and dynamics of dendritic signaling. Science, 11, 739–744. Hebb, D. O. (1949). The organization of behavior, A neuropsychological study. New York: Wiley. Huemmeke, M., Eysel, U. T., & Mittmann, T. (2002). Metabotropic glutamate receptors mediate expression of LTP in slices of rat visual cortex. Eur. J. Neurosci., 15(10), 1641–1645. Karmarkar, U. R., & Buonomano, D. V. (2002). A model of spike-timing dependent plasticity, One or two coincidence detectors? J. Neurophysiol., 88, 507–513. Karmarkar, U. R., Najarian, M. T., & Buonomano, D. V. (2002). Mechanisms and signicance of spike-timing dependent plasticity. Biol. Cybern., 87, 373–382. Kempter, R., Gerstner, W., & van Hemmen J. L. (2001). Intrinsic stabilization of output rates by spike-based Hebbian learning. Neural Comp., 13, 2709–2741. Kempter, R., Leibold, C., Wagner, H., & van Hemmen, J. L. (2001). Formation of temporal-feature maps by axonal propagation of synaptic learning. Proc. Natl. Acad. Sci. (USA), 98(7), 4166–4171. Koch, C. (1999). Biophysics of computation. New York: Oxford University Press. Kohn, ¨ J., & Worg ¨ otter, ¨ F. (1998). Employing the Z-transform to optimize the calculation of the synaptic conductance of NMDA and other synaptic channels in network simulations. Neural Comp., 10, 1639–1651. Kovalchuk, Y., Eilers, J., Lisman, J., & Konnerth, A. (2000). NMDA receptormediated subthreshold Ca2C -signals in spines of hippocampal neurons. J. Neurosci., 20, 1791–1799. Krukowski, A. E., & Miller, K. D. (2001). Thalamocortical NMDA conductances and intracortical inhibition can explain cortical temporal tuning. Nature Neurosci., 4, 424–430.

How the Shape of Pre- and Postsynaptic Signals Can Inuence STDP

623

Larkum, M. E., Zhu, J. J., & Sakmann, B. (2001). Dendritic mechanisms underlying the coupling of the dendritic with the axonal action potential initiation zone of adult rat layer 5 pyramidal neurons. J. Physiol. (Lond.), 533, 447– 466. Leibold, C., & van Hemmen, J. L. (2001). Temporal receptive elds, spikes, and Hebbian delay selection. Neural Networks, 14(6-7), 805–813. Leibold, C., & van Hemmen, J. L. (2002). Mapping time. Biol. Cybern., 87, 428–439. Levy, W. B., & Steward, O. (1983). Temporal contiguity requirements for longterm associative potentiation/depression in the hippocampus. Neurosci., 8, 791–797. Linden, D. J. (1999). The return of the spike, Postsynaptic action potentials and the induction of LTP and LTD. Neuron, 22, 661–666. Linsker, R. (1988). Self-organisation in a perceptual network. Computer, 21(3), 105–117. Luscher, ¨ C., & Frerking, M. (2001). Restless AMPA receptors, Implications for synaptic transmission and plasticity. Trends Neurosci., 24(11), 665–670. Mackintosh, N. J. (1974). The psychologyof animal learning. Orlando, FL: Academic Press. Mackintosh, N. J. (1983). Conditioning and associative learning. Oxford: Oxford University Press. Magee, J. C., & Johnston, D. (1997). A synaptically controlled, associative signal for Hebbian plasticity in hippocampal neurons. Science, 275, 209–213. Malenka, R. C., & Nicoll, R. A. (1999). Long-term potentiation—a decade of progress? Science, 285, 1870–1874. Markram, H., Lubke, ¨ J., Frotscher, M., & Sakmann, B. (1997). Regulation of synaptic efcacy by coincidence of postsynaptic APs and EPSPs. Science, 275, 213–215. Markram, H., & Tsodyks, M. (1996). Redistribution of synaptic efcacy between neocortical pyramidal neurons. Nature, 382, 807–810. Martinez, J. L., & Derrick, B. E. (1996). Long-term potentiation and learning. Annu. Rev. Psychol., 47, 173–203. Nishiyama, M., Hong, K., Mikoshiba, K., Poo, M., & Kato, K. (2000). Calcium release from internal stores regulates polarity and input specicity of synaptic modication. Nature, 408, 584–588. Oja, E. (1982). A simplied neuron model as a principal component analyzer. J. Math. Biol., 15(3), 267–273. Panchev, C., Wermter, S., & Chen, H. (2002). Spike-timing dependent competitive learning of integrate-and-re neurons with active dendrites. In Lecture Notes in Computer Science, Proc. Int. Conf. Articial Neural Networks (pp. 896– 901). Berlin: Springer-Verlag. Pike, F. G., Meredith, R. M., Olding, A. A., & Paulsen, O. (1999). Postsynaptic bursting is essential for ”Hebbian” induction of associative long-term potentiation at excitatory synapses in rat hippocampus. J. Physiol. (Lond.), 518, 571–576. Porr, B., Saudargiene, A., & Worg¨ ¨ otter, F. (2004). Analytical solution of spiketiming dependent plasticity based on synaptic biophysics. Advances in neural information processing systems, 16. Cambridge, MA: MIT Press. In press.

624

A. Saudargiene, B. Porr, and F. Worg ¨ otter ¨

Porr, B., & Worg ¨ otter, ¨ F. (2003a). Isotropic sequence order learning. Neural Comp., 15, 831–864. Porr, B., & Worg¨ ¨ otter, F. (2003b). Isotropic-sequence-order learning in a closedloop behavioural system. Proc. Roy. Soc. B, 1811 (361), 2225–2244. Prokasy, W. F., Hall, J. F., & Fawcett, J. T. (1962). Adaptation, sensitization, forward and backward conditioning, and pseudo-conditioning of the GSR. Psychol. Rep., 10, 103–106. Rao, R. P. N., & Sejnowski, T. J. (2000). Predictive sequence learning in recurrent neocortical circuits. In S. Solla, T. K. Leen, & K.-R. Muller ¨ (Eds.), Advances in neural information processing systems, 12 (pp. 164–170). Cambridge, MA: MIT Press. Rao, R. P. N., & Sejnowski, T. J. (2001). Spike-timing-dependent Hebbian plasticity as temporal difference learning. Neural Comp., 13, 2221–2237. Roberts, E. B., & Ramoa, A. S. (1999). Enhanced NR2A subunit expression and decreased NMDA receptor decay time at the onset of ocular dominance plasticity in the ferret. J. Neurophysiol., 81, 2587–2591. Roberts, P. D. (1999). Computational consequences of temporally asymmetric learning rules, I. Differential Hebbian learning. J. Comput. Neurosci., 7(3), 235–246. Roberts, P. D., & Bell, C. C. (2002). Spike timing dependent synaptic plasticity in biological systems. Biol. Cybern., 87, 392–403. Rubin, J., Lee, D. D., & Sompolinsky, H. (2001). Equilibrium properties of temporally asymmetric Hebbian plasticity. Phys. Rev. Lett., 86(2), 364–367. Sabatini, B. L., Maravall, M., & Svoboda, K. (2001). Ca2C signaling in dendritic spines. Current Opinion Neurobiol., 11, 349–356. Schiller, J., Schiller, Y., & Clapham, D. E. (1998). Amplication of calcium inux into dendritic spines during associative pre- and postsynaptic activation: The role of direct calcium inux through the NMDA receptor. Nat. Neurosci., 1, 114–118. Schiller, J., Schiller, Y., Stuart, G., & Sakmann, B. (1997). Calcium action potentials restricted to distal apical dendrites of rat neocortical pyramidal neurons. J Physiol., 505, 605–616. Senn, W., & Buchs, N. J. (2003). Spike-based synaptic plasticity and the emergence of direction selective simple cells: Mathematical analysis. J. Comput. Neurosci., 14, 119–138. Senn, W., Markram, H., & Tsodyks, M. (2000). An algorithm for modifying neurotransmitter release probability based on pre-and postsynaptic spike timing. Neural Comp., 13, 35–67. Seung, H. S. (1998). Learning continuous attractors in recurrent networks. In M. Kearns, M. Jordan, & S. Solla (Eds.), Advances in neural information processing systems, 10 (pp. 654–660). Cambridge, MA: MIT Press. Shouval, H. Z., Bear, M. F., & Cooper, L. N. (2002). A unied model of NMDA receptor-dependent bidirectional synaptic plasticity. Proc. Natl. Acad. Sci. (USA), 99(16), 10831–10836. Singer, W., & Gray, C. (1995). Visual feature integration and the temporal correlation hypothesis. Annu. Rev. Neurosci., 18, 555–586.

How the Shape of Pre- and Postsynaptic Signals Can Inuence STDP

625

Song, I., & Huganir, R. L. (2002). Regulation of AMPA receptors during synaptic plasticity. Trends Neurosci., 25(11), 578–588. Song, S., & Abbott, L. F. (2001). Cortical development and remapping through spike timing-dependent plasticity. Neuron, 32, 1–20. Song, S., Miller, K. D., & Abbott, L. F. (2000). Competitive Hebbian learning through spike-timing-dependent synaptic plasticity. Nature Neurosci., 3, 919– 926. Sourdet, V., & Debanne, D. (1999). The role of dendritic ltering in associative long-term synaptic plasticity. Learning and Memory, 6, 422–447. Sterratt, D. C., & van Ooyen, A. (2002). Does morphology inuence temporal plasticity. In J. R. Dorronsoro (Ed.), ICANN 2002 (Vol. LNCS 2415, pp. 186– 191). Berlin: Springer-Verlag. Stuart, G., Spruston, N., Sakmann, B., & H¨ausser, M. (1997). Action potential initiation and backpropagationin neurons of the mammalian central nervous system. Trends Neurosci., 20, 125–131. Sutton, R. S. (1988). Learning to predict by the methods of temporal differences. Mach. Learn., 3, 9–44. Sutton, R., & Barto, A. (1981). Towards a modern theory of adaptive networks, Expectation and prediction. Psychol. Review, 88, 135–170. Teyler, T. J., & DiScenna, P. (1987). Long-term potentiation. Annu. Rev. Neurosci., 10, 131–161. Xie, X., & Seung, S. (2000). Spike-based learning rules and stabilization of persistent neural activity. In S. A. Solla, T. K. Leen, & K. R. Muller ¨ (Eds.), Advances in neural information processing systems, 12 (pp. 199–208). Cambridge, MA: MIT Press. Yuste, R., Majewska, A., Cash, S. S., & Denk, W. (1999). Mechanisms of calcium inux into hippocampal spines: Heterogeneity among spines, coincidence detection by NMDA receptors, and optical quantal analysis. J. Neurosci., 19, 1976–1987. Received February 14, 2003; accepted August 20, 2003.

Communicated by Ad Aertsen

LETTER

Self-Organizing Dual Coding Based on Spike-Time-Dependent Plasticity Naoki Masuda [email protected]

Kazuyuki Aihara [email protected] Department of Complexity Science and Engineering, Graduate School of Frontier Sciences, University of Tokyo, Tokyo, Japan

It has been a matter of debate how ring rates or spatiotemporal spike patterns carry information in the brain. Recent experimental and theoretical work in part showed that these codes, especially a population rate code and a synchronous code, can be dually used in a single architecture. However, we are not yet able to relate the role of ring rates and synchrony to the spatiotemporal structure of inputs and the architecture of neural networks. In this article, we examine how feedforward neural networks encode multiple input sources in the ring patterns. We apply spiketime-dependent plasticity as a fundamental mechanism to yield synaptic competition and the associated input ltering. We use the Fokker-Planck formalism to analyze the mechanism for synaptic competition in the case of multiple inputs, which underlies the formation of functional clusters in downstream layers in a self-organizing manner. Depending on the types of feedback coupling and shared connectivity, clusters are independently engaged in population rate coding or synchronous coding, or they interact to serve as input lters. Classes of dual codings and functional roles of spike-time-dependent plasticity are also discussed. 1 Introduction Neural coding is a subject of intense debate. Classically, ring rates of neurons play major roles in, for example, encoding amplitudes of sensory stimuli and delivering instructions to motor neurons. However, the need for temporal averaging is recognized as a severe drawback of single-neuron rate codes, and recent developments in recording techniques have enabled detection of synchronous ring, correlated ring, and their dynamical properties with high temporal resolution (Gray, Konig, ¨ Engel, & Singer, 1989; Aertsen, Gerstein, Habib, & Palm, 1989; Aertsen, Erb, & Palm, 1994; Vaadia et al., 1995; de Oliveira, Thiele, & Hoffmann, 1997; Salinas & Sejnowski, 2001). These experimental results are suggestive of important functions provided by spatiotemporal spike codings based on individual spike timing and synNeural Computation 16, 627–663 (2004)

c 2004 Massachusetts Institute of Technology °

628

N. Masuda and K. Aihara

Figure 1: Schematic diagram showing dual coding. A neural population, consisting of ve neurons in this example, responds asynchronously (B, D) or synchronously (C, E) to an identical external stimulus (A). The temporal waveform of the stimulus is better approximated by population ring rates (D) obtained on the basis of asynchronous spike trains (B) than by population ring rates (E) obtained on the basis of synchronous spike trains (C).

chrony, which are beyond mere rate codings (von der Malsburg & Schneider, 1986; Abeles, 1991; Fujii, Ito, Aihara, Ichinose, & Tsukada, 1996; Gerstner, Kempter, van Hemmen, & Wagner, 1996; Masuda & Aihara, 2002a). However, the limitations on the rate codes mentioned above can be overcome if populations of neurons are pooled (Shadlen & Newsome, 1998; Gerstner, 2000). To address the problem of which code is dominantly used, several articles consider the possibility that temporal spike coding in the sense of synchrony and population rate coding can be used dually in a single neural network (Tsodyks, Uziel, & Markram, 2000; Masuda & Aihara, 2002b, 2003; van Rossum, Turrigiano, & Nelson, 2002). Population rates based on asynchronously ring neurons encode external inputs (see Figure 1A) relatively accurately, as schematically shown in Figures 1B and 1D. Figures 1C and 1E illustrate that synchronous ring or its interval encodes information on instantaneous input amplitudes with a lower resolution, but it is efcient in the sense that postsynaptic ring occurs more easily with synchronous inputs than with asynchronous inputs (Abeles, 1991; Salinas & Sejnowski, 2000, 2001; Stroeve & Gielen, 2001; Masuda & Aihara, 2002b; Kuhn, Rotter, & Aertsen, 2002; Kuhn, Aertsen, & Rotter, 2003). Labeled groups of neurons identied by synchrony are also believed to encode object information (von der Malsburg & Schneider, 1986; Abeles, 1991; Diesmann, Gewaltig, &

Self-Organizing Dual Coding

629

Aertsen, 1999). These codings can be dynamically switched by modulating, for example, the noise strength (Masuda & Aihara, 2002b; van Rossum et al., 2002), parameters specifying single-neuron properties or network structure (Masuda & Aihara, 2003), or short-term dynamics of neurotransmitter (Tsodyks et al., 2000). Experimental evidence shows that such switching may occur as a result of changes in situations during a task or changes in environments (de Oliveira et al., 1997; Riehle, Grun, ¨ Diesmann, & Aertsen, 1997). However, theoretical analysis of dual coding has considered only global synchrony in homogeneous networks, which abandons embedding the spatial structure of inputs in network states and therefore can encode at most one input source. In reality, multiple neuronal assemblies formed via synchronous ring (von der Malsburg & Schneider, 1986; Sompolinsky, Golomb, & Kleinfeld, 1990; Abeles, 1991; Diesmann et al., 1999) or correlated ring (Aertsen et al., 1989, 1994; Vaadia et al., 1995; Fujii et al., 1996; Salinas & Sejnowski, 2001) may coexist in the brain, facilitating the segregation of distinct entities. This framework may be a key to answering the binding problem and the superposition catastrophe (von der Malsburg & Schneider, 1986; Fujii et al., 1996; Salinas & Sejnowski, 2001). The presumed global synchrony actually should be replaced by synchrony constrained to a subgroup of neurons. It is also possible that these clusters of synchronously ring neurons interact to be engaged in further information processing. Clusters are induced by mechanisms such as synaptic learning (Aertsen et al., 1994; Horn, Levy, Meilijson, & Ruppin, 2000; Levy, Horn, Meilijson, & Ruppin, 2001), tuned delays (Fujii et al., 1996), specic coupling structure (Sompolinsky et al., 1990), and local excitatory interaction accompanied by global inhibition (von der Malsburg & Schneider, 1986). In this letter, we discuss the emergence of clusters and the functional role of clusters in regard to synaptic learning and feedback coupling, especially from the viewpoint of duality between rate coding and temporal coding. A recent distinguished experimental nding associated with learning and cluster formation is spike-time-dependent plasticity (STDP) in which synaptic weights are updated, depending on relative ne timing between the presynaptic and postsynaptic spikes (see Abbott & Nelson, 2000, for review). Multiple recordings of pyramidal neurons in layer 5 of the rat neocortex (Markram, Lubke, ¨ Frotscher, & Sakmann, 1997), tectal neurons of Xenopus tadpoles (Zhang, Tao, Holt, Harris, & Poo, 1998), and CA1 region of the rat hippocampus (Nishiyama, Hong, Mikoshiba, Poo, & Kato, 2000) revealed that synapses are potentiated when a presynaptic spike arrives about 20 ms or less before a postsynaptic spike. When the order of ring is reversed, the synapse is depressed (Markram et al., 1997; Zhang et al., 1998). The STDP learning, which extends the standard Hebbian learning, implies that spike timing and temporal codes are important for brain function. Model studies have shown possible functions of STDP learning. In feedback neural networks with appropriately tuned delays, STDP generally promotes self-organization of the coupling structure to result in clustered r-

630

N. Masuda and K. Aihara

ing. The clustered ring serves as a mechanism for memory retrieval (Levy, Horn, Meilijson, & Ruppin, 2001), realization of synre chains (Horn et al., 2000), and recovery of cognitive maps (Song & Abbott, 2001). Moreover, numerous studies show a fundamental consequence of the STDP learning in feedforward architecture, or synaptic competition in which only the feedforward synapses representing more synchronous, precise, or early inputs are more likely to survive (Song, Miller, & Abbott, 2000; Song & Abbott, 2001). Synaptic competition is a winner-take-all sort of input ltering, and it is useful for explaining high coefcients of variation of interspike intervals (Abbott & Song, 1999; Song et al., 2000; Cˆateau & Fukai, 2003), coincidence detection (Gerstner et al., 1996; Kistler & van Hemmen, 2000), stabilization of ring rates (Kempter, Gerstner, & van Hemmen, 2001), difference learning (Rao & Sejnowski, 2001), column formation (Song & Abbott, 2001), development and recovery of cortical maps (Song & Abbott, 2001), and others (Abbott & Nelson, 2000; Gerstner & Kistler, 2002b). There are two main approaches to understanding synaptic competition mathematically. Methods based on dynamical systems for mean synaptic weights are successful in explaining synaptic competition in which feedforward synapses linking to a common downstream neuron are strengthened only when the inputs are synchronous (Kempter, Gerstner, & van Hemmen, 1999; Kempter et al., 2001; Kistler & van Hemmen, 2000; Gerstner & Kistler, 2002a, 2002b; Gutig, ¨ Aharonov, Rotter, & Sompolinsky, 2003). Although mean dynamics are easier to derive, another approach based on the Fokker-Planck equations supplies us with more detailed information, such as the distribution functions of synaptic weights (van Rossum, Bi, & Turrigiano, 2000; Rubin, Lee, & Sompolinsky, 2001; Cˆateau & Fukai, 2003). For example, the bimodal distribution of synaptic weights obtained by numerical simulations (Song et al., 2000; Song & Abbott, 2001) is explained quasianalytically. However, these methods largely ignore inputs with temporal and spatial structure, which are related to synchrony presumably important for brain function. Exceptions are Rubin et al. (2001) and G utig ¨ et al. (2003), which describe synaptic dynamics for temporally structured inputs. In this letter, we rst analyze the formation of clusters via STDP when multiple inputs are presented. In section 2, we explain the feedforward neural networks used in this article. In section 3, using the Fokker-Planck equations and its reduction to effective deterministic dynamics, we examine how synaptic competition occurs, especially when multiple inputs with equal or different degrees of synchrony are presented to upstream neurons. The results provide the basis of cluster formation in the downstream layers. Then in section 4, we numerically investigate what kinds of information coding in terms of synchronous, clustered, and asynchronous states can be induced in networks. In particular, shared connectivity and feedback coupling within the downstream layers are modulated so that coding schemes switch. Consequently, clustered synchrony, which is related to the functional cell assemblies (Aertsen et al., 1989, 1994; Fujii et al., 1996), is featured. Sec-

Self-Organizing Dual Coding

631

tion 5 is devoted to discussions on various dual codings and dynamical states, including those presented in section 4, and on the functional relation between learning and synchrony.

2 Model The model neural network is schematically shown in Figure 2. The upstream layer, which is called the sensory layer because it receives external inputs directly, contains n1 sensory neurons. The ring time of the ith sensory neuron is determined by an inhomogeneous Poisson process with rate function ºi .t/ given as the external input. Accordingly, the probability that the neuron res once in [t; t C 1t] is ºi .t/1t for 1t ! 0, regardless of the history of ring. The output spike train reects possible deterministic information on ºi .t/ and stochasticity generated by, for example, the ongoing background activity (Arieli, Sterkin, Grinvald, & Aertsen, 1996; Shadlen & Newsome, 1998; Brunel, 2000; Gerstner, 2000). The downstream layer, or cortical layer, comprises n2 cortical neurons that process spike inputs from the sensory layer. Cortical neurons are assumed to be leaky integrate-and-re (LIF) neurons with membrane leak rate ° and a threshold for ring equal to 1. The membrane potential of the ith cortical neuron is denoted by vi .t/. If the jth cortical neuron res its kth spike

Figure 2: Architecture of the feedforward neural network. LIF = leaky integrateand-re neurons.

632

N. Masuda and K. Aihara

0 at t D Tj;k , it sends a spike after synaptic delay ¿ D 0:3 ms to the other cortical neurons. The time course of the synaptic current in the ith neuron 0 receiving the spike is written as ²i;j g.t ¡ Tj;k ¡ ¿ /, where ²i;j is the feedback synaptic weight from the jth neuron to the ith neuron. Without losing generality, g.t/ satises g.t/ D 0 (t < 0), g.t/ ¸ 0, and limt!1 g.t/ D 0. After ring, the neuron is reset to the resting state vi D 0. In section 4, we consider three cases of feedback coupling: (1) when ²i;j is turned off, (2) when ²i;j is homogeneous and constant in time, and (3) when ²i;j changes dynamically as a result of learning; all of these cases yield different coding schemes. Generally, stronger coupling tends to induce synchronous ring, whereas weaker coupling leads to asynchronous ring that realizes efcient population rate coding (Tsodyks et al., 2000; Masuda & Aihara, 2003). Furthermore, this tendency is quite common to both excitatory and inhibitory coupling, although eventual ring rates become different (Brunel, 2000; Gerstner & Kistler, 2002b). Therefore, we assume that ²i;j ¸ 0 to focus on the dynamics of the STDP learning. The ith cortical neuron also receives feedforward input with time course wi;j g.t¡Tj;k / from the jth sensory neuron, which emits its kth spike at t D Tj;k . The delay for feedforward synapses is assumed to be absent or, equivalently, uniformly constant. The feedforward synaptic weight wi;j is assumed to change via STDP. The coupling strength wi;j from the jth sensory neuron to the ith cortical neuron is changed by ( AC exp .¡±t=¿ 0 / ; ±t > 0; (2.1) G.±t/ D ¡A¡ exp .±t=¿ 0 / ; ±t · 0;

where AC and A¡ are the sizes of the synaptic modication by a single STDP event, and ±t ´ tpost ¡tpre is the time difference between a postsynaptic spike instant tpost and a presynaptic spike instant tpre. The shape of the learning window is shown in Figure 3A. We conne w in [0; wmax ] by resetting it to wmax (or 0) whenever w goes beyond wmax (or below 0). Excluding inhibitory feedforward synapses is justied by the fact that the weights of inhibitory synapses decay to small values as a result of the competitive dynamics, as we will see in section 3. Let us assume AC < A¡ to prohibit the explosion of synaptic weights and induce synaptic competition (Kempter et al., 1999, 2001; Song et al., 2000; Song & Abbott, 2001; Gutig ¨ et al., 2003). For simplicity, we consider STDP only due to the presynaptic spike temporally closest for each postsynaptic spike (Kistler & van Hemmen, 2000; van Rossum et al., 2000; Cateau ˆ & Fukai, 2003). To summarize, the dynamics of the ith cortical neuron between two spikes 0 0 at t D Ti;k and t D Ti;k is represented by 0 0 C1 n1 X n2 X X X dvi 0 D ¡ ¿ / ¡ ° vi .t/; wi;j g.t ¡ Tj;k / C ²i;j g.t ¡ Tj;k dt jD1 k jD1 k 0 vi .Ti;k / D 0: 0

(2.2)

Self-Organizing Dual Coding

633

Figure 3: Asymmetric (A) and symmetric (B) learning windows used for the STDP learning. Single updating magnitudes are normalized by the maximum synaptic weights.

3 Analysis of Synaptic Competition with Multiple External Inputs 3.1 Fokker-Planck Formalism for Inhomogeneous Inputs. In this section, we introduce the Fokker-Planck equations to examine how the feedforward synapses get organized through the STDP learning. The asymptotic distributions of synaptic weights have been evaluated for spatially uncorrelated inputs (van Rossum et al., 2002; Cˆateau & Fukai, 2003) and prescribed input-output correlation (Rubin et al., 2001). On the other hand, powerful analysis by the dynamical system theory (Kempter et al., 1999, 2001) has limitations in the vicinity of the boundaries where the diffusive term plays an important role, as we will see below. Our particular interest is to understand the dynamics and the asymptotics of the synaptic weights when inhomogeneous external inputs are applied. The Fokker-Planck equation for p.w; t/, or the probability distribution function of the synaptic weight w at time t, is written as 1 @2 @p.w; t/ @ D¡ .A.w/p.w; t// C .B.w/p.w; t// 2 @w2 @t @w

(3.1)

with the following reective boundary conditions: J.w; t/ ´ A.w/p.w; t/ ¡

1 @ .B.w/p.w; t// D 0 2 @w

(3.2)

at w D 0 and w D wmax . These conditions guarantee in a natural way that the density function p.w; t/ does not extend beyond the boundaries. The drift

634

N. Masuda and K. Aihara

term A.w/ and the diffusion term B.w/ satisfy Z A.w/ D Z B.w/ D

1 ¡1 1

¡1

d±t G.±t/ T.w; ±t/;

(3.3)

d±t G2 .±t/ T.w; ±t/;

(3.4)

where T.w; ±t/ is the rate at which the event that the presynaptic spike time is tpost ¡ ±t occurs, given w and a postsynaptic spike at t D tpost (Cˆateau @p.w;t/ & Fukai, 2003). Given that w is conned in [0; wmax ], we set @t D 0 in equation 3.1 to obtain the asymptotic distribution of w as follows: ÁZ p.w/ D N exp

w

2A.w/ ¡

@B.w/ @w

!

B.w/

;

0 · w · wmax ;

(3.5)

where N is the normalization constant. We now derive the drift term for inhomogeneous inputs ºj .t/. The method for calculating actual distributions is explained in section 3.2 for the specic case of two input sources. To calculate the drift term, we ignore the membrane leak in equation 2.2 (van Rossum et al., 2000; Rubin et al., 2001). The initial membrane potentials are also assumed to be distributed uniformly in [0; 1]. With these assumptions, the distribution of membrane potentials is uniform in [0; 1] at the instant of a presynaptic spike, and the ring rate of a postsynaptic neuron at t D tpost conditioned by the presynaptic ring time and the synaptic weight is equal to the rate of increase in the membrane potential. Therefore, using equation 2.2, the conditioned ring rate is given by post

ºi

Z .wi;j ; tpost jtpre/ D wi;j g.tpost ¡ tpre / C Z C

tpost ¡¿ ¡1

¡1

n2 X

dt0

tpost

i0 D1;i0 6Di

dt0

n1 X j0 D1

post

²i;i0 ºi0

wi;j0 ºj0 .t0 /g.tpost ¡ t0 /

.t0 /g.tpost ¡ t0 ¡ ¿ /:

(3.6)

It follows from Bayes’ theorem that post

ºj .tpre /ºi post

post

.wi;j ; tpostjtpre / D ºi

.tpost/T.wi;j ; tpost ¡ tpre jtpost /;

(3.7)

where ºi .tpost/ denotes the ring rate of the ith cortical neuron, and T.wi;j ; ¢jtpost/ is the T.wi;j ; ¢/ given tpost. Substitution of tpre D tpost ¡ ±t and

Self-Organizing Dual Coding

635

equation 3.6 into equation 3.7 yields 8 Z tpost n1 X ºj .tpost ¡ ±t/ < T.wi;j ; ±tjtpost / D post wi;j g.±t/ C dt0 wi;j0 ºj0 .t0 /g.tpost ¡ t0 / ¡1 0 D1 º .tpost / : i

j

Z C

tpost ¡¿ ¡1

dt

n2 X

0

i0 D1;i0 6Di

post ²i;i0 ºi0 .t0 /g.tpost

9 = ¡ t ¡ ¿/ : ; 0

(3.8)

Equation 3.8 guarantees that the effect of single spikes represented in the rst term appears only for ±t > 0, reecting the causality of the time course of synaptic current (g.t/ D 0, t < 0). However, the second term of equation 3.8 suggests that T.wi;j ; ±tjtpost / for ±t < 0 is more complicated than the simple multiplication of the presynaptic and postsynaptic ring rates (Abbott & Song, 1999; van Rossum et al., 2000; Cˆateau & Fukai, 2003) because of possible correlation between external inputs (Kempter et al., 1999, 2001; Gerstner, 2001; Gerstner & Kistler, 2002a; G utig ¨ et al., 2003). Because the STDP learning occurs on long timescales, T.wi;j ; ±t/ is averaged over tpost, which is proportional to hº.¢/it wi;j g.±t/ C C

n1 X j0 D1

Z

n2 X

Z wi;j0

²i;i0

1 0

i0 D1;i0 6Di

1 0

dt0 g.t0 /hºj .¢/ºj0 .¢ C ±t ¡ t0 /it post

dt0 g.t0 /hºi .¢/ºi0

.¢ C ±t ¡ t0 ¡ ¿ /it ;

(3.9)

where h¢it is the time averaging. Moreover, we assume statistical homogeneity of inputs; hºj .¢/it D hº.¢/it , .1 · j · n1 /. Since the scaling of T.wi;j ; ±t/ merely changes the timescale of equation 3.1, we use equation 3.9 for equation 3.3 to have Z 1 A.wi;j / D hº.¢/it wi;j dt g.t/G.t/ 0

C hº.¢/i2t C hº.¢/i2t C

n1 X

Z wi;j0

j0 D1 n1 X j0 D1

n2 X

²i;i0 i0 D1;i0 6Di post

£ hºj .¢/ºi0

¡1

Z wi;j0 Z

1 ¡1

1

1 ¡1

Z dt G.t/

1 0

Z dt G.t/ Z

dt G.t/

0

1

1 0

dt0 g.t0 /Cj;j0 .t ¡ t0 / dt0 g.t0 /

dt0 g.t0 /

.¢ C ±t ¡ t0 ¡ ¿ /it ;

(3.10)

636

N. Masuda and K. Aihara

where Cj;j0 .t/ D

hºj .¢/ºj0 .¢ C t/it ¡ hº.¢/i2t hº.¢/i2t

(3.11)

is the time correlation between ºj .t/ and ºj0 .t/. The diffusion term is derived in a similar manner by combining equations 3.4 and 3.9. Feedforward synapses with larger A.wi;j / are more likely to be potentiated, whereas other synapses with smaller A.wi;j / are suppressed. Three factors determine the drift term, which agrees with the results derived by mean-eld analysis (Kempter et al., 1999, 2001; Song et al., 2000; Gerstner & Kistler, 2002a). In equation 3.10, the rst term represents the single spike effect, the second term expresses the effect of synchronous inputs, and the third term the synaptic Rweights for normalization beR1 P 1 uniformly depresses 1 cause jn0 D1 wi;j0 > 0, ¡1 dt G.t/ < 0, and 0 dt0 g.t0 / > 0. The existence of the second term is consistent with numerical results (Song & Abbott, 2001) and with the fact that inducing potentiation is more effective with synchronous inputs than asynchronous inputs (Gerstner et al., 1996; Abbott & Song, 1999; Kempter et al., 1999). 3.2 Dynamics of Synaptic Competition. For a deeper understanding of the dynamical aspects of synaptic competition, let us prepare two temporally changing external inputs, each of which covers half of the sensory layer exclusively (Gutig ¨ et al., 2003). We set 8 > : 0;

1 · j; j0 · n21 ; n1 0 2 C 1 · j; j · n1 ; otherwise,

(3.12)

where ca > 0 and cb > 0 are the strengths of the intragroup correlation. This is the correlation function for colored noise. However, the results that follow are not sensitive to the choice of Cj;j0 .t/ as far as Cj;j0 .t/ is symmetric, has a positive maximum at t D 0, and decays to zero as t ! §1. Furthermore, we assume the following time course of the synaptic current: ( g.t/ D

exp.¡t=¿ g /; 0;

t ¸ 0; t < 0:

(3.13)

The second term of equation 3.10 becomes positive because of the cooperation of g.t/, which implements biological time courses of the synaptic currentRsuch as the one in equation 3.13, and synchronous inputs. On av1 erage, 0 dt0 Ci;j .t ¡ t0 /g.t0 / is actually larger for positive t, in which the synapses are potentiated, than for negative t. This feature will be missed if too simple time courses of synaptic current, such as delta-shaped spikes

Self-Organizing Dual Coding

637

(van Rossum et al., 2000; Cˆateau & Fukai, 2003), are used. In the numerical simulations below, the second term is not large enough to allow the rst term to be ignored, as it has been in other work (Kempter et al., 1999). The interplay of the effects of single-neuron ring and those of synchronous inputs induces competitive dynamics. We denote by wi;a and wi;b the representative synaptic weights corresponding to wi;1 ; wi;2 ; : : : ; wi;n1 =2 and wi;n1 =2C1 ; : : : ; wi;n1 , respectively. To describe the dynamics of these mean weights on the basis of the Fokker-Planck formalism, we ignore the feedback coupling in equation 2.2 for a moment by setting ²i;i0 D 0, 1 · i; i0 · n2 . Since it sufces to consider just one cortical neuron in this case, we omit the subscript of the cortical neuron from wi;a and wi;b . By substituting equations 2.1, 3.12, and 3.13 into equation 3.10, we obtain A.wa / D A1 wa C A2 ca hwa i ¡ A3 .hwa i C hwb i/ ´ A.wa ; ca ; hwa i ; hwb i/;

A.wb / D A.wb ; cb ; hwb i ; hwa i/;

(3.14) (3.15)

¿0 ¿ g A2 ¿0 C 2¿g C Á ¿c2 ¿g ¿0 2 n1 C hº.¢/it hwa i ca A2 2 .¿c C ¿g /.2¿c C ¿0 / ¡ Á ! ! 2¿c ¿g3 ¿0 ¿c2 ¿g ¿0 2 C ¡ AC .¿ g2 ¡ ¿c2 /.¿0 C 2¿g / .¿ g ¡ ¿c /.¿ 0 C 2¿c / ¿0 ¿ g n1 C hº.¢/i2t .hwa i C hwb i/.A2¡ C A2C / (3.16) 2 2 ´ B .wa ; ca ; hwa i ; hwb i/ ; (3.17)

B.wa / D hº.¢/it wa

B.wb / D B .wb ; cb ; hwb i ; hwa i/ ;

(3.18)

where hwa i and hwb i denote the ensemble averages with respect to the corresponding probability distribution functions, and ¿0 ¿ g AC ; ¿0 C ¿ g Á ¿c2 ¿g ¿0 2 n1 ¡ A2 ´ hº.¢/it A¡ 2 .¿c C ¿g /.¿c C ¿0 / Á ! ! 2¿c ¿g3 ¿0 ¿c2 ¿ g ¿0 C ¡ AC ; .¿ g2 ¡ ¿c2 /.¿0 C ¿g / .¿ g ¡ ¿c /.¿0 C ¿c / A1 ´ hº.¢/it

(3.19)

(3.20)

638

N. Masuda and K. Aihara

A3 ´ hº.¢/i2t

n1 .A¡ ¡ AC /¿0 ¿g : 2

(3.21)

Distributions of wa (or wb ) are obtained by substituting equations 3.14 and 3.17 (or equations 3.15 and 3.18) into equation 3.5. The obtained distributions are in turn used for calculating hwa i and hwb i, which are again used in equations 3.14, 3.15, 3.17, and 3.18. The asymptotic distributions pa .wa / and pb .wb / are obtained by repeating this procedure until the two distributions converge. The results up to this point would also hold for inhibitory feedforward synapses. However, equation 3.14 indicates that the drift terms for inhibitory synapses are generally smaller than those for excitatory synapses. Consequently, the inhibitory synapses are unlikely to survive in the end. For this reason, only the excitatory synapses are considered in our model. We then numerically calculate the dynamics of hwa i and hwb i with ¿g D 5 ms, ¿c D 14 ms, ¿0 D 20 ms, AC D 0:00375, A¡ D 0:004, and wmax D 0:07. To inspect dynamical aspects closely, we change hwa i and hwb i only by a small amount (5% at most) in each time step. The results for two input sources with different degrees of correlation, ca D 0:35 and cb D 0:25, are shown in Figure 4A. Synaptic competition is observed, and as expected, the ultimate distribution in which hwa i wins has the larger basin of attraction with respect to the initial condition (Kempter et al., 1999). On the other hand, the results for ca D cb D 0:25 shown in Figure 4B indicate that the two inputs are equally likely to win (Song & Abbott, 2001), with the separatrix being the straight line hwa i D hwb i. If we start from random initial distributions of synaptic weights, half of the cortical neurons is devoted to encoding ºa .t/, whereas the other half encodes ºb .t/. When the initial condition is close to the separatrix, the real nal distribution depends on the uctuation expressed by equation 3.4 as well as on the initial condition. This result coincides with the direct extension of the results based on dynamical systems (Kempter et al., 1999, 2001; Gerstner & Kistler, 2002a). In these articles, the cortical or downstream neurons are also assumed to be inherently stochastic, and the correlation between instantaneous ring rates is calculated. It is interesting to see that the result is also derived when we consider the spiking mechanism in detail. 3.3 Effective Deterministic Dynamics. Figures 4A and 4B show that a saddle point and a separatrix are associated with synaptic competition. To understand the competitive dynamics more clearly, we reduce the stochastic dynamics of learning to the deterministic dynamics. Consequently, we investigate how the interplay of the effective drift terms hA.wa /i and hA.wb /i determines the dynamics of hwa i and hwb i. With the detailed calculations developed in appendix A, the results of this mean-eld analysis are summarized in Figure 4C. The effect of the diffusion terms tied with the boundary conditions is manifested near the boundaries. The amount of noise speci-

Self-Organizing Dual Coding

639

Figure 4: Dynamics of mean synaptic weights when the degrees of synchrony of two inputs are different (A) or the same (B). The xed points and the nullclines for (B) are schematically shown in (C).

es the level of the noise oor indicated in the gure. A saddle point and two stable xed points exist as shown in Figures 4A and 4B. The separatrix characterizes two basins of attraction, in each of which the input ºa .t/ or ºb .t/ is dominantly encoded after learning. Individual synaptic weights are attracted to one of the two stable xed points, the selection of which depends on the initial conditions as well as on the dynamical noise caused by the stochastic arrival of spikes and the random wiring between layers. If ºa .t/ is more signicant than ºb .t/ with ca > cb , the slope of the nullcline hwP a i D 0 increases to make the basin for ºa .t/ larger. Our results extend the previous results (Kempter et al., 1999, 2001) in that the dynamical feature of synaptic weights is treated with realistic boundary conditions and that the competition between two equally synchronous inputs is explicitly analyzed.

640

N. Masuda and K. Aihara

When ca D cb , the analysis in appendix B shows that the saddle point exists if A1 C ca A2 > 0;

A1 C ca A2 ¡ 2A3 < 0:

(3.22) (3.23)

Equation 3.22 is usually satised because the synaptic weights grow when the global depression term proportional to A3 is absent. Equation 3.23 indicates that the depressing effect overrides the potentiating effect at the saddle point, which guarantees that the summed synaptic weights do not diverge (Kempter et al., 2001; Gerstner & Kistler, 2002a). We have relied on mean-eld descriptions by averaging the drift terms, or equations 3.14 and 3.15, over the weight space. This reduction renders our results essentially the same as those derived by simpler methods with differential equations (Kempter et al., 1999, 2001; Gerstner & Kistler, 2002a). However, the drift terms actually differ by individual synapses even if the presynaptic neurons belong to the same group. Because the drift terms are monotonically increasing in wi , a synapse that happens to have large wi is likely to be enhanced, whereas a synapse with small wi at a certain moment will be depressed. Compared with the differential equation approaches, the Fokker-Planck formalism enables us to look into the intragroup competition that occurs when there are excessive presynaptic neurons (Song et al., 2000; Song & Abbott, 2001). It is straightforward to apply the arguments to the cases of more than two inputs (see appendix B). When m inputs equally strong in terms of synchrony are received and the network is congured so that synaptic competition occurs, generalization of equations 3.22 and 3.23 results in ¸1 D ¢ ¢ ¢ D ¸m¡1 D A1 C ca A2 > 0;

¸m D A1 C ca A2 ¡ mA3 < 0:

(3.24) (3.25)

Accordingly, the saddle point has only a one-dimensional stable manifold whose eigenvector is .1; 1; : : : ; 1/. The other m ¡ 1 unstable eigenmodes underlie the symmetric competitive property of the STDP learning. Let us next assume that one input correlation is higher than the others, as follows: c1 > c2 D c3 D ¢ ¢ ¢ D cm ;

(3.26)

where we use c1 and c2 instead of ca and cb to avoid confusion. Then it follows that p .c1 ¡ c2 /A2 mA3 .¡1 C 1 C D/ C (3.27) ¸1 D A1 C c1 A2 C ; 2 2

Self-Organizing Dual Coding

¸2 D ¢ ¢ ¢ D ¸m¡1 D A1 C c1 A2 > 0; ¸m D A1 C c1 A2 C

641

p

.c1 ¡ c2 /A2 mA3 .1 C 1 C D/ ¡ ; 2 2

(3.28) (3.29)

where DD

.c1 ¡ c2 /2 A22 C .2m ¡ 4/.c1 ¡ c2 /A2 A3 m2 A23

> 0:

(3.30)

In general, the conguration of synaptic weights is attracted along the eigenmode with the largest eigenvalue (Kempter et al., 2001; Gerstner & Kistler, 2002a). Consequently, only the synapses from the neurons encoding the input with autocorrelation c1 , which correspond to the eigenmode for ¸1 , are likely to be eventually strengthened at the expense of others if A2 > 0. In the nondegenerated case where (3.31)

c1 > c2 > ¢ ¢ ¢ > cm holds, the calculations in appendix B guarantee ¸1 > ¸2 > ¢ ¢ ¢ > ¸m¡1 > 0 > ¸m :

(3.32)

The basic dynamics of the mean weights associated with the m groups is similar to that for the previous examples. The m-dimensional weight prole contracts along the one-dimensional stable manifold. Then, along the (m¡1)dimensional unstable manifold, it escapes away from states in which all the weights are more or less balanced. In nondegenerated cases, the weights chiey move along the eigenvector associated with ¸1 , which is . A1 Cc11A2 ¡¸1 , 1 1 t , where t denotes the transpose. On the basis of A1 Cc2 A2 ¡¸1 ; : : : ; A1 Ccm A2 ¡¸1 / equation B.14, the sign of the rst component of the eigenvector is different from that of the others, and the mean weight for the input with correlation c1 grows large while the others converge to tiny values. This is the likely consequence as long as the initial condition is in the appropriate basin of attraction, which is presumably larger than those for the other presynaptic neurons encoding the inputs with correlation c2 ; : : : ; cm . Thorough analysis requires better knowledge of the basin of attraction and the position of xed points, which apparently seems difcult to obtain. Finally, we examine the weight dynamics in the presence of feedback coupling. For simplicity, we ignore axonal and synaptic delays and set ¿ D 0 and g.t/ D ±.t/ for cortical neurons. Then, using equation 3.10, the drift term for the ith cortical neuron becomes A.wi;a / D A1 wi;a C A2 ca hwi;a i ¡ A3 .hwi;a i C hwi;b i/ C

n2 X

i0 D1;i0 6Di

² i;i0 A.wi0 ;a /;

(3.33)

642

N. Masuda and K. Aihara

A.wi;b / D A1 wi;b C A2 cb hwi;b i ¡ A3 .hwi;a i C hwi;b i/ C

n2 X

² i;i0 A.wi0 ;b /;

(3.34)

i0 D1;i0 6Di

where ² i;i0 D

²i;i0 : post ºi0 .¢/

(3.35)

Since feedback coupling makes synaptic weights of different neurons correlated in general, we have to consider the dynamics in the 2n2 -dimensional space for (w1;a , w1;b , w2;a , w2;b ; : : : ; wn2 ;a , wn2 ;b ). As shown in appendix C, the eigenvector corresponding to the largest eigenvalue in this space is calculated as .1; ¡1; 1; ¡1; : : : ; 1; ¡1/. This indicates that weights of feedforward synapses of different postsynaptic neurons codevelop to represent a single target input. This result is in remarkable contrast with the consequence in the absence of feedback coupling in which n2 leading eigenmodes are degenerated. They are represented as .1; ¡1; 0; 0; : : : ; 0; 0/t , .0; 0; 1; ¡1; 0; 0; : : : ; 0; 0/t ; : : : ; .0; 0; : : : ; 0; 0; 1; ¡1/t , and feedforward synaptic weights of each cortical neuron are allowed to evolve independently. Because of the codevelopment, analysis of a single cortical neuron in section 3.2 is sufcient for understanding the network behavior. 4 Simulation Results In this section, we return to the starting point to ask what kinds of coding schemes can operate in the feedforward networks shown in Figure 2 when multiple inputs are presented to the sensory layer. We are particularly interested in the coexistence of multiple clusters guided by multiple input sources, which is a key to solving binding problems and the superposition catastrophe (von der Malsburg & Schneider, 1986; Fujii et al., 1996; Salinas & Sejnowski, 2001). We consider a simple case of two temporal waveforms, each covering a part of the sensory layer in a complementary way: ( ºi .t/ D

10:0 C 5:0ºa .t/; 10:0 C 5:0ºb .t/;

1 · i · n1 ; n1 C 1 · i · n1 :

(4.1)

A cortical neuron is assumed to receive synapses from n01 sensory neurons according to the following rule. When a cortical neuron is connected to more (or fewer) presynaptic neurons in assembly f1; 2; : : : ; n1 g than in assembly fn1 C 1; : : : ; n1 g, this cortical neuron will prefer ºa .t/ (or ºb .t/) as a result of learning. If the wiring structure forces the majority of cortical neurons to select the same assembly of sensory neurons, the cortical layer obviously encodes one of the inputs. Here we consider nontrivial situations in which

Self-Organizing Dual Coding

643

each cortical neuron receives a synapse from a sensory neuron in one of two assemblies with equal probability. Accordingly, a cortical neuron is connected to n01 =2 sensory neurons in each assembly on average. Then, in the absence of the recurrent connection, the feedforward synaptic weights of cortical neurons are subject to the dynamics discussed in section 3. The STDP learning makes each cortical neuron encode ºa .t/ or ºb .t/, depending on the initial wiring and the stochastic dynamics. The cortical neurons encoding the same input are expected to form clusters with correlated ring. This kind of self-organizing behavior has been discussed in feedback networks (Horn et al., 2000; Levy et al., 2001), and we examine it for feedforward networks in relation to emergent coding schemes. Relevant to the dynamical states is shared connectivity among the ith and jth cortical neurons denoted by ri;j (Shadlen & Newsome, 1998; Masuda & Aihara, 2003). This quantity measures how much input a pair of downstream neurons share. With a large value of ri;j , two cortical neurons receive from the sensory layer correlated inputs that promote synchrony and limit the performance of the cortical layer as a population rate coder (Masuda & Aihara, 2003). For networks with xed structure, ri;j is calculated by counting the number of shared inputs divided by the total number of inputs. Since the synaptic weights are inhomogeneous and dynamically changing, we dene ri;j as follows: ri;j D

n1 X kD1

¡ ¢ min wi;k ; wj;k

¿

n1 ¡ ¢ 1X wi;k C wj;k ; 2 kD1

(4.2)

where min.x; y/ is the smaller value of x and y. We start with wi;j D 12 wmax for all the existing synapses. If both ith and jth cortical neurons purely n0

n0

1 encode ºa .t/ after learning, ri;j D 2n1 , whereas ri;j D 2.n ¡n if both cortical 1 1 1/ neurons encode ºb .t/. If they encode different inputs, ri;j ideally converges to 0. Inhomogeneous shared connectivity allows the network to encode two inputs in different manners (section 4.2). We also vary the feedback coupling strength ²i;j , which can switch network states (Tsodyks et al., 2000; Masuda & Aihara, 2003). The input rates for 5:0k ms · t < 5:0.k C 1/ ms are dened by

ºa .t/ D ºk;a D 0:7ºk¡1;a C »k;a ;

ºb .t/ D ºk;b D 0:7ºk¡1;b C »k;b ;

(4.3) (4.4)

where º¡1;a D º¡1;b ´ 0, and »k;a and »k;b are independently taken from the gaussian distribution with mean 0 and standard variation 1. The inputs can be approximated by the Ornstein–Uhlenbeck process with the correlation time equal to ¿c D 14 ms, which has been used in section 3. Furthermore, we set n2 D 50, n01 D 100, and ¿ D 0:3 ms.

644

N. Masuda and K. Aihara

Figure 5: Comparison of theoretical (smooth lines) and numerical (steps) weight distributions after learning, for a single postsynaptic neuron. (A, B) The weight distributions for the synapses corresponding to ºa .t/ and ºb .t/, respectively.

To verify the validity of the analytical methods developed in section 3, we rst compare the theoretical and numerical weight distributions for a single postsynaptic neuron in the nonleaky case (° D 0). The two weight distributions corresponding to the presynaptic neurons encoding ºa .t/ or ºb .t/ are calculated analytically and numerically for the parameter values described above, with AC D 0:00375, A¡ D 0:004, and wmax D 0:03. Figure 5 shows the agreement between the theoretical and analytical results. However, the analytical methods cannot directly predict synaptic weight distributions obtained by the simulations below. Membrane potential leak, which we ignored in the analytical part, actually generates pseudocorrelation between two input sources. A cortical neuron is more likely to re when large instantaneous inputs, probably simultaneous large inputs from both input sources, are received. This pseudocorrelation decreases A3 ; thus, the numerically obtained distributions are better approximated by the analytical results with AC when it is larger than the value actually used. Therefore, we set AC D 0:00333 rather than AC D 0:00375 in the following simulations. However, the qualitative feature of the learning dynamics is preserved even after the modication. We also set ° D 0:05 ms¡1 and wmax D 0:07. The ring rates are measured by counting the number of spikes with a bin width of 5 ms. Particularly, we stipulate that the relevant population ring rates are those based on n2 =5 neurons that most prefer ºa .t/ or ºb .t/. The performance of the rate coding by a single neuron or by a population is evaluated by the correlation coefcient of the time-dependent ring rates and ºa .t/ or ºb .t/ for the last 800 ms of a simulation run (Masuda & Aihara, 2002b, 2003). The degree of synchrony is measured with the correlation

Self-Organizing Dual Coding

645

coefcient between the spike counts of each pair of cortical neurons, with the use of the same bin width. 4.1 Even Shared Connectivities from Two Sources. We rst consider the case in which the shared connectivities tied with two assemblies in the sensory layer are the same: n1 D n1 =2. When feedback coupling is absent, each cortical neuron learns to encode ºa .t/ or ºb .t/ with equal probability (see Figure 4B). The simulation results with n1 D 240 and ²i;j D 0 are shown in Figure 6. In Figure 6, the cortical neurons are sorted so that neurons with smaller indices prefer ºa .t/. As shown in Figures 6A and 6E, single cortical neurons with small (or large) indices can encode ºa .t/ (or ºb .t/) by rate coding, with only the synapses from the sensory neurons associated with ºa .t/ (or ºb .t/) surviving. Figure 6A shows that the performance of encoding ºa .t/ (or ºb .t/) by ring rates, which is measured by the correlation coefcients introduced above, is higher for population ring rates based on the rst (or last) n2 =5 neurons than for single-neuronal ring rates. The cortical neurons are divided into two loose clusters identied by the relatively high degree of synchrony between the neurons encoding the same inputs, as shown in Figure 6B. Despite the weak synchrony, population rate coding is successful in encoding the inputs with high resolution. Figures 6C and 6D, which compare shared connectivity before and after learning, show that it is selforganized toward a structured state through learning. In other words, the learning causes the cortical layer to be specialized in two population rate codes. Let us denote this state by Ra Rb , which means that the cortical layer represents ºa .t/ and ºb .t/ by population rate codes (R) with two segregated subpopulations. Next, we introduce feedback coupling by setting ²i;j D 0:006 uniformly. Figure 7B shows that global synchrony is induced by the feedback coupling. However, the population rate coding performance with feedback coupling (Figure 7A) is much lower than that without feedback coupling (Figure 6A). The cortical layer operates in a synchronous code mode, which is robust and efcient in terms of power consumption at the cost of the coding accuracy (Masuda & Aihara, 2002b, 2003; van Rossum et al., 2002). In addition, the information on ºa.t/ and ºb .t/ is mixed because of the global coupling for most initial conditions. As the analytical results in section 3.3 predict, either ºa .t/ or ºb .t/, which is chosen randomly, is encoded by the whole cortical layer in the end. This ltering mechanism works not only at the singleneuron level but also at the network level, and it may be closely related to formation of columnal structure and dynamical neuronal modules through specialization (Song & Abbott, 2001). This state is denoted by Sa Sa , where S denotes synchrony. In this state, the cortical layer, which consists of two clusters if without feedback coupling, encodes Sa by globally synchronous ring. The existence of simultaneously active clusters identied by synchronous or correlated ring is a key to illuminating the binding problem and the

646

N. Masuda and K. Aihara

Figure 6: Simulation results when the shared connectivities for two inputs are equal. We set n1 D 240, n2 D 50, n01 D 100, n1 D 120, and ²i;j D 0 for all i; j. (A) Performance of single-neuron rate coding for ºa .t/ is indicated by squares, and that for ºb .t/ is indicated by circles. Performance of population rate coding for ºa .t/ is indicated by solid lines, and that for ºb .t/ is indicated by dotted lines. Performance is measured by the correlation coefcient between the ring rates and the external inputs, and population ring rates are calculated by taking the

Self-Organizing Dual Coding

647

Figure 7: Simulation results for two equal shared connectivities with constant feedback coupling. We set n1 D 240, n2 D 50, n01 D 100, n1 D 120, and ²i;j D 0:006. (A) Performance of single-neuron rate coding and population rate coding measured by the correlation function (see the caption of Figure 6 for details). (B) Degree of synchrony between cortical neurons. (C) Shared connectivity between cortical neurons.

Figure 6 (cont.): average ring rates of the n2 =5 D 10 neurons that are most engaged in encoding ºa .t/ or ºb .t/. (B) Degree of synchrony between pairs of cortical neurons. The horizontal and vertical axes correspond to the indices of cortical neurons. The brighter points correspond to larger values of the degree of synchrony. For clarity, a black point corresponds to a value less than 0.2, and a white point corresponds to a value more than 0.5. (C) Shared connectivity between pairs of cortical neurons before learning and (D) after learning. A black (or white) point corresponds to a value less than 0.1 (or more than 0.8). (E) The synaptic weights from the sensory neurons to the rst and the n2 th cortical neurons after learning.

648

N. Masuda and K. Aihara

superposition catastrophe, as discussed in the context of the synre chain (Abeles, 1991; Diesmann et al., 1999; Horn et al., 2000), dynamical cell assemblies (Aertsen et al., 1989, 1994; Vaadia et al., 1995; Fujii et al., 1996), object segregation (von der Malsburg & Schneider, 1986; Sompolinsky et al., 1990), and memory retrieval (Levy et al., 2001; Gerstner & Kistler, 2002b). It is possible for the network to encode ºa .t/ and ºb .t/ by two exclusively synchronous clusters with uniform coupling as described above, but only for a small range of parameter values. A more robust mechanism is to apply a learning rule in which ²i;j is increased only when the ith and jth cortical neurons re almost synchronously and is decreased otherwise (Nishiyama et al., 2000; Abbott & Nelson, 2000). Here we use a simple learning window (see Figure 3B), which is dened as follows: G.t/ D

» 0:0009 ²max ; ¡0:0006 ²max ;

.0 · jtj · 4 ms/ .8 · jtj · 23 ms/;

(4.5)

where ²max D 0:006. We run the learning procedure with the initial conditions ²i;j D 0 for all i; j and the restriction 0 · ²i;j · ²max . Figure 8B, in comparison with Figure 7B, shows that two synchronous clusters coexist. Figure 8D shows that feedback coupling eventually formed, separating two clusters and enhancing intracluster synchrony. This coding scheme is denoted by Sa Sb . However, it is difcult for these synchronous clusters to emerge with a standard learning window represented by equation 2.1 or Figure 3A. Actually, in recurrent networks or within a layer, synaptic potentiation in one direction accompanies synaptic depression in the other direction. The nal coupling strengths for the learning window equation 2.1 with the learning rate slowed by the ratio 0:0009=0:004 are shown in Figure 8E. No synaptic structure is formed with the asymmetric learning window, leaving the gure uniformly black. Similar to the uncoupled situation, the coding scheme for this situation is Ra Rb . The numerical results are qualitatively the same for various ratios of two learning rates. 4.2 Uneven Shared Connectivities from Two Sources. To investigate inhomogeneous information coding with respect to two external inputs, we impose different shared connectivities regarding ºa .t/ and ºb .t/ by setting n1 D 315 and n1 D 65. By assumption, a cortical neuron still receives incident spikes from each assembly of sensory neurons with probability 1=2, and the situation does not change for an isolated cortical neuron. However, population ring patterns may differ statistically between two emergent clusters of cortical neurons because of the uneven shared connectivities. The n0 shared connectivity 2n1 D 50=65 of cortical neurons encoding ºa .t/ is higher 1

n0

1 than the shared connectivity 2.n ¡n D 50=250 of cortical neurons encoding 1 1/ ºb .t/. The former subpopulation is more apt to synchronize (Shadlen &

Self-Organizing Dual Coding

649

Figure 8: Simulation results for two equal shared connectivities with adaptive feedback coupling for symmetric (A, B, C, D) and asymmetric (E) learning windows. We set n1 D 240, n2 D 50, n01 D 100, n1 D 120, and ²max D 0:006. (A) Performance of single-neuron rate coding and population rate coding (see the caption of Figure 6 for details). (B) Degree of synchrony. (C) Shared connectivity. (D) Feedback coupling strength after learning with the symmetric learning window of the form equation 4.5. A brighter point indicates a stronger coupling weight, and a black (or white) point corresponds to 0 (or ²max ). (E) Feedback coupling strength after learning with the asymmetric learning window of the form equation 2.1.

650

N. Masuda and K. Aihara

Newsome, 1998; Salinas & Sejnowski, 2000, 2001; Masuda & Aihara, 2003). Although the supposed shared connectivity values are somewhat higher than experimentally observed ones (Shadlen & Newsome, 1998; Salinas & Sejnowski, 2000), the higher shared connectivity could occur in networks with dense local coupling. Without feedback coupling, the cortical neurons encoding ºa .t/ are actually more synchronized (see Figure 9B) and are accompanied by a higher intracluster shared connectivity (see Figure 9C) than are cortical neurons of the other subpopulation. However, as shown in Figure 9A, the neurons encoding ºb .t/ can be pooled to realize more accurate rate coding. Two equally synchronous inputs are processed in different ways, and this coding mode is symbolized as Sa Rb . In this way, different codes may be used simultaneously by a network or a specic part of the brain, with its neuronal resources segregated for different modes.

Figure 9: Simulation results for two different shared connectivities without feedback coupling. We set n1 D 315, n2 D 50, n01 D 100, n1 D 65, and ²i;j D 0. (A) Performance of single-neuron rate coding and population rate coding (see the caption of Figure 6 for details). (B) Degree of synchrony. (C) Shared connectivity.

Self-Organizing Dual Coding

651

Figure 10: Simulation results for two different shared connectivities with feedback coupling. We set n1 D 315, n2 D 65, n01 D 100, n1 D 65, and ²i;j D 0:006. (A) Performance of single-neuron rate coding and population rate coding (see the caption of Figure 6 for details). (B) Degree of synchrony. (C) Shared connectivity.

Figure 10 shows the numerical results when the feedback coupling strength is uniformly equal to ²i;j D 0:006. In this situation, the cluster that has been weak in terms of synchrony is absorbed into the strong cluster so that the whole network encodes ºa .t/ by synchrony (Sa Sa ). This is because a stronger cluster generally sends more correlated feedback inputs to a weaker cluster than the other way around. Then the neurons receiving more correlated inputs are likely to become synchronous, with ring times locked to the inputs (Shadlen & Newsome, 1998; Salinas & Sejnowski, 2000; Masuda & Aihara, 2003), which are the ring times of the stronger cluster in this case. 5 Discussion 5.1 Various Types of Dual Coding. The results in section 4 show that multiple external inputs can be dealt with in inhomogeneous ways even in

652

N. Masuda and K. Aihara

a single network. We have found the rate code mode (Ra Rb ), the clustered synchronous code mode (Sa Sb ), the global synchronous code mode (Sa Sa ), and the mixed code mode (Sa Rb ). The analytical results in section 3 suggest that these modes can be extended to the case of a larger number of subpopulations in a layer. These modes are generalizations of our previous work on dual coding (Masuda & Aihara, 2003) in the inhomogeneous case. At the same time, these modes are code mode generalizations of the synre chain and coding schemes by synchronous clusters (von der Malsburg & Schneider, 1986; Abeles, 1991; Diesmann et al., 1999). In particular, we have shown that the combined effects of input structure, coupling structure, and synaptic learning lead to various network states including clustered ring. Similar to associative memory models, these complex network states, reecting inputs and network structure, are probably related to intricate information processing in the brain. Our results suggest that rate codes and temporal codes may be used in the brain dually in terms of time and neuronal populations. For example, even in synre chains, the neurons in the background may encode other information with population ring rates. Furthermore, the totally synchronous mode, the asynchronous mode, or other modes may alternate in time. This view is partially supported by experimental evidence (Gray et al., 1989; de Oliveira et al., 1997). The notion of dual coding has been used explicitly and implicitly by many authors. Although its denition may vary somewhat, dual coding generally refers to a mechanism in which the brain uses multiple codes. With our results in mind, let us review the different kinds of proposed dual coding: ² Dual coding by time sharing: Multiple codes, such as the rate code and the synchronous code, can emerge in different situations in the same network. In other words, time sharing triggered by external stimuli or switching of cognitive states enables the codes to alternate. At the network level, the theoretical results in this work and others (Aertsen et al., 1994; Tsodyks et al., 2000; Araki & Aihara, 2001; van Rossum et al., 2002; Masuda & Aihara, 2002b, 2003) together with experimental evidence (de Oliveira et al., 1997; Riehle et al., 1997) support this idea. With our notation, this mode corresponds to the situation in which various coding schemes such as Ra Rb , Sa Sb , and Sa Sa appear in turn. At the single-neuron level, the input frequency and the neurotransmitter release probability let the coding scheme of a single neuron switch between temporal coding based on coincidence detection and rate coding (Tsodyks & Markram, 1997). ² Dual coding by a multifunctional encoder: Different properties of neuronal states or spike patterns of a single neuron or a single population may simultaneously encode multiple entities. For example, it is proposed that a cell assembly encodes the meaning, whereas the ring rates of the constituent neurons encode its signicance or intensity

Self-Organizing Dual Coding

653

of inputs (Sompolinsky et al., 1990), the degree of participation of a single neuron in the assembly (Kruger ¨ & Becker, 1991), or the types of discriminative stimuli (Sakurai, 1999). Membership of a neuron in an assembly makes sense only by grouping neurons by, for example, synchrony or correlated ring. In a sense, this dual coding is equivalent to the synchronous mode S in our notation, with its coarse ring rate carrying the quantitative information. In any case, this coding scheme can be understood as the superimposition of Sa and Ra . Even at a singleneuron level, experiments show that information on time is encoded in the spike timing, whereas information on object identity is encoded in the spike count (Berry, Warland, & Meister, 1997). Experimental results indicative of dual coding by time sharing (de Oliveira et al., 1997; Riehle et al., 1997) can also be interpreted as examples of this type of dual coding; synchrony possibly represents internal events, whereas ring rates reect external inputs. ² Dual coding by different use of subpopulations: Subpopulations of a neural network can be allocated to multiple information sources, as examined in section 4. Model studies suggest that neural networks can be divided into clusters, each encoding an input feature or an internal state by ring rates or synchrony (for example, Ra Rb , Sa Sb , and Sa Rb ) (von der Malsburg & Schneider, 1986; Sompolinsky et al., 1990; Abeles, 1991; Fujii et al., 1996). Interaction of the clusters typically results in ltering of relatively asynchronous clusters, which are led to, for example, Sa Sa . By denition, a single neuron cannot operate in this mode. Moreover, this coding scheme is different from that described by the experimental literature cited above. Real neural codes may be a mixture of the theoretically proposed ones stated above, enabling proper assignment of neuronal space and time for high performance and capacity. The coding scheme can also change as signals pass through the layers; transition from R to S is generally easy, whereas the transition from S to R is generally difcult (Song & Abbott, 2001; Masuda & Aihara, 2003). In this work, the sensory layer serves as the rate coder Ra Rb because of the stochasticity of ring. We have observed that Ra Rb is passed onto the cortical layer without transformation or with transformation into Sa Rb , Sa Sa , or Sa Sb , depending on the shared connectivity and the structure of the feedback coupling. We note that the density approaches using the Fokker-Planck equations have been used for analyzing dynamics of ring rates and membrane potentials. As a result, the conditions for synchrony or asynchrony in recurrent networks have been established for simple input schemes such as constant, white gaussian, or step inputs (Brunel, 2000; Gerstner, 2000; Gerstner & Kistler, 2002b). Although it may be interesting to apply these methods to the analysis of dual coding, such analysis requires extensive calculations in the case of spatiotemporal inputs and is therefore a subject of future work.

654

N. Masuda and K. Aihara

5.2 Functional Roles of Spike-Time-Dependent Plasticity Revisited. STDP has led to the emergence of various modes via unsupervised learning and associated synaptic competition. The selection of synapses and selforganization through STDP are not new (see, e.g., Kempter et al., 1999; Kistler & van Hemmen, 2000; Song et al., 2000; van Rossum et al., 2000; Song & Abbott, 2001; Cˆateau & Fukai, 2003). In this article, we have discussed how learning inuences the coding schemes in relation to synchrony, clustered states, and asynchrony. The mechanisms that cause synchrony, such as high shared connectivity and strong feedback coupling, should be contrasted to STDP and its functional roles (Gerstner & Kistler, 2002b; also see section 1). The STDP learning enables neurons to respond fast (e.g., ring for low thresholds, Kistler & van Hemmen, 2000; latency reduction, Song et al., 2000), to be precise (e.g., coincidence detection, Kistler & van Hemmen, 2000; the associated temporal coding in barn owl auditory systems, Gerstner et al., 1996), and to be relational or sequential (e.g., difference learning, Rao & Sejnowski, 2001; alternatively ring clusters, Horn et al., 2000, and Levy et al., 2001). However, STDP does not necessarily make neurons synchronous because an accidental near-synchronous ring of two neurons potentiates the synaptic weight in one direction while it depresses the synapse in the opposite direction if it ever exists. In more or less synchronous situations, jitter may cause the next near-synchronous ring event to occur with the converse order of ring. After the transient, the coupling strengths between these two neurons are likely to converge to the minimum in both directions, as veried in Figure 8E. In other words, a dynamical state in which the synapses are bidirectionally strong is unstable. Although rough synchrony can emerge via unidirectionally potentiated synapses, the order of ring becomes xed within the synchronous cluster. This order implies sequential relations beyond mere synchrony. Of course, this kind of rough synchrony may be used for practical functions such as column formation, without taking advantage of the possibly meaningful order of ring (Song & Abbott, 2001). Synchrony with the precision of 1 ms is widely found in experiments (Gray et al., 1989; de Oliveira et al., 1997), and its functional role has been discussed extensively (von der Malsburg & Schneider, 1986; Abeles, 1991; Diesmann et al., 1999; Tsodyks et al., 2000). Synchronization may be caused not only by common inputs but also by lateral coupling (Gray et al., 1989; Tsodyks et al., 2000). However, it seems difcult for the asymmetric STDP rule as represented by equation 2.1, which has the characteristic timescale of ¿0 D 20 ms, to provide a mechanism for synchronization. Therefore we have introduced the symmetric learning window for the feedback coupling represented by equation 4.5. The symmetrical learning window is supported by experimental evidence for excitatory (Nishiyama et al., 2000) and inhibitory (Abbott & Nelson, 2000) neurons. Other mechanisms that lead to partially synchronous ring without full synchrony include enhancing synaptic weights for a short time in response to simultaneous increases in

Self-Organizing Dual Coding

655

the ring rates of two neurons (Sompolinsky et al., 1990) and use of global inhibitors that prohibit simultaneous activation of multiple clusters (von der Malsburg & Schneider, 1986; Watanabe, Aihara, & Kondo, 1998). It is a future problem to include inhibitory synapses and consider cooperation of various types of synaptic plasticity. Appendix A Here we derive the dynamics of the averaged weights hwa i and hwb i under synaptic competition. Using equations 3.1 and 3.2, we have Z wmax @ hwP a i D dw w pa .w; t/ @t 0 Z wmax @ D¡ dw w .A.w/pa .w; t// @w 0 Z 1 wmax @ 2 C dw w.B.w/pa .w; t// 2 0 @w2 Z wmax Z 1 wmax @ D dw A.w/pa .w; t/ ¡ dw .B.w/pa .w; t// 2 0 @w 0 D A1 hwa i C A2 ca hwa i C A3 .hwa i C hwb i/ 1 C .B.0; ca ; hwa i ; hwb i/pa .0; t/ 2 ¡ B.wmax ; ca ; hwa i ; hwb i/pa .wmax ; t//;

(A.1)

where we have used integration by parts. Consequently, the effective dynamics of hwa i and hwb i is written as follows: ³

´ ³ ´³ ´ 1 hwP a i ¡A3 hwa i A1 C ca A2 ¡ A3 D C hwb i hwP b i ¡A3 A1 C cb A2 ¡ A3 2 ³ ´ .B.0; ca ; hwa i ; hwb i/pa .0; t/ ¡ B.wmax ; ca ; hwa i ; hwb i/pa .wmax ; t// £ : (A.2) .B.0; cb ; hwb i ; hwa i/pb .0; t/ ¡ B.wmax ; cb ; hwb i ; hwa i/pb .wmax ; t//

The rst term in equation A.2 describes drift-driven dynamics, which is dominant everywhere except near the boundaries. The correction to the dynamics in the second term results from interplay of diffusion and the boundary conditions. The constraint that the weights are conned in [0; wmax ] yields px .w; t/ D ±.w/; if hwx i D 0;

px .w; t/ D ±.w ¡ wmax /; if hwx i D wmax ;

(A.3) (A.4)

656

N. Masuda and K. Aihara

where x D a or b. From equations A.3 and A.4 and the fact that the diffusion term is always positive, it follows that hwP x ihwx iD0 > 0;

hwP x ihwx iDwmax < 0:

(A.5)

Accordingly, the dynamics is conned to the square [0; wmax ] £ [0; wmax ]. The second term in equation A.2 changes dramatically only near the boundaries where px .w; t/ approaches a singular distribution. If we further suppose that the rst (or second) component of this term is monotonically decreasing in hwa i (or hwb i), the nullclines of equation A.2 look like the ones in Figure 4C. Appendix B To derive the conditions for the saddle point, let us assume ca D cb . Consequently, a possible xed point .w¤ ; w¤ /, 0 · w¤ · wmax on the straight line hwa i D hwb i satises 1 .A1 C ca A2 ¡ 2A3 /w¤ C .B.0; ca ; w¤ ; w¤ /pa .0; t/ 2 ¡ B.wmax ; ca ; w¤ ; w¤ /pa .wmax ; t// D 0:

(B.1)

Equation B.1 actually has a solution because equations A.3 and A.4 guarantee that the left-hand side of equation B.1 is larger than 0 for w¤ D 0 and smaller than 0 for w¤ D wmax . The stability of the xed point is determined by the eigenvalues ¸1 and ¸2 , which are the solutions of the following equation: ¤ ;w¤ / ¡¸ A C ca A2 ¡ A3 C @B.w @hwa i 1 ¤ ;w¤ / ¡A3 C @B.w @hwb i

¡A3 C

@B.w¤ ;w¤ / @hwb i

A1 C ca A2 ¡ A3 C

@B.w¤ ;w¤ / @hwa i

D 0;

¡ ¸ (B.2)

where B.hwa i ; hwb i/ D

1 .B.0; ca ; hwa i ; hwb i/pa .0; t/ 2 ¡ B.wmax ; ca ; hwa i ; hwb i/pa .wmax ; t//:

(B.3)

Self-Organizing Dual Coding

657

The saddle point .w¤ ; w¤ / must satisfy the following inequality: Á

@ B.w¤ ; w¤ / A1 C ca A2 ¡ A3 C @ hwa i

!2

Á

@ B.w¤ ; w¤ / ¡ ¡A3 C ¸1 ¸2 D @ hwb i Á ! @B.w¤ ; w¤ / @B.w¤ ; w¤ / D A1 C ca A2 C ¡ @ hwa i @ hwb i Á ! @B.w¤ ; w¤ / @B.w¤ ; w¤ / ¢ A1 C ca A2 ¡ 2A3 C C @ hwa i @ hwb i < 0:

!2

(B.4)

If the saddle point is not extremely close to the boundaries, the effect of the boundary conditions does not dramatically change near the saddle point. We ignore the derivative of B for this reason, and equation B.4 is reduced to equations 3.22 and 3.23. When m equally synchronous inputs are applied, a nontrivial xed point is obtained by solving the following set of equations: w¤ D hwa i D hwb i D ¢ ¢ ¢ D hwm i;

(B.5)

1 .B.0; ca ; w¤ ; : : : ; w¤ /pa.0; t/ 2 ¡ B.wmax ; ca ; w¤ ; : : : ; w¤ /pa .wmax ; t// D 0;

.A1 C ca A2 ¡ mA3 /w¤ C

(B.6)

which has an interior solution for reasons similar to that when m D 2. If we ignore the derivative of the last term of equation B.6, the eigenvalues ¸1 , ¸2 ; : : : ; ¸m in equations 3.24 and 3.25 are derived by solving A1 C ca A2 ¡ A3 ¡ ¸ ¡A3 ¡A3 :: :

¡A3 A1 C ca A2 ¡ A3 ¡ ¸

¡A3 ¡A3 :: :

¢ ¢ ¢ D 0:

(B.7)

We next consider the case in which all the input correlations are different, as represented in equation 3.31. For simplicity, the differences between the correlations are assumed to be small so that equation B.5 approximately holds. Then the characteristic equation similar to equation B.7 reads

658

N. Masuda and K. Aihara

as follows: m X iD1

A3 D 1: A1 C ci A2 ¡ ¸

(B.8)

To get insight into the ranges of the eigenvalues, let us put Á 3.¸/ ´

m X iD1

1 A1 C ci A2 ¡ ¸

!¡1 :

(B.9)

Then the eigenvalues are the solutions of 3.¸/ D A3 . For 3.¸/, there exist x1 < x2 < ¢ ¢ ¢ < xm¡1 such that: 3.¸/ ! 1

as ¸ ! xi C 0 or ¡ 1

3.¸/ ! ¡1 as ¸ ! xi ¡ 0 or 1 d 3.¸/ < 0; ¸ 6D xi ; 1 · i · m d¸ 3.A1 C ci A2 / D 0; 1 · i · m:

(B.10) (B.11) (B.12) (B.13)

Therefore, positive A3 , we obtain A1 C c1 A2 > ¸1 > A2 C c2 A2 > ¸2 > ¢ ¢ ¢ > A2 C cm A2 > ¸m :

(B.14)

The conditions for the synaptic competition are again generalized versions of equations 3.24 and 3.25, expressed as follows: A1 C ci A2 > 0; 1 · i · m ¡ 1;

A1 C cm A2 ¡ mA3 < 0:

(B.15) (B.16)

Combining equations B.14 and B.15 yields ¸1 > ¸2 > ¢ ¢ ¢ > ¸m¡1 > 0:

(B.17)

With equation B.16, ¸m < 0 is assured by ³ 3.0/ <

m A1 C c1 A2

´¡1

< A3 :

(B.18)

Appendix C We analyze the dynamics of synaptic weights in the presence of feedback coupling. Let us assume for simplicity ² i;j D ² and that the network receives

Self-Organizing Dual Coding

659

equally strong two input sources, that is, ca D cb D c. Using equations 3.33 and 3.34, the effective deterministic dynamics is written as 0

1 hwP 1;a i B hwP 1;b i C B C B hwP 2;a i C B C B hwP 2;b i C B C B : C @ :: A hwP n2 ;b i 0 ¡A3 0 0 A1 C cA2 ¡ A3 B 0 0 A3 A1 C cA2 ¡ A3 B B 0 0 C ¡ ¡A A cA A 1 2 3 3 B DB 0 0 ¡A3 A1 C cA2 ¡ A3 B B 0 0 0 0 @ :: : 0 1 0 10 1 hw1;a i 0 0 1 0 1 0 ::: hwP 1;a i B hw1;b i C B0 0 0 1 0 1 : : :C B hwP 1;b i C B C B CB C B hw2;a i C B1 0 0 0 1 0 : : :C B hwP 2;a i C B C B CB C ¢ B hw2;b i C C ² B0 1 0 0 0 1 : : :C B hwP 2;b i C : B C B CB C B : C B: CB : C @ :: A @ :: A @ :: A hwn2 ;b i 0 1 0 1 0 1 ::: hwP n2 ;b i

1 0 ::: C 0 C C 0 C C 0 C C A

(C.1)

Substituting the following eigenmode: 0

1 hw1;a i B hw1;b i C B C exp.¸t/ w D B : C @ :: A

(C.2)

hwn2 ;b i into equation C.1, we obtain the characteristic equation as follows: A C cA2 ¡ A3 ¡ ¸ ¡A3 ²¸ 0 1 A3 A1 C cA2 ¡ A3 ¡ ¸ ²¸ 0 ²¸ 0 A1 C cA2 ¡ A3 ¡ ¸ ¡A3 ²¸ 0 ¡A3 A1 C cA2 ¡ A3 ¡ ¸ ²¸ 0 ²¸ 0 :: : D 0:

²¸ ²¸ ²¸ ²¸

0 : : : 0 0 0

(C.3)

660

N. Masuda and K. Aihara

The eigenvalues of equation C.3 are calculated as follows: A1 C cA2 A1 C cA2 ¡ 2A3 ; ¸2 D ; 1 ¡ .n2 ¡ 1/²¸ 1 ¡ .n2 ¡ 1/²¸

¸1 D

¸3 D

A1 C cA2 A1 C cA2 ¡ 2A3 ; ¸4 D ; 1 C ²¸ 1 C ²¸

(C.4)

and ¸3 and ¸4 are degenerated, each corresponding to an n2 ¡ 1 dimensional eigenspace. The eigenvector for the largest eigenvalue or ¸1 is .1; ¡1; 1; ¡1; : : : ; 1; ¡1/t . When ² D 0, it follows that ¸1 D ¸3 > ¸2 D ¸4 . Then the n2 -dimensional dominant eigenspace is spanned by .1; ¡1; 0; 0; : : : ; 0; 0/t , .0; 0; 1; ¡1; 0; 0; : : : ; 0; 0/t ; : : : ; .0; 0; : : : ; 0; 0; 1; ¡1/t . Acknowledgments We thank W. Gerstner for carefully reading and discussing the manuscript. We also thank T. Aihara, M. Watanabe, S. Katori, and H. Fujii for their helpful comments. This study is supported by the Japan Society for the Promotion of Science and also by the Advanced and Innovational Research Program in Life Sciences and a Grant-in-Aid on priority areas (C), Advanced Brain Science Project from the Ministry of Education, Culture, Sports, Science, and Technology, the Japanese Government. References Abbott, L. F., & Nelson, S. B. (2000). Synaptic plasticity: Taming the beast. Nature Neuroscience, 3, 1178–1183. Abbott, L. F., & Song, S. (1999). Temporally asymmetric Hebbian learning, spike timing and neuronal response variability. In M. S. Kearns, S. A. Solla, & D. A. Cohn (Eds.), Advances in neural information processing systems, 11 (pp. 69–75). Cambridge, MA: MIT Press. Abeles, M. (1991). Corticonics. Cambridge: Cambridge University Press. Aertsen, A., Erb, M., & Palm, G. (1994). Dynamics of functional coupling in the cerebral cortex: An attempt at a model-based interpretation. Physica D, 75, 103–128. Aertsen, A. M. H. J., Gerstein, G. L., Habib, M. K., & Palm, G. (1989). Dynamics of neuronal ring correlation: Modulation of “effective connectivity.” J. Neurophsiol., 61(5), 900–917. Araki, O., & Aihara, K. (2001). Dual information representation with stable ring rates and chaotic spatiotemporal spike patterns in a neural network model. Neural Computation, 13, 2799–2822. Arieli, A., Sterkin, A., Grinvald, A., & Aertsen, A. (1996). Dynamics of ongoing activity: Explanation of the large variability in evoked cortical responses. Science, 273, 1868–1871.

Self-Organizing Dual Coding

661

Berry, M., J., Warland, D., K., & Meister, M. (1997). The structure and precision of retinal spike trains. Proc. Natl. Acad. Sci. USA, 94, 5411–5416. Brunel, N. (2000). Dynamics of sparsely connected networks of excitatory and inhibitory spiking neurons. J. Comput. Neurosci., 8, 183–208. Caˆ teau, H., & Fukai, T. (2003). A stochastic method to predict the consequence of arbitrary forms of spike-timing-dependent plasticity. Neural Computation, 15, 597–620. de Oliveira, S. C., Thiele, A., & Hoffmann, K-P. (1997). Synchronization of neuronal activity during stimulus expectation in a direction discrimination task. J. Neurosci., 17(23), 9248–9260. Diesmann, M., Gewaltig, M-O., & Aertsen, A. (1999). Stable propagation of synchronous spiking in cortical neural networks. Nature, 402, 529–533. Fujii, H., Ito, H., Aihara, K., Ichinose, N., & Tsukada, M. (1996). Dynamical cell assembly hypothesis—theoretical possibility of spatio-temporal coding in the cortex. Neural Networks, 9(8), 1303–1350. Gerstner, W. (2000). Population dynamics of spiking neurons: Fast transients, asynchronous states, and locking. Neural Computation, 12, 43–89. Gerstner, W. (2001). Coding properties of spiking neurons: Reverse and crosscorrelations. Neural Networks, 14, 599–610. Gerstner, W., Kempter, R., van Hemmen, J. L., & Wagner, H. (1996). A neuronal learning rule for sub-millisecond temporal coding. Nature, 383, 76– 78. Gerstner, W., & Kistler, M. (2002a).Mathematical formulations of Hebbian learning. Biol. Cybern., 87, 404–415. Gerstner, W., & Kistler, W., M. (2002b). Spiking neuron models. Cambridge: Cambridge University Press. Gray, C. M., Konig, ¨ P., Engel, A. K., & Singer, W. (1989). Oscillatory responses in cat visual cortex exhibit inter-columnar synchronization which reects global stimulus properties. Nature, 338, 334–337. Gutig, ¨ R., Aharonov, R., Rotter, S., & Sompolinsky, H. (2003). Learning input correlations through nonlinear temporally asymmetric Hebbian plasticity. J. Neurosci., 23(9), 3697–3714. Horn, D., Levy, N., Meilijson, I., & Ruppin, E. (2000). Distributed synchrony of spiking neurons in a Hebbian cell assembly. In S. A. Solla, T. K. Leen, & K.-R. Muller ¨ (Eds.), Advances in neural information processing systems, 12 (pp. 129–135). Cambridge, MA: MIT Press. Kempter, R., Gerstner, W., & van Hemmen, J. L. (1999). Hebbian learning and spiking neurons. Phys. Rev. E, 59(4), 4498–4514. Kempter, R., Gerstner, W., & van Hemmen, J. L. (2001). Intrinsic stabilization of output rates by spike-based Hebbian learning. Neural Computation, 13, 2709–2741. Kistler, W., M., & van Hemmen, J. L. (2000). Modeling synaptic plasticity in conjunction with the timing of pre- and postsynaptic action potentials. Neural Computation, 12, 385–405. Kruger, ¨ J., & Becker, D. (1991). Recognizing the visual stimulus from neuronal discharges. Trends in Neurosciences, 14(7), 282–286.

662

N. Masuda and K. Aihara

Kuhn, A., Aertsen, A., & Rotter, S. (2003). Higher-order statistics of input ensembles and the response of simple model neurons. Neural Computation, 15, 67–101. Kuhn, A., Rotter, S., & Aertsen, A. (2002). Correlated input spike trains and their effects on the response of the leaky integrate-and-re neuron. Neurocomputing, 44–46, 121–126. Levy, N., Horn, D., Meilijson, I., & Ruppin, E. (2001). Distributed synchrony in a cell assembly of spiking neurons. Neural Networks, 14, 815–824. Markram, H., Lubke, ¨ J., Frotscher, M., & Sakmann, B. (1997). Regulation of synaptic efcacy by coincidence of postsynaptic APs and EPSPs. Science, 275, 213–215. Masuda, N., & Aihara, K. (2002a). Spatio-temporal spike encoding of a continuous external signal. Neural Computation, 14, 1599–1628. Masuda, N., & Aihara, K. (2002b). Bridging rate coding and temporal spike coding by effect of noise. Phys. Rev. Lett., 88(24), 248101. Masuda, N., & Aihara, K. (2003). Duality of rate coding and temporal spike coding in multilayered feedforward networks. Neural Computation, 15, 103– 125. Nishiyama, M., Hong, K., Mikoshiba, K., Poo, M., & Kato, K. (2000). Calcium stores regulate the polarity and input specicity of synaptic modication. Nature, 408, 584–588. Rao, R. P. N., & Sejnowski, T., J. (2001). Spike-timing-dependent Hebbian plasticity as temporal difference learning. Neural Computation, 13, 2221– 2237. Riehle, A., Grun, ¨ S., Diesmann, M., & Aertsen, A. (1997). Spike synchronization and rate modulation differently involved in motor cortical function. Science, 278, 1950–1953. Rubin, J., Lee, D. D., & Sompolinsky, H. (2001). Equilibrium properties of temporally asymmetric Hebbian plasticity. Phys. Rev. Lett., 86(2), 364–367. Sakurai, Y (1999). How do cell assemblies encode information in the brain? Neuroscience and Biobehavioral Reviews, 23, 785–796. Salinas, E., & Sejnowski, T. J. (2000). Impact of correlated synaptic input on output ring rate and variability in simple neuronal models. J. Neurosci., 20(16), 6193–6209. Salinas, E., & Sejnowski, T. J. (2001). Correlated neuronal activity and the ow of neural information. Nature Reviews Neuroscience, 2, 539–550. Shadlen, M. N., & Newsome, W. T. (1998). The variable discharge of cortical neurons: Implications for connectivity, computation, and information coding. J. Neurosci., 18(10), 3870–3896. Sompolinsky, H., Golomb, D., & Kleinfeld, D. (1990). Global processing of visual stimuli in a neural network of coupled oscillators. Proc. Natl. Acad. Sci. USA, 87, 7200–7204. Song, S., & Abbott, L. F. (2001). Cortical development and remapping through spike timing-dependent plasticity. Neuron, 32, 339–350. Song, S., Miller, K. D., & Abbott, L. F. (2000). Competitive Hebbian learning through spike-timing-dependent synaptic plasticity. Nature Neuroscience, 3(9), 919–926.

Self-Organizing Dual Coding

663

Stroeve, S., & Gielen, S. (2001). Correlation between uncoupled conductancebased integrate-and-re neurons due to common and synchronous presynaptic ring. Neural Computation, 13, 2005–2029. Tsodyks, M. V., & Markram, H. (1997). The neural code between neocortical pyramidal neurons depends on neurotransmitter release probability. Proc. Natl. Acad. Sci. USA, 94, 719–723. Tsodyks, M., Uziel, A., & Markram, H. (2000). Synchrony generation in recurrent networks with frequency-dependent synapses. J. Neurosci, 20, RC50 1–5. Vaadia, E., Haalman, L., Abeles, M., Bergman, H., Prut, Y., Slovin, H., & Aertsen, A. (1995). Dynamics of neuronal interactions in monkey cortex in relation to behavioural events. Nature, 373, 515–518. van Rossum, M. C. W., Bi, G. Q., & Turrigiano, G. G. (2000). Stable Hebbian learning from spike timing-dependent plasticity. J. Neurosci., 20(23), 8812– 8821. van Rossum, M. C. W., Turrigiano, G. G., & Nelson, S. B. (2002). Fast propagation of ring rates through layered networks of noisy neurons. J. Neurosci., 22(5), 1956–1966. von der Malsburg, C., & Schneider, W. (1986). A neural cocktail-party processor. Biol. Cybern., 54, 29–40. Watanabe, M., Aihara, K., & Kondo, S. (1998). A dynamic neural network with temporal coding and functional connectivity. Biol. Cybern., 78, 87–93. Zhang, L. I., Tao, H. W., Holt, C. E., Harris, W. A., & Poo, M. (1998). A critical window for cooperation and competition among developing retinotectal synapses. Nature, 395, 37–44. Received March 4, 2003; accepted September 8, 2003.

Communicated by Anthony Bell

NOTE

The Shape of Neural Dependence Rick L. Jenison [email protected]

Richard A. Reale [email protected] Departments of Psychology and Physiology and the Waisman Center, University of Wisconsin–Madison, Madison, WI 53706, U.S.A.

The product-moment correlation coefcient is often viewed as a natural measure of dependence. However, this equivalence applies only in the context of elliptical distributions, most commonly the multivariate gaussian, where linear correlation indeed sufciently describes the underlying dependence structure. Should the true probability distributions deviate from those with elliptical contours, linear correlation may convey misleading information on the actual underlying dependencies. It is often the case that probability distributions other than the gaussian distribution are necessary to properly capture the stochastic nature of single neurons, which as a consequence greatly complicates the construction of a exible model of covariance. We show how arbitrary probability densities can be coupled to allow greater exibility in the construction of multivariate neural population models.

Neural population coding of events in the world is thought to involve a coordinated activity among neurons. In delineating the structure of this dependence among neurons in the brain, it is the time pattern of discharges from individual cells that represents the variables or marginal dimensions under study. A common misperception is that any marginal probability distribution can be used to construct a parametric model of this dependence among variables. The most commonly used measure for dependence, the Pearson product-moment correlation coefcient, was developed on the basis of normal marginals and addresses only linear dependence (Yule, 1897; Mari & Kotz, 2001). However, a new and innovative approach, the so-called copula method, provides the ability to couple arbitrary marginal densities (Sklar, 1959; Joe, 1997; Nelsen, 1999). The word copula is a Latin noun that refers to a bond and is used in linguistics to refer to a proposition that links a subject and predicate. In probability theory, it couples marginal distributions to form exible multivariate distribution functions. The appeal of the copula is that we eliminate the implied reliance on the multivariate gaussian or the assumption that dimensions are independent. Neural Computation 16, 665–672 (2004)

c 2004 Massachusetts Institute of Technology °

666

R. Jenison and R. Reale

Recently, Nirenberg, Carcieri, Jacobs, and Latham (2001) indirectly estimated the relative contribution of neural dependence to mutual information, which accounted for about 10% of the information transmitted. Pola, Thiele, Hoffmann, and Panzeri (2003) derived a method to directly estimate the mutual information on the entire response joint probability distribution. In both of these reports, as well as earlier studies of neural information, the shape of the dependence structure is generally ignored and considered to be absorbed into the joint probability density. In this communication, we show how the conditional response entropy can be factored into separate independent and dependent contributions to transmitted sensory information. These factors take the form of conditional differential entropies, and therefore the structure and impact of the dependence can be computed directly rather than indirectly. Sklar’s theorem (Sklar, 1959) states that any multivariate distribution can be expressed as the copula function C[u1 ; u2 ; : : : ; uN ] evaluated at each of the marginal distributions. By virtue of the probability integral transform (Casella & Berger, 1990), each marginal ui D Fi .xi / has a uniform distribution on [0; 1], where Fi .xi / is the cumulative integral of pi [xi ] for the random variables Xi . The joint probability density follows as p[x1 ; x2 ; : : : ; xN ] D

N Y iD1

pi [xi ] £ c[u1 ; u2 ; : : : ; uN ];

(1)

where pi [xi ] is each marginal density and coupling is provided by N C[u1 ;u2 ;:::;uN ] c[u1 ; u2 ; : : : ; uN ] D @ @u , which is itself a probability density. When 1 @u2 ¢¢¢@uN the random variables are independent, the copula density c[u1 ; u2 ; : : : ; uN ] is identically equal to one. The importance of equation 1 is that the independent portion, reected as the product of the marginals, can be separated from the function c[u1 ; u2 ; : : : ; uN ] describing the dependence structure or shape. There are several families of copulas, some of which are particularly descriptive of the dependence structure observed among the discharge patterns of sensory neurons. The choice of copula is a form of model selection that can be formalized using information criteria (Soo, 2000) such as the Akaike information criterion (AIC) (Akaike, 1974). The AIC penalizes the negative log maximum likelihood of the estimated model by the number of parameters in the model [¡2 log (maximum likelihood) + 2 (number of parameters)]. A smaller relative AIC represents a better model t while taking into account the complexity of the model. We have shown previously the importance of considering the impact of correlated noise on the acuity of sound localization by an ensemble of auditory cortical neurons in the cat (Jenison, 2000). Figure 1A shows a scatter plot of the timing (latency) to the rst evoked spike following a sound presented at a particular location, µ, in space for a pair of neurons recorded from separate electrodes in the primary auditory (AI) eld of an anesthetized cat.

The Shape of Neural Dependence

667

Figure 1: (A) Joint conditional probability density of rst-spike latency recorded from two cortical neurons in response to a given sound source in space. Gray contours correspond to estimated iso-densities based on the AMH copula. The estimated single parameter ® controls the shape of the copula, here ® D 0:95. The inset shows conditional cross sections of the joint density based on the copula (gray) and estimated multivariate gaussian (black) (B) AMH copula density c[u1 ; u2 j µ ] estimated from data shown in A. (C) AMH copula conditional entropy as a function of dependence and neural ensemble size (N). Multiple integrals were evaluated numerically using quasi–Monte Carlo integration.

668

R. Jenison and R. Reale

These two neurons are representative of a common class of eld AI cells described by their spatial sensitivity to free-eld (actual or virtual) sound sources. In this regard, these AI neurons exhibited broad spatial receptive elds when measured with directionally dependent stimuli even at the low intensity levels of the sound source (Brugge, Reale, & Hind, 1996). Furthermore, their response latency was tightly locked to stimulus onset, and systematic gradients in latency were evident within the receptive eld. Models have been successfully developed that capture this systematic response latency variability due to both sound-source direction and sound-source intensity (Jenison, Reale, Hind, & Brugge, 1998; Reale, Jenison, & Brugge, 2003). The joint probability density contours (gray) reect maximum likelihood t inverse-gaussian (IG) marginal probability densities (Jenison & Reale, 2003) coupled by the Ali-Mikail-Haq (AMH) copula (see appendix A). The AMH copula is parameterized by a dependence measure ®, which for the general multivariate case ranges between 0 and 1. The t shown in Figure 1 yields ® D 0:95 with an AIC equal to 3642. In comparison, for this data set, the AIC for independent IG marginals is 3728, and the multivariate gaussian AIC is 3766. This gure shows two common characteristics of the joint response of auditory cortical neurons. First, the marginal distributions are positively skewed, unlike the gaussian distribution, and second, the dependence structure is not elliptical, which is also inconsistent with a multivariate gaussian density. The inset shows comparisons of this copula-based joint density and the corresponding multivariate gaussian t conditioned on several levels of rst-spike latencies from neuron 2. The estimated AMH copula density c[u1 ; u2 ] is shown in Figure 1B, which can be viewed as a function that modulates the product density to form the joint probability density function as dened by equation 1. In this case, the AMH copula density narrows in the lower tails (near zero) and broadens in the upper tails (near one). This is a characteristic of the AMH copula and relates to our general observations of AI cortical neurons where the strength of association between ensemble neurons is greater for shorter rst-spike latencies compared to responses of longer latencies. When a pair of neurons res in response to a particular location in space, the shorter latencies (which reect a strong response) are more consistent within the ensemble; longer latencies (weaker responses) tend to be less consistent. This analysis indicates that not only is a multivariate gaussian density no longer a necessary assumption in delineating the structure of dependence among neurons, but employing a multivariate gaussian density with these data would actually provide misleading information, particularly for the behavior of the tail regions that greatly inuence estimation. An important outcome of this improvement is that an ideal observer analysis of sound location using responses from AI neurons can now employ parametric probability models using the proper dependence structure and marginals (Jenison, 2000; Jenison & Reale, 2003).

The Shape of Neural Dependence

669

Shannon information can provide a measure of the information available for localizing a sound source in space transmitted on average by a given ensemble of N neurons. Mutual information between stimulus and response is based on Shannon’s entropy, which can be interpreted as the degree to which the total response entropy is reduced by the entropy conditioned on a particular stimulus setting µ. Let this conditional entropy be H[x1 ; x2 ; : : : ; xN j µ], where xi is the response of the ith neuron. Using the copula density, it is straightforward to show that the conditional entropy can be split into two terms: the sum of entropies due to the independent components of the conditional density and the quantity due strictly to the dependence structure dened by the copula (see appendix B). This can be expressed as H[x1 ; x2 ; : : : ; xN j µ] D

N X iD1

H[xi j µ] C Hc [u1 ; u2 ; : : : ; uN j µ ];

(2)

where the copula entropy is Hc [u1 ; u2 ; : : : ; uN j µ] Z D¡ c[u1 ; u2 ; : : : ; uN j µ] log c[u1 ; u2 ; : : : ; uN j µ]du:

(3)

[0;1]N

It follows as a consequence of equation 2 that copula entropy must be mathematically equivalent to the negative of the mutual information between neurons, but with the benet of being computed directly from the dependence structure. Figure 1C shows the reduction in the conditional response entropy due to the AMH copula as a function of dependence and parameterized by ensemble size N. The entropy of the copula is at its maximum at zero dependence, and the copula entropy declines as a function of dependence and ensemble size. So although the decline is rather modest for a pair of neurons, as suggested by Nirenberg et al. (2001), it accelerates here as the ensemble size increases. The importance of this separability is that it is generally complicated, and often intractable, to parametrically model the joint probability density if it is nongaussian. When independence is assumed and the product density is based on only the marginal densities, the joint probability is easy to compute. Spike latency joint probability density can now be constructed using the marginal densities coupled by the separately estimated copula density function, which greatly simplies the construction. The gaussian assumption is no longer a necessary constraint, and this newly found freedom allows for much greater exibility in computational modeling studies of the brain. Finally, given the separability, the dependence entropy can be computed directly to assess the impact of the shape of neural dependence on sensory coding.

670

R. Jenison and R. Reale

Appendix A: Ali-Mikhail-Haq (AMH) Copula The AMH copula distribution (Ali, Mikhail, & Haq, 1978) is an Archimedean copula (Nelsen, 1999) that can be constructed by an additive generator function ’ and its inverse ’ ¡1 , if it exists, using the following form: " CArchimedean[u1 ; u2 ; : : : ; uN I ®] D ’

¡1

N X iD1

# ’[ui I ®] :

(A.1)

The generator function for the AMH copula is ’[uI ®] D log[ 1¡®.1¡u/ ], which u has an inverse ’ ¡1 [yI ®] D e1¡® , and yields y ¡® CAMH [u1 ; u2 ; : : : ; uN I ®] D

®¡

®¡1 QN 1¡®C®ui ; iD1

(A.2)

ui

R x1 where ui D Fi .xi / D ¡1 pi .t/ dt is the marginal cumulative distribution. ® is the measure of dependence where 0 · ® < 1. An Archimedean ndimensional copula is a proper distribution if and only if the inverse generator function ’ ¡1 is completely monotonic (Schweizer & Sklar, 1983). Appendix B: Copula Differential Entropy The joint entropy can be factorized into the sum of entropies due to each marginal density and the entropy of the copula (dependence structure):

H[x1 ; x2 ; : : : ; xN ] D Z ¡

RN

N X iD1

H[xi ] C Hc [u1 ; u2 ; : : : ; uN ]

(B.1)

p[x1 ; x2 ; : : : ; xN ] log[p[x1 ; x2 ; : : : ; xN ]] dx Z

D¡

N Y RN iD1

(

£ log Z D¡ £

N Y iD1

N Y RN iD1

(

pi [xi ] £ c [u1 ; u2 ; : : : ; uN ]

N X iD1

) pi [xi ] £ c [u1 ; u2 ; : : : ; uN ] dx

pi [xi ] c [u1 ; u2 ; : : : ; uN ] )

log pi [xi ] C log c [u1 ; u2 ; : : : ; uN ] dx:

(B.2)

The Shape of Neural Dependence

671

Integrate both sides of the equation over the unit hypercube: Z Z ¡ p[x1 ; x2 ; : : : ; xN ] log[p[x1 ; x2 ; : : : ; xN ]] dx du RN

[0;1]N

Z D¡

Z

£

RN

N X iD1

Z ¡

RN iD1

[0;1]N

(

N Y

pi [xi ] c [u1 ; u2 ; : : : ; uN ] )

log pi [xi ] C log c [u1 ; u2 ; : : : ; uN ] dx du

(B.3)

p[x1 ; x2 ; : : : ; xN ] log[p[x1 ; x2 ; : : : ; xN ]] dx

D¡

N Z X

pi [xi ] log pi [xi ] dxi

iD1

Z

¡

[0;1]N

c [u1 ; u2 ; : : : ; uN ] log c[u1 ; u2 ; : : : ; uN ] du

:H[x1 ; x2 ; : : : ; xN ] D

N X iD1

H[xi ] C Hc [u1 ; u2 ; : : : ; uN ]

(B.4)

(B.5)

Acknowledgments This work was supported by National Institutes of Health grant DC-03554. References Akaike, H. (1974). A new look at the statistical model identication. IEEE Trans. Auto. Control., 19, 723. Ali, M. M., Mikhail, N. N., & Haq, M. S. (1978). A class of bivariate distributions including the bivariate logistic. J. Multivariate Anal., 8, 405–412. Brugge, J. F., Reale, R. A., & Hind, J. E. (1996). The structure of spatial receptive elds of neurons in primary auditory cortex of the cat. J. Neurosci., 16, 4420– 4437. Casella, G., & Berger, R. L. (1990). Statistical inference. North Scituate, MA: Duxbury Press. Jenison, R. L. (2000). Correlated cortical populations can enhance sound localization performance. J. Acoust. Soc. Am., 107, 414–421. Jenison, R. L., & Reale, R. A. (2003). Likelihood approaches to sensory coding in auditory cortex. Network: Comp. Neural Sys., 14, 83–102. Jenison, R. L., Reale, R. A., Hind, J. E., & Brugge, J. F. (1998). Modeling of auditory spatial receptive elds with spherical approximation functions. J. Neurophysiol., 80, 2645–2656. Joe, H. (1997). Multivariate models and dependence concepts. London: Chapman & Hall.

672

R. Jenison and R. Reale

Mari, D. D., & Kotz, S. (2001). Correlation and dependence. London: Imperial College Press. Nelsen, R. B. (1999). An introduction to copulas. New York: Springer-Verlag. Nirenberg, S., Carcieri, S. M., Jacobs, A. L., & Latham, P. E. (2001). Retinal ganglion cells act largely as independent encoders. Nature, 411, 698–701. Pola, G., Thiele, A., Hoffmann, K. P., & Panzeri, S. (2003). An exact method to quantify information transmitted to different mechanisms of correlational coding. Network: Comp. Neural Sys., 14, 35–60. Reale, R. A., Jenison, R. L., & Brugge, J. F. (2003). Directional sensitivity of neurons in the primary auditory (AI) cortex: Effects of sound-source intensity level. J. Neurophysiol., 89, 1024–1038. Schweizer, B., & Sklar, A. (1983). Probabilistic metric spaces. Amsterdam, NorthHolland. Sklar, A. (1959). Fonctions de r´epartition a` n dimensions et leur marges. Publ. Int. Stat Univ., Paris, 8, 229–231. Soo, E. S. (2000). Principal information theoretic approaches. J. Am. Stat. Assoc., 95, 1349–1353. Yule, G. U. (1897). On the theory of correlation. J. Roy. Statist. Soc., 60, 812–854. Received May 29, 2003; accepted September 16, 2003.

LETTER

Communicated by Bard Ermentrout

On the Phase Reduction and Response Dynamics of Neural Oscillator Populations Eric Brown [email protected]

Jeff Moehlis ¤

[email protected] Program in Applied and Computational Mathematics, Princeton University, Princeton, NJ 08544, U.S.A.

Philip Holmes [email protected] Program in Applied and Computational Mathematics and Department of Mechanical and Aerospace Engineering, Princeton University, Princeton, NJ 08544, U.S.A.

We undertake a probabilistic analysis of the response of repetitively ring neural populations to simple pulselike stimuli. Recalling and extending results from the literature, we compute phase response curves (PRCs) valid near bifurcations to periodic ring for Hindmarsh-Rose, HodgkinHuxley, FitzHugh-Nagumo, and Morris-Lecar models, encompassing the four generic (codimension one) bifurcations. Phase density equations are then used to analyze the role of the bifurcation, and the resulting PRC, in responses to stimuli. In particular, we explore the interplay among stimulus duration, baseline ring frequency, and population-level response patterns. We interpret the results in terms of the signal processing measure of gain and discuss further applications and experimentally testable predictions. 1 Introduction This letter seeks to add to our understanding of how the ring rates of populations of neural oscillators respond to pulselike stimuli representing sensory inputs and to connect this to mechanisms of neural computation and modulation. In particular, we study how responses depend on oscillator type (classied by its bifurcation to periodic ring), baseline ring rate of the population, and duration of the input. As in, e.g., Fetz and Gustaffson (1983) and Herrmann and Gerstner (2001), our results also apply to the interpretation of peri-stimulus time histograms (PSTHs), which represent averages over ensembles of independent neuronal recordings. ¤ Current address: Department of Mechanical and Environmental Engineering, University of California, Santa Barbara, CA 93106 U.S.A.

Neural Computation 16, 673–715 (2004)

c 2004 Massachusetts Institute of Technology °

674

E. Brown, J. Moehlis, and P. Holmes

We are motivated by attempts to understand different responses, in the form of PSTHs of spike rates in the brainstem organ locus coeruleus, of monkeys performing target identication and other tasks (Usher, Cohen, ServanSchreiber, Rajkowsky, & Aston-Jones, 1999; Brown, Mohelis, Holmes, Clayton, Rajkowski, & Aston-Jones, 2003b), but there are many other situations in which populations of spiking neurons are reset by stimuli. For example, the multiple oscillator and beat frequency models of interval timing of Meck et al. (Matell & Meck, 2000) involve cortical oscillators of differing frequencies, and the 40 Hz synchrony reported by Gray and Singer and Eckhorn et al. (see Gray, 2000, and Eckhorn, 1999, for reviews) also suggest the onset of coherent oscillations in visual cortex. For most neuron models, we nd that the response of populations to a xed stimulus current scales inversely with the prestimulus baseline ring rate of the population. While the ring rates of individual neurons also display this inverse relationship (encoded in their f ¡ I curves; Rinzel & Ermentrout, 1998), the scaling of the population response differs from that of individual neurons. This effect suggests a possible role of baseline ring rate in cognitive processing by neural populations: decreasing baseline ring rates (via reduced inputs from other brain areas or via neuromodulators; e.g. Usher et al., 1999; Aston-Jones, Rajkowski, & Cohen, 2000; Aston-Jones, Chen, Zhu, & Oshinsky, 2001) can adjust the ‘fraction’ of an incoming stimulus that is passed on to the next processing module. Recent data from the brainstem nucleus locus coeruleus (LC), for example, reect this pattern: greater responsivity and better cognitive performance are both correlated with slower baseline ring rates (Aston-Jones, Rajkowski, Kubiak, & Alexinsky, 1994; Usher et al., 1999; Brown et al., 2003b). We also nd that for certain common neuron models, the maximum population response to a step stimulus of xed strength can occur only (if it occurs at all) after stimulus removal. Moreover, in all cases, there are resonant stimulus durations for which there is no poststimulus response. Thus, the magnitude and timing of maximal population response depends strongly on both neuron type and stimulus duration relative to the baseline period. This letter is organized as follows. Section 2 discusses phase reduction techniques for ordinary differential equations with attracting limit cycles. In the following section, we recall and compute phase response curves for familiar neuron models near the four codimension one bifurcations to periodic ring, using normal forms and numerical calculations (Ermentrout, 2002). These two sections review part of the broad literature on the topic and provide new results: PRCs valid near degenerate Hopf and homoclinic bifurcations and the scaling of PRCs with the frequency of the neurons from which they are derived. Section 4 then analyzes ring probabilities in response to simple stimuli, enabling us to predict spike histograms, describe their dependence on parameters characterizing the stimuli and neuron type, and emphasize similarities and differences among the responses of different models. These results are summarized in six roman-numbered boldface

Phase Reduction and Response Dynamics

675

statements. Section 5 interprets these results in terms of the gain, or signal amplication, of neural populations. Section 6 closes the letter with comments on further applications and possible experimental tests. Both phase reduction methods and population modeling have a rich history, including numerous applications in neuroscience. The classical phase coordinate transformation used in this article originated at least by 1949 (Malkin, 1949), with the complementary asymptotic phase ideas expanded in, among others, Coddington and Levinson (1955), Winfree (1974, 2001), and Guckenheimer (1975) and applied in Ermentrout and Kopell (1984, 1990, 1991), Hansel, Mato, and Meunier (1993, 1995), van Vreeswijk, Abbott, and Ermentrout (1994), Ermentrout (1996), Hoppensteadt and Izhikevich (1997), Kuramoto (1997), Kim and Lee (2000), Bressloff and Coombes (2000), Izhikevich (2000b), Brown, Holmes, and Moehlis (2003a), and Lewis and Rinzel (2003); see also the related “spike response method” in Gerstner, van Hemmen, & Cowan, 1996, and Gerstner & Kistler, 2002 and references therein. Voltage density approaches, primarily undertaken in an integrate-andre framework involving reinjection boundary conditions and in some cases involving distributed conductances, are developed and applied in, among others, Stein (1965), Wilson and Cowan (1972), Fetz and Gustaffson (1983), Gerstner (2000), Nykamp and Tranchina (2000), Omurtag, Knight, and Sirovich (2000), Herrmann and Gerstner (2001), Casti et al. (2001), Brunel, Chance, Fourcaud, and Abbott (2001), Fourcaud and Brunel (2002), and Gerstner and Kistler (2002). In particular, density formulations derived from integrate-and-re models (Fetz & Gustaffson, 1983; Herrmann & Gerstner, 2001) demonstrate the inverse relationship between peak ring rates and baseline frequency (for populations receiving pulsed stimuli) that we extend to other neuron models in this article. The work of Brunel et al. (2001) and Fourcaud and Brunel (2002) focusses on the transmission of stimuli by noisy integrate-and-re populations. It explains how components of incoming signals are shifted and attenuated (or amplied) when output as ring rates of the population, depending on the frequency of the signal component and the characteristics of noise in the population. Some of the conclusions of our article (for integrate-and-re neurons only) could presumably be reconstructed from the Brunel et al. results by decomposing our stepped stimuli into Fourier components; however, simpler methods applicable to our noise-free case allow our different analytical insights into response properties. Experiments on population responses to applied stepped and uctuating currents have also been performed, for example, by Mainen and Sejnowski (1995) in cortical neurons. Due to noise inherent in their biological preparations, responses to stepped, but not uctuating, stimuli are gradually damped (cf. also Gerstner, 2000; Gerstner & Kistler, 2002); these effects are studied using a phase density approach by Ritt (2003). The phase density formulation is also used in Kuramoto (1984) and Strogatz (2000), where the emphasis is on coupling effects in populations with distributed frequencies, generally without external stimuli. The approach

676

E. Brown, J. Moehlis, and P. Holmes

closest to ours is that of Tass (1999), who focuses on how pulsed input signals can desynchronize populations of noisy, coupled phase oscillators that have clustered equilibrium states; of particular interest is the critical stimulus duration Tcrit for which the maximum desynchronizing effect is achieved. By contrast, this letter focuses on synchronizing responses of independent noiseless oscillators (with uniform stationary distributions) and, using analytical solutions to this simpler problem, stresses the inuence of individual neuron properties. Specically, we contribute a family of simple expressions for time-dependent ring rates in response to pulsed stimuli, derived from different nonlinear oscillator models via phase reductions and the method of characteristics. Our expressions allow us to identify a series of novel relationships between population dynamics during and after stepped stimuli and the frequencies and bifurcation types of the individual neurons making up the population. As already mentioned, we consider only uncoupled (and noiseless) neurons, but we note that our results remain generally valid for weakly coupled systems. In particular, in Brown et al. (2003b), we show that for a noisy neural population with synaptic and electrotonic couplings sufcient to reproduce observed variations in experimental cross-correlograms, the uncoupled limit is adequate for understanding key rst-order modulatory effects. 2 Phase Equations for Nonlinear Oscillators with Attracting Limit Cycles 2.1 Phase Reductions. Following, e.g., Ermentrout (1996, 2002), Hoppensteadt and Izhikevich (1997), Guckenheimer (1975), and Winfree (1974, 2001), we rst describe a coordinate change to phase variables that will simplify the analysis to come. Our starting point is a general, conductancebased model of a single neuron: CVP D [I g .V; n/ C Ib C I.V; t/]; nP D N.V; n/I .V; n/T 2 RN :

(2.1) (2.2)

Here, V is the voltage difference across the membrane, the .N ¡ 1/-dimensional vector n comprises gating variables and I g .V; n/ the associated membrane currents, and C is the cell membrane capacitance. The baseline inward current Ib effectively sets oscillator frequency and will correspond below to a bifurcation parameter. I.V; t/ represents synaptic currents from other brain areas due to stimulus presentation; below, we neglect reversal potentials so that I.V; t/ D I.t/. We write this equation in the general form xP D F.x/ C G.x; t/I

x D .V; n/T 2 RN ;

(2.3)

Phase Reduction and Response Dynamics

677

where F.x/ is the baseline vector eld, G.x; t/ is the stimulus effect, and T denotes transpose. In our simplication, G.x; t/ D .I.t/; 0/T ; in a more general setting, perturbations in the gating equations 2.2, could also be included. We assume that the baseline (G ´ 0) neural oscillator has a normally hyperbolic (Guckenheimer & Holmes, 1983), attracting limit cycle ° . This persists under small perturbations (Fenichel, 1971), and hereafter we assume that such a limit cycle always exists for each neuron. The objective is to simplify equation 2.3 by dening a scalar phase variable µ .x/ 2 [0; 2¼/ for all x in some neighborhood U of ° (within its domain of attraction), such that the phase evolution has the simple form dµ.x/ D ! for dt all x 2 U when G ´ 0. Here, ! D 2¼=T, where T is the period of equation 2.3 with G ´ 0. From the chain rule, this requires dµ .x/ @µ @µ @µ D .x/ ¢ F.x/ C .x/ ¢ G.x; t/ D ! C .x/ ¢ G.x; t/ : dt @x @x @x

(2.4)

Equation 2.4 denes a rst-order partial differential equation (PDE) that the scalar eld µ.¢/ must satisfy. Using the classical techniques of isochrons (Winfree, 1974, 2001; Guckenheimer, 1975; Kuramoto, 1997; cf. Hirsch, Pugh, & Shub, 1977), the unique (up to a translational constant) solution µ .¢/ to this PDE can be constructed indirectly. Even after µ .¢/ has been found (see the next subsection), equation 2.4 is not a phase-only (and hence self-contained) description of the oscillator dynamics. However, evaluating the vector eld at x° .µ /, which we dene as the intersection of ° and the µ.x/ level set (i.e., isochron), we have dµ .x/ @µ ° D!C .x .µ // ¢ G.x° .µ ; t// C E ; dt @x

(2.5)

where E is an error term of O .jGj2 /, where the scalar jGj bounds G.x; t/ over all components as well as over x and t (cf. Kuramoto, 1997). Dropping this error term, we may rewrite equation 2.5 as the onedimensional phase equation, dµ @µ D!C .µ /¢G.µ; t/ ; dt @x

(2.6)

which is valid (up to the error term) in the whole neighborhood U of ° . 2.2 Computing the Phase Response Curve. In the case of equations 2.1 and 2.2, the only partial derivative we must compute to fully dene equation 2.6 is with respect to voltage, and we dene the phase response curve @µ (PRC; Winfree, 2001) as @V .µ / ´ z.µ /. Then equation 2.6 becomes dµ D ! C z.µ /I.t/ ´ v.µ ; t/; dt

(2.7)

678

E. Brown, J. Moehlis, and P. Holmes

the population dynamics of which is the subject of this letter. Note that equation 2.7 neglects reversal potential effects for the various synapses that contribute to the net I.t/: if these were included, I.t/ would be replaced by I.µ; t/. Furthermore, if G had nonzero components in more than just the voltage direction, we would need to compute a vector-valued PRC; each component of this could be computed in a similar manner to that below. 2.2.1 Direct Method. We now describe a straightforward and classical way to compute z.µ / that is useful in experimental, numerical, and analytical studies (Winfree, 1974, 2001; Glass & Mackey, 1988; Kuramoto, 1997). By denition, z.µ / D lim

1V!0

1µ ; 1V

(2.8)

£ ¤ where 1µ D µ.x° C .1V; 0/T / ¡ µ.x° / is the change in µ.x/ resulting from a perturbation V ! V C 1V from the base point x° on ° (see Figure 1). Since µP D ! everywhere in the neighborhood of ° , the difference 1µ is preserved under the baseline (G D 0) phase ow; thus, it may be measured in the limit as t ! 1, when the perturbed trajectory has collapsed back to the limit cycle ° . That is, z.µ / can be found by comparing the phases of solutions in the innite-time limit starting on and innitesimally shifted from base points on ° . This is the idea of asymptotic phase. This method will be used in section 3 to compute PRCs for the normal forms commonly arising in neural models.

limit cycle g

*

DV

Dq

level sets of q (x)

@µ Figure 1: The direct method for computing @V at the point indicated by * is to take the limit of 1µ =1V for vanishingly small perturbations 1V. One can calculate 1µ in the limit t ! 1, as discussed in the text.

Phase Reduction and Response Dynamics

679

@µ 2.2.2 Other Methods. Another technique for nding @V .µ / involves solving the adjoint problem associated with equations 2.1 and 2.2 (Hoppensteadt & Izhikevich, 1997; Ermentrout & Kopell, 1991); this procedure is automated in the program XPP (Ermentrout, 2002) and is equivalent to the direct method discussed above. This equivalence, described in appendix A, is implicit in the calculation of coupling functions presented in Hoppensteadt and Izhikevich (1997) and Ermentrout (2002). The implementation of the adjoint method on XPP is used to compute the PRCs for full neuron models that are compared with normal form predictions later in this article. Since only partial derivatives @µ evaluated on ° enter equation 2.7, and @x not the value of the phase function µ itself, it is tempting to compute these partial derivatives directly from equation 2.4. However, when viewed as an algebraic equation for the vector eld @µ , equation 2.4 yields innitely @x many solutions, being only one equation for the N unknown functions @µ D 1; : : : ; N . Some of these solutions are much easier to construct than @xj ; j the phase response curve computed via the direct or the adjoint method. 2 However, for such a solution, which we write as @µ D @µ ) to distinguish it @x .6 @x from partial derivatives of the asymptotic phase µ , there is not necessarily 2 .x/ a corresponding phase variable µ2 such that dµdt D !; x 2 U (in the absence of stimulus): recall the uniqueness of the solution µ.x/ to equation 2.4. (See appendix B for a specic coordinate change from the literature in this context.)

2.3 Validity of the Phase Reduction. We shall always assume that the phase ow µP is nonnegative at the spike point µs ´ 0; otherwise equation 2.7 does not make sense as a neuron model (neurons cannot cross “backwards” through the spike and regain a state from which they can immediately re again). For oscillators giving PRCs z.µ / with z.µs / 6D 0, this assumption restricts admissible perturbing functions I.t/ (or, in the more general case of equation 2.6, G.x; t/) to those satisfying I.t/z.µs / > ¡!:

(2.9)

Thus, for z.µs / > 0, excitatory input (I.t/ > 0) is always admissible, but there is a lower bound on the strength of inhibitory input for which phase reductions hold. In particular, if I.t/ contains a noise component, it must be bounded below; this requires trimming the white (diffusive) or OrnsteinUhlenbeck noise processes commonly used to model variability in synaptic inputs. These problems do not arise for continuous PRCs having z.µs / D 0. We note that z.µ s / D 0 approximately holds for the Hodgkin-Huxley (HH) and Hindmarsh-Rose (HR) neurons to be considered below, and indeed holds for any neuron model with a fast vector eld surrounding the spike tip xs on the limit cycle. In this case, asymptotic phase changes very little in a small neighborhood near xs , since µ D !t, and only a short time is spent in the neighborhood. A small perturbation in the V direction therefore

680

E. Brown, J. Moehlis, and P. Holmes

takes trajectories to isochrons with similar values of µ, and so has little effect on asymptotic phase. For the integrate-and-re systems investigated below, spikes are not explicitly modeled. While this may be viewed as an articial omission leading to z.µs / 6D 0, the population dynamics of such systems are of interest because they are in rather common use. 3 Phase Response Curves for Specic Models In this section, we derive or recall analytical approximations to PRCs for multidimensional systems with limit cycles that arise in the four (local and global) codimension one bifurcations (Guckenheimer & Holmes, 1983): these are appropriate to conductance-based models of the form of equations 2.1 and 2.2. We then give PRCs for one-dimensional (linear) integrateand-re models. Of these PRC calculations, results for the homoclinic and degenerate Hopf bifurcation are new, while the results for other models, previously derived as referenced in the text, are summarized and recast to display their frequency dependence and for application to population models in what follows. 3.1 Phase Response Curves Near Codimension One Bifurcations to Periodic Firing. Bifurcation theory (Guckenheimer & Holmes, 1983) identies four codimension one bifurcations, which can give birth to a stable limit cycle for generic families of vector elds: a SNIPER bifurcation (saddle-node bifurcation of xed points on a periodic orbit), a supercritical Hopf bifurcation, a saddle-node bifurcation of limit cycles, and a homoclinic bifurcation (see Figure 2). All four bifurcation types have been identied in specic neuron models as a parameter. Here, the baseline inward current Ib , varies; for example, SNIPER bifurcations are found for type I neurons (Ermentrout, 1996) like the Connor model and its two-dimensional Hindmarsh-Rose (HR) reduction (Rose & Hindmarsh, 1989), supercritical Hopf bifurcations may occur for the abstracted FitzHugh-Nagumo (FN) model (Keener & Sneyd, 1998), a saddle-node bifurcation of limit cycles is found for the HodgkinHuxley (HH) model (Hodgkin & Huxley, 1952; Rinzel & Miller, 1980), and a homoclinic bifurcation can occur for the Morris-Lecar (ML) model (Rinzel & Ermentrout, 1998). In this section, we calculate or summarize PRCs for limit cycles arising from all four bifurcations. This is accomplished, where possible, through use of one- and two-dimensional normal form equations. Normal forms are obtained through center manifold reduction of equations 2.1 and 2.2 at the bifurcation, followed by a similarity transformation to put the linear part of the equation into Jordan normal form, and nally by successive nearidentity nonlinear coordinate transformations to remove as many terms as possible, a process that preserves the qualitative dynamics of the system (Guckenheimer & Holmes, 1983). To obtain the PRC in terms of the original @µ variables, that is, @V , rather than in terms of the normal form variables

Phase Reduction and Response Dynamics

681

Figure 2: (a) SNIPER bifurcation. Two xed points die in a saddle-node bifurcation at ´ D 0, giving a periodic orbit for ´ > 0, assumed to be stable. (b) Supercritical Hopf bifurcation. A xed point loses stability as ® increases through zero, giving a stable periodic orbit (closed curve). (c) Bautin bifurcation. See the 2 text for details. At ® D 4c f there is a saddle-node bifurcation of periodic orbits. Both a stable (solid closed curve) and unstable (dashed closed curve) periodic 2 orbit exist for 4c f < ® < 0; the unstable periodic orbit dies in a subcritical Hopf bifurcation at ® D 0. The xed point is stable (resp., unstable) for ® < 0 (resp., ® > 0). (d) Homoclinic bifurcation. A homoclinic orbit exists at ¹ D 0, giving rise to a stable periodic orbit for ¹ > 0.

682

E. Brown, J. Moehlis, and P. Holmes

(which we henceforth denote .x; y/) with associated PRCs @µ and @µ , it is @x @y necessary to undo these coordinate transformations. However, since the normal form coordinate transformations affect only nonlinear terms, we obtain the simple relationship @µ @µ @µ D ºx C ºy C O .x; y/; @V @x @y where

@x ; ºx D @V xDyD0

(3.1)

@y : ºy D @V xDyD0

The remainder term in equation 3.1 is assumed to be small near the bifurcations of relevance and is neglected below. This introduces vanishing error in the Hopf case, in which the bifurcating periodic orbits have arbitrarily small radii; the same is true near SNIPER and homoclinic bifurcations, where periodic orbits spend arbitrarily large fractions of their period near the origin. When using the Bautin normal form, however, we must tacitly assume that the nonzero onset radius of stable bifurcating orbits is small; failure of this assumption for the HH model may contribute to the discrepancy between PRCs derived by analytical and numerical methods (see section 3.3). Before proceeding, a few notes regarding the normal form equations that we will consider are in order. For the SNIPER bifurcation, we consider the normal form for a saddle-node bifurcation of xed points, which must be properly embedded globally in order to capture the presence of the periodic orbit (the unstable branch of the center manifold must close up and limit on the saddle node; cf. Figure 2a). For the saddle-node bifurcation of periodic orbits, we appeal to the sequence of bifurcations for type II neurons such as the HH model (Hodgkin & Huxley, 1952), namely, a subcritical Hopf bifurcation in which an unstable periodic orbit branch bifurcates from the rest state, turns around, and gains stability in a saddle-node bifurcation of periodic orbits (Rinzel & Miller, 1980). This sequence is captured by the normal form of the Bautin (degenerate Hopf) bifurcation (Kuznetsov, 1998); cf. Guckenheimer & Holmes, 1983, sec. 7.1). Finally, for the homoclinic bifurcation, we consider only the linearized ow near the xed point involved in the bifurcation. This is not strictly a normal form, and as for the SNIPER bifurcation, a proper global return interpretation is necessary to produce the periodic orbit. Near the SNIPER, Hopf, and Bautin local bifurcations, there is a separation of timescales between dynamics along versus dynamics normal to the one- or two-dimensional attracting center manifold containing (or, in the SNIPER case, consisting of) the periodic orbit. In particular, sufciently close to the bifurcation point, the time required for perturbed solutions to collapse back onto the manifold is negligible compared with the period of

Phase Reduction and Response Dynamics

683

the orbit. This implies that as the bifurcation is approached, the tangent space of any N ¡ 1 dimensional isochron (computed at its intersection with the periodic orbit) becomes normal to the corresponding tangent space of the center manifold. Thus, sufciently near these three bifurcations, the only relevant contributions that perturbations make to asymptotic trajectories is by their components along the center manifold, as captured by the above terms ºx and (additionally for the Hopf and Bautin bifurcations) ºy . Hence equation 3.1 captures the phase response curve for the full Ndimensional system. For the homoclinic global bifurcation, the same conclusion holds, although for a different reason: in this case, there is no lowdimensional center (i.e., locally slow) manifold. However, because the dynamics that asymptotically determine the PRC are linear for the homoclinic bifurcation (unlike the SNIPER, Hopf, and Bautin cases), a PRC valid for full N-dimensional systems can still be computed analytically, as described below. We use the direct method of section 2.2.1 to compute PRCs from the normal form equations. This involves linearizing about the stable periodic orbit, which is appropriate because the perturbations 1V to be considered are vanishingly small. The explicit solution of the normal form equations yields 1µ, and taking limits, we obtain the PRC (cf. equation 2.8). Without loss of generality, the voltage peak (spike) phase is set at µs D 0, and coordinates are dened so that phase increases at a constant rate ! in the absence of external inputs, as in section 2.1. Analogs of some of the following results have been previously derived by alternative methods, as noted in the text, and we also note that PRCs for relaxation oscillators have been discussed in Izhikevich (2000b). However, unlike the previous work, here we explicitly compute how the PRCs scale with oscillator frequency. 3.1.1 Saddle Node in a Periodic Orbit (SNIPER). A SNIPER bifurcation occurs when a saddle-node bifurcation of xed points takes place on a periodic orbit (see Figure 2a). Following the method of Ermentrout (1996, with details of the calculation), we ignore the direction(s) transverse to the periodic orbit and consider the one-dimensional normal form for a saddle-node bifurcation of xed points, xP D ´ C x2 ;

(3.2)

where x may be thought of as local arc length along the periodic orbit. For ´ > 0, the solution of equation 3.2 traverses any interval in nite time; as in Ermentrout (1996), the period T of the orbit may be approximated by calculating the total time necessary for the solution to equation 3.2 to go from x D ¡1 to x D C1 and making the solution periodic by resetting x to ¡1 every time it res at x D 1. This gives T D p¼´ , hence p ! D 2 ´.

684

E. Brown, J. Moehlis, and P. Holmes

Since equation 3.2 is one-dimensional, Ermentrout (1996) immediately computes @µ @t ! D! D dx ; @x @x

(3.3)

dt

where

dx dt

is evaluated on the solution trajectory to equation 3.2. This gives

2 @µ D [1 ¡ cos µ]; @x !

(3.4)

as rst derived in Ermentrout (1996), but with explicit !-dependence displayed here. Considering a voltage perturbation 1V, we have @µ csn D zSN D [1 ¡ cos µ]; @V !

(3.5)

where csn D 2ºx is a model-dependent constant (see equation 3.1). Note that @µ is nonnegative or nonpositive according to the sign of csn . Since in type I @V neuron models (Ermentrout, 1996), a positive voltage perturbation advances phase (and hence causes the neuron to re sooner), in the following we will generally assume csn to be positive. 3.1.2 Generalized and Supercritical Hopf Bifurcations. The normal form for the (generalized) Hopf bifurcation (Guckenheimer & Holmes, 1983; Kuznetsov, 1998) is zP D .® C i¯/z C .c C id/jzj2 z C . f C ig/jzj4 zI

(3.6)

in polar coordinates, this is rP D ®r C cr3 C f r5 ; ÁP D ¯ C dr2 C gr 4 :

(3.7) (3.8)

We study two cases, always treating ® as the bifurcation parameter. In the rst case, we assume c < 0, yielding a supercritical Hopf bifurcation. For ® < 0, there is a stable xed point at the origin that loses stability as ® increases p through zero, giving birth to a stable periodic orbit with radius rpo;H D ¡®=c (see Figure 2b). Crucially, rpo;H D 0 when ® D 0, so that only terms of cubic order in equations 3.7 and 3.8 are required to capture (unfold) the supercritical Hopf dynamics. Hence we may set g D f D 0 for a local analysis. In the second case, we assume c > 0, so that equations 3.7 and 3.8 have a subcritical Hopf bifurcation at ® D 0, and there is no stable periodic orbit

Phase Reduction and Response Dynamics

685

for any value of ® when g D f D 0. Hence, we must reintroduce these terms to capture the relevant dynamics. Assuming additionally that f < 0, for ® < 0 there is a stable xed point at the origin that loses stability in a subcritical Hopf bifurcation at ® D 0, giving rise to an unstable periodic orbit as ® decreases through zero. The branch of unstable periodic orbits turns 2 2 around at a saddle-node bifurcation of periodic orbits at ® D 4c f ; for ® > 4c f p stable periodic solutions exist with radius rpo;B D [ 21f .¡c ¡ c2 ¡ 4®f /]1=2 (see Figure 2c). This is the generalized Hopf or Bautin bifurcation (identied by the subscript B). In either case, the angular speed is constant on the stable periodic orbit; hence, we set the asymptotic phase µ equal to the polar angle Á on the periodic orbit itself. (However, radial level sets of Á extending off the periodic orbit are not isochrons, since ÁP varies with r.) We calculate the PRC by linearizing about the attracting periodic orbit rpo . Letting r D rpo C r0 , we obtain rP0 D ¸r0 C O .r02 /, where ¸ is the transverse Floquet exponent (eigenvalue) for the stable periodic orbit. In the supercritical Hopf bifurcation, ¸ D ¸H D ¡2® < 0 and rpo D rpo;H ; in the Bautin, p ¸ D ¸B D 1f .c2 ¡ 4®f C c c2 ¡ 4®f / < 0 and rpo D rpo;B . Here and below, we drop terms of O .r02 / because we are concerned with arbitrarily small perturbations (cf. equation 2.8). Solving the linearized radial equation with initial condition r.0/ D r0 , we obtain r.t/ D rpo C .r0 ¡ rpo /e¸j t ;

(3.9)

with j D H or B. Next, integrating equation 3.8 yields Z t Z t [¯ C d.r.s//2 C g.r.s//4 ]ds; Á.t/ D dÁ D 0

(3.10)

0

and taking Á .0/ D Á0 , substituting equation 3.9 in 3.10, letting t ! 1, and dropping terms of O .r02 /, we obtain the phase µ associated with the initial condition .r0 ; Á0 /: 2 4 C grpo µ.t/ D Á0 C .¯ C drpo /t ¡

2 2rpo .d C 2grpo /.r0 ¡ rpo /

¸B

:

(3.11)

Here we have again used the fact that the polar angle Á and the phase µ are identical on the periodic orbit. Suppose that we start with an initial condition .xi ; yi / on the periodic orbit, with polar coordinates .rpo ; Ái /. As t ! 1, the trajectory with this 2 C 4 initial condition has asymptotic phase Ái C .¯ C drpo grpo /t. Now consider a perturbation 1x in the x-direction to .x f ; y f / D .rpo cos Ái C 1x; rpo sin Ái /. To lowest order in 1x, this corresponds, in polar coordinates, to ³ ´ sin Ái .r f ; Á f / D rpo C cos Ái 1x; Ái ¡ 1x : rpo

686

E. Brown, J. Moehlis, and P. Holmes

Setting .r0 ; Á0 / D .r f ; Á f / in equation 3.11 and subtracting the analogous expression with .r0 ; Á0 / D .rpo;j ; Ái /, j D H or B, we compute the change in asymptotic phase due to this perturbation, 3

2drpo;j C 4grpo;j 1 @µ D¡ cos µ ¡ sin µ; @x ¸j rpo;j

(3.12)

where we have substituted µ for the polar angle Ái , again using the fact that the two variables take identical values on the periodic orbit. Similarly, we nd 3

2drpo;j C 4grpo;j 1 @µ D¡ sin µ C cos µ: @y ¸j rpo;j

(3.13)

We now express rpo;j and ¸j in terms of the frequencies of the periodic orbits. In the supercritical Hopf case (recall that we set g D f D 0 here), 4 at the bifurcation point, the phase frequency ! is ÁP D !H D ¯, and from 2 equation 3.8, we have ! ¡ !H D drpo;H , yielding rpo;H D

p

j! ¡ !H j p : jdj

(3.14)

Substituting for rpo;H , we have ! ¡ !H D ¡®d=c, which together with the expression for ¸H gives ¸H D

2c .! ¡ !H /: d

In the Bautin case, we nd that µ ¶q d gc g C 2 ! ¡ !SN D ¡ c2 ¡ 4®f C 2 .c2 ¡ 4®f /; 2f 2f 4f

(3.15)

(3.16)

where !SN is the frequency of the periodic orbit at the saddle-node bifurca2 tion (® D 4c f ). Thus, from equation 3.16, q

c2 ¡ 4®f D kj! ¡ !SN j C O .j! ¡ !SN j2 /; 2

(3.17)

2

f where k D j f d¡gc j, and we may use the expressions for rpo;B and ¸B to compute s ¡c C O .j! ¡ !SN j/; (3.18) rpo;B D 2f

¸B D

ck j! ¡ !SN j C O .j! ¡ !SN j2 /: f

(3.19)

Phase Reduction and Response Dynamics

687

Next, we substitute these equations 3.14, 3.15, and 3.18, 3.19, for rpo and ¸ into equations 3.12 and 3.13. For the supercritical Hopf case, this gives p 1 jdj @µ D p [d cos.µ / C c sin.µ /]; @x j! ¡ !SN j jcj p 1 jdj @µ D p [d sin.µ / ¡ c cos.µ /]: @y j! ¡ !SN j jcj

(3.20) (3.21)

In the Bautin case, we get " s ³ ´3=2 # 1 ¡c ¡c @µ D ¡2d ¡ 4g j! ¡ !SN j 2f 2f @x " s ³ ´3=2 # 1 ¡c ¡c @µ D ¡2d ¡ 4g j! ¡ !SN j 2f 2f @y

f cos µ C O .1/; ck

(3.22)

f sin µ C O .1/ ; ck

(3.23)

where we have explicitly written terms of O .j! ¡ !SN j/¡1 , which dominate near the saddle node of periodic orbits. Note that the only term involving the bifurcation parameter ® is the prefactor, so that as this parameter is varied, all other terms in equations 3.22 and 3.23 remain constant. Equipped with equations 3.20 and 3.21, the PRC for a perturbation in the V-direction near a supercritical Hopf bifurcation is found from equation 3.1 to be zH .µ / D

@µ cH D p sin.µ ¡ ÁH /; @V j! ¡ !H j

where the constant cH D º c¡º d

(3.24)

p q jdj .ºx c C ºy d/2 C .ºx d ¡ ºy c/2 and the phase jcj

shift ÁH D tan¡1 . ºyx cCºxy d /. The form of this PRC was originally presented as equation 2.11 of Ermentrout and Kopell (1984). See that article, as well as section 4 of Ermentrout (1996) and Hoppensteadt and Izhikevich (1997), for earlier, alternative methods and computations for the PRC near supercritical Hopf bifurcation. For the Bautin bifurcation, we similarly arrive at zB .µ / D

@µ cB D sin.µ ¡ ÁB /: j! ¡ !SN j @V

Here cB D [¡2d

q

¡c 2f

(3.25)

q 3=2 ] f 2 C º 2 is a constant (which can be ¡ 4g. ¡c y 2f / ck ºx

positive or negative depending on d and g), and ÁB D tan¡1 . ººxy / is an !independent phase shift.

688

E. Brown, J. Moehlis, and P. Holmes

3.1.3 Homoclinic Bifurcation. Finally, suppose that the neuron model has a parameter ¹ such that a homoclinic orbit to a hyperbolic saddle point p with real eigenvalues exists at ¹ D 0. Then there will be a periodic orbit ° for, say, ¹ > 0, but not for ¹ < 0. Specically, we assume a single unstable eigenvalue ¸u smaller in magnitude than that of the all stable eigenvalues, ¸u < j¸s;j j, so that the bifurcating periodic orbit is stable (Guckenheimer & Holmes, 1983; see Figure 2d). If parameters are chosen close to the homoclinic bifurcation, solutions near the periodic orbit spend most of their time near p, where the vector eld is dominated by its linearization. This may generically be written in the diagonal form: xP D ¸u x; yPj D ¸s;j yj ; j D 1; : : : ; N ¡ 1;

(3.26) (3.27)

where the x and yj axes are tangent to the unstable and a stable manifold of p, respectively, and ¸s;j < 0 < ¸u are the corresponding eigenvalues. For simplicity, we assume here that the segments of the axes shown in Figure 3 are actually contained in the respective manifolds; this can always be achieved locally by a smooth coordinate change (Guckenheimer & Holmes, 1983). We dene the box B D [0; 1]£¢ ¢ ¢£[0; 1] that encloses ° for the dominant part of its period, but within which equations 3.26 and 3.27 are still a good approximation; 1 is model dependent but xed for different periodic orbits occurring as a bifurcation parameter varies within the model. We do not explicitly model ° outside of B, but note that the trajectory is reinjected

re-injection

y

D

g p

e

D

x

Figure 3: The setup for deriving the PRC for oscillations near a homoclinic bifurcation, shown (for simplicity) with N D 2.

Phase Reduction and Response Dynamics

689

after negligible time (compared with that spent in B) at a distance ² from the stable manifold, where ² varies with the bifurcation parameter ¹ (see Figure 3). Thus, periodic orbits occurring closer to the bifurcation point correspond to lower values of ² and have larger periods. We approximate the period T.²/ as the time that the x coordinate of ° takes to travel from ² to 1 under equation 3.26: ³ ´ 1 1 ln T.²/ D : ¸u ²

(3.28)

Notice that the x-coordinate of ° alone determines T.²/, and hence may be thought of as independently measuring the phase of ° through its cycle. We set µ D 0 at x D ² and, assuming instantaneous reinjection, µ D 2¼ at x D 1. Then ! D 2¼=T.²/, and as in equation 3.3, @µ ! ! ! D dx D D exp.¡¸u µ=!/: @x ¸u x.µ / ¸u ²

(3.29)

dt

In the nal equality, we used the solution to equation 3.26, x.t/ D ² exp.¸u t/, with the substitution t D µ=!. Since, as remarked above, motion in the yj directions does not affect the phase of ° , only components of a perturbation 1V along the x-axis contribute to the phase response curve; thus, the PRC @µ D ºx @µ , where ºx is as dened following equation 3.1. Using zHC D @V @x equation 3.28, ² D 1 exp.¡2¼ ¸u =!/, which allows us to eliminate ² from equation 3.29: zHC .µ / D

³ ´ ³ ´ 2¼ ¸u @µ µ D chc ! exp exp ¡¸u ; @V ! !

(3.30)

where chc D ¸ºux1 is a model-dependent constant. This is an exponentially decaying function of µ with maximum ³ zmax

D chc ! exp

2¼¸u !

´ (3.31)

and minimum ³ ´ 2¼ ¸u D chc !: zmin D zmax exp ¡ !

(3.32)

Here and below we assume chc > 0. zHC is discontinuous at the spike point µs D 2¼ , which forces us to take a limit in dening population-averaged ring rates below but does not otherwise affect the following analysis.

690

E. Brown, J. Moehlis, and P. Holmes

3.2 One-Dimensional Neuron Models. Generalized integrate-and-re models have the form VP D F.V/ C G.V; t/;

(3.33)

where V.t/ is constrained to lie between a reset voltage Vr and a threshold Vth , and the following reset dynamics are externally imposed: if V.t/ crosses Vth from below, a spike occurs, and V.t/ is reset to Vr . Here, nothing is lost in transforming to the single phase equation, 2.6; in particular, the error term of equation 2.5 does not apply. In fact, as noted in Ermentrout (1981), the crucial @µ quantity @V can be found directly from equation 3.33 with G.V; t/ ´ 0: z.µ / D

@µ @t ! D! D ; @V @V F.V.µ //

(3.34)

where we recall that µ is dened such that µP D !. In the next two subsections, we compute phase response curves for two simple integrate-and-re models. 3.2.1 Integrate-and-Fire Neuron. We rst consider the simplest possible integrate-and-re (IF) model: CVP D .Ib C I.t//I Vr D 0; Vth D 1;

(3.35)

where Ib is the baseline current, C is membrane capacitance, and G.V; t/ D I.t/. Hereafter we set C D 1 for the IF model. The angular frequency of a baseline (I.t/ D 0) oscillation is ! D 2¼ Ib , and equation 3.34 gives zIF .µ / D

! ! D ´ 2¼: F.V.µ // Ib

(3.36)

Thus, the IF PRC is constant in µ and frequency independent. 3.2.2 Leaky IF Neuron. Next, we consider the leaky integrate and re (LIF) model: CVP D .Ib C gL .VL ¡ V/ C I.t//I Vr D 0; Vth D 1 < VL C

Ib ; gL

(3.37)

where Ib is the baseline current, gL > 0 and VL are the leak conductance and reversal potential, C is the capacitance, and G.V; t/ D I.t/. As above, we also set C D 1 for this model. We assume Ib ¸ gL .1 ¡ VL / so that when I.t/ D 0, the neuron res periodically with frequency µ ³ ! D 2¼ gL ln

Ib C g L V L Ib C gL VL ¡ gL

´¶¡1

:

(3.38)

Phase Reduction and Response Dynamics

691

This expression shows how Ib enters as a bifurcation parameter, with Ib D gL .1 ¡ VL / corresponding to the bifurcation point at which ! D 0. Solving equation 3.37 for V.t/ with initial condition V.0/ D Vr D 0, and then using µ D !t and equation 3.34, gives ! gL

zLIF .µ / D

³

³ ´´ ³ ´ 2¼ gL gL µ 1 ¡ exp ¡ exp ; ! !

(3.39)

equivalent to formulas previously derived in van Vreeswijk et al. (1994) and Lewis and Rinzel (2003). Thus, the PRC for the LIF model is an exponentially increasing function of µ, with a maximum that decreases with !: zmax .!/ D

! gL

³

³ exp

2¼gL !

´

´ ¡1 ;

(3.40)

and minimum ³ ´ 2¼ gL ! D zmin .!/ D zmax exp ¡ .1 ¡ e¡2¼gL =! /: ! gL

(3.41)

Recall that the PRC near a homoclinic bifurcation is also an exponential function but with opposite slope: this is because both the essential dynamics near a homoclinic bifurcation and the LIF dynamics are linear, while the trajectories accelerate following spikes in the homoclinic case and decelerate in the LIF. This is our nal analytical PRC calculation; we summarize the results derived above in Table 1 and Figure 4. Table 1: Phase Response Curves for the Different Neuron Models. Bifurcation

z.µ/ csn !

SNIPER Hopf Bautin Homoclinic

p

cH

LIF

¡ ÁB /] C O .1/

! gL

0

¡ 2¼ ¸u ¢ !

exp .¡¸u µ=!/

1 ¡ e¡2¼ g L

cH

¡p

j!¡!H j

jcB j j!¡! SN j

C O .1/

chc ! exp

2¼

¡

2csn !

p

[sin.µ ¡ ÁH /]

jcB j j!¡!SN j [sin.µ

IF

zmin

[1 ¡ cos.µ/]

j!¡!H j

chc ! exp

zmax

¡ 2¼ ¸u ¢

jcB j ¡ j!¡!

SN j

eg L µ =!

! gL

¡

e2¼ gL =! ¡ 1

C O .1/

chc !

!

2¼

¢ =!

cH j!¡!H j

2¼

¢

! g L .1

¡ e¡2¼ g L =! /

692

E. Brown, J. Moehlis, and P. Holmes

HR (SNIPER)

ML (homoclinic)

z(q)

0.6

500

0.4 0

0.2 0

2

q

4

6

FN (Hopf)

500

0

2

4

6

4

6

4

6

IF

10

40

q

z(q)

20 5

0 20 40 0

2

q

4

6

HH (Bautin)

0 0

2

q LIF

10

z(q)

1 0.5 5

0 0.5 1 0

2

q

4

6

0 0

2

q

Figure 4: PRCs for the various neuron models, from the formulas of section 3 and numerically computed using XPP (Ermentrout, 2002), all with µs D 0. The relevant bifurcations are noted where applicable. Dot-dashed, dashed, and dotted curves for each model correspond to increasing frequencies, respectively: HR: ! D 0:0102; 0:0201; 0:0316 rad/msec (corresp. 1.62, 3.20, 5.03 Hz); FN: ! D 0:204; 0:212; 0:214 (corresp. 32.5, 33.7, 34.1 Hz); HH: ! D 0:339; 0:355; 0:525 rad/msec (corresp. 54.2, 56.5, 83.6 Hz); ML: ! D 0:0572; 0:0664; 0:0802 rad/msec (corresp. 9.10, 10.6, 12.8 Hz); IF: (any frequency); LIF: ! D 0:419; 0:628; 1:26 rad/msec (corresp. 66.7, 100, 200 Hz). For the LIF model, gL D 0:110. Normal forms 3.5, 3.24, 3.25, and 3.30 for the PRCs closest to bifurcation shown solid (scale factors ci t by least squares); the IF and LIF PRCs are exact. PRC magnitudes decrease with ! for the HR, HH, ML, and LIF models; are constant for the IF model; and increase with ! for the FN model. The phase shifts ÁH and ÁB are chosen as ¼ (yielding z.µs / D 0; see section 2.3). The inset to the ML plot displays the same information on a log scale, demonstrating exponential decay.

Phase Reduction and Response Dynamics

HR (SNIPER)

zmax

0.8 0.6

0.02

0.04

w FN (Hopf)

80 60

0 8

20

2

5

0.2

HH (Bautin)

4

zmax

w

3 2

0

0.5

15

w

1

LIF

10 5

1 0

IF

6 4

0.18

0.06 0.08 0.1 0.12

w

40

0

ML (homoclinic)

600

200

0.2

zmax

800

400

0.4

0

693

0.4

0.5

w

0.6

0

0.5

w

1

Figure 5: Scaling of zmax with ! for the various neuron models (and hence scaling of the population response FLdmax ¡ FLb with !; see section 4.2). Asterisks are numerical values from the PRCs of the full HR, FN, HH, and ML models, and curves show predictions of the normal forms 3.5, 3.25, 3.24, and 3.30 with leastsquares ts (of PRC maxima) for scale factors. Results for the IF and LIF models are exact.

694

E. Brown, J. Moehlis, and P. Holmes

3.3 Accuracy of the Analytical PRCs. The range of parameters over which the PRCs of the full neuron models are well approximated by the analytical expressions derived above varies from model to model. One overall limitation, noted in Izhikevich (2000a), is that normal form calculations for the Bautin and supercritical Hopf bifurcation ignore the relaxation nature of the dynamics of typical neural oscillators. However, the analytical PRCs— equations 3.5, 3.24, 3.25, and 3.30—are qualitatively, and in many cases quantitatively, correct. Figure 4 compares these formulas with PRCs calculated using XPP (Ermentrout, 2002) for the HR, FN, HH, and ML models near the relevant bifurcations (PRCs for the integrate-and-re models are exact). The companion Figure 5 demonstrates the scaling of PRC maxima with baseline frequency, which is also correctly predicted by the normal form analysis. Frequencies ! were varied by changing the bifurcation parameter: baseline inward current Ib . Here and elsewhere, the neural models are as given in Rose and Hindmarsh (1989), Murray (2002), Hodgkin and Huxley (1952), and Rinzel and Ermentrout (1989). All parameter values used here are reproduced along with the equations in appendix C. Finally, looking forward to the next section, we note that the analytical PRCs derived here will correctly predict key qualitative aspects of population responses to stimuli. 4 Probabilistic Analysis of Firing Rates 4.1 A Phase Density Equation. We now describe how time-dependent ring rates in response to external stimuli emerge from averages of oscillator population dynamics with appropriate initial conditions. Let ½.µ; t/ denote the probability density of solutions of equation 2.7. Thus, ½.µ; t/dµ is the probability that a neuron’s phase in an arbitrary trial lies in the interval [µ; µ C dµ] at time t. This density evolves via the advection equation: @½.µ ; t/ @ D ¡ [v.µ; t/ ½.µ ; t/]: @t @µ

(4.1)

Boundary conditions are periodic in the probability ux: v.0; t/ ½.0; t/ D limÃ!2¼ v.Ã; t/½ .Ã; t/, which reduces to ½.0; t/ D ½.2¼; t/ for smooth phase response curves z. A related phase density approach is used in Tass (1999) and Ritt (2003), and we rst derived the solution in Brown et al. (2003b). In the presence of noise, there is an additional diffusion term in equation 4.1 (Stein, 1965; Tass, 1999; Brown et al., 2003b). Multiple trials in which stimuli are not keyed to oscillator states may be modeled by averaging solutions of the linear PDE equation 4.1 over suitably distributed initial conditions; since (unmodeled) noise and variable and/or drifting frequencies tend to distribute phases uniformly in the absence of stimuli, we set ½0 ´ 1=2¼ . Histograms of ring times may then be extracted by noting that ring probabilities for arbitrary cells at time t are equal to the passage rate of the probability density through the spike phase, that is, the

Phase Reduction and Response Dynamics

695

probability ux 4

FL.t/ D lim¡ v.Ã; t/ ½.Ã; t/ D lim¡ [! C z.Ã /I.t/] ½.Ã; t/ : Ã!µs

(4.2)

Ã!µs

The limit from below allows for discontinuities in z.µ / (as in the homoclinic and LIF PRCs), since the relevant quantity is ux across the spike threshold from lower values of V and hence from lower values of µ. If the PRC z.µ / and hence ½.µ ; t/ are continuous at µs , equation 4.2 simply becomes FL.t/ D [! C z.µ s /I.t/] ½.µs ; t/. We emphasize that expression 4.2 equally describes the average ring rate of an entire uncoupled population on a single trial or the average ring rate of single neurons drawn from such a population over many sequential trials, as in Herrmann and Gerstner (2001), or a combination of both. 4.2 Patterns of Firing Probabilities and Conditions for Refractory Periods. Equation 4.1 can be explicitly solved for piecewise constant stimuli of duration d D t2 ¡ t1 : I.t/ D IN for t1 · t · t2 and I.t/ D 0 otherwise. (Here and elsewhere, we assume IN > 0 unless explicitly noted.) Specically, the method of characteristics (Whitham, 1974; pp. 97–100 of Evans, 1998) yields ³ Z ½.µ ; t/ D ½0 .2µ ;t .0// exp ¡ Á

D

1 exp ¡IN 2¼

Z

tQ2 t1

t 0

@ v.2 µ;t .t0 /; t0 /dt0 @µ !

´

z0 [2µ;t .s/]ds ;

(4.3)

where t ¸ t1 , tQ2 D min.t; t2 / and we take the initial condition ½0 D ½.µ; 0/ D 1=2¼ . Here, 2µ;t .s/ lies on the characteristic curve given by d 2µ ;t .s/ D v.2 µ;t .s/; s/; ds

(4.4)

with end point condition 2µ;t .t/ D µ. When 2µ;t .s/ coincides with a discontinuity in z, the integrands in equation 4.3 are not dened, and we must appeal to the continuity of probability ux or, equivalently, to the following change of variables. We now simplify expression 4.3. Using the fact that v.2 µ;t .s/; s/ D ! C N Iz.2 µ;t .s// for t1 · s · t 2 , and changing variables from s to 2µ;t .s/, Z

tQ2 t1

Z

2µ ;t .tQ2 /

z0 [2µ;t .s/] d2µ;t .s/ N 2µ ;t .t1 / ! C Iz.2 µ;t .s// " # N 1 ! C Iz.2 µ ;t .tQ2 // D ln ; N IN ! C Iz.2 µ ;t .t1 //

z0 [2µ;t .s/]ds D

(4.5)

696

E. Brown, J. Moehlis, and P. Holmes

so that " 1 ½.µ ; t/ D 2¼

# N ! C Iz.2 µ;t .t1 // : N ! C Iz.2 µ;t .tQ2 //

(4.6)

This expression is valid everywhere it is dened. To obtain the terms in equation 4.6, we integrate equation 4.4 backward in time from the nal condition at s D t until s D t1 or s D tQ2 ; this may be done analytically for the normal form PRCs of section 3 or numerically for PRCs from full neuron models. The integration yields the PRC-independent expression 2µ;t .tQ2 / D µ ¡ !.t ¡ tQ2 /

(4.7)

for all neuron models, while 2µ;t .t1 / is model dependent via the PRC. Note that while the stimulus is on (i.e., t1 · t · t2 ), tQ2 D t so that 2µ ;t .tQ2 / D µ. After the stimulus turns off, v.µ ; t/ is independent of µ, and ½ is constant along curves with constant µ ¡ !t. Thus, for t > t2 , ½.µ ; t/ is simply a traveling wave rotating with frequency !, with ½.µ ; t2 / determining the phase density. From the denition 4.2, we have: " # N ! C z.Ã /I.t/ ! C Iz.2 Ã;t .t1 // (4.8) FL.t/ D lim : N Ã !µs 2¼ ! C Iz.2 Ã;t .tQ2 // Figure 6 shows examples of FL.t/ for the various neuron models, computed via equation 4.8 with both numerically and analytically derived PRCs z, as well as as via numerical simulations of the full neuron models. The phase reduction 4.8 gives qualitative, and in some cases precise, matches to the full numerical data. We recall that the accuracy of phase reductions from full neuron models improves with weaker stimuli IN and that the analytical PRCs better approximate their numerical counterparts as the bifurcation point is approached (i.e., as Ib is varied). Note that if limÃ!µs z.Ã / D 0, I.t/ does not directly enter equation 4.8, so FL.t/ depends only on variations in ½ resulting from the stimulus. However, (I) if limÃ!µs z.Ã / 6D 0, the ring probability FL.t/ jumps at stimulus onset and offset. See Figure 6, and recall that we set µs D 0. This is our rst main result. Some comments on the limit in equation 4.8 are appropriate. Since for all neuron models, we always assume that v.µ / is positive and bounded and is dened except at isolated point(s), 2Ã;t .s/ is a continuous function of Ã , s and t. Nevertheless, as 2Ã;t .t1 / and 2Ã;t .tQ2 / pass through µs as t advances, discontinuities in z.¢/ give discontinuities in FL.t/, but the limit in equation 4.8 ensures that FL.t/ is always dened. As remarked above, if the PRC z.¢/ is a continuous function, then limÃ!µs z.Ã / D z.µ s / and taking the limit is unnecessary.

Phase Reduction and Response Dynamics

697

Figure 6: (a–f) Phase density ½.µ ; t/ in gray scale (darker corresponding to higher values) (top) and ring probability FL.t/ in msec¡1 (bottom) for stimuli of length 3=2 £ P (indicated by black horizontal bars), from equations 4.6 and 4.8 via the method of characteristics. Dashed curves indicate FL.t/ from the normal form PRCs of equations 3.5, 3.24, 3.25, 3.30, 3.36, and 3.39; solid curves from numerical PRCs computed via XPP. Baseline frequencies and values of NI for HR, FN, HH, ML, IF, and LIF models are (0.0201, 0.212, 0.429, 0.08, 0.628, 0.628) rad/msec (corresp. 3.20, 33.7, 68.3, 12.7, 100, 100 Hz) and (0.1, 0.0015, 0.25, 0.0005,0.05, 0.05) ¹A=cm2 , respectively. The vertical bars are PSTHs, numerically computed using the full conductance-based equations (see appendix C) using 10,000 initial conditions, with Ib set to match frequencies of the corresponding phase models. Initial conditions generated by evolving the full equations for a (uniformly distributed) random fraction of their period, from a xed starting point. Note that FL.t/ jumps discontinuously at stimulus onset and offset for the IF and LIF models, since for these models, z.µs / 6D 0 (point I in text). Also, ! during stimulus, FL.t/ does not dip below the baseline value 2¼ for the HR, IF, and LIF models, because zmin ¼ 0 in these cases (point V).

While the stimulus is on, solutions to equation 4.4 are periodic with period Z PD

2¼ 0

dµ N / ! C Iz.µ

(4.9)

698

E. Brown, J. Moehlis, and P. Holmes -3

x 10 a)

FL(t)

0 0

200

400 600 time (msec.)

800

1000

HH (Bautin)

0.08

0

FL(t)

0.06

-3

400 600 time (msec.)

800

1000

400 600 time (msec.)

800

1000

HR (SNIPER)

6

0.07

200

x 10

d)

b)

FL(t)

4 2

2

4 2

0.05 0.04

HR (SNIPER)

6

4

0

x 10

c)

HR (SNIPER)

6

FL(t)

-3

0

20

40 60 time (msec.)

80

100

0

0

200

Figure 7: (a–d) Firing probabilities FL.t/ for the HH and HR models, with stimulus characteristics chosen to illustrate the points in the text. Dashed and solid curves and vertical bars denote data obtained as in Figure 6. (a) A stimulus (IN D 0:04 ¹A/cm2 ) of length exactly P D 232:50 msec (indicated by the horizontal black bar) for the HR model (! D0.0201 rad/msec) leaves no trace (point II). (b) A stimulus (NI D 0:25 ¹A/cm2 ) of duration dmax D 11:46 msec for the HH model (! D0.429 rad/msec) yields maximum response after the stimulus has switched off (because zmin < 0) but for the HR model (d) (! D0.0102 rad/msec) with stimulus duration dmax D 152:01 msec, the peak in FL.t/ is achieved at t2 (because zmin ¼ 0), (points III, IVa). Plots c and d illustrate point VI: the stimulus in c is identical to that of d, but the slower HR population d (! D 0:0102 versus 0.0201 rad=msec) displays the greatest response.

(independent of the end point condition). Thus, equation 4.6 implies that ½.µ ; t/ must also be P-periodic, so that the distribution returns to ½.µ ; t1 / ´ 1 1 2¼ every P time units: that is, ½.µ ; t 1 CkP/ ´ 2¼ for integers k. If the stimulus is turned off after duration d D t2 ¡t1 D kP, this at density therefore persists (recall that ½ evolves as a traveling wave), giving our second result: (II) for stimulus durations that are multiples of P, poststimulus ring probabili! ties FL.t/ return to the constant value 2¼ . This is illustrated in Figure 7a and corresponds to the absence of poststimulus refractory periods and ringing, and is related to the black holes discussed in Tass (1999). Figures 6 and 7 also illustrate the periodic regimes both during and after the stimulus. When the stimulus duration d is not a multiple of P (and provided z.µ / is not constant), ½.µ ; t2 / has at least one peak exceeding 1=2¼ and at least one valley less than 1=2¼ (see phase density plots of Figure 6). Let the largest and smallest possible ½ values be ½max and ½min , respectively. Equation 4.6

Phase Reduction and Response Dynamics

699

then gives " ½max

1 D 2¼

# N max ! C Iz I N min ! C Iz

" ½min

1 D 2¼

# N min ! C Iz ; N max ! C Iz

(4.10)

where zmin ´ z.µmin / and zmax ´ z.µmax / are the global extrema of the PRC; note the relationship ½min ½max D 1=4¼ 2 . Recalling that 2µ ;t .tQ2 / D µ during the stimulus, comparing equations 4.10 and 4.6 shows that ½max occurs at µmin and ½min at µmax . When it exists, the stimulus duration dmax (resp., dmin ) for which a distribution with peak ½max (resp., valley ½min ) occurs is essentially obtained by requiring (ignoring the limits required for discontinuous PRCs) that a characteristic curve passes through µmax (resp., µmin ) at t1 and through µmin (resp., µmax ) at time t2 . Thus, (III) for stimulus durations dmax (resp., dmin ), poststimulus ring probabilities FL.t/ exhibit their maximal ! . These deviations may deviation above (resp., below) the baseline rate 2¼ or not be exceeded during the stimulus itself. (See Figure 7 for examples and Figure 6 for the evolution of phase density during a prolonged stimulus; in particular, note that while dmax is not strictly dened for the LIF model, shorter stimuli (of arbitrarily small duration) always give higher peaks.) We now determine whether maximal peaks and minimal valleys in ring rates occur during or after stimulus for the various neuron types. Again using 2Ã;t .tQ2 / D Ã during the stimulus, equation 4.8 yields " # N ! C z.Ã /IN ! C Iz.2 Ã;t .t1 // FL .t/ D lim N Ã!µs 2¼ ! C Iz.Ã / d

D lim

Ã!µs

1 N [! C Iz.2 Ã;t .t1 //]; t 1 < t · t2 : 2¼

(4.11)

The superscript on FLd .t/ denotes “during” the stimulus, emphasizing that this expression is valid only for t1 < t · t2 . After the stimulus has turned off, a different special case of equation 4.8 is valid: " # N ! ! C Iz.2 Ã;t .t1 // a (4.12) FL .t/ D lim ; t > t2 : N Ã!µs 2¼ ! C Iz.2 Ã;t .t2 // Here the superscript on FLa .t/ denotes “after” the stimulus. We now use these expressions to write the maximum and minimum possible ring rates during and after the stimulus: FLdmax D FLamax D

1 N max ] [! C Iz 2¼ " # N max ! ! C Iz 2¼

N min ! C Iz

(4.13) (4.14)

700

E. Brown, J. Moehlis, and P. Holmes

FLdmin D FLamin D

¤ 1 £ N min ! C Iz 2¼ " # N min ! ! C Iz 2¼

N max ! C Iz

(4.15) :

(4.16)

From equations 4.13 through 4.16, we have FLdmax

¡

FLamax

FLdmin ¡ FLamin

# N max ! C Iz N min ; Iz N min ! C Iz " # N min 1 ! C Iz N max : D Iz N max 2¼ ! C Iz 1 D 2¼

"

(4.17)

(4.18)

Since we restrict to the case where v.µ ; t/ > 0 (i.e., there are no xed points for the phase ow), the terms in the brackets of the preceding equations are always positive. This implies, for IN > 0, FLamax ¸ FLdmax if and only if zmin · 0;

(4.19)

FLamax · FLdmax if and only if zmin ¸ 0;

(4.20)

FLamin · FLdmin if and only if zmax ¸ 0;

(4.21)

FLamin

(4.22)

¸

FLdmin

if and only if zmax · 0;

where the “equals” cases of the inequalities require zmax D 0 or zmin D 0. In other words, (IVa) for the specic stimulus durations that elicit maximal peaks in ring rates, these maximal peaks occur during the stimulus if zmin ¸ 0 but after the stimulus switches off if zmin · 0; (IVb) for the specic (possibly different) stimulus durations that elicit minimal ring rate dips, these minimal dips occur during the stimulus if zmax · 0 but after the stimulus switches off if zmax ¸ 0. We recall that zmin < 0 is a dening condition for type II neurons (Ermentrout, 1996). The poststimulus maximum (resp. minimum) ring rates are obtained as the peak (resp. valley) of the distribution ½.µ ; t/ passes through µs . As Figure 7b shows, the delay from stimulus offset can be signicant for typical neuron models. Dening the baseline rate valid for t < t1 , FLb .t/ ´

! ; 2¼

(4.23)

equation 4.15 shows that FLdmin ¸ FLb if and only if zmin ¸ 0. Thus, (V) if zmin ¸ 0, the ring rate does not dip below baseline values until (possibly) after the stimulus switches off. Table 2 summarizes the above results for the neuron models studied here.

Phase Reduction and Response Dynamics

701

Table 2: Predictions Using the Numerical PRCs of Figure 4. Neuron Model

Response “Jumps” With Stimulus? (point I)

Maximum Response After Stimulus and Depressed Firing During Stimulus? (points IV and V)

HR HH FN ML IF LIF

No No Yes Yes Yes Yes

No Yes Yes No No No

Note: The conclusions follow from the limiting value of z.µs / (point I in text), and the value of the PRC minimum zmin (points IVa and V).

We conclude this section by noting that Fourier transformation of the analog of equation 4.1 in the presence of noise shows that FL.t/ decays at exponential or faster rates due to noise and averaging over distributions of neuron frequencies (cf. Tass, 1999; Brown et al., 2003b). For mildly noisy or heterogeneous systems, the results I through V remain qualitatively similar but are smeared; for example, ½.µ ; t/ is no longer time periodic during or after the stimulus, but approaches a generally nonuniform equilibrium state via damped oscillations. 4.3 Frequency Scaling of Response Magnitudes. We now determine how the maximum and minimum deviations from baseline ring rates depend on the baseline (prestimulus) ring rate of the neural population. Following the discussion of the previous section, we separately compute the scaling of maximal (minimal) responses that are possible during stimulus and the scaling of maximal (minimal) responses that are possible after stimuli switch off. Equations 4.13 through 4.16 and 4.23 yield 1 N [Izmax ] 2¼ 1 N [Izmin ] FLdmin ¡ FLb D 2¼ " # N max ¡ zmin / ! I.z a b FLmax ¡ FL D N min 2¼ ! C Iz " # N min ¡ zmax / ! I.z a b FLmin ¡ FL D : N max 2¼ ! C Iz FLdmax ¡ FLb D

(4.24) (4.25) (4.26)

(4.27)

These expressions provide one set of measures of the sensitivity of population-level response at different baseline ring rates. Additionally, taking

702

E. Brown, J. Moehlis, and P. Holmes FLd

¡FLb

ratios with the prestimulus ring rate (e.g., nding max ) determines FLb the size of deviations relative to baseline activity. We use the information summarized in Table 1 to compile these measures for all neuron models in Tables 3 through 6. Note that in these tables, “moving away from the bifurcation” means varying parameters so that the frequency varies away from its value at the onset of ring, namely, ! D 0 for the SNIPER and homoclinic bifurcations and IF and LIF models, !H for the supercritical Hopf bifurcation, and !SN for the Bautin bifurcation. The scaling of FLdmax ¡ FLb , as an example, is conrmed by Figure 5. In summary, (VI) different neural models and bifurcations imply different scalings of maximal response magnitude with frequency. Most measures of population ring rate responses increase for frequencies closer to the bifurcation point (see Tables 3–6). If these models are parameterized so that frequency increases as the bifurcation parameter Ib increases through the bifurcation point, this means that populations at lower frequencies tend to display greater responses (see Figure 5 for examples). This effect is explored in the next section.

Table 3: Scaling of Deviations in Firing Rate During Stimulus FLdmax ¡ FLb for the Different Neuron Models. FLdmax ¡ FLb

Bifurcation

h 1 2¼

SNIPER

h 1 2¼

Hopf

p

h 1 2¼

Bautin Homoclinic

1 2¼

»

N Bj Ijc j!¡!SN j

¡

1 !

Weaker (weaker)

i » p

i »

¡ 2¼ ¸u ¢ !

IN N 1 I! 2¼ gL

Stronger or Weaker Effect as Move Away from Bifurcation, to Lowest Order. Unnormalized (Normalized by FLb )

i

NH Ic j!¡!H j

N hc ! exp Ic

IF LIF

N sn 2Ic !

Lowest Order Scaling Near Bifurcation

e2¼ g L =! ¡ 1

¢

1 j!¡!H j

1 j!¡!SN j

Weaker (weaker) Weaker (weaker)

» ! exp.k=!/

Weaker (weaker)

const.

Constant (weaker)

» ! exp.k=!/

Weaker (weaker)

Note: The positive constant k differs from case to case.

Phase Reduction and Response Dynamics

703

Table 4: Scaling of Deviations in Firing Rate During Stimulus FLdmin ¡ FLb for the Different Neuron Models. Bifurcation

FLdmin ¡ FLb

SNIPER 1 ¡ 2¼

p

h Bautin

1 2¼

NH Ic j!¡! H j

N Bj Ijc j!¡!SN j

1 ¡ 2¼

Homoclinic

Constant

Constant (constant)

i » ¡p

i

IN N 1 I! 2¼ gL

¡

1 ¡ e¡2¼ gL

1 j!¡!H j

1 » ¡ j!¡!

N hc ! Ic

IF LIF

Stronger or Weaker Effect as Move Away from Bifurcation, to Lowest Order. Unnormalized (Normalized by FLb )

0

h Hopf

Lowest Order Scaling Near Bifurcation

¢ =!

SN j

Weaker (weaker) Weaker (weaker)

»!

Stronger (constant)

Constant

Constant (weaker)

»!

Stronger (constant)

Table 5: Scaling of Deviations in Firing Rate After Stimulus, FLamax ¡ FLb , for the Different Neuron Models. FLamax ¡ FLb

Bifurcation

h 1 2¼

SNIPER

h Hopf

1 2¼

Bautin

1 2¼

!

h

Homoclinic

N hc ! 1 Ic 2¼ 1C Ic N hc

p

N sn 2Ic !

Lowest Order Scaling Near Bifurcation

Stronger or Weaker Effect as Move Away from Bifurcation, to Lowest Order. Unnormalized (Normalized by FLb )

i

N H! 2Ic

NH j!¡!H j¡ Ic

N B j! 2Ijc N Bj !j!¡!SN j¡Ijc

»

1 !

Weaker (weaker)

i » p

i

.exp.2¼¸u =!/ ¡ 1/

»

1 j!¡! H j

1 j!¡!SN j

Weaker (weaker) Weaker (weaker)

» ! exp.k=!/

Weaker (weaker)

IF

0

Constant

Constant (constant)

LIF

¡2¼ gL =! N /.e2¼ g L =! ¡1/ ! I.1¡e ¡2¼ gL =! 2¼ N g L C I.1¡e /

» ! exp.k=!/

Weaker (weaker)

Note: The positive constant k differs from case to case.

704

E. Brown, J. Moehlis, and P. Holmes

Table 6: Scaling of Deviations in Firing Rate After Stimulus FLamin ¡ FLb for the Different Neuron Models. FLamin ¡ FLb

Bifurcation

h 1 ¡ 2¼

SNIPER

N sn 2 Ic N !C2csn I=!

h Hopf

1 ¡ 2¼

Bautin

1 ¡ 2¼

!

h

p

Lowest Order Scaling Stronger or Weaker Effect Near Bifurcation as Move Away from Bifurcation, to Lowest Order. Unnormalized (Normalized by FLb )

i

N H! 2 Ic

NH j!¡! H jCIc

N B j! 2 Ijc N Bj !j!¡! SN jC Ijc

» ¡!

Stronger (constant)

i » ¡p

i

1 j!¡!H j

Weaker (weaker)

1 » ¡ j!¡! SN j

Weaker (weaker)

Homoclinic

N hc ! exp.¡ 2¼¸u /¡1 Ic ! 2¼ exp. ¡2¼ ¸u /C Ic N hc !

» ¡!

Stronger (constant)

IF

0

Constant

Constant (constant)

LIF

N 2¼ gL =! ¡1/.e¡2¼ gL =! ¡1/ ! I.e 2¼ N 2¼ g L =! ¡1/ gL C I.e

» ¡!

Stronger (constant)

5 Gain of Oscillator Populations In attempts to understand neural information processing, it is useful to understand how input signals are modied by transmission through various populations of spiking cells in different brain organs. The general way to treat this problem is by transfer functions (Servan-Schreiber, Printz, & Cohen, 1990; Gerstner & Kistler, 2002). Here we interpret the results of the previous section in terms of the amplication, or attenuation, of step function input stimuli by the neural population. We consider both extremal and average values of the ring rate FL.t/ during stepped stimuli of varying strengths and illustrate for neurons near a SNIPER bifurcation. We will use the word gain to describe the sensitivity of the resulting input-output relationship: systems with higher gain have a greater output range for a specic set of input strengths. The average ring rate during stimulus is

hFLd i ´

1 ; P

(5.1)

where P is the period of an individual oscillator during the stimulus (see equation 4.9), and h¢i is the average over one such period. For the special

Phase Reduction and Response Dynamics

case of a population near a SNIPER bifurcation, PSN D p hFLdSN i

p ! 2 C 2csn I D : 2¼

705 2¼ , ! 2 C2csn I

so that

(5.2)

These expressions describe the standard f ¡ I curve typically studied for single neurons (Rinzel & Ermentrout, 1998). The instantaneous responses of neurons are in some cases of greater interest than averages such as equations 5.1 and 5.2. To derive the extremal (i.e., maximally above or below baseline) ring rates, we appeal to expressions 4.11 and 4.12, which are valid for both positive and negative values of IN as long as v.µ ; t/ remains nonnegative. (However, the subsequent formulas of section 4.2 require modication: max and min must be appropriately N Thus, the extremal value of interchanged when dealing with negative I.) FLd .t/ is (cf. equation 4.13) FLd;ext D

1 N max ]; [! C Iz 2¼

in general, and in particular for the SNIPER bifurcation: " # 1 2csn IN d;ext FLSN D !C : 2¼ !

(5.3)

(5.4)

In Figure 8, we plot FLext SN as a function of both baseline ring rate and stimulus strength I,N where the latter takes both positive and negative values. For (here, negative) stimulus values, and frequencies, sufcient to cause the minimum of v.µ / to dip below zero, xed points appear in the phase model, giving ring rates FLd .t/ D hFLdSN i D FLd;ext SN D 0. Notice the increased sensitivity of extremal ring rates to changes in stimulus strength at low baseline frequencies. This “increased gain” is also shown in Figure 9a, which plots slices through Figure 8 for two different baseline frequencies. However, there is no analogous effect for the average ring rates of equation 5.2, which follow the standard frequency-current relationships for individual neurons (see Figure 9b). Note that there is always a crossing point between ring rate curves for near-SNIPER populations with high and low baseline frequencies (see Figure 9a). Above this crossing point, stimuli are more greatly amplied by the low-frequency population; below the crossing point, they are more greatly amplied by the high-frequency population. This is analogous to increasing the slope (= gain) of a sigmoidal response function as in ServanSchreiber et al. (1990), gain increase in Figure 1 of that article being analogous to decrease of !. Thus, if signal discrimination depends on extremal ring rates, the effects of gain modulation on signal and noise discrimination of Servan-Schreiber et al. (1990) could be produced by changes in baseline rate.

706

E. Brown, J. Moehlis, and P. Holmes

d,ext

FLSN 20 18 16 14 12 10 8 6 4 2 6 5

0.1 4

0.05 3

bas eline frequency (Hz.)

0 2

0.05 1

0.1

I

Figure 8: Maximum and minimum ring rate FLd;ext SN of a population of stimulated HR neurons in Hz, as a function of baseline frequency (Hz) and applied current strength NI (¹A/cm2 ).

6 Discussion We now provide further comments on how the mechanisms studied in this article could be applied and tested. As discussed in section 5 and with regard to the locus coeruleus (LC) in section 1, baseline frequency-dependent variations in the sensitivity of neural populations to external stimuli could be used to adjust gain in information processing. The effect could be to engage the processing units relevant to specic tasks, and as in Servan-Schreiber et al. (1990) and Usher et al. (1999), to additionally sensitize these units to salient stimuli. See Brown et al. (2003b) for details of the LC application. We recall that section 4.2 described the different types of poststimulus ringing of ring rates FL.t/ that occur for the various neuron models. This phase-resetting effect has long been studied in theoretical and experimental neuroscience (e.g. Winfree, 2001; Tass, 1999; Makeig et al., 2002). As we show here (see equation 4.19 and Figure 7), for neuron models having a phase response curve z.µ / that takes negative values, the greatest deviations from baseline ring rates can occur signicantly after stimulus end. Subpopulations of such neurons could be used in detecting termination of sensory stimuli. Elevated ring rates FL.t/ that remain (or are enhanced)

Phase Reduction and Response Dynamics 16 14 12

b)

d,ext

FLSN

5

< FL

d SN

4

10 8

3

6

2

4

1

2 -0.1

6

<

a)

707

-0.05

0 I

0.05

0.1

-0.1

-0.05

0 I

0.05

0.1

Figure 9: Maximum and minimum ring rate of a population of stimulated HR neurons (a) and average ring rate (b) in Hz, as a function of applied current strength for two different baseline frequencies: 1.3 Hz (dot-dashed line) and the higher-frequency 2.9 Hz (dotted line). The increased gain effect at lower baseline frequencies discussed is evident for maximum and minimum, but not average, ring rates (see the text).

after the stimulus ends are an example of persistent neural activity, a general phenomenon implicated in short-term memory, interval timing, and other functions. However, physiological evidence suggests that some of the persistent activity observed in vivo results from desynchronized, not phaseclustered, neural groups. Finally, stimulus-induced ringing of population ring rates (which occurs at the natural baseline frequency of the neuron population; see equation 4.6) could play a role in generating alpha-wave patterns (“alpharinging”); the possible relevance of this effect is well known and is a topic of current debate in the EEG community (Makeig et al., 2002; Bogacz, Yeung, & Holroyd, 2002). The results presented here are experimentally testable. As noted in section 1, the predictions for average ring rate FL.t/ are equally valid for multichannel recordings from a (weakly coupled) population and for sequences of single-unit recordings from members of such a population. Thus, the FL.t/ predictions of this article can be compared with PSTHs formed from both types of data. The scaling of response magnitudes predicted in section 4.3 could be tested in any experiment in which baseline neural ring rates are modulated pharmacologically while stereotyped stimuli are presented. This is essentially what is done in many experiments on the effects of different neuromodulators, neurotransmitters, and other agents. For example, direct application of the neuropeptide corticotropin releasing factor (CRF) has been found to increase LC baseline activity and simultaneously decrease responses to sensory stimuli (Moore & Bloom, 1979) in some, but

708

E. Brown, J. Moehlis, and P. Holmes

not all, protocols. Many other examples of such modulatory effects of neurotransmitters or exogenous inputs exist for neurons in other brain areas (Aston-Jones et al., 2001). However, a general difculty is that these substances may change many parameters in neurons besides the bifurcation parameter Ib that is the focus of this article, making it difcult to determine what mechanism leads to changes in averaged response. Furthermore, the presence of noise tends to diminish the scaling results reported here (cf. Herrmann & Gerstner, 2001; Brown et al., 2003b), and while it seems that coupling can in some circumstances amplify the scaling (Brown et al., 2003b), we are still working to clarify this effect. We close by mentioning another experimental test of the predictions presented here, suggested by John Rinzel. First, one could determine what pharmacological manipulations would cause a given in vitro neuron to transition from periodic ring near a SNIPER bifurcation to periodic ring near a Bautin bifurcation. Then one could measure how trial-averaged responses to stereotyped stimuli vary as this manipulation is performed. In particular, this article predicts that maximal responses should occur during the stimulus in SNIPER ring, but after the stimulus switches off following a manipulation to Bautin ring. Appendix A: The Adjoint Method Consider an innitesimal perturbation 1x to the trajectory x° .t/ 2 ° at time t D 0. Let x.t/ be the trajectory evolving from this perturbed initial condition. Dening 1x.t/ via x.t/ D x° .t/ C 1x.t/, d1x.t/ D DF.x° .t//1x.t/ C O .k1xk2 /; dt

1x.0/ D 1x:

(A.1)

For the phase shift dened as 1µ D µ.x.t// ¡ µ.x° .t//, we have 1µ D hrx° .t/ µ; 1x.t/i C O .k1xk2 /;

(A.2)

where h¢; ¢i denes the standard inner product (written as a dot product in the main text), and rx° .t/ µ is the gradient of µ evaluated at x° .t/. We recall from above that 1µ is independent of time (after the perturbation at t D 0) so that taking the time derivative of equation A.2 yields, to lowest order in k1xk, ½

¾ ½ ¾ drx° .t/ µ d1x.t/ ; 1x.t/ D ¡ rx° .t/ µ; dt dt D ¡hrx° .t/ µ; DF.x° .t// 1x.t/i

D ¡hDFT .x° .t// rx° .t/ µ; 1x.t/i:

(A.3)

Phase Reduction and Response Dynamics

709

Here the matrix DFT .x° .t// is the transpose (i.e., adjoint) of the (real) matrix DF.x° .t//. Since the above equalities hold for arbitrary perturbations 1x.t/, we have drx° .t/ µ D ¡DFT .x° .t// rx° .t/ µ: dt

(A.4)

Finally, recall that from equation 2.4 that dµ dx D rx µ ¢ D rx µ ¢ F.x/ D !; dt dt

(A.5)

which in particular must hold at t D 0. Thus, as in Hoppensteadt and Izhikevich (1997), Ermentrout (2002), and Ermentrout and Kopell (1991), we must solve equation A.4 subject to the condition rx° .0/ µ ¢ F.x° .0// D !:

(A.6)

Since rx° .t/ µ evolves in RN , equation A.6 supplies only one of N required initial conditions; the rest arise from requiring that the solution rx° .t/ µ to equation A.4 be T-periodic (Hoppensteadt & Izhikevich, 1997; Ermentrout, 2002; Ermentrout & Kopell, 1991). Note that equations A.4 and A.6 correspond to equations 9.16 and 9.17 of Hoppensteadt and Izhikevich (1997), with the identication of rx µ ! Q and a slightly different parameterization. Indeed, this is the adjoint problem that XPP solves to numerically nd the PRC QXPP . The relationship is rx µ D !QXPP :

(A.7)

Appendix B: The Strong Attraction Method The “strong attraction limit” of a coordinate change to the phase variable µ2 discussed in Ermentrout and Kopell (1990) and Hoppensteadt and Izhikevich (1997) effectively sets @µ2 F.x/ .x/ D !; kF.x/k2 @x

(B.1)

which clearly satises equation 2.4 but implicitly imposes N ¡ 1 additional constraints. In particular, level sets of µ2 are always orthogonal to ° , which is not generally the case for isochrons. Furthermore, equation B.1 requires F.x/ that kF.x/k 2 ! is the gradient of the scalar function µ 2 , which is possible only if it is curl free in a neighborhood of ° . Since it is proportional to the unitF.x/ normalized vector eld, which exhibits the attracting limit cycle, kF.x/k 2 ! will never meet this requirement, so the phase variable µ2 cannot be extended

710

E. Brown, J. Moehlis, and P. Holmes

° and @µ2 .x° / can also give to a neighborhood of ° . More practically, @µ @x .x / @x qualitatively different phase dynamics, with µ dynamics representing more accurately the original full equations. See Brown et al. (2003a) for an example involving the stability of phase-locked states in coupled Hodgkin-Huxley systems.

Appendix C: Equations for the Neural Models Rose-Hindmarsh Equations VP D [Ib ¡ gNa m1 .V/3 .¡3.q ¡ Bb1 .V// C 0:85/.V ¡ VNa / ¡gK q.V ¡ VK / ¡ gL .V ¡ VL /]=C Pq D .q1 .V/ ¡ q/=¿q .V/

q1 .V/ D n1 .V/4 C Bb1 .V/; b1 .V/ D .1=.1 C exp.°b .V C 53:3//// 4 ;

m1 .V/ D ®m .V/=.® m .V/ C ¯m .V//; n1 .V/ D ®n .V/=.®n .V/ C ¯n .V//;

¿q .V/ D .¿b .V/ C ¿n .V//=2; ¿n .V/ D Tn =.®n .V/ C ¯n .V//; ¿b .V/ D Tb .1:24 C 2:678=.1 C exp..V C 50/=16:027///;

®n .V/ D 0:01.V C 45:7/=.1 ¡ exp.¡.V C 45:7/=10//;

®m .V/ D 0:1.V C 29:7/=.1 ¡ exp.¡.V C 29:7/=10//; ¯n .V/ D 0:125 exp.¡.V C 55:7/=80/;

¯m .V/ D 4 exp.¡.V C 54:7/=18/: Parameters:

VNa D 55 mV; VK D ¡72 mV; VL D ¡17 mV; gNa D 120 mS=cm2 ; gK D 20 mS=cm2 ; gL D 0:3 mS=cm2 ; gA D 47:7 mS=cm2 ;

C D 1 ¹F=cm2 ; Iib D 5 ¹A=cm2 ; °b D 0:069 mV¡1 ; Tb D 1 msec; Tn D 0:52 msec; B D 0:21 gA =gK : Fitzhugh-Nagumo Equations VP D [¡w ¡ V.V ¡ 1/.V ¡ a/ C Ib ]=C wP D ².V ¡ ga w/ Parameters: ga D 1; ² D 0:05 ; a D 0:1 mV ; C D 1 ¹F=cm2 :

Phase Reduction and Response Dynamics

711

Hodgkin-Huxley Equations dV=dt D 1=C.I ¡ gNa h.V ¡ VNa /m3 ¡ gK .V ¡ VK /n4 ¡ gL .V ¡ VL //

dm=dt D am .V/.1 ¡ m/ ¡ bm .V/m dh=dt D ah .V/.1 ¡ h/ ¡ bh .V/h

dn=dt D an .V/.1 ¡ n/ ¡ bn .V/n

am .V/ D 0:1.V C 40/=.1 ¡ exp.¡.V C 40/=10// bm .V/ D 4 exp.¡.V C 65/=18/

ah .V/ D 0:07 exp.¡.V C 65/=20/

bh .V/ D 1=.1 C exp.¡.V C 35/=10//

an .V/ D 0:01.V C 55/=.1 ¡ exp.¡.V C 55/=10//

bn .V/ D 0:125 exp.¡.V C 65/=80/ Parameters:

VNa D 50 mV; Vk D ¡77 mV; VL D ¡54:4 mV; gNa D 120 mS=cm2

gK D 36 mS=cm2 ; gL D :3 mS=cm2 ; C D 1 ¹F=cm2 : Morris-Lecar Equations VP D [gCa m1 .V/.VCa ¡ V/ C gK w.VK ¡ V/ C gL .VL ¡ V/ C Ib ]=C wP D Á.w1 .V/ ¡ w/=¿w .V/

m1 .V/ D 0:5.1 C tanh..V ¡ V1 /=V2 // w1 .V/ D 0:5.1 C tanh..V ¡ V3 /=V4 // ¿w .V/ D 1= cosh..V ¡ V3 /=.2V4 //

Parameters: Á D 0:23; gL D 2 mS=cm2 ; gCa D 4 mS=cm2 ; gK D 8 mS=cm2 ; C D 20 ¹F=cm2

VK D ¡84 mV; VL D ¡60 mV; VCa D 120 mV

V1 D ¡1:2 mV; V2 D 18 mV; V3 D 12 mV ; V4 D 17:4 mV Acknowledgments We thank Jonathan Cohen, Tim Lewis, and the anonymous referees for helpful comments. This work was partially supported by DoE grant DE-FG0295ER25238, and PHS grants MH58480 and MH62196 (Cognitive and Neural Mechanisms of Conict and Control, Silvio M. Conte Center). E.B. was supported under a NSF Graduate Fellowship and a Burroughs-Wellcome Training grant, 1001782. J.M. was supported by a NSF Postdoctoral Research Fellowship.

712

E. Brown, J. Moehlis, and P. Holmes

References Aston-Jones, G., Chen, S., Zhu, Y., & Oshinsky, M. (2001). A neural circuit for circadian regulation of arousal. Nature Neurosci., 4, 732–738. Aston-Jones, G., Rajkowski, J., & Cohen, J. (2000). Locus coeruleus and regulation of behavioral exibility and attention. Prog. Brain Res., 126, 165–182. Aston-Jones, G., Rajkowski, J., Kubiak, P., & Alexinsky, T. (1994). Locus coeruleus neurons in the monkey are selectively activated by attended stimuli in a vigilance task. J. Neurosci., 14, 4467–4480. Bogacz, R., Yeung, N., & Holroyd, C. (2002). Detection of phase resetting in the electroencephalogram:An evaluation of methods. Abstract, Society for Neuroscience, Washington DC. Bressloff, P., & Coombes, S. (2000). Dynamics of strongly coupled spiking neurons. Neural Comp., 12, 91–129. Brown, E., Holmes, P., & Moehlis, J. (2003a). Globally coupled oscillator networks. In E. Kaplan, J. Marsden, & K. Sreenivasan (Eds.), Problems and perspectives in nonlinear science, A celebratory volume in honor of Lawrence Sirovich (pp. 183–215). New York: Springer-Verlag. Brown, E., Moehlis, J., Holmes, P., Clayton, E., Rajkowski, J., & Aston-Jones, G. (2003b). The inuence of spike rate and stimulus duration on noradrenergic neurons. Unpublished manuscript, Program in Applied and Computational Mathematics, Princeton University. Brunel, N., Chance, F., Fourcaud, N., & Abbott, L. (2001). Effects of synaptic noise and ltering on the frequency response of spiking neurons. Phys. Rev. Lett., 86, 2186–2189. Casti, A., Omurtag, A., A., S., Kaplan, E., Knight, B., Sirovich, L., & Victor, J. (2001). A population study of integrate-and-re-or-burst neurons. Neural Comp., 14, 957–986. Coddington, E., & Levinson, N. (1955). Theory of ordinary differential equations. New York: McGraw-Hill. Eckhorn, R. (1999). Neural mechanisms of scene segmentation: Recordings from the visual cortex suggest basic circuits for linking eld models. IEEE Trans. Neural Networks, 10, 464–479. Ermentrout, G. (1981). n : m phase locking of weakly coupled oscillators. J. Math. Biol., 12, 327–342. Ermentrout, G. (1996). Type I membranes, phase resetting curves, and synchrony. Neural Comp., 8, 979–1001. Ermentrout, G. (2002). Simulating, analyzing, and animating dynamical systems: A guide to XPPAUT for researchers and students. Philadelphia: SIAM. Ermentrout, G., & Kopell, N. (1984). Frequency plateaus in a chain of weakly coupled oscillators, I. SIAM J. Math. Anal., 15, 215–237. Ermentrout, G., & Kopell, N. (1990). Oscillator death in systems of coupled neural oscillators. SIAM J. Appl. Math., 50, 125–146. Ermentrout, G., & Kopell, N. (1991). Multiple pulse interactions and averaging in coupled neural oscillators. J. Math. Biol., 29, 195–217. Evans, L. (1998). Partial differential equations. Providence, RI: American Mathematical Society.

Phase Reduction and Response Dynamics

713

Fenichel, N. (1971). Persistence and smoothness of invariant manifolds for ows. Ind. Univ. Math. J., 21, 193–225. Fetz, E., & Gustaffson, B. (1983). Relation between shapes of post-synaptic potentials and changes in ring probability of cat motoneurones. J. Physiol., 341, 387–410. Fourcaud, N., & Brunel, N. (2002). Dynamics of the ring probability of noisy integrate-and-re neurons. Neural Comp., 14, 2057–2110. Gerstner, W. (2000). Population dynamics of spiking neurons: Fast transients, synchronous states, and locking. Neural Comp., 12, 43–89. Gerstner, W., & Kistler, W. (2002). Spiking neuron models. Cambridge: Cambridge University Press. Gerstner, W., van Hemmen, L., & Cowan, J. (1996). What matters in neuronal locking? Neural Comp., 8, 1653–1676. Glass, L., & Mackey, M. (1988). From Clocks to chaos. Princeton, NJ: Princeton Paperbacks. Gray, C. (2000). The temporal correlation hypothesis of visual feature integration: Still alive and well. Neuron, 24, 31–47. Guckenheimer, J. (1975). Isochrons and phaseless sets. J. Math. Biol., 1, 259–273. Guckenheimer, J., & Holmes, P. (1983). Nonlinear oscillations, dynamical systems and bifurcations of vector elds. New York: Springer-Verlag. Hansel, D., Mato, G., & Meunier, C. (1993). Phase dynamics for weakly coupled Hodgkin-Huxley neurons. Europhys. Lett., 25(5), 367–372. Hansel, D., Mato, G., & Meunier, C. (1995). Synchrony in excitatory neural networks. Neural Comp., 7, 307–337. Herrmann, A., & Gerstner, W. (2001). Noise and the PSTH response to current transients: I. General theory and application to the integrate-and-re neuron. J. Comp. Neurosci., 11, 135–151. Hirsch, M., Pugh, C., & Shub, M. (1977). Invariant manifolds. New York: SpringerVerlag. Hodgkin, A., & Huxley, A. (1952). A quantitative description of membrane current and its application to conduction and excitation in nerve. J. Physiol., 117, 500–544. Hoppensteadt, F., & Izhikevich, E. (1997). Weakly connected neural networks. New York: Springer-Verlag. Izhikevich, E. (2000a). Neural excitability, spiking and bursting. Int. J. Bif. Chaos, 10, 1171–1266. Izhikevich, E. (2000b). Phase equations for relaxation oscillators. SIAM J. Appl. Math., 60, 1789–1804. Keener, J., & Sneyd, J. (1998). Mathematical physiology. New York: SpringerVerlag. Kim, S., & Lee, S. (2000). Phase dynamics in the biological neural networks. Physica D, 288, 380–396. Kuramoto, Y. (1984). Chemical oscillations, waves, and turbulence. Berlin: SpringerVerlag. Kuramoto, Y. (1997). Phase- and center-manifold reductions for large populations of coupled oscillators with application to non-locally coupled systems. Int. J. Bif. Chaos, 7, 789–805.

714

E. Brown, J. Moehlis, and P. Holmes

Kuznetsov, Y. (1998). Elements of applied bifurcation theory (2nd ed.). New York: Springer-Verlag. Lewis, T., & Rinzel, J. (2003). Dynamics of spiking neurons connected by both inhibitory and electrical coupling. J. Comp. Neurosci., 14, 283–309. Mainen, Z., & Sejnowski, T. (1995). Reliability of spike timing in neocortical neurons. Science, 268, 1503–1506. Makeig, S., Westereld, M., Jung, T.-P., Enghoff, S., Townsend, J., Courchesne, E., & Sejnowski, T. (2002). Dynamic brain sources of visual evoked responses. Science, 295, 690–694. Malkin, I. (1949). Methods of Poincar´e and Liapunov in the theory of nonlinear oscillations. Moscow: Gostexizdat. Matell, M., & Meck, W. (2000). Neurophysiological mechanisms of interval timing behavior. BioEssays, 22, 94–103. Moore, R., & Bloom, F. (1979). Central catecholamine neuron systems: Anatomy and physiology of the norepinephrine and epinephrine systems. Ann. Rev. Neurosci., 2, 113–168. Murray, J. (2002). Mathematical biology I: An introduction. (3rd ed.). New York: Springer-Verlag. Nykamp, D., & Tranchina, D. (2000). A population density approach that facilitates large-scale modeling of neural networks: Analysis and application to orientation tuning. J. Comp. Neurosci., 8, 19–50. Omurtag, A., Knight, B., & Sirovich, L. (2000). On the simulation of large populations of neurons. J. Comp. Neurosci., 8, 51–63. Rinzel, J., & Ermentrout, G. (1998). Analysis of neural excitability and oscillations. In C. Koch & I. Segev (Eds.), Methods in neuronal modeling (pp. 251–291). Cambridge, MA: MIT Press. Rinzel, J., & Miller, R. (1980). Numerical calculations of stable and unstable periodic solutions to the Hodgkin-Huxley equations. Math. Biosci., 49, 27– 59. Ritt, J. (2003). A probabilisticanalysis of forced oscillators,with application to neuronal response reliability. Unpublished doctoral dissertation, Boston University. Rose, R., & Hindmarsh, J. (1989). The assembly of ionic currents in a thalamic neuron I. The three-dimensional model. Proc. R. Soc. Lond. B, 237, 267–288. Servan-Schreiber, D., Printz, H., & Cohen, J. (1990). A network model of catecholamine effects: Gain, signal-to-noise ratio, and behavior. Science, 249, 892–895. Stein, R. (1965). A theoretical analysis of neuronal variability. Biophys. J., 5, 173–194. Strogatz, S. (2000). From Kuramoto to Crawford: Exploring the onset of synchronization in populations of coupled oscillators. Physica D, 143, 1–20. Tass, P. (1999). Phase resetting in medicine and biology. New York: Springer-Verlag. Usher, M., Cohen, J., Servan-Schreiber, D., Rajkowsky, J., & Aston-Jones, G. (1999). The role of locus coeruleus in the regulation of cognitive performance. Science, 283, 549–554. van Vreeswijk, C., Abbot, L., & Ermentrout, G. (1994). When inhibition not excitation synchronizes neural ring. J. Comp. Neurosci., 1, 313–321. Whitham, G. (1974). Linear and nonlinear waves. New York: Wiley.

Phase Reduction and Response Dynamics

715

Wilson, H., & Cowan, J. (1972). Excitatory and inhibitory interactions in localized populations of model neurons. Biophys. J., 12, 1–24. Winfree, A. (1974). Patterns of phase compromise in biological cycles. J. Math. Biol., 1, 73–95. Winfree, A. (2001). The geometry of biological time. Second Edition. New York: Springer-Verlag. Received May 29, 2003; accepted September 15, 2003.

LETTER

Communicated by Michael Berry

Estimating the Entropy Rate of Spike Trains via Lempel-Ziv Complexity Jos´e M. Amigo´ [email protected] Centro de Investigaci´on Operativa,UniversidadMiguel Hern´andez, 03202Elche, Spain

Janusz Szczepa nski ´ [email protected] Institute of Fundamental Technological Research, Swietokrzyska 21, 00-049 Warsaw, Poland, and Centre of Trust and Certication “Centrast” Co., Poland

Elek Wajnryb [email protected] Institute of Fundamental Technological Research, Swietokrzyska 21, 00-049 Warsaw, Poland

Maria V. Sanchez-Vives [email protected] Instituto de Neurociencias, Universidad Miguel Hern´andez-CSIC, 03550 San Juan de Alicante, Spain

Normalized Lempel-Ziv complexity, which measures the generation rate of new patterns along a digital sequence, is closely related to such important source properties as entropy and compression ratio, but, in contrast to these, it is a property of individual sequences. In this article, we propose to exploit this concept to estimate (or, at least, to bound from below) the entropy of neural discharges (spike trains). The main advantages of this method include fast convergence of the estimator (as supported by numerical simulation) and the fact that there is no need to know the probability law of the process generating the signal. Furthermore, we present numerical and experimental comparisons of the new method against the standard method based on word frequencies, providing evidence that this new approach is an alternative entropy estimator for binned spike trains. 1 Introduction Neurons respond to external or internal stimuli by ring sequences of discrete, identical action potentials called spike trains (see Figure 1). It is still an open question whether the information about the stimuli is carried by the spike timing, spike patterns, number of spikes, or a different mechaNeural Computation 16, 717–736 (2004)

c 2004 Massachusetts Institute of Technology °

718

J. Amig o, ´ J. Szczepa´nski, E. Wajnryb, and M. Sanchez-Vives

Figure 1: Binary representation of a typical spike train with a given time precision 1¿: By dividing the time axis into discrete bins, a spike train (represented here as a sequence of impulses) can be represented as a sequence of binary digits, where 1 denotes at least one spike in the given bin and 0 denotes no spike.

nism (perhaps including some of the previous ones), but independent of whatever the answer might be, the average amount of information a neuron is transmitting about the stimuli is quantitatively measured in bits per symbol or bits per second by the mutual information, which is obtained by subtracting the noise entropy rate (due to the variability of the responses to a xed stimulus) from the (Shannon) entropy rate of the entire response ensemble. (See Shannon, 1948; Cover & Thomas, 1991, for the general theory of information and Rieke, Warland, de Ruyter van Steveninck, & Bialek, 1998; Borst and Theunissen, 1999; Paninski, 2003 for more elaborated and practical discussions in relation to computational neuroscience.) In this article, we explore the application of Lempel-Ziv complexity (Lempel & Ziv, 1976) to the measurement of the entropy rate of neurological signals (not to be mistaken for the complexity measure of the same name used for data compression, which was proposed later) and compare both the numerical and experimental results to those obtained with the by now classic technique of Strong, Koberle, de Ruyter van Steveninck, and Bialek (1998). The comparison turns out to be very satisfactory: the complexity-

Estimating the Entropy Rate of Spike Trains

719

based approach performs better for short sequences, for which undersampling becomes critical. We also address the question of the convergence of the normalized Lempel-Ziv complexity to the entropy rate. The application of Lempel-Ziv complexity to the estimation of mutual information will be studied in a forthcoming article. 2 Signal Codication and Entropy Before applying the concepts and methods of discrete information theory to a spike train, this has to be transformed into a digital signal or “word,” in the parlance of information theory—a sequence of nitely many symbols or “letters.” The transformation of a spike train into a word is called the codication of the signal and the procedure, the (en)coding. Codication can be carried out in a variety of ways. Hereafter we will consider only the temporal or binary time bin coding (see Figure 1), which is the most popular encoding since MacKay and McCulloch (1952) used it in their seminal article. Let the rst spike of a train under observation occur at time t0 and the last one at t0 C T. The time interval [t0 ; t0 C T] is then split into n bins [ti¡1 ; ti ] .1 · i · n; with tn D t0 C T/ of the same length 1¿ D T=n: If each bin is now coded by 0 or 1 according to whether it contains no spikes (0) or at least one spike (1), the result is a binary message or word of length n: The bin length 1¿ can be interpreted as the time resolution with which the spike train is being observed. Once a spike train has been codied into a message, this can be viewed as emitted by an information source S that we call, for obvious reasons, a neu´ Wajnryb, & Sanchez-Vives, 2003). In order ronal source (Amig´o, Szczepanski, to calculate the entropy (or, rather, the entropy rate) of S; we follow Strong et al. (1998) and Reinagel and Reid (2000). First, a representative ensemble of neuronal responses to a given group of stimuli is recorded. Second, one examines segments of the resulting binned spike trains in (overlapping) windows of duration L · T; each segment becoming after codication a binary word of length l D L=1¿. Let pQi be the normalized count of the ith word in the ensemble of words of length l in a set of observations. Then, the corresponding estimate of the neuronal source entropy rate (in bits per second) is H.l; 1¿ / D ¡

1 X pQi log2 pQi ; l1¿

(2.1)

where the sum is over all the words of length l: As made clear by the notation, such an estimate depends on both l and the time resolution 1¿ . The true entropy rate H.S/ is reached in the limit of vanishing time bins and innitely long words; mathematically H.S/ D

lim

1¿ !0; l!1

H.l; 1¿/:

720

J. Amig o, ´ J. Szczepa´nski, E. Wajnryb, and M. Sanchez-Vives

We want to stress again that the entropy is a property of sources and therefore difcult to evaluate (Strong et al., 1998). In fact, the knowledge of the probability distribution involved in its calculation requires, in principle, an extensive sampling that usually cannot be performed, not to mention the reproducibility of the test conditions. In contrast, the complexity as originally formulated by Lempel and Ziv (1976) is a property of individual sequences that can be used to estimate the entropy or, more generally, to bound it from below. We will sometimes call this concept LZ-76 complexity to distinguish it from LZ-78 complexity, a different denition of complexity also due to Lempel and Ziv (Ziv & Lempel, 1978), which is the basis of typical lossless compression algorithms (e.g., WinZip) in common use in modern computing and other information technologies. Unless otherwise stated, we henceforth refer always to the LZ-76 complexity. 3 Lempel-Ziv Complexity We dene Lempel-Ziv complexity recursively. Given the word xn1 :D x1 x2 ; : : : ; xn of length n (xi 2 f0; 1g; 1 · i · n), a block of length l .1 · l · n/ is just a segment of xn1 of length l; that is, a subsequence of l consecutive letters; say xiCl iC1 :D xiC1 xiC2 ; : : : ; xiCl .0 · i · n ¡ l/: In particular, letters are blocks of length 1, and blocks of higher length are obtained by juxtaposition of blocks of lower length. Set B1 D x11 D x1 , and suppose that B 1 B2 ; : : : ; Bk D xn1 k ;

where B1 B2 ; : : : ; Bk denotes the juxtaposition of the blocks B1 ; B2 D xn2 2 ; ; : : : ; ; Bk D xnnkk¡1 C1 and nk¡1 C 1 · nk < n (with n0 D 0 and n1 D 1/: Dene n

B kC1 :D xnkC1 k C1

.nk C 1 · nkC1 · n/

to be the block of minimal length such that it does not occur in the sequence n ¡1 x1 kC1 : Proceeding in this way, we obtain a decomposition of xn1 in “minimal” blocks, say, xn1 D B1 B2 ; : : : ; Bp

(3.1)

in which only the last block Bp can occasionally appear twice. The complexity C.xn1 / of xn1 is then dened as the number of blocks in the (clearly unique) decomposition 3.1: C.xn1 / :D p: The procedure is illustrated by the following example. The decomposition of the binary word x19 1 D 010110100011011 10010 into minimal blocks of new patterns is 0j 1j 011j 0100j 011011j 1001j 0;

Estimating the Entropy Rate of Spike Trains

721

where the vertical lines separate the blocks. Therefore, the complexity of x19 1 is 7. Let us mention in passing that to dene LZ-78 complexity, one compares the current block xnk C1 ; : : : ; xnk Clk , lk D 1; 2; : : : ; with the previous minimal blocks B1 ; : : : ; Bk (instead of looking for the same pattern in the whole segCl ment x1 ; : : : ; xnk Clk ¡1 , as in LZ-76) and sets BkC1 D xnnkk C1k as soon as it differs from all of them. For the word x19 1 of the example above, one now gets the block decomposition 0j 1j 01j 10j 100j 011j 0111j 00j 10, and hence its LZ-78 complexity is 9. The generation rate of new patterns along xn1 ; a binary word of length n, is measured by the normalized complexity c.xn1 /; which is dened by c.xn1 / D

C.xn1 / p D log2 n: n= log2 n n

Sequences that are not complex (e.g., periodic or quasi-periodic) have a very small normalized complexity. At the opposite end are the random sequences. Although the normalized complexity can take values higher than 1; the normalized complexity of sequences generated by random sources is about 1 with very high probability. 4 Relation Between Entropy and LZ-76 Complexity To state the relation between the normalized complexity c.xn1 ; 1¿ / of a binned spike train xn1 and the entropy rate per second, 1 X pQi log2 pQi l!1 l1¿

H.1¿ / :D lim H.l; 1¿ / D ¡ lim l!1

(4.1)

of the neuronal source S, which has produced the binned spike train xn1 ; we still need two denitions. S is said to be stationary if the probability of any block xiCl¡1 of length l ¸ 1 does not depend on its position i ¸ 1; which i means that the statistical properties of the (in principle, arbitrarily long) words generated by S do not change with time. In general, the entropy fails to exist for nonstationary sources. A stationary source is called ergodic if ensemble averages and time averages coincide almost surely, that is, one can calculate expected values over the word ensemble using the relative frequencies of sufciently long substrings from a typical word. Hence, ergodicity can be tested in practical cases by sampling typical words; every such word should produce (about) the same averages. As a rule, ergodicity is often encountered in nature for stationary processes. Maybe for this reason, ergodicity is, in general, tacitly assumed—always when entropy is estimated using word relative frequencies from a single sequence without further justication.

722

J. Amig o, ´ J. Szczepa´nski, E. Wajnryb, and M. Sanchez-Vives

It can be proved (Ziv, 1978) that if S is stationary, then c.xn1 ; 1¿ / · H.1¿ / on average; 1¿ n!1

lim sup

(4.2)

and, moreover, if S is ergodic, then c.xn1 ; 1¿ / D H.1¿ / almost surely. 1¿ n!1

lim sup

(4.3)

Therefore, equations 4.2 and 4.3 provide ways to bound from below and estimate, respectively, the entropy of a neuronal source (with the corresponding properties) via the Lempel-Ziv complexity of a sample of spike trains and a typical (i.e., randomly chosen) spike train produced by it. Both equations 4.1 and 4.3 involve limits to innity and hence are difcult to implement in practice in order to estimate H.1¿ /, “l; n ! 1” meaning that l, n are to be taken as large as possible. But the nature of the difculty is different. Whereas large l in H.l; 1¿ / leads inevitably to undersampling of the words of length l requiring some extrapolation technique to be put in place (see section 5), large n or even small n in c.xn1 ; 1¿ /=1¿ can lead to good approximations of H.1¿ / if the convergence in equation 4.3 is sufciently fast. This basic difference hints at the possibility, substantiated by numerical evidence with short sequences, that the entropy rate can be estimated by means of the normalized complexity in situations where the standard estimator performs poorly. 5 Undersampling and Stationarity When estimating the entropy rate of experimental time series, two main difculties arise (independent of the method used), which have to do with the niteness of the real signals and their stationarity. First, the denition of H.1¿ /, (see equation 4.1) requires words of increasing length l, whereas real spike trains are necessarily nite. Now, increasing l when counting different words out of spike trains of nite length depletes the word statistics and therefore renders the estimations of the relative frequencies pQi of the words of length l less and less reliable. Also as a result of this statistical depletion (or undersampling), H.l; 1¿ / gets articially smaller than H.1¿ / for sufciently large l, while H.1; 1¿ / ¸ H.2; 1¿ / ¸ ¢ ¢ ¢ ¸ H.1¿ / should hold. Furthermore, stationarity is an assumption that cannot be taken for granted in biological systems because of phenomena such as adaptation and synaptic plasticity, and should be checked on a case-by-case basis. Usually one assumes that “short” registers are sufciently stationary for practical purposes since the time variability of their statistical properties cannot become important in the short term. In case of doubt, one should cut the

Estimating the Entropy Rate of Spike Trains

723

experimental registers even shorter to ensure stationarity. But, again, this assumption is at odds with the limit l ! 1 in equation 4.1. Therefore, for one reason or the other, one usually faces in practice short sequences when it comes to evaluating their entropy rate. The by now standard technique for so doing was proposed in Strong et al. (1998) and consists of extrapolating the linear trend (if any) of the graph of H.l; 1¿ / versus 1= l up to the vertical axis obtaining the intercept as H.1¿ /. As a real and practical alternative, we propose next a complexity-based approach to estimate the entropy rate of binned spike trains. Indeed, equation 4.3 provides a simple way to estimate the entropy of an ergodic neuronal source via the normalized complexity rate of a typical binned spike train xL1 of total length L produced by it, namely, c.xl1 ; 1¿ / ’ H.1¿ /; 1¿ for all l · L such that c.xl1 ; 1¿ /=1¿ has already converged to H.1¿ /. But as already noted, while the performance of the classical entropy rate estimator H.l; 1¿ / depends critically on the size of l as compared to L because of undersampling, the applicability of the normalized complexity rate relies rather on the convergence speed of c.xl1 ; 1¿ / as l ! L, that is, as the pattern count continues. If the convergence is fast, one can also assess the entropy rate of short data series—too short for other methods to deliver. In particular, one expects this to be the case when the source entropy is small (i.e., when the source is far from random), because then the pattern count is low. This is important because stationarity often demands experimental spike trains to be short. Numerical simulations with two-state Markov processes give support to our claim that the normalized complexity rate converges fast to H.1¿ /. The next section is devoted to these simulations. 6 Numerical Simulations The entropy estimation of signals of nite and, especially, short length is a challenging problem. In order to tackle this problem for the estimators in which we are interested, the normalized Lempel-Ziv complexity and the usual estimate based on the word relative frequencies (see equation 2.1), we consider two-state (numbered 0 and 1) Markov processes in discrete time. The conditional probabilities for such processes are completely dened in terms of the transition probabilities from state 0 to state 1, p10 ; and from state 1 to state 0, p01. The whole Markov transition probability matrix PnC1;n .x j y/, where x; y D 0; 1, is given by PnC1;n .x j y/ D

³ 1 ¡ p10 p10

´ p01 : 1 ¡ p01

(6.1)

724

J. Amig o, ´ J. Szczepa´nski, E. Wajnryb, and M. Sanchez-Vives

The equation for the probability evolution (master equation) then reads ³

´ ³ 1 ¡ p10 PnC1 .0/ D PnC1 .1/ p10

p01 1 ¡ p01

´³ ´ Pn .0/ ; Pn .1/

(6.2)

and it has the stationary solution ³

´ ³ ´ Peq .0/ p =.p C p10 / D 01 01 : Peq .1/ p10=.p01 C p10 /

(6.3)

The entropy rate H of such a process (or, equivalently, Markov source) generating an innite long binary sequence is given by Cover and Thomas (1991): H D Peq .0/.¡p10 log2 p10 ¡ .1 ¡ p10/ log2 .1 ¡ p10 //

C Peq .1/.¡p01 log2 p01 ¡ .1 ¡ p01 / log2 .1 ¡ p01//:

(6.4)

On the other hand, the entropy rate estimate obtained as in equation 2.1 from the statistics of l-bit long words in an innite long binary sequence generated by a Markov source can be written as the convex sum H.l/ D

1 l¡1 H C Heq ; l l

(6.5)

(l D 1; 2; : : :), where H is the source entropy rate, equation 6.4, and Heq D ¡P eq .0/ log2 Peq .0/ ¡ Peq .1/ log2 Peq .1/:

(6.6)

Notice from equation 6.5 that H.l D 1/ D Heq , lim H.l/ D H

l!1

(6.7)

(as it should by denition) and, in general, H · H.l/ · Heq . In the particular case of a Markov source with p01 C p10 D 1;

(6.8)

both extremal values Heq and H coincide .D ¡p01 log2 p01 ¡ p10 log2 p10/ and, hence, H D H.l/ D Heq

(6.9)

for words of any length l. In particular, H D H.l D 1/; which shows that if p01 C p10 D 1; one can estimate the entropy rate with arbitrary precision just by sampling a sufciently large number of single bits. But this is rather an exceptional situation.

Estimating the Entropy Rate of Spike Trains

725

Figure 2: Convergence rate with increasing word length of the normalized Lempel-Ziv complexity to the true entropy rate of the Markov source, which has generated the sequence. Here, p10 D 0:1 and p01 D 0:8: The corresponding curve for the normalized LZ-78 complexity is shown for comparison.

In general, p01 C p10 6D 1, and since H :D liml!1 H.l/, the reliable estimation of H through H.l/ requires a large sample of long words, while experimental data series are usually too short to satisfy this need. Moreover, given a nite binary sequence of length n; it can be shown that H.l/ decays as Hasym .l/ D

log2 .n ¡ l C 1/ l

(6.10)

for sufciently long words, so that liml!n H.l/ D liml!n Hasym .l/ D 0: This is the undersampling scenario one faces in practice because of the nite length of real data series. We present in Figures 2 through 6 the results of our numerical simulations pertaining to the estimation of the entropy H via the normalized Lempel-Ziv complexity and by direct sampling of word probabilities. We have performed many such simulations, but only two representative cases will be discussed here. Specically, Figures 2 and 3 exhibit the convergence rate of the normalized Lempel-Ziv complexity to H for Markov sources with transition probabilities p10 D 0:1, p01 D 0:8 in Figure 2 and p10 D p01 D 0:05 in Figure 3.

726

J. Amig o, ´ J. Szczepa´nski, E. Wajnryb, and M. Sanchez-Vives

Figure 3: Convergence rate with increasing word length of the normalized Lempel-Ziv complexity to the true entropy rate of the Markov source that has generated the sequence. Here, p10 D p10 D 0:05: The corresponding curve for the normalized LZ-78 complexity is also shown for comparison.

The exact values of the entropy rate for these transition probabilities are, according to equation 6.4, H D 0:497 and H D 0:286 bits per symbol, respectively. We present also the corresponding curves of the normalized LZ-78 complexity to show that the former (LZ-76) converges much faster than the latter. Figures 4 and 5 compare both approaches for the extreme case of a binary sequence of only 200 bits generated by two Markov processes with the same transition probabilities as before. Whereas the standard method errs from the exact values of the entropy rate by as much as 22% (the rst case) and 24% (the second case), the worst relative error of the complexity-based estimation is only 8%. These and similar simulations support the utility of the complexity-based estimations for short sequences. Figure 6 shows that for sequences of 4000 bits, the standard estimator performs awlessly, as expected. The relative error is about 1% for both Markov processes. The corresponding estimations from the normalized complexity can be read in Figures 2 and 3, and the relative error is also 1%. We conclude after the preceding benchmarking with two-state Markov processes that the normalized complexity can be applied successfully in a variety of situations, and its performance, at least in the controlled envi-

Estimating the Entropy Rate of Spike Trains

727

Figure 4: A close-up of the LZ-76 complexity curves of Figures 2 and 3 in the interval 50 · n · 200. It can be seen that the normalized complexity rate is already close to the true entropy rate, the relative error being less than 14% around n D 200 and 8% at n D 200 in either case.

Figure 5: The standard extrapolation technique applied to the same cases as in Figure 4 (sequence length = 200 bits) misses the true entropy rate value by a relative error of 22% (rst Markov process) and 24% (second Markov process).

728

J. Amig o, ´ J. Szczepa´nski, E. Wajnryb, and M. Sanchez-Vives

Figure 6: The standard extrapolation technique applied to the same cases as in Figures 2 and 3 (sequence length = 4000 bits). Now the relative error is about 1%, the same as if the entropy rate is read from the corresponding normalized complexity rate curve in Figures 2 or 3.

ronment of numerical simulation, is at least as good as that of the standard method. We turn next to the neural experimental data. 7 Experimental Work and Results In this section, we continue the comparison of the two methods, this time with experimental data. We have included for this purpose recordings of different discharge patterns obtained both in vivo and in vitro and induced by a variety of stimuli. The experimental data were obtained from primary cortex recordings both in vivo and in brain slice preparations (in vitro). Intracellular recordings in vivo were obtained from anesthetized adult cats (see Sanchez-Vives, Nowak, & McCormick, 2000a, for details). For the preparation of slices, ferrets 2 to 4 months old of either sex were used (see Sanchez-Vives, Nowak, & McCormick, 2000b, for details). Action potentials were detected with a window discriminator, and the time of their occurrence was collected with a 10 ¹sec resolution. The resulting time series were used to analyze the neuronal spiking. Concerning the stimuli, they were of three kinds:

Estimating the Entropy Rate of Spike Trains

729

1. Intracellular periodic current injection. Intracellular sinusoidal currents were injected in vivo. The frequency of the waveform was 2 Hz, and the intensity ranged between 0:2 and 1:5 nA. The cell recorded ensemble had 8 samples (spike train lengths between 15:56 sec and 47:64 sec). 2. Visual stimulation with sinusoidal drifting gratings. The visual stimulus consisted of a sinusoidal drifting grating presented in a circular patch of 3 to 5 degrees diameter, centered on the receptive eld center (in vivo). Only simple cells (classied as in Skottun et al., 1991) were included in this study. In this case, 8 samples were analyzed (spike train lengths between 15:87 sec and 23:62 sec). 3. Intracellular random current injection. Random currents ranging between ¡=C 1:5 nA with different degrees of correlations were injected during the intracellular recordings from cortical brain slices (in vitro). The ensemble consisted of 20 samples (spike train lengths between 16:32 sec and 35:47 sec). Figures 7 and 8 present the complexity rates c.xn1 ; 1¿ /=1¿ versus the coding frequency 1=1¿ of a randomly chosen spike train xn1 belonging, re-

Figure 7: Normalized complexity rate c.1¿ /=1¿ versus coding frequency 1=1¿ for the two in vivo experimental cases considered in the text: intracellular periodic current injection and visual stimulation with sinusoidal drifting gratings. The values for 1=1¿ D 100; 200, and 300 Hz are presented in Tables 1 and 2. The spike train is “typical” in the sense that it was chosen randomly.

730

J. Amig o, ´ J. Szczepa´nski, E. Wajnryb, and M. Sanchez-Vives

Figure 8: Normalized complexity rate c.1¿ /=1¿ versus coding frequency 1=1¿ for the two in vitro experimental cases considered in the text: intracellular random current injection with stimuli whose autocorrelation functions decay fast or slow. The spike trains are typical. The values for 1=1¿ D 100; 200, and 300 Hz are presented in Tables 3 and 4.

spectively, to the in vivo (periodic current injection and visual stimulation) and in vitro (random current injection) experimental cases just explained. Notice that we have further split the responses to random stimuli into two subsets according to whether the autocorrelation function of the stimuli decays slowly or fast. Only spike trains for which this distinction were clear were considered for analysis. Figures 9 to 12 show the graphs 1= l 7! H.l; 1¿ / with 1¿ D 0:01, 0:005, 0:0033 sec used to estimate the corresponding entropy rate H.1¿ / by the standard technique of Strong et al. (1998) for the same spike trains as before. Tables 1 to 4 summarize the entropy rate estimations (in bits per second) obtained by the standard and complexity-based methods applied to the same spike train in each of the four previous experimental cases. The complexity-based estimations are the readings for 1=1¿ D 100, 200, 300 Hz of the computer program that calculates the complexity curves in Figures 7 and 8. The values listed as standard estimations were numerically extrapolated from the linear trends of the corresponding graphs, Figures 9 to 12.

Estimating the Entropy Rate of Spike Trains

731

Figure 9: The standard estimate of the entropy rate in the in vivo periodic current injection case for the corresponding spike train of Figure 7, obtained by extrapolating the linear trend of H.l; 1¿ / versus 1=T (T D l1¿ , the window length in time) to innitely long windows. The time bins were 1¿ D 0:01, 0:005, and 0:0033 sec (coding frequency = 100, 200, and 300 Hz, respectively). The corresponding estimations are presented in Table 1.

Table 1: H.1¿ / for Periodic Current Injection (in vivo). Coding Frequency

Standard

Complexity

100 Hz 200 Hz 300 Hz

41.38 59.20 68.42

42.93 60.40 67.00

Coding Frequency

Standard

Complexity

100 Hz 200 Hz 300 Hz

30.30 47.85 62.55

32.78 50.14 62.11

Table 2: H.1¿ / for Visual Stimulation.

732

J. Amig o, ´ J. Szczepa´nski, E. Wajnryb, and M. Sanchez-Vives

Figure 10: The standard estimate of the entropy rate in the in vivo visual stimulation with sinusoidal drifting gratings case for the corresponding spike train of Figure 7. The time bins were 1¿ D 0:01; 0:005, and 0:0033 sec (coding frequency = 100, 200, and 300 Hz, respectively). The corresponding estimations are presented in Table 2.

Table 1 compares the entropy rate estimations obtained by the two methods for a typical neuron response to periodic current injection (in vivo). The agreement is remarkable in this case. The relative deviation of the estimations decreases from 8% for 1=1¿ D 100 Hz to below 1% for 1=1¿ D 300 Hz. Analogously, Table 2 shows the estimations obtained by both methods from a typical neuron response to visual stimulation with sinusoidal drifting gratings. Here the relative deviation falls slightly from 3% for 1=1¿ D 100 Hz to 2% for 1=1¿ D 200 and 300 Hz. Table 3 shows the results obtained for a spike train elicited by random current injection (in vitro). Furthermore, the spike train belongs to the samples with a slowly decaying autocorrelation function. The relative deviation of the estimations jumps from 2% for 1=1¿ D 100 and 200 Hz to roughly 5% for 1=1¿ D 300. We nish the comparison between the standard and the complexitybased entropy rate estimators with Table 4, which summarizes the results for a spike train with a fast-decaying autocorrelation function elicited by random current injection. This is the case where the agreement is worse, the relative deviation of estimations uctuating around the 15% level.

Estimating the Entropy Rate of Spike Trains

733

Figure 11: The standard estimate of the entropy rate in the in vitro random current injection with slowly decaying autocorrelation function case for the corresponding spike train of Figure 8. The time bins were 1¿ D 0:01; 0:005, and 0:0033 sec (coding frequency = 100, 200, and 300 Hz, respectively). The corresponding estimations are presented in Table 3.

Table 3: H.1¿ / for Random Current Injection (Slow Decay). Coding Frequency

Standard

Complexity

100 Hz 200 Hz 300 Hz

52.38 68.69 78.00

53.38 67.23 74.70

Table 4: H.1¿ / for Random Current Injection (Fast Decay). Coding Frequency

Standard

Complexity

100 Hz 200 Hz 300 Hz

22.31 27.75 31.05

19.00 24.39 26.03

734

J. Amig o, ´ J. Szczepa´nski, E. Wajnryb, and M. Sanchez-Vives

Figure 12: The standard estimate of the entropy rate in the in vitro random current injection with fast-decaying autocorrelation function case for the corresponding spike train of Figure 8. The time resolutions were 1¿ D 0:01; 0:005, and 0:0033 sec (coding frequency = 100, 200, and 300 Hz, respectively). The corresponding estimations are presented in Table 4.

Summing up, the results are very satisfactory in both in vivo cases and in vitro random injection with a slowly decaying autocorrelation function. For in vitro random injection with a fast-decaying autocorrelation function, we nd a 15% discrepancy, the normalized complexity rate lying always below the standard entropy rate estimations. We conjecture that this discrepancy is due to nonstationarity. 8 Conclusions We propose the normalized Lempel-Ziv complexity (LZ-76 to be more precise) as a reliable estimator of the entropy of neuronal sources. The exact relation between both concepts is given in equations 4.2 and 4.3. Our numerical simulations with two-dimensional Markov processes with different transition probabilities show that the normalized complexity converges fast to the entropy; in section 6, we reported on two such simulations. A further advantage of the complexity approach is that there is no need to know the probability law of the process generating the signal since the normalized

Estimating the Entropy Rate of Spike Trains

735

complexity is a property of individual sequences. More important, the numerical evidence gathered over time with Markov processes and partially reported in this article strengthens the validity of the complexity approach. Indeed, the entropy rate estimations of articial registers by means of the complexity rate compare very favorably with those made with the extrapolation technique of Strong et al. (1998) and, moreover, outperforms this latter method for some sequences too short for it to deliver an accurate estimation but long enough for the normalized complexity rate to be close to the entropy rate. In sum, we conclude that Lempel-Ziv complexity certainly belongs in the toolbox of computational neuroscience as a relevant measure of entropy. Acknowledgments This work was partially supported by Polish-Spanish Scientic Cooperation Program (PAS-CSIC), grant 20/2001-2002, and by Human Frontiers Science Program Organization and MCYT (BFI2002-03643) to M.V. Sanchez-Vives. I. Vajda (Czech Academy of Sciences) and M. Slater (University College London) read the draft and made valuable comments. Part of the experimental work included here was performed at D.A. McCormick’s laboratory at Yale University. L.G. Nowak collaborated in the in vivo experiments and T. Michalek assisted in the numerical post-processing of the data. We very much thank also the referees for their constructive criticism. References Amig o, ´ J. M., Szczepa´nski, J., Wajnryb, E., & Sanchez-Vives, M. V. (2003). On the number of states of the neuronal sources. BioSystems, 68, 57–66. Borst, A., & Theunissen, F. E. (1999). Information theory and neural coding. Nature Neurosci., 2, 947–957. Cover, T. M., & Thomas, J. A. (1991). Elements of information theory. New York: Wiley. Lempel, A., & Ziv, J. (1976). On the complexity of an individual sequence. IEEE Trans. Inform. Theory, IT-22, 75–88. MacKay, D., & McCulloch, W. S. (1952). The limiting information capacity of a neuronal link. Bull. Math. Biophys., 14, 127–135. Paninski, L. (2003).Estimation of entropy and mutual information. Neural Comp., 15, 1191–1253. Reinagel, P., & Reid, R. C. (2000). Temporal coding of visual information in the thalamus. J. Neurosci., 20, 5392–5400. Rieke, F., Warland, D., de Ruyter van Steveninck, R., & Bialek, W. (1998). Spikes: Exploring the natural code. Cambridge, MA: MIT Press. Sanchez-Vives, M. V., Nowak, L. G., & McCormick, D. A. (2000a). Membrane mechanisms underlying contrast adaptation in cat area 17 in vivo. J. Neurosci., 20, 4267–4285.

736

J. Amig o, ´ J. Szczepa´nski, E. Wajnryb, and M. Sanchez-Vives

Sanchez-Vives, M. V., Nowak, L. G., & McCormick, D. A. (2000b).Cellular mechanisms of long-lasting adaptation in visual cortical neurons in vitro. J. Neurosci., 20, 4286–4299. Shannon, C. E. (1948). A mathematical theory of communication. Bell Sys. Tech. Journal, 27, 379–423, 623–656. Skottun, B. C., DeValois, R. L., Grosof, D. H., Movshon, J. A., Albrecht, D. G., & Bonds, A. B. (1991). Classifying simple and complex cells in the basis of response modulation. Vision Res., 31, 1079–1086. Strong, S. P., Koberle, R., de Ruyter van Steveninck, & Bialek, W. (1998). Entropy and information in neural spike trains. Phys. Rev. Lett., 80, 197–200. Ziv, J. (1978). Coding theorems for individual sequences. IEEE Trans. Inform. Theory, IT-24, 405–412 Ziv, J., & Lempel, A. (1978). Compression of individual sequences via variablerate coding. IEEE Trans. Inform. Theory, IT-24, 530–536. Received September 12, 2002; accepted September 5, 2003.

LETTER

Communicated by Ad Aertsen

Investigation of Possible Neural Architectures Underlying Information-Geometric Measures Masami Tatsuno [email protected] ARL Division of Neural Systems, Memory and Aging, The University of Arizona, Tucson, AZ 85724-5115, U.S.A., and Laboratory for Mathematical Neuroscience, RIKEN Brain Science Institute, Saitama, 351-0198, Japan

Masato Okada [email protected] Laboratory for Mathematical Neuroscience, RIKEN Brain Science Institute, Saitama, 351-0198, Japan, and Intelligent Cooperation and Control, PRESTO, JST, c/o RIKEN BSI, Saitama 351-0198, Japan

A novel analytical method based on information geometry was recently proposed, and this method may provide useful insights into the statistical interactions within neural groups. The link between informationgeometric measures and the structure of neural interactions has not yet been elucidated, however, because of the ill-posed nature of the problem. Here, possible neural architectures underlying information-geometric measures are investigated using an isolated pair and an isolated triplet of model neurons. By assuming the existence of equilibrium states, we derive analytically the relationship between the information-geometric parameters and these simple neural architectures. For symmetric networks, the rst- and second-order information-geometric parameters represent, respectively, the external input and the underlying connections between the neurons provided that the number of neurons used in the parameter estimation in the log-linear model and the number of neurons in the network are the same. For asymmetric networks, however, these parameters are dependent on both the intrinsic connections and the external inputs to each neuron. In addition, we derive the relation between the information-geometric parameter corresponding to the two-neuron interaction and a conventional cross-correlation measure. We also show that the information-geometric parameters vary depending on the number of neurons assumed for parameter estimation in the log-linear model. This nding suggests a need to examine the information-geometric method carefully. A possible criterion for choosing an appropriate orthogonal coordinate is also discussed. This article points out the importance of a model-based approach and sheds light on the possible neural structure underlying the application of information geometry to neural network analysis. Neural Computation 16, 737–765 (2004)

c 2004 Massachusetts Institute of Technology °

738

M. Tatsuno and M. Okada

1 Introduction The elucidation of information representation in neural ring patterns has been a main focus in neuroscience. Recent developments in experimental technology have enabled the simultaneous recording of multineuron activity, and a large amount of data has been accumulated (Wilson & McNaughton, 1993; Riehle, Grun, ¨ Diesmann, & Aertsen, 1997; Chapin, Moxon, Markowitz, & Nicolelis, 1999; Hampson, Simeral, & Deadwyler, 1999; Kudrimoti, Barnes, & McNaughton, 1999; Laubach, Wessberg, & Nicolelis, 2000; Wessberg et al., 2000; Hoffman & McNaughton, 2002). Powerful analytical methods will be required to understand how information is represented within these large populations of neurons. To this end, there have been several theoretical attempts (Rieke, Warland, de Ruyter von Steveninck, & Bialek, 1997; Gerstein, 1998; Koch & Segev, 1998; Zhang, Ginzburg, McNaughton, & Sejnowski, 1998; Panzeri & Schultz, 2001), including those applying the statistical analytical methods known as JPSTH (Aertsen, Gerstein, Habib, & Palm, 1989) and unitary event analysis (Grun, ¨ 1996; Grun ¨ & Diesmann, 2000; Grun, ¨ Diesmann, & Aertsen, 2002a, 2002b). More recently, a novel analytical method based on information geometry (Amari, 1985; Amari & Nagaoka, 2000) was proposed (Amari, 2001), and was extended to make it applicable to real neuron data (Nakahara et al., 2001; Nakahara & Amari, 2002a, 2002b). The potential advantages of the information-geometric measure can be summarized as follows. (1) The orthogonal coordinates spanned by the mean ring rate (and the mean coincident ring rate in the case of networks of more than two neurons) and information-geometric parameters can be systematically constructed. (2) Information in the joint spike trains can be parsed into independent contributions carried by the mean ring rate (and the mean coincident ring rates in the case of three or more neurons) and the information-geometric parameters. (3) A biologically plausible null hypothesis can be used for statistical testing. In other words, any neural ring, not only independent or noncorrelated neural ring, can be taken as the null hypothesis. This property is useful because in real experiments, one may observe correlated neural activity in the baseline recording when no external stimulus has been explicitly presented to the animal. In this case, statistical tests based on the null hypothesis of independent or noncorrelated ring are inappropriate because the preexisting, intrinsic correlations are the rule rather than an exception. Instead, the neural activity in the baseline recording should be taken as the null hypothesis. The information-geometric measure is able to handle this situation. (4) Systematic extension to any number of neuronal interactions is possible in a straightforward manner by adding extra terms to the log-linear model. In this way, an information-geometric measure has been proposed to provide a powerful analytical tool for multiple-spike data. Despite these potential advantages of the information-geometric approach, the possible relationships between the information-geometric mea-

Neural Architectures Underlying Information-Geometric Measures

739

sures and the underlying neural architectures have not yet been elucidated because more than one neural architecture could produce the same information-geometric parameters. In other words, determination of the underlying neural architectures can be considered an ill-posed inverse problem, such as reconstruction of an original three-dimensional image from a two-dimensional image on a retina, whereas extracting the neural interaction through the information geometry can be considered a forward problem. Furthermore, as we will elucidate below, a single informationgeometric parameter for a two-neuron interaction cannot be derived for asymmetric, bidirectional connections between two neurons. To investigate this difcult problem, we considered an isolated pair and an isolated triplet of model neurons that obey stochastic dynamic equations. Under the assumption that an equilibrium state exists, we elucidated analytically the possible neural architectures underlying the informationgeometric parameters. The outline of this letter is as follows. In section 2, we provide a brief introduction to the information-geometric approach with an emphasis on twoneuron and three-neuron networks. In section 3, we introduce the model neuron and dene its stochastic dynamics. Section 4 provides an analytical relationship between the underlying neural architecture parameters and the corresponding information-geometric parameters for the two-neuron network. We also consider the traditional correlation functions in this section. In section 5, we investigate the three-neuron network and derive an analytical expression of the information-geometric parameters. In addition, we clarify the difference between the two-neuron information-geometric parameters and the three-neuron information-geometric parameters. We then summarize and discuss our results in section 6. The preliminary results of this study have been reported elsewhere (Tatsuno & Okada, 2002, 2003). 2 Information-Geometric Measure Let us briey introduce the analytical method based on information geometry (Amari, 1985; Amari & Nagaoka, 2000). Utilizing the orthogonality of the natural and expectation parameters in the exponential family of distributions, Amari (2001) proposed a novel method for analyzing a population of ring neurons. Techniques for hypothesis testing of the neural interaction and calculation of the mutual information were also formulated (Nakahara et al., 2001; Nakahara & Amari, 2002a, 2002b). In this analytical method, the spike train is represented by binary random variables, so that expansion of the probability distribution of neural ring using the log-linear model can be performed exactly. To illustrate the method, let us consider an example of a two-neuron system. Let X1 and X2 be two binary variables whose joint probability is given by pij D Probfx1 D iI x2 D jg ¸ 0, i; j D 0; 1, with a constraint p00 C p01 C p10 C p11 D 1. Let us then expand log p.x1 ; x2 / in the polynomial of x1 and

740

M. Tatsuno and M. Okada

x2 as follows: .2/ log p.x1 ; x2 / D µ1.2/ x1 C µ2.2/ x2 C µ12 x1 x2 ¡ Ã .2/ :

(2.1)

This is called the log-linear model, and it is an exact expansion since xi takes the binary values 0 or 1. The µ .2/ coefcients and Ã .2/ are given by µ1.2/ D log

p10 ; p00

µ2.2/ D log

p01 ; p00

.2/ D log µ12

p11 p00 ; p01 p10

Ã .2/ D ¡ log p00 :

(2.2)

Here, the µ .2/ ’s are the natural parameters in the exponential family and .2/ are called information-geometric parameters. µ12 , for example, represents the two-neuron interaction between X1 and X2 , and is considered to be orthogonal to the mean ring rate parameters (hx1 i, hx2 i), as shown below. There are several coordinate systems that can be used to represent twoneuron ring, such as the coordinate system of the mean ring rates and the mean coincident ring rate .hx1 i; hx2 i; hx1 x2 i/, and the coordinate system of .2/ the natural parameters .µ1.2/ ; µ2.2/ ; µ12 /. In the information-geometric frame.2/ work, however, the mixed coordinate system .hx1 i; hx2 i; µ12 / is used because .2/ µ12 is considered to be orthogonal to hx1 i and hx2 i. Here, the orthogonality is dened by the fact that a small change of log p.x1 ; x2 / in the direction of hxi i .2/ is not correlated with a small change of log p.x1 ; x2 / in the direction of µ12 . .2/ @ The small change can be represented by @hxi i log p.x1 ; x2 I hx1 i; hx2 i; µ12 / in the hxi i direction, and by Then, if

@ .2/ @µ12

.2/ .2/ log p.x1 ; x2 I hx1 i; hx2 i; µ12 direction. / in the µ12

"

#

@ @ .2/ .2/ log p.x1 ; x2 I hx1 i; hx2 i; µ12 E / .2/ log p.x1 ; x2 I hx1 i; hx2 i; µ12 / @hxi i @µ12 D0

(2.3)

.2/ holds where E is the expectation with respect to p.x1 ; x2 I hx1 i; hx2 i; µ12 /, these directions are said to be orthogonal to each other. And indeed, it is .2/ easy to show that µ12 and hxi i are mutually orthogonal. In the three-neuron system, the probability p.x1 ; x2 ; x3 / is expanded as

log p.x1 ; x2 ; x3 / D µ1.3/ x1 C µ2.3/ x2 C µ3.3/ x3 .3/ .3/ .3/ C µ12 x1 x2 C µ23 x2 x3 C µ13 x1 x3 .3/ C µ123 x1 x2 x3 ¡ Ã .3/ :

(2.4)

Neural Architectures Underlying Information-Geometric Measures

741

The µ .3/ coefcients and Ã .3/ are given by .3/

µ1

.3/ µ12

p100 p010 p001 .3/ .3/ ; µ2 D log ; µ3 D log ; p000 p000 p000 p110 p000 p011 p000 p101 p000 .3/ .3/ D log D log D log ; µ23 ; µ13 p100 p010 p010 p001 p100 p001 D log

.3/ D log µ123

p111 p100 p010 p001 p110 p101 p011 p000

Ã .3/ D ¡ log p000 :

(2.5)

.2/ Again, the µ .3/ ’s are the natural parameters in the exponential family; µ123 , for example, represents the three-neuron interaction of X1 , X2 , and X3 . There are two orthogonal coordinate systems that can be used for a three-neuron net.3/ work. One is (hx1 i, hx2 i, hx3 i, hx1 x2 i, hx2 x3 i, hx1 x3 i; µ123 ). The other is (hx1 i, hx2 i, .3/ .3/ .3/ .3/ hx3 i; µ12 , µ23 , µ13 , µ123 ). In the former, .hx1 i, hx2 i, hx3 i, hx1 x2 i, hx2 x3 i, hx1 x3 i/, .3/ and µ123 are orthogonal to each other, and in the latter, .hx1 i; hx2 i; hx3 i/ and .3/ .3/ .3/ .3/ .µ12 ; µ23 ; µ13 ; µ123 / are mutually orthogonal. Furthermore, the extension to a general N neuron system is possible by expanding p.x1 ; x2 ; : : : ; xN / as

log p.x1 ; x2 ; : : : ; xN / D

X i

C

µi.N/ xi C X

i<j
X

µij.N/ xi xj

i< j

.N/ µijk xi xj xk

C¢¢¢

.N/ C µ1;:::;N x1 ; : : : ; xN ¡ Ã .N/ ;

(2.6)

.N/ where µ12;:::;k represents the kth-order neuronal interactions. As was seen for .N/ the two- and three-neuron networks, µ12;:::;k is orthogonal to hxi i, hxi xj i; : : : ; hx1 ; : : : ; xk¡1 i, so that the hierarchically organized orthogonal coordinate systems can be constructed. By using this hierarchical relationship, the method based on information geometry can independently separate different orders of interaction; therefore, it should provide a useful tool for spike train analysis.

3 Model We used an isolated pair and an isolated triplet of stochastic model neurons in this study because this simple setting allowed us to investigate the system with mathematical clarity. Following Ginzburg and Sompolinsky (1994), the state of each neuron at time t takes one of two states, denoted by Si .t/ D 0; 1, where i D 1; 2 for the

742

M. Tatsuno and M. Okada

two-neuron network and i D 1; 2; 3 for the three-neuron network. The total input to the ith neuron at time t is given by (two-neuron network) ui .t/ D Jij Sj .t/ C hi ; ui .t/ D Jij Sj .t/ C Jik Sk .t/ C hi ; (three-neuron network);

(3.1)

where Jij denotes the synaptic connection from the jth presynaptic neuron to the ith postsynaptic one, and hi represents the external input to the ith neuron. The neurons are assumed to have a stochastic nature due to exposure to local noise. The neuron dynamics is determined by the transition rate between the binary states. This transition rate w is written as a function of the total input and the stochastic noise, 1 g.ui /; ¿0 1 w.Si D 1 ! Si D 0/ D .1 ¡ g.ui //; ¿0 w.Si D 0 ! Si D 0/ D 1 ¡ w.0 ! 1/; w.Si D 0 ! Si D 1/ D

w.Si D 1 ! Si D 1/ D 1 ¡ w.1 ! 0/;

(3.2)

where ¿0 is a microscopic characteristic time, and g.ui / is a monotonically increasing, differentiable function such as g.ui / D .1 C tanh.ui //=2. For the two-neuron network, the probability of nding the system in a state fS.2/ g D .S1 ; S2 / at time t can be described by the following master equation, d P.S1 ; S2 ; t/ D ¡w.S1 ! .1 ¡ S1 //P.S1 ; S2 ; t/ dt ¡ w.S2 ! .1 ¡ S2 //P.S1 ; S2 ; t/ C w..1 ¡ S1 / ! S1 /P..1 ¡ S1 /; S2 ; t/

C w..1 ¡ S2 / ! S2 /P..S1 ; .1 ¡ S2 /; t/;

(3.3)

while the probability of nding the system in a state fS.3/ g D .S1 ; S2 ; S3 / in the three-neuron network can be given by d P.S1 ; S2 ; S3 ; t/ D ¡w.S1 ! .1 ¡ S1 //P.S1 ; S2 ; S3 ; t/ dt ¡ w.S2 ! .1 ¡ S2 //P.S1 ; S2 ; S3 ; t/ ¡ w.S3 ! .1 ¡ S3 //P.S1 ; S2 ; S3 ; t/

C w..1 ¡ S1 / ! S1 /P..1 ¡ S1 /; S2 ; S3 ; t/ C w..1 ¡ S2 / ! S2 /P.S1 ; .1 ¡ S2 /; S3 ; t/

C w..1 ¡ S3 / ! S3 /P.S1 ; S2 ; .1 ¡ S3 /; t/:

(3.4)

Neural Architectures Underlying Information-Geometric Measures

743

4 Two-Neuron Network In this section, we derive the analytical expression for the two-neuron information-geometric parameters and the correlation function in terms of the neural architecture parameters J21 ; J12 ; h1 , and h2 . 4.1 Mean Firing Rate and Mean Coincident Firing Rate. Following Ginzburg and Sompolinsky (1994), the evolution of the mean ring rate of the ith neuron is given by X d d hSi .t/i D Si P.S1 ; S2 ; t/: dt dt fS.2/ g

(4.1)

Substituting equation 3.3 into equation 4.1 yields ¿0

d hSi .t/i D ¡hSi .t/i C hg.ui .t//i: dt

(4.2)

For the evolution of the mean coincident ring rate, a similar calculation yields ¿0

d hSi .t/Sj .t/i D ¡2hSi .t/Sj .t/i C hSi .t/g.uj .t//i C hSj .t/g.ui .t//i: (4.3) dt

At the equilibrium limit, the mean ring rate obeys hSi i D hg.ui /i D hg.Jij Sj C hi /i:

(4.4)

By taking into account the relationship g.Jij Sj C hi / D Sj g.Jij C hi / C .1 ¡ Sj /g.hi /;

(4.5)

we obtain an analytical expression of the mean ring rate in terms of the neural architecture parameters, hS1 i D hS2 i D

g.h1 / C 1g1 g.h2 / ; 1 ¡ 1g1 1g2 g.h2 / C 1g2 g.h1 / ; 1 ¡ 1g1 1g2

(4.6) (4.7)

where 1g1 D g.J12 C h1 / ¡ g.h1 / and 1g2 D g.J21 C h2 / ¡ g.h2 /, respectively. Through a simple calculation, the mean coincident ring rate in the

744

M. Tatsuno and M. Okada

equilibrium state is also obtained as hS1 S2 i D

1 fhS1 ig.J21 C h2 / C hS2 ig.J12 C h1 /g: 2

(4.8)

4.2 Information-Geometric Measure and Equal-Time Correlation Function. The correlation function between the activity of two neurons is dened as Cij .t; t C ¿ / ´ h±Si .t/±Sj .t C ¿ /i;

¿ ¸ 0;

(4.9)

where ±Si .t/ ´ Si .t/ ¡ hSi .t/i. Here, note that the restriction ¿ ¸ 0 arises from the derivation of the time evolution of Cij .t; t C ¿ /, which can be done by estimating the probability at a time t and at a later time t C ¿ . After obtaining the analytical form of Cij .t; t C ¿ /, we can use the well-known relation Cij .t; t C ¿ / D Cji .t; t ¡ ¿ / for negative ¿ . Following Ginzburg and Sompolinsky (1994), the evolution of the equal-time (¿ D 0) correlation function is given as ¿0

d Cij .t; t/ D ¡2Cij .t; t/ C h±Si .t/±g.uj .t//i C h±Sj .t/±g.ui .t//i; dt

(4.10)

where ±g.u.t// ´ g.u.t// ¡ hg.u.t//i. The equal-time correlation function at the equilibrium limit, denoted as Cij .0/, is then given as Cij .0/ D

1 fh±Si ±g.uj /i C h±Sj ±g.ui /ig: 2

(4.11)

This equation is further reduced to Cii .0/ D hSi i.1 ¡ hSi i/; Cij .0/ D

1 fCii .0/1gj C Cjj .0/1gi g; 2

(i 6D j).

(4.12)

Together with equations 4.6 through 4.8, the above equations are the analytical expressions of the correlation function through the neural architecture parameters J21 ; J12 ; h1 , and h2 . For the two-neuron information-geometric measure, the relationship between the probability of each event pij and the mean ring rates hS1 i and hS2 i, and the mean coincident ring rate hS1 S2 i is given by p00 D 1 ¡ hS1 i ¡ hS2 i C hS1 S2 i; p01 D hS2 i ¡ hS1 S2 i; p10 D hS1 i ¡ hS1 S2 i; p11 D hS1 S2 i:

(4.13)

Neural Architectures Underlying Information-Geometric Measures

745

The equal-time two-neuron information-geometric parameters µ .2/ .0/’s are then rewritten as µ1.2/ .0/ D log µ2.2/ .0/ D log .2/ µ12 .0/ D log

hS1 i ¡ hS1 S2 i ; 1 ¡ hS1 i ¡ hS2 i C hS1 S2 i hS2 i ¡ hS1 S2 i ; 1 ¡ hS1 i ¡ hS2 i C hS1 S2 i

hS1 S2 i.1 ¡ hS1 i ¡ hS2 i C hS1 S2 i/ ; .hS1 i ¡ hS1 S2 i/.hS2 i ¡ hS1 S2 i/

(4.14)

where the mean ring rates and the mean coincident ring rate are given by equations 4.6 through 4.8. These equations are the analytical expressions of the two-neuron information-geometric parameters µ .2/ .0/’s in terms of the neural architecture parameters J21 ; J12 ; h1 , and h2 . In the case of symmetric connections, .J21 D J12 D J/, the above equations are signicantly reduced to the simplied expressions µ1.2/ .0/ D 2h1 ;

µ2.2/ .0/ D 2h2 ;

.2/ µ12 .0/ D 2J:

(4.15)

These equations represent that µ1.2/ .0/ and µ2.2/ .0/ reect only the external .2/ inputs h1 and h2 , respectively, and µ12 .0/ reects only the symmetric con.2/ nections. Therefore, an observed change in µ12 .0/, for example, indicates that the synaptic connection strength between the two neurons has been modied. In the case of asymmetric connections, however, this nice relation no longer holds. The information-geometric parameters µ .2/ .0/’s depend on both the external inputs and the bidirectional synaptic connections in the complicated manner described above. .2/ The relationship between the information-geometric parameter µ12 .0/ and the correlation function Cij .0/ is also derived as .2/ µ12 .0/ D log.C12 .0/ C hS1 ihS2 i/

C log.C12 .0/ C 1 ¡ hS1 i ¡ hS2 i C hS1 ihS2 i/ ¡ log.¡C12 .0/ C hS1 i ¡ hS1 ihS2 i/ ¡ log.¡C12 .0/ C hS2 i ¡ hS1 ihS2 i/:

(4.16)

.2/ We can easily show that between µ12 .0/ and Cij .0/, there is a nonlinear but monotonically increasing relationship. .2/ To illustrate the dependence of µ1.2/ .0/, µ2.2/ .0/, and µ12 .0/ on the underlying asymmetric connections, here we show an example where only J21 D 1J is modied while all other neural architecture parameters, J12 ; h1 ; and h2 , are kept at zero. In this case, the theoretical values of the µ .2/ .0/’s are

746

M. Tatsuno and M. Okada 1.5

(2) 12

0.5

(2) 2

0

(2) , 1

(2) , 2

(2) 12

1

0.5

(2) 1

1

1.5 0

2

4

6

8

10

J Figure 1: Dependence of the two-neuron information-geometric parameters .2/ µ1.2/ .0/, µ2.2/ .0/, and µ12 .0/ on the unidirectional connection (J21 D 1J), where 1J is modied between 0 and 10. All the other neural architecture parameters (J12 , h1 , and h2 ) were set to zero. The solid, dashed, and dotted lines represent µ1.2/ .0/, .2/ µ2.2/ .0/, and µ12 .0/, respectively. This example demonstrates that the asymmetric connection can modify all the two-neuron information-geometric parameters.

obtained as µ1.2/ .0/ D log .2/ µ2 .0/ D log .2/ µ12 .0/ D log

4 ¡ 3 tanh.1J/ ; 4 ¡ tanh.1J/ 4 C tanh.1J/ ; 4 ¡ tanh.1J/

.4 C 3 tanh.1J//.4 ¡ tanh.1J// ; .4 ¡ 3 tanh.1J//.4 C tanh.1J//

(4.17)

and the general shape is shown in Figure 1. By increasing 1J, we modify the values of all the information-geometric parameters, µ1.2/ .0/, µ2.2/ .0/, and .2/ µ12 .0/. This indicates that a change in the asymmetric connection affects all the information-geometric parameters. .2/ To illustrate the dependence of µ12 .0/ on the external inputs, h1 and h2 are simultaneously modied for both the symmetric and asymmetric networks.

Neural Architectures Underlying Information-Geometric Measures

747

4.5 4 3.5 3

J21=5, J12=0

(2) 12

2.5 2

J21=J12=0.5

1.5 1

J21=J12=0

0.5 0 0.5 0

0.5

1

1.5

2

h Figure 2: Dependence of the two-neuron information-geometric parameter .2/ µ12 .0/ on the external input .h1 D h2 D 1h/. 1h was modied between 0 and 2. The dashed line and dotted line were obtained from the symmetric connections .2/ .J21 ; J12 / D .0:5; 0:5/ and .J21 ; J12 / D .0; 0/, respectively. In these cases, µ12 .0/ was not affected by 1h. Therefore, it can correctly depict the connection strength. On .2/ the other hand, if the connection is asymmetric, µ12 .0/ is affected by the change in 1h. This is indicated by the solid curve obtained from .J21 ; J12 / D .5; 0/.

For simplicity, the setting h1 D h2 D 1h is used. The result is shown in Figure 2. The connection strengths .J21 ; J12 / D .0:5; 0:5/ and .J21 ; J12 / D .0; 0/ were used as examples of a symmetric connection, and .J21 ; J12 / D .5; 0/ was used as an example of an asymmetric connection. As indicated by theory, .2/ for the asymmetric connections, µ12 .0/ was modied by the external inputs, .2/ while µ12 .0/ for the symmetric connections was independent of the external inputs. 4.3 Information-Geometric Measure and Correlation Function with Time Delay. For the time-delayed correlation function, following Ginzburg and Sompolinsky (1994), the time evolution is given as

¿0

d Cij .t; t C ¿ / D ¡Cij .t; t C ¿ / C h±Si .t/±g.uj .t C ¿ //i; d¿

¿ ¸ 0: (4.18)

748

M. Tatsuno and M. Okada

The equilibrium limit, denoted by Cij .¿ /, is obtained by taking the t ! 1 limit, ¿0

d Cij .¿ / D ¡Cij .¿ / C h±Si .t/±g.uj .t C ¿ //i; d¿

(4.19)

which needs to be solved together with the initial condition Cij .0/. This equation is further reduced to d Cii .¿ / D ¡Cii .¿ / C 1gi Cij .¿ / d¿ d ¿0 Cij .¿ / D ¡Cij .¿ / C 1gj Cii .¿ /; d¿

¿0

(i 6D j),

(4.20)

and the general solutions are obtained as 1 Cii .¿ / D 2

(

s

1gi Cij .0// 1gj µ ¶ q ¿ £ exp .¡1 C 1gi 1gj / ¿0 Á s ! 1gi C Cii .0/ ¡ Cij .0/ 1gj µ ¶) q ¿ £ exp .¡1 ¡ 1gi 1gj / ; ¿0 .Cii .0/ C

s 1 Cij .¿ / D 2

1gj 1gi

(

(4.21)

s

1gi Cij .0// 1gj µ ¶ q ¿ £ exp .¡1 C 1gi 1gj / ¿0 Á s ! 1gi ¡ Cii .0/ ¡ Cij .0/ 1gj µ ¶) q ¿ £ exp .¡1 ¡ 1gi 1gj / ; ¿0 .Cii .0/ C

(i 6D j):

(4.22)

Since 1gi , 1gj , Cii .0/, and Cij .0/ are written using the neural architecture parameters J21 , J12 , h1 , and h2 , these equations provide analytical expressions of the time-delayed correlation function in terms of these neural architecture parameters. Here, note that Cij .¿ / D Cji .¡¿ /. Using Cij .¿ /, the time-delayed

Neural Architectures Underlying Information-Geometric Measures

749

.2/ two-neuron information-geometric parameter µ12 .¿ / is derived as

.2/ µ12 .¿ / D log.C12 .¿ / C hS1 ihS2 i/

C log.C12 .¿ / C 1 ¡ hS1 i ¡ hS2 i C hS1 ihS2 i/ ¡ log.¡C12 .¿ / C hS1 i ¡ hS1 ihS2 i/

¡ log.¡C12 .¿ / C hS2 i ¡ hS1 ihS2 i/:

(4.23)

Since C12 .¿ /, hS1 i and hS2 i are written using the neural architecture parameters J21 , J12 , h1 , and h2 , equation 4.23 is the analytical expression of the .2/ time-delayed two-neuron information-geometric parameter µ12 .¿ / in terms of these neural architecture parameters. To illustrate how the two-neuron information-geometric parameters be.2/ have in terms of the delay time ¿ , µ12 .¿ / corresponding to the two sets of neural architecture parameters, .J21 ; J12 ; h1 ; h2 / D .0:717; 0:717; ¡0:549; 0:255/ and .J21 ; J12 ; h1 ; h2 / D .5; 0; 0; 0/, are plotted in Figure 3. With these neural architecture parameters, the same information-geometric parameters at time .2/ lag zero, .µ1.2/ .0/; µ2.2/ .0/; µ12 .0// D .¡1:10; 0:51; 1:43/, were generated. This example demonstrates that the equal-time information-geometric parame.2/ ters, (µ1.2/ .0/, µ2.2/ .0/, µ12 .0/), are mapped from multiple neural architecture parameters (J21 , J12 , h1 , h2 ), and therefore the underlying neural architectures cannot be uniquely determined when using only the information at time lag zero. If we take into account the time-delayed information-geometric pa.2/ rameter µ12 .¿ /, however, this ill-posed nature is reduced, and we will have a good estimate of the underlying neural architectures. We also investigated the relationship between the information-geometric .2/ parameter µ12 .¿ / and the correlation function C12 .¿ / as a function of the delay time ¿ . Figure 4 illustrates a typical example where both measures were calculated from the same neural architecture parameters .J21 ; J12 ; h1 ; h2 / D .2/ .1:5; 1:0; 0:5; 0/. For comparison, the value of µ12 .¿ / was normalized to have the same peak value as C12 .¿ /. As Figure 4 shows clearly, the peak value of both measures was obtained at the same ¿ , and this is consistent with the fact that they had a monotonically increasing relationship. In other words, both measures depict a similar feature of neural ring, even though they are connected by a nonlinear relationship (given by equation 4.23). It is also important to note, however, that even if both measures depict similar correlated ring features, the information-geometric measure offers several additional benets: the ability to generate orthogonal coordinate systems, the decomposition of mutual information into independent contributions, provision of a biologically plausible statistical test, and the opportunity to extend it in a straightforward way to a network of more than two neurons.

750

M. Tatsuno and M. Okada 2 1.8 1.6 1.4

1.10

(2) =0.51 2

J =5, J =0 21 12 h =0, h =0

(2) =1.43 12

1

2

(2) ( 12

)

1.2

(2) = 1

1

0.8

J =0.717, J =0.717 21 12 h1= 0.549, h2=0.255

0.6 0.4 0.2 0 100

50

0

50

100

Figure 3: Multiple neural architectures generate the same two-neuron information-geometric parameters at time lag zero. The equal-time two-neuron .2/ information-geometric parameters .µ1.2/ .0/; µ2.2/ .0/; µ12 .0// D .¡1:10; 0:51; 1:43/ were generated from the two neural architecture parameters, (J21 ; J12 ; h1 ; h2 )=(5, 0, 0, 0) and (J21 , J12 , h1 , h2 )=(0.717, 0.717, ¡0:549, 0.255), which are represented by the solid and dashed curves, respectively. The symmetric network has a symmet.2/ ric form of µ12 .¿ / with respect to ¿ D 0, whereas the asymmetric one shows an asymmetric shape reecting the connection strength. This is a typical example of an ill-posed inverse problem because it is impossible to determine the underlying neural architecture parameters from the equal-time information-geometric parameter. ¿0 D 10 was used in this example.

5 Three-Neuron Network In this section, we derive the analytical expression of the three-neuron information-geometric parameters in terms of the neural architecture parameters, J12 , J21 , J23 , J32 , J13 , J31 , h1 , h2 , and h3 . 5.1 Mean Firing Rate and Mean Coincident Firing Rate of Three Neurons. To calculate the three-neuron information-geometric parameters, we need to derive the analytical expression of the mean ring rate hSi i, the two-neuron mean coincident ring rate hSi Sj i, and the three-neuron mean coincident ring rate hS1 S2 S3 i.

0.02

C12( ), Normalized

(2) ( 12

)

Neural Architectures Underlying Information-Geometric Measures

J21 = 1.5 J12 = 1.0 h1 = 0.5 h2 = 0

(2) ( 12

751

)

0.01

C12( ) 0 100

50

0

50

100

Figure 4: Relationship between the two-neuron information-geometric parame.2/ ter µ12 .¿ / and the correlation function C12 .¿ /. Both the two-neuron informationgeometric parameter (dotted line) and the correlation function (solid line) were calculated from the neural architecture parameters (J21 ; J12 ; h1 ; h2 ) = (1.5, 1.0, 0.5, 0). To enable direct comparison of the shape, the information-geometric parameter was normalized to have the same peak value with the correlation function. ¿0 D 10 was used in the calculation.

Since we have already calculated the evolution of hSi i and hSi Sj i in equations 4.2 and 4.3, we now calculate the evolution of the three-neuron mean coincident ring rate. We start with X d d hS1 .t/S2 .t/S3 .t/i D S1 S2 S3 P.S1 ; S2 ; S3 ; t/; dt dt fS.3/ g

(5.1)

and plugging equation 3.4 into equation 5.1 yields ¿0

d hS1 .t/S2 .t/S3 .t/i D ¡3hS1 .t/S2 .t/S3 .t/i dt C hS1 .t/S2 .t/g.u3 .t//i C hS2 .t/S3 .t/g.u1 .t//i

C hS1 .t/S3 .t/g.u2 .t//i:

(5.2)

This equation describes the evolution of the three-neuron mean coincident ring rate.

752

M. Tatsuno and M. Okada

At the equilibrium limit, the mean ring rate of each neuron obeys hSi i D hg.ui /i

D hg.Jij Sj C Jik Sk C hi /i:

(5.3)

By taking into account the relationship g.Jij Sj C Jik Sk C hi / D Sj Sk g.Jij C Jik C hi / C Sj .1 ¡ Sk /g.Jij C hi /

C .1 ¡ Sj /Sk g.Jik C hi /

C .1 ¡ Sj /.1 ¡ Sk /g.hi /;

(5.4)

equation 5.3 is rewritten as hSi i D hSj Sk ifg.Jij C Jik C hi / ¡ g.Jij C hi / ¡ g.Jik C hi / C g.hi /g C hSj ifg.Jij C hi / ¡ g.hi /g

C hSk ifg.Jik C hi / ¡ g.hi /g C g.hi /:

(5.5)

For the two-neuron mean coincident ring rate hSi Sj i, a similar calculation yields hSi Sj i D

1 fhSj Sk ifg.Jij C Jik C hi / ¡ g.Jij C hi /g 2 C hSj ig.Jij C hi /

C hSi Sk ifg.Jji C Jjk C hj / ¡ g.Jji C hj /g C hSi ig.Jji C hj /g:

(5.6)

Equations 5.5 and 5.6 are six sets of linear equations of hS1 i, hS2 i, hS3 i, hS1 S2 i, hS2 S3 i, and hS1 S3 i. Therefore, by simultaneously solving equations 5.5 and 5.6, we obtain the analytical expression of hS1 i, hS2 i, hS3 i, hS1 S2 i, hS2 S3 i, and hS1 S3 i in terms of the neural architecture parameters J12 , J21 , J23 , J32 , J13 , J31 , h1 , h2 , and h3 . For the three-neuron mean coincident ring rate hS1 S2 S3 i, we obtain the following relationship from equation 5.2: hS1 S2 S3 i D

1 fhS1 S2 ig.J31 C J32 C h3 / 3 C hS2 S3 ig.J12 C J13 C h1 /

C hS1 S3 ig.J21 C J23 C h2 /g:

(5.7)

Since the analytical expression of hS1 S2 i, hS2 S3 i, and hS1 S3 i has already been obtained from equations 5.5 and 5.6, we can also express hS1 S2 S3 i in terms of the neural architecture parameters.

Neural Architectures Underlying Information-Geometric Measures

753

To calculate the three-neuron information-geometric parameters, we use the following relationship between the probability of each event pijk and hSi i, hSi Sj i, and hS1 S2 S3 i, p000 D 1 ¡ hS1 i ¡ hS2 i ¡ hS3 i C hS1 S2 i C hS2 S3 i C hS1 S3 i ¡ hS1 S2 S3 i; p100 D hS1 i ¡ hS1 S2 i ¡ hS1 S3 i C hS1 S2 S3 i; p010 D hS2 i ¡ hS1 S2 i ¡ hS2 S3 i C hS1 S2 S3 i; p001 D hS3 i ¡ hS1 S3 i ¡ hS2 S3 i C hS1 S2 S3 i; p110 D hS1 S2 i ¡ hS1 S2 S3 i; p011 D hS2 S3 i ¡ hS1 S2 S3 i;

p101 D hS1 S3 i ¡ hS1 S2 S3 i; p111 D hS1 S2 S3 i:

(5.8)

Plugging these equations into equation 2.5 yields the three-neuron information-geometric parameters expressed by hSi i, hSij i, and hS123 i. And by using equations 5.5 through 5.7, the three-neuron information-geometric parameters can be explicitly written in terms of the neural architecture parameters. In the following, we introduce two typical network structures that we will use to investigate how the three-neuron information-geometric parameters are affected by the underlying neural architecture parameters. 5.2 Symmetrically Connected Network. The rst example is the case where three neurons are connected through symmetric connections Jij D Jji . This is a natural extension from the symmetrically connected two-neuron network. In this case, similar to the two-neuron network, the relation between the three-neuron information-geometric parameters and the underlying neural architecture parameters is dramatically simplied as µ1.3/ D 2h1 ;

.3/ D 2J12 ; µ12

.3/ D 0: µ123

µ2.3/ D 2h2 ;

.3/ D 2J23 ; µ23

µ3.3/ D 2h3 ;

.3/ D 2J13 ; µ13

(5.9)

These equations clearly demonstrate that the rst-order µi.3/ represents the external input hi , the second-order µij.3/ represents the symmetric connections .3/ represents the triple neuron interaction, which Jij , and the third-order µ123 is absent in this example. Therefore, any change observed in µi.3/ indicates that the external input has changed. Similarly, a change in µij.3/ indicates that a change in the synaptic connection Jij has been induced. Furthermore, a simple calculation also yields that the symmetrically connected N-neuron

754

M. Tatsuno and M. Okada

.N/ .N/ network has relations of µi.N/ D 2hi , µij.N/ D 2Jij , µijk D 0, ¢ ¢ ¢ , µ1¢¢¢N D 0, which is consistent with previous work (Amari, Kurata, & Nagaoka, 1992). Let us further consider the case where we observe only two of the three neurons. This is the simplest example of what we encounter in real experiments. That is, in electrophysiological experiments, we record a small number of neurons from an N-neuron network, where N is usually unknown. In this example, we assume that only two neurons are observed from the symmetrically connected three-neuron network. In relation to the isolated pair of model neurons, we now have the hidden third neuron, which has symmetric connections with neurons 1 and 2. To calculate the two-neuron information-geometric parameters µ1.2/ , µ2.2/ , .2/ and µ12 , let us recall that the probability p.S1 ; S2 ; S3 / at the equilibrium limit can be given by the following Boltzmann-Gibbs distribution, 2 3 3 2 X 3 X X 1 (5.10) p.S1 ; S2 ; S3 / D exp 42 hi Si C 2 Jij Si Sj 5 ; Z iD1 iD1 j>i

where the partition function Z is given by 2 3 3 2 X 3 X X X exp 42 ZD hi Si C 2 Jij Si Sj 5 : fS.3/ g

iD1

(5.11)

iD1 j>i

If we observe only two neurons S1 and S2 , the corresponding probability p.S1 ; S2 / is obtained as p.S1 ; S2 / D D

1 X

p.S1 ; S2 ; S3 /

S3 D0

1 fexp[2.h1 S1 C h2 S2 C J12 S1 S2 /] Z C exp[2f.h1 C J12 /S1 C .h2 C J23 /S2 C J12 S1 S2 C h3 g]g: (5.12)

Thus, the probability of each event is calculated as 1 .exp[2h3 ] C 1/; Z 1 D .exp[2h2 ] C exp[2h2 C 2h3 C 2J23 ]/; Z 1 D .exp[2h1 ] C exp[2h1 C 2h3 C 2J13 ]/; Z 1 D .exp[2h1 C 2h2 C 2J12 ] Z C exp[2h1 C 2h2 C 2h3 C 2J12 C 2J23 C 2J13 ]/:

p00 D p01 p10 p11

(5.13)

Neural Architectures Underlying Information-Geometric Measures

755

Plugging equation 5.13 into equation 2.2 yields .2/

µ1

D 2h1 C log

µ2.2/ D 2h2 C log .2/ D 2J12 C log µ12

1 C exp[2.h3 C J13 /] ; 1 C exp[2h3 ] 1 C exp[2.h3 C J23 /] ; 1 C exp[2h3 ]

.1 C exp[2h3 ]/.1 C exp[2.h3 C J13 C J23 /]/ : .1 C exp[2.h3 C J13 /]/.1 C exp[2.h3 C J23 /]/

(5.14)

These are the analytical expressions of the two-neuron information-geometric parameters using the underlying neural architecture parameters. The second term of the right-hand side of each equation indicates that there exists an effect from the third neuron, which is not observed in this example. .2/ Therefore, µ1.2/ , µ2.2/ , and µ12 do not represent the external inputs and the .3/ connection, respectively, while µ1.3/ , µ2.3/ , and µ12 do represent them. This example is educational because even if we constrain ourselves regarding a symmetric connection, the two-neuron information-geometric parameters and the three-neuron information-geometric parameters differ. Therefore, extra caution is necessary when the information-geometric parameters are calculated from neural data where the total number of neurons in the network is unknown. 5.3 Asymmetrically Connected Network with Common Input. In the previous section, we showed that the three-neuron information-geometric parameters can be used to extract the external inputs and synaptic connections successfully if the neurons are connected symmetrically. If the connection becomes asymmetric, though, such a simple relation no longer holds. Indeed, simultaneously solving equations 5.5 through 5.7 for general asymmetric connections leads to very complicated solutions for the three-neuron .3/ information-geometric parameters, and nonzero µ123 could arise without a direct triple-neuron connection. To illustrate how the information-geometric parameters are inuenced by asymmetric connections, we examine the case where neurons 1 and 2 receive a common input from neuron 3. In other words, neurons 1 and 2 are downstream neurons that receive common inputs from the upstream neuron 3. For simplicity, we introduce asymmetry only between neurons 1 and 3 and between neurons 2 and 3; that is, we assume that the downstream neurons 1 and 2 have a symmetric connection with each other (J12 D J21 ), and they receive a common input from upstream neuron 3 (J13 D J23 ), but there is no connection back to the upstream neuron 3 from downstream neuron 1 or 2 (J31 D J32 D 0). We then compare two typical situations where we observe a pair of the downstream neurons 1 and 2 or a pair of downstream neuron 1 and up-

756

M. Tatsuno and M. Okada

stream neuron 3. Since observing the downstream neuron 2 and the upstream neuron 3 is qualitatively similar to the pair of neuron 1 and neuron 3, it is not explicitly treated here. We also consider the observation of all three neurons to illustrate the difference between the two-neuron informationgeometric parameters and the three-neuron information-geometric parameters. For the pair consisting of downstream neurons 1 and 2, the two.2/ neuron information-geometric parameters (µ1.2/ , µ2.2/ , µ12 ) and the three.3/ .3/ .3/ .3/ neuron information-geometric parameters (µ1 , µ2 , µ12 , µ123 ) are investigated by modifying a common input (J13 , J23 ) and the external input to neuron 3 (h3 ). Similarly, for the case of the pair consisting of downstream .2/ neuron 1 and upstream neuron 3, the changes in (µ1.2/ , µ3.2/ , µ13 ) and (µ1.3/ , .3/ .3/ .3/ µ3 , µ13 , µ123 ) are investigated. Figures 5 and 6 illustrate the case where the common inputs (J13 , J23 ) are modied while the external input to neuron 3 (h3 ) is xed. The neural architecture parameters were set as (J12 , J21 , J23 , J32 , J13 , J31 ) D (0.5, 0.5, 1J, 0, 1J, 0) and .h1 ; h2 ; h3 / D .0:3; ¡0:2; 0/. Figure 5 shows the results for the pair consisting of downstream neurons 1 and 2. For 1J D 0, we have .2/ .3/ .3/ D µ12 D 2J12 , and µ123 D 0, because µ1.2/ D µ1.3/ D 2h1 , µ2.2/ D µ2.3/ D 2h2 , µ12 the network was reduced to the symmetrically connected model neurons from the perspective of both the two-neuron and three-neuron networks. For 1J > 0, however, all the information-geometric parameters show nonlinear changes in their values, although the three-neuron information-geometric parameters seem more robust than the two-neuron information-geometric .3/ parameters. Also note that a nonzero µ123 can arise for 1J > 0, even if there is no direct three-neuron connection. Figure 6 shows the outcome when downstream neuron 1 and upstream neuron 3 are observed. In this case, bear in mind that the two neurons are asymmetrically connected by a unidirectional connection from neuron 3 to neuron 1. For 1J D 0, from a two-neuron network perspective, the network is decomposed into two isolated model neurons where neuron 1 has a connection with a hidden neuron 2. On the other hand, this can be considered a symmetrically connected network from a three-neuron network perspec.2/ .3/ tive. Therefore, we have µ1.3/ D 2h1 , µ3.2/ D µ3.3/ D 0, µ13 D µ13 D 0, and .3/ .2/ µ123 D 0. The two-neuron information-geometric parameter µ1 is not equal to 2h1 , however, because neuron 1 is inuenced by the hidden neuron 2. Note that µ1.2/ of the pair consisting of downstream neuron 1 and upstream neuron 3 (see Figure 6) differs from µ1.2/ of the pair consisting of downstream neurons 1 and 2 (see Figure 5), which were calculated as µ1.2/ D log .2/

µ1

D log

hS1 S2 i.1 ¡ hS1 i ¡ hS2 i C hS1 S2 i/ (neurons 1 and 2) .hS1 i ¡ hS1 S2 i/.hS2 i ¡ hS1 S2 i/

hS1 S3 i.1 ¡ hS1 i ¡ hS3 i C hS1 S3 i/ (neurons 1 and 3): (5.15) .hS1 i ¡ hS1 S3 i/.hS3 i ¡ hS1 S3 i/

Information

geometric Measure

Neural Architectures Underlying Information-Geometric Measures

1.25

2 12

1

757

3 12

2 1

0.75

3 1

0.5

3 123

0.25 0

2 2

0.25 0

0.2

3 2

0.4

0.6

0.8 J

1

1.2

1.4

Figure 5: Dependence of the two-neuron information-geometric parameters and the three-neuron information-geometric parameters on the common input (J13 D J23 D 1J). A pair of neuron 1 and neuron 2 was used for the two-neuron information-geometric calculation. 1J was modied between 0 and 1:5. The neural architecture parameters were set as .J12 , J21 , J23 , J32 , J13 , J31 / D .0:5, 0:5, 1J, 0, 1J, 0), and .h1 ; h2 ; h3 / D .0:3; ¡0:2; 0/. The black and gray solid curves represent µ1.2/ and µ1.3/ , respectively. Both measures were increased from 0:6 at 1J D 0, which represents 2h1 . The black and gray dashed curves represent µ2.2/ and µ2.3/ , respectively. They start from ¡0:4 at 1J D 0, which represents 2h2 . The black .2/ .3/ and gray dash-dotted curves represent the two-neuron interactions µ12 and µ12 , respectively. They start from 1 at 1J D 0, which represents 2J12 D 2J21 D 1. In all six curves, the three-neuron information-geometric parameters appear more robust than the two-neuron information-geometric parameters. The solid curve .3/ starting at 0 for 1J D 0 represents the three-neuron interaction µ123 . For 1J > 0, this value gradually increased even though there was no direct three-neuron connection.

For 1J > 0, all the information-geometric parameters were modied by a common input (1J), but the effect was especially apparent on the µ3 ’s and µ13 ’s. In Figures 7 and 8, we modied the external input to neuron 3 (h3 ), while the common inputs (J13 J23 ) were xed. The parameters used in this example were .J12 ; J21 ; J23 ; J32 ; J13 ; J31 / D .0:5; 0:5; 0:3; 0; 0:3; 0/ and .h1 ; h2 ; h3 / D .0:3; ¡0:2; 1h/, so that the connection between neuron 1 and neuron 2 was symmetric (J12 D J21 D 0:5) and the common inputs were xed at the same

758

M. Tatsuno and M. Okada

Information

geometric Measure

1.5

2 1

1

3 1 2 13

0.5

3 123

3 13

0 0.5

2 3

1 1.5

3 3

0

0.2

0.4

0.6

0.8 J

1

1.2

1.4

Figure 6: Dependence of the two-neuron information-geometric parameters and the three-neuron information-geometric parameters on the common input (J13 D J23 D 1J). A pair consisting of neuron 1 and neuron 3 was used for the two-neuron information-geometric calculation. 1J was modied between 0 and 1:5. The neural architecture parameters were set as .J12 ; J21 ; J23 ; J32 ; J13 ; J31 / D .0:5, 0:5, 1J, 0, 1J, 0), and .h1 ; h2 ; h3 / D .0:3; ¡0:2; 0/. The black and gray solid curves represent µ1.2/ and µ1.3/ , respectively. In this example, µ1.2/ does not represent 2h1 even for 1J D 0 because of the effect from the hidden neuron 2. The black and gray dashed curves represent µ3.2/ and µ3.3/ , respectively. They start from 0 at 1J D 0, which represents 2h3 D 0. The black and gray dash-dotted curves .2/ .3/ represent the two-neuron interactions µ13 and µ13 , respectively. They start from 0 at 1J D 0, which represents 2J13 D 2J23 D 0. Both the µ3 ’s and the µ13 ’s show a big change with 1J. The solid curve starting at 0 for 1J D 0 represents the .3/ three-neuron interaction µ123 , which is identical to the one in Figure 5.

value (J13 D J23 D 0:3). Figure 7 shows the results when downstream neurons 1 and 2 were observed and reveals that all the information-geometric parameters were affected by 1h. However, while 1h had an apparent effect .2/ .3/ .3/ on µ1.2/ , µ1.3/ , µ2.2/ , and µ2.3/ , it had very little effect on µ12 , µ12 , or µ123 . Figure 8 shows the results when downstream neuron 1 and upstream neuron 3 were observed. In this situation, only µ3.2/ and µ3.3/ were modied in an almost linear fashion, while the other information-geometric parameters showed very small changes. This indicates that µ3.2/ and µ3.3/ effectively extract the effect of 1h. Note also that µ1.2/ of the pair consisting of down-

Information

geometric Measure

Neural Architectures Underlying Information-Geometric Measures

759

2 12

1

3 12

2 1

0.8

3 1

0.6 0.4 0.2

3 123

0

2 2

0.2 0

0.2

3 2

0.4

0.6

0.8 h

1

1.2

1.4

Figure 7: Dependence of the two-neuron information-geometric parameters and the three-neuron information-geometric parameters on the external input to the third neuron (h3 D 1h). A pair consisting of neuron 1 and neuron 2 was used for the two-neuron information-geometric calculation. 1h was modied between 0 and 1:5. The black and gray solid curves represent µ1.2/ and µ1.3/ , respectively, which increase in value with 1h. The black and gray dashed curves represent µ2.2/ and µ2.3/ , respectively, and are also modied by 1h. The black and .2/ .3/ gray dash-dotted curves represent the two-neuron interactions µ12 and µ12 , respectively, and the solid curve starting from 0 is the three-neuron interaction .3/ µ123 . All three measures seem stable against changes in 1h.

stream neuron 1 and upstream neuron 3 (see Figure 8) differs from µ1.2/ of the pair consisting of the downstream neurons 1 and 2 (see Figure 7). From Figures 5 to 8, we can speculate that if the common inputs are modied, change will be observed in all of the information-geometric parameters. On the other hand, if the change is mainly in the rst-order informationgeometric parameters, the external input to the upstream neuron could be a cause of this change. Furthermore, this simple example illustrates that the resultant information-geometric parameters show very different behavior depending on the choice of recording neurons. Therefore, by collecting typical examples of the behavior of the information-geometric parameters to use as templates, we may be able to roughly estimate the underlying neural architecture parameters.

760

M. Tatsuno and M. Okada

Information

geometric Measure

2.5

2 3

2 1.5

2 1

1

3 1

3 3

2 13

0.5

3 13

0

3 123

0

0.2

0.4

0.6

0.8 h

1

1.2

1.4

Figure 8: Dependence of the two-neuron information-geometric parameters and the three-neuron information-geometric parameters on the external input to the third neuron (h3 D 1h). A pair consisting of neuron 1 and neuron 3 was used for the two-neuron information-geometric calculation. 1h was modied between 0 and 1:5. The black and gray solid curves represent µ1.2/ and µ1.3/ , respectively, and show little change with 1h. The black and gray dashed curves represent µ3.2/ and µ3.3/ , respectively, and were modied almost linearly with 1h. The black and gray dash-dotted curves represent the two-neuron interactions .2/ .3/ µ13 and µ13 , respectively, which were very stable against 1h. The solid curve .3/ starting from 0 is the three-neuron interaction µ123 , which is identical to the one in Figure 7.

6 Discussion In this study, the relationship between the information-geometric parameters and possible neural architectures was elucidated using a pair and a triplet of stochastic model neurons. For symmetric networks (Jij D Jji ), there exists a simple relationship where µij.2/ and µij.3/ are dependent only on the underlying connections, and µi.2/ and µi.3/ are dependent only on the external input hi . Note, however, that if the total number of neurons in the network is unknown, this simple relation will not always hold. On the other hand, asymmetric networks have a complicated relationship where all information-geometric parameters are dependent on all neural architecture parameters. This illustrates the ill-posed nature of the problem: multiple

Neural Architectures Underlying Information-Geometric Measures

761

neural architectures can generate the same information-geometric parameters, as in the example of the two-neuron case (see Figure 3). If the timedelayed information-geometric parameters are taken into account, however, the underlying neural architecture parameters can be estimated even for asymmetric networks. In addition, the nonlinear but monotonically increas.2/ ing relationship between µ12 and Cij ensures that both correlation measures depict similar ring features. The fact that the two-neuron information-geometric parameters can differ from the three-neuron information-geometric parameters also sheds light on the choice of orthogonal coordinate systems. Let us consider an experiment where we simultaneously observe two neurons from an N-neuron network, where N is unknown to the experimenter. In this situation, we can .2/ use the orthogonal coordinates .hS1 i; hS2 i; µ12 / calculated from the twoneuron information-geometric measures. Let us imagine that through some technical improvement, one is able to record three neurons at the same time. .3/ .3/ .3/ At this stage, we would have a choice of using .hS1 i; hS2 i; hS3 iI µ12 ; µ23 ; µ13 ; .3/ .3/ µ123 /, or .hS1 i; hS2 i; hS3 i; hS12 i; hS23 i; hS13 iI µ123 / as the orthogonal coordinates. Which set of orthogonal coordinates is more appropriate? Recall that even in a symmetrically connected network, (µi.2/ , µij.2/ ) differs from (µi.3/ , µij.3/ ). In other words, all the information-geometric parameters can be modied depending on the number of neurons assumed in parameter estimation in the log-linear model. Therefore, when we increase the number of recorded neurons from two to three, it would be better to choose (hS1 i, hS2 i, hS3 i, .3/ hS1 S2 i, hS2 S3 i, hS1 S3 i; µ123 ), because the hSi i’s and hSi Sj i’s are independent of the order of the information-geometric measure calculation. Furthermore, since this consideration also applies to a network of any number of neurons, it would be appropriate to use orthogonal coordinates where only the highest-order of the information-geometric parameters is included. It is also instructive to revisit Ito and Tsuji’s model (2000) using the framework of this information-geometric approach. In their article, they investigated the normalization problem of JPSTH using a simple probabilistic model of unidirectionally interacting point processes and concluded that it is impossible to nd a correct normalization factor for JPSTH without a priori knowledge regarding the underlying model parameters. This statement is, in principle, consistent with our ndings in this study. For networks with asymmetric connections, even the information-geometric parameter .2/ does not represent the connection between the two neurons. However, µ12 what Ito and Tsuji failed to point out in their article and is critically important is that these correlation measures are applicable only to networks with symmetric connections. The bidirectional connections in asymmetric networks cannot be expressed by a single correlation parameter. In this way, applying the information-geometric measure directly to Ito and Tsuji’s model does not improve the situation because their model has

762

M. Tatsuno and M. Okada

asymmetric interactions. For example, they introduce the unidirectional positive interaction ® as follows. If neuron 1 has a spike at time t and neuron 2 does not have a spike at time t C ¿ in the original spike trains, the ring probability of neuron 2 at time t C ¿ is increased by the amount ®. This unidirectional interaction can be rewritten as ®D

g.J21 C h2 / ¡ g.h2 / ; 1 ¡ g.h2 /

(6.1)

in terms of the neural architecture parameters. To modify the network to make it symmetric, let us introduce a reverse interaction: ¯D

g.J12 C h1 / ¡ g.h1 / : 1 ¡ g.h1 /

(6.2)

This ¯ represents the interaction that if neuron 2 has a spike at time t and neuron 1 does not have a spike at time t C ¿ in the original spike trains, the ring probability of neuron 1 at time t C ¿ is increased by the amount of ¯. In general, however, the symmetric interaction (® D ¯) does not indicate a symmetric connection (J21 D J12 ), because the external inputs h1 and h2 can differ. Therefore, the connection is symmetric only in the case that ® D ¯ .2/ and h1 D h2 . In this case, the information-geometric parameter µ12 can be used to extract the connection strength correctly. In this study, we treated the model at the asymptotic limit of stationary ring. In a real experiment, however, one often observes dynamic and nonstationary changes in neural ring. To analyze these real spike data statistically, the spike trains are often divided into several behavioral epochs, and statistical analysis is applied under the assumption that the spikes in each epoch are stationary. This is done mainly because one cannot obtain a sufcient number of samples for statistical testing of nonstationary processes. In addition, under some experimental conditions, the neuronal state may asymptotically relax to one stationary attractor and stay there for the duration of the behavioral epoch. The external stimulus or the inputs from the other brain regions to the recorded neurons will then move the neural state to the next stable attractor. In this view, our approach can be applied to these stationary attractors and provides the rst step toward estimation of the neural architectures underlying the information-geometric measure. Furthermore, our approach can be extended to the analysis of nonstationary neural activity, since we have already derived the dynamical equations for hSi i, hSi Sj i, and hS1 S2 S3 i (see equations 4.2, 4.3, and 5.2). Solving these equations will shed further light on the relation between the nonstationary change of the information-geometric measures and the underlying neural architectures. Finally, our model approach can be extended in a number of directions. .2/ One extension is to investigate the time-delayed µ12 .¿ / and Cij .¿ / more

Neural Architectures Underlying Information-Geometric Measures

763

systematically to obtain a better estimation of J21 and J12 . Another direction is to extend the model approach to general N-neuron systems, since the information-geometric measure can be applied to any number of neural interactions. We can also extend our approach to allow the use of other kinds of model neurons, such as an integrate-and-re model or the Hodgkin and Huxley (1952) model, though the mathematical treatment will be more difcult. In any case, further development of good correlation measures for spike ring and further sophistication of the model approach will be necessary to elucidate how information is represented and processed in the brain. Acknowledgments We thank S. Amari and H. Nakahara for their helpful discussions. M.T. is also grateful to B. McNaughton for his informative discussions and critical reading of the manuscript and to S. Cowen and T. Ellmore for their useful comments. M.T. was supported by a JSPS Overseas Research Fellowship. This work was partially supported by Grant-in-Aid for Scientic Research on Priority Areas No. 14084212 and Grant-in-Aid for Scientic Research (C) No. 14580438. References Aertsen, A. M. H. J., Gerstein, G. L., Habib, M., & Palm, G. (1989). Dynamics of neuronal ring correlation: Modulation of “effective connectivity.” Journal of Neurophysiology, 61, 900–918. Amari, S. (1985). Differential geometrical methods in statistics. Berlin: SpringerVerlag. Amari, S. (2001). Information geometry on hierarchy of probability distributions. IEEE Transactions on Information Theory, 47, 1701–1711. Amari, S., Kurata, K., & Nagaoka, H. (1992). Information geometry of Boltzmann machines. IEEE Transactions on Neural Networks, 3(2), 260–271. Amari, S., & Nagaoka, H. (2000). Methods of information geometry. New York: Oxford University Press. Chapin, J. K., Moxon, K. A., Markowitz, R. S., & Nicolelis, M. A. L. (1999). Realtime control of a robot arm using simultaneously recorded neurons in the motor cortex. Nature Neuroscience, 2(7), 664–670. Gerstein, G. L. (1998). Correlation-based analysis methods for neural ensemble data. In M. A. L. Nicolelis (Ed.), Methods for neural ensemble recordings (pp. 157–177). Boca Raton, FL: CRC Press. Ginzburg, I., & Sompolinsky, H. (1994). Theory of correlations in stochastic neural networks. Physical Review E, 50, 3171–3191. Grun, ¨ S. (1996). Unitary joint-events in multiple-neuron spiking activity: Detection, signicance, and interpretation. Thun, Frankfurt on Main: Verlag Harri Deutsch. Grun, ¨ S., & Diesmann, M. (2000). Evaluation of higher-order coincidences in multiple parallel processes. Society for Neuroscience Abstracts, 26, 828.3.

764

M. Tatsuno and M. Okada

Grun, ¨ S., Diesmann, M., & Aertsen, A. M. H. J (2002a). “Unitary events” in multiple single-neuron spiking activity: I: Detection and signicance. Neural Computation, 14, 43–80. Grun, ¨ S., Diesmann, M., & Aertsen, A. M. H. J. (2002b). “Unitary events” in multiple single-neuron spiking activity: II: Nonstationary data. Neural Computation, 14, 81–120. Hampson, R. E., Simeral, J. D., & Deadwyler, S. A. (1999). Distribution of spatial and nonspatial information in dorsal hippocampus. Nature, 402, 610–614. Hodgkin, A. L., & Huxley, A. F. (1952). A quantitative description of membrane current and its application to conduction and excitation in nerve. Journal of Physiology, 117(4), 500–544. Hoffman, K. L., & McNaughton, B. L. (2002). Coordinated reactivation of distributed memory traces in primate neocortex. Science, 297, 2070–2073. Ito, H., & Tsuji, S. (2000). Model dependence in quantication of spike interdependence by joint peri-stimulus time histogram. Neural Computation, 12, 195–217. Koch, C., & Segev, I. (Eds.). (1998). Methods in neuronal modeling. Cambridge, MA: MIT Press. Kudrimoti, H. S., Barnes, C. A., & McNaughton, B. L. (1999). Reactivation of hippocampal cell assemblies: Effects of behavioral state, experience, and EEG dynamics. Journal of Neuroscience, 19, 4090–4101. Laubach, M., Wessberg, J., & Nicolelis, M. A. L. (2000). Cortical ensemble activity increasingly predicts behavior outcomes during learning of a motor task. Nature, 405, 567–571. Nakahara, H., & Amari, S. (2002a). Information-geometric decomposition in spike analysis. In T. G. Dietterich, S. Becker, & Z. Ghahramani (Eds.), Advances in neural information processing systems, 14. Cambridge, MA: MIT Press. Nakahara, H., & Amari, S. (2002b). Information-geometric measure for neural spikes. Neural Computation, 14, 2269–2316. Nakahara, H., Amari, S., Tatsuno, M., Kang, S., Kobayashi, K., Anderson, K. C., Miller, E. K., & Poggio, T. (2001). Information geometric measures for spike ring. Society for Neuroscience Abstracts, 27, 821.46. Panzeri, S., & Schultz, S. R. (2001). A unied approach to the study of temporal, correlational, and rate coding. Neural Computation, 13, 1311–1349. Riehle, A., Grun, ¨ S., Diesmann, M., & Aertsen, A. M. H. J. (1997). Spike synchronization and rate modulation differently involved in motor cortical function. Science, 278, 1950–1953. Rieke, F., Warland, D., Steveninck, R. de Ruyter von, & Bialek, W. (1997). Spikes. Cambridge, MA: MIT Press. Tatsuno, M., & Okada, M. (2002). Possible neural mechanisms underlying information-geometric measure parameters. Society for Neuroscience Abstracts, 28, 675.15. Tatsuno, M., & Okada, M. (2003). How does the information-geometric measure depend on underlying neural mechanisms? Neurocomputing, 52–54, 649– 654. Wessberg, J., Stambaugh, C. R., Kralik, J. D., Beck, P. D., Laubach, M., Chapin, J. K., Kim, J., Biggs, S. J., Srinivasan, M. A., & Nicolelis, M. A. L. (2000).

Neural Architectures Underlying Information-Geometric Measures

765

Real-time prediction of hand trajectory by ensembles of cortical neurons in primates. Nature, 408, 3610–365. Wilson, M. A., & McNaughton, B. L. (1993). Dynamics of the hippocampal ensemble code for space. Science, 261, 1055–1058. Zhang, K., Ginzburg, I., McNaughton, B. L., & Sejnowski, T. J. (1998). Interpreting neuronal population activity by reconstruction: Unied framework with application to hippocampal place cells. Journal of Neurophysiology, 79, 1017–1044. Received January 27, 2003; accepted September 5, 2003.

Communicated by Guy Lebanon

LETTER

Robustifying AdaBoost by Adding the Naive Error Rate Takashi Takenouchi [email protected] Department of Statistical Science, Graduate University of Advanced Studies, Tokyo, Japan

Shinto Eguchi [email protected] Institute of Statistical Mathematics, Japan, and Department of Statistical Science, Graduate University of Advanced Studies, Tokyo, Japan

AdaBoost can be derived by sequential minimization of the exponential loss function. It implements the learning process by exponentially reweighting examples according to classication results. However, weights are often too sharply tuned, so that AdaBoost suffers from the nonrobustness and overlearning. We propose a new boosting method that is a slight modication of AdaBoost. The loss function is dened by a mixture of the exponential loss and naive error loss functions. As a result, the proposed method incorporates the effect of forgetfulness into AdaBoost. The statistical signicance of our method is discussed, and simulations are presented for conrmation. 1 Introduction In this article, we investigate a classication problem using a boosting method. The boosting method aims to construct a strong learning machine by combining weak hypotheses (Schapire, 1990). One typical boosting method is AdaBoost (Freund & Schapire, 1997; Schapire, 1999). The key concept here is to change the weight distribution of a training data set in each step of the learning process. Let y denote the class label, which is to be predicted based on a feature vector represented by x 2 Rp . A weak hypothesis f .x/ with values §1 is in a given class F . An indicator function I is dened by » 1; A is true, I.A/ D (1.1) 0; otherwise. For a given training data set f.xi ; yi /gN iD1 and a discriminant function F.x /, the exponential loss is dened by Lexp .F/ D

N X iD1

¡ ¢ exp ¡yi F.xi / :

Neural Computation 16, 767–787 (2004)

(1.2) c 2004 Massachusetts Institute of Technology °

768

T. Takenouchi and S. Eguchi

We introduce AdaBoost by the sequential minimization of equation 1.2 as follows: 1 1. Start with weights w1 .i/ D .i D 1; : : : ; N/ , F0 .x/ D 0. N 2. For m D 1; : : : ; M a. Find fm .x/ D argmin "m . f /;

(1.3)

f 2F

where "m . f / D b. Calculate ®m D

PN

iD1 wm .i/I. f .xi /

6D yi /.

1 1 ¡ " m . fm / log : 2 " m . fm /

(1.4)

c. Update the weight distribution as wmC1 .i/ D

exp.¡Fm .xi /yi / ; ZmC1

(1.5)

P ZmC1 D N iD1 exp.¡Fm .xi /yi /. ±P ² M 3. Output the discriminant function sgn . ® f . x / mD1 m m where Fm .x/ D

Pm

sD1 ®s fs .x/,

In step 2b, ®m is a log-odds ratio of weighted error rate "m . fm / and is the coefcient of fm .x/ in the output function. When "m . fm / is small, ®m takes a large value, and when "m . fm / is near 12 , ®m takes a small value. In step 2c, we can rewrite the update rule, equation 1.5 as » wmC1 .i/ /

wm .i/e®m wm .i/e¡®m

if fm .xi / 6D yi ; otherwise

(1.6)

where wm .i/ D Z¡1 m exp.¡Fm¡1 .xi /yi /. In other words, the weight wm .i/ goes up by e®m when the weak hypothesis fm .x/ fails to classify the training data .xi ; yi / correctly and goes down by e¡®m otherwise. In a subsequent discussion, we propose alternative updating methods in place of equation 1.6. In addition, the update rule, equation 1.6, has the following property: "mC1 . fm / D

1 2

.m D 1; : : : ; M ¡ 1/:

(1.7)

This implies that the hypothesis fm .x/ optimized with respect to the weight distribution wm .i/ has no predictive accuracy or is a random guess under the updated distribution wmC1 .i/.

Robustifying AdaBoost

769

While AdaBoost is a very simple and powerful method, this method is sensitive to outliers. Robust statistics emerged in the 1960s in the statistical community (Huber, 1981; Hampel, Rousseeuw, Ronchetti, & Stahel, 1986). This concept is based on a geometric interpretation when the data space is unbounded and continuous. An outlier can be dened as residing far from the bulk of a given data set. A procedure is said to be robust if it is relatively unaffected by outliers. An inuence function is introduced to assign a quantied value for the effect of the procedure. However, the space of labels is limited to f¡1; 1g in our context, so that we cannot use the usual denition of robustness described above. The outlying occurs simply in the transposition between ¡1 and 1, in which the label y is changed to ¡y. Therefore, we have to take a probabilistic approach rather than a geometric approach. Typically, the distribution of the label y given x, p.yjx/ is contaminated as .1 ¡ ´/p.yjx/ C ´p.¡yjx/;

(1.8)

with the probability ´ (Copas, 1988). In a subsequent discussion, we will give another form of ´ depending on x. We focus on a simple example in order to explore the weakness of AdaBoost under noisy conditions. We generate a data set of 200 examples, which is completely separable by a linear separator, and change the label of one instance only. Figure 1 gives the data set and the classication result provided by AdaBoost. Since AdaBoost exponentially increases the weight for the outlier, some weak hypotheses with a decision boundary near the outlier are selected in learning steps. Consequently, the output is greatly affected by the outlier, as shown in Figure 1, and leads to poor performance in the sense of generalization error. In this article, we propose a new boosting algorithm called ´-Boost. We dene the naive loss function as Lnaive.F/ D ¡

N X

yi F.xi /:

(1.9)

iD1

We dene the loss function of ´-Boost by simple modication to equation 1.2 as follows: L´ .F/ D .1 ¡ ´/Lexp .F/ C ´Lnaive.F/;

(1.10)

where 0 · ´ < 1. If we apply ´-Boost to the data set given above, ´-Boost arrives at the same result as that for the separable case shown in Figure 2. The key in ´-Boost is to consider the uniform weights given by the naive error loss function. We will show that ´-Boost moderates the overweighting for outliers using a uniform weight distribution. This letter is organized as follows. Section 2 introduces the algorithm of ´-Boost from equation 1.10 and presents a comparison with AdaBoost.

770

T. Takenouchi and S. Eguchi

Figure 1: Data set, Bayes rule, and discriminant function constructed by AdaBoost. Weak hypotheses were stumps. The data set had 200 examples generated by p.x; y/ D ¼y Á.xI ¹y ; I/; where ¼1 D 0:65; ¼¡1 D 0:35; ¹1 D .¡4:0; 0:0/T ; ¹¡1 D .3:0; 0/T . The probability density function Á.xI ¹y ; I/ was a two-dimensional normal distribution having mean vector ¹y and variance matrix I. The Bayes rule is x1 D ¡ 12 C 17 log ¼¼1 . The outlier was placed at .¡7:3; 0:0/T . ¡1

In section 3, we discuss statistical properties and the model associated with ´-Boost. We show that ´-Boost is more robust than AdaBoost under the contamination model, equation 1.8. In section 4, simple computer simulations are presented, and in section 5, we discuss the performance of ´-Boost. 2 ´-Boost 2.1 Algorithm of ´-Boost. Let us dene ´-Boost as a simple modication to AdaBoost. Assuming 0 · ´ < 1, the algorithm is as follows: 1. Start with weights w¤1 .i/ D 2. For m D 1; : : : ; M

1 .i D 1; : : : ; N/ , F0 .x/ D 0. N

Robustifying AdaBoost

771

Figure 2: Data set and discriminant function constructed by ´-Boost, ´ D 0:1.

a. Find ¤ fm .x/ D argmin "m . f /;

(2.1)

f 2F

¤ where "m .f/ D b. Calculate

PN

¤ iD1 wm .i/I. f .xi /

p ¤ ®m

D log

6D yi /.

1 ¡ "m . fm / C .´Km /2 C ´Km p ; "m . fm /

(2.2)

where "m . f / is dened in AdaBoost, Km D

.1 ¡ 2"1 . fm // p 2 " m . fm /

³

.1 ¡ ´/Zm N

´¡1

:

(2.3)

c. Update the weight distribution as w¤mC1 .i/ D

.1 ¡ ´/ exp.¡Fm .xi /yi / C ´ ; Z¤mC1

(2.4)

772

T. Takenouchi and S. Eguchi

where Fm .x/ D yi / C ´.

Pm

¤ ¤ sD1 ®s fs .x/; ZmC1

3. Output the discriminant function sgn.

D

PN

iD1 .1¡´/ exp.¡Fm .xi /

PM

¤ mD1 ®m fm .x//.

If we set ´ D 0, this algorithm reduces to AdaBoost. In step 2b, we ¤ , which is the coefcient of calculate ®m fm .x/ in the output discriminant function Fm .x/. Here, "m . fm / is the weighted error rate evaluated by wm .i/, which is the weight dened in AdaBoost, and "1 . fm / is weighted error rate by ¤ the uniform weight w1 .i/ D N1 . If we set ´ D 0, ®m is the log-odds of "m . fm /, just as in AdaBoost. The ´-Boost adjusts the log-odds by ´Km . If "1 . fm / ¸ 12 , so that the chosen weak hypothesis fm .x/ has very low performance under ¤ uniform weight w¤1 .i/, Km · 0 and ´-Boost decreases ®m due to log-odds. On 1 the other hand, if "1 . fm / · 2 , so that fm .x/ is an efcient hypothesis under ¤. w¤1 .i/, then ´-Boost increases ®m In step 2c, we update the weight distribution. We can rewrite the update rule as follows: w¤m .i/ D .1 ¡ ±m /wm .i/ C ±m w1 .i/;

(2.5)

where ±m D

´Z1 : .1 ¡ ´/Zm C ´Z1

(2.6)

Therefore, we can interpret this update rule as moderating the exponential update of weights in AdaBoost using the uniform weight w1 .i/ (see Figure 3). By using this weight in step 2a, ´-Boost tends not to select the weak hypothesis, which is specied for outliers, because this weighting considers the error rate, which is evaluated at uniform weight. 2.2 Derivation. We can derive ´-Boost by the sequential minimization of the loss function, equation 1.10 (see Eguchi & Copas, 2001). We now assume that we have the mth solution of this minimization problem, Fm .x/, Fm .x/ D

m X

®s¤ fs .x/:

(2.7)

sD1

Let us consider an update from Fm .x/ to Fm .x/ C ®f .x/ to minimize the loss function, where f .x/ is a new weak hypothesis and ® is its weight. We rst optimize the weak hypothesis f .x/. It is difcult to obtain the exact solution except for the AdaBoost. Here, we adopt an argument similar to that in Hastie, Tibshirani, and Friedman (2001) and Mason, Baxter, Bartlett, and Frean (1999). By expanding L´ .F C ®f / with ®f .x/, we obtain @L´ .Fm C ®f / ®: (2.8) L´ .Fm C ®f / ¼ L´ .Fm / C @® ®D0

Robustifying AdaBoost

773

Figure 3: Weight function of uniform weight, AdaBoost, ´-Boost with respect to F.x/y, where F.x/ is a discriminant function.

Next, we minimize N n o X @L´ .Fm C ®f / D¡ .1 ¡ ´/e¡Fm .xi /yi C ´ f .xi /yi : @® ®D0 iD1

(2.9)

If we set w¤mC1 .i/ / .1 ¡ ´/e¡Fm .xi /yi C ´;

(2.10)

¤ the above minimization is equivalent to the minimization of "mC1 . f /, which is dened in ´-Boost, and we denote its solution as fmC1 .x/. Next, we optimize ®. Since the weak hypothesis fmC1 .x/ is xed, we can solve

min L´ .Fm C ®fmC1 / ®

D min ®

N X iD1

.1 ¡ ´/e¡.Fm .xi /C®fmC1 .xi //yi ¡ ´.Fm .xi / C ®fmC1 .xi //yi

(

D min e® .1 ¡ ´/ZmC1 ®

N X iD1

I. f .xi / 6D yi /wmC1 .i/ C e¡® .1 ¡ ´/ZmC1

774

T. Takenouchi and S. Eguchi

£

N X iD1

¡´

I. f xi / D yi /wmC1 .i/ ¡ ®´N

N X

)

N X

f .xi /yi w1 .i/

iD1

(2.11)

F.xi /yi :

iD1

Setting @ L´ .Fm C ®fmC1 / @® D e¡® .1 ¡ ´/ZmC1 "mC1 . fmC1 / 8Á 9 !2 < 2= 1 ¡ C ´K " . f / .´K / mC1 mC1 mC1 mC1 £ ¡ e® ¡ p : ; "mC1 . fmC1 / "mC1 . fmC1 / D 0;

(2.12)

we obtain equation 2.2 in ´-Boost. By repeating the above procedures, we can derive ´-Boost. Also, in ´-Boost, we observe that ¤ "mC1 . fm / D

1 : 2

(2.13)

¤ is the solution of equation 2.11, ¤ satises Since ®m ®m

N X @ L´ .Fm¡1 C ®fm / ¤ D¡ f.1 ¡ ´/e¡Fm¡1 .xi /yi ¡®m fm .xi /yi C ´g @® ¤ ®D®m iD1 £ fm .xi /yi D¡ /

N X iD1

f.1 ¡ ´/e¡Fm .xi /yi C ´g fm .xi /yi

¤ 2"mC1 . fm /

¡1

D 0; which proves equation 2.13. Next, we discuss the properties of the loss function. We seek a minimizer of the abstract loss for a discriminant function F.x/. The abstract loss is E[.1 ¡ ´/e¡F.X/Y ¡ ´F.X /Y];

(2.14)

where E denotes the expectation with respect to the joint distribution of .X ; Y/, g.x; y/. It is sufcient to minimize the loss function conditional on

Robustifying AdaBoost

775

x (Friedman, Hastie, & Tibshirani, 2000). Setting @E[.1 ¡ ´/e¡¯Y ¡ ´¯YjX D x] ¤ @¯ ¯DF´ .x/ ¤

¤

D g.1jx/.¡.1 ¡ ´/e¡F´ .x/ ¡ ´/ C g.¡1jx/..1 ¡ ´/eF´ .x/ C ´/ D 0;

(2.15)

we obtain ¤

log

g.1jx/ .1 ¡ ´/eF´ .x/ C ´ D log ; ¤ g.¡1jx/ .1 ¡ ´/e¡F´ .x/ C ´

(2.16)

or equivalently ¤

g.yjx/ D

.1 ¡ ´/eF´ .x/y C ´ ¡ ¤ ¢ ; ¤ .1 ¡ ´/ eF´ .x/ C e¡F´ .x/ C 2´

(2.17)

where F¤´ .x/ is a minimizer of the abstract loss, F¤´ .x/ D argmin E[.1 ¡ ´/e¡F.x/Y ¡ ´F.x/YjX D x]:

(2.18)

F

We observe that log

g.1jx/ D 0 () F¤´ .x/ D 0; g.¡1jx/

(2.19)

so that the Bayes rule and F¤´ .x/ have the same class boundary. Note that F¤´ .x/ differs from the Bayes rule over the entire region, except at the class boundary. In section 3, we discuss the properties of F¤´ .x/. 2.3 Connection Between ´-Boost and ´-Divergence. Let X D fx1 ; : : : ; xN g be a set of feature vectors in the training data set and Y D f¡1; 1g be a label set. We denote by P the set of nonnegative measures on Y conditioned by X . 8 9 < = X P D p.yjx/ : (2.20) p.yjx/ < 1; for each x 2 X : : ; y2Y We denote the empirical distribution of the training data set as Q x; y/ D p.

N 1 X ±.x; xi /±.y; yi / N iD1

776

T. Takenouchi and S. Eguchi

and the marginal distribution of x as Q x/ D p. as

N 1 X ±.x; xi /: N iD1

We dene the extended conditional Kullback-Leibler divergence on P £ P D.p; q/ D

X

x2X

Q x/ p.

X»

p.yjx/ log

y2Y

¼ p.yjx/ ¡ p.yjx/ C q.yjx/ ; (2.21) q.yjx/

which is also referred to as the I-divergence by Csisz a´ r (1984). If p.yjx/ and q.yjx/ are both probability distributions, then D.p; q/ reduces to the usual conditional version of the Kullback-Leibler divergence. Alternatively, the extended conditional ´-divergence on P £ P (Murata, Eguchi, Takenouchi, & Kanamori, 2002) is dened as X X Q x/ D´ .p; q/ D p.

x2X

y2Y

»

£ .p.yjx/ ¡ ´/ log

¼ p.yjx/ ¡ ´ ¡ p.yjx/ C q.yjx/ : (2.22) q.yjx/ ¡ ´

We observe that D.p ¡ ´; q ¡ ´/ D D´ .p; q/:

(2.23)

If ´ D 0, D´ .p; q/ reduces to D.p; q/. With equation 2.7, which is the mth discriminant function in ´-Boost, we consider the following model where qm .yjx/ ¡´ is the extended exponential model, ³

´ 1 Q C ´; qm .yjx/ D .1 ¡ ´/ exp Fm .x/.y ¡ y/ 2

(2.24)

where yQ D EpQ [yjx]. We assume that each xi 2 X has a unique label yi , so that we obtain EpQ [yjxi ] D yi (see Lebanon & Lafferty, 2001). The weight of ´-Boost w¤m .i/ corresponds to qm¡1 .¡yi jxi / from equation 2.10. We observe that Q qm / ¡ D´ .p; Q qmC1 / D´ .p; X X 1 ¤ Q Q x/ Q x/ ¡ ´/ ®mC1 D p. .p.yj fmC1 .x/.y ¡ y/ 2 x2X y2Y Q Q C .1 ¡ ´/.e 2 Fm .x/.y¡y/ ¡ e 2 FmC1 .x/.y¡y/ / 1

1

Robustifying AdaBoost

777

D

N n ± ²o 1 X ¤ ¡.¡´/®mC1 fmC1 .xi /yi C .1 ¡ ´/ e¡Fm .xi /yi ¡ e¡FmC1 .xi /yi N iD1

D

1 .L´ .Fm / ¡ L´ .FmC1 //: N

Then the sequential minimization of the ´-divergence between the model, Q x/ is equivalent to the sequential minimization of equation 2.24 and p.yj equation 1.10. In addition, we observe that Q qm / ¡ D´ .p; Q qmC1 / ¡ D´ .qmC1 ; qm / D´ .p; ¼ X X» 1 Q x/ Q x/ ¡ qmC1 .yjx// ®mC1 fmC1 .x/.y ¡ y/ Q D p. .p.yj 2 x2X y2Y D

N 1 X qmC1 .¡yi jxi /®mC1 fmC1 .xi /yi N iD1

/ ®mC1

N X

w¤mC2 .i/ fmC1 .xi /yi

iD1

D 0:

(2.25)

This follows from equation 2.13. Therefore, if we assume D´ .qmC1 ; qm / > 0, we have Q qm / > D´ .p; Q qmC1 /: D´ .p;

(2.26)

Thus, we can say that the model qm .yjx/ approaches the empirical distribution in the sense of D´ approximation. 3 Properties of ´-Boost 3.1 Model Associated with ´-Boost. We next x a vector f .x/ of weak hypotheses f1 .x/; : : : ; fm .x/ and a parameter ®0 2 Rm . Let g´0 .x; y/ be a probability density function of .X ; Y/, and assume that g´0 .x; y/ D g.x/g´0 .yjx/ D g.x/

.1 ¡ ´0 /e® 0 ¢ f .x/y C ´0 ; .1 ¡ ´0 /.e® 0 ¢ f .x/ C e¡®0 ¢ f .x/ / C 2´0

(3.1)

where 0 · ´0 < 1 and ® ¢f .x/ denotes the inner product. This model is identical to equation 2.17 with ´ D ´0 and F¤´ .x/ D ®0 ¢ f .x/. Since ´0 is basically unknown, we optimize the parameter ® by the loss function generated by .1 ¡ ´1 /e¡z ¡ ´1 z for some xed ´1 , 0 · ´1 < 1. Let

®.´1 / D argmin E[.1 ¡ ´1 /e¡®¢ f .X/Y ¡ ´1 ® ¢ f .X /Y]; ®

(3.2)

778

T. Takenouchi and S. Eguchi

where E denotes the expectation with respect to the joint distribution g´0 .x; y/ of .X ; Y/. We note that ®.´1 / satises the equilibrium condition as E[f .X /Yf¡.1 ¡ ´1 /e¡®.´1 /¢ f .X/Y ¡ ´1 g] D 0:

(3.3)

We now consider the abstract error rate for a discriminant function ®.´1 /¢ f .x/ as L.®.´1 /¢f / D E [I.®.´1 /¢ f .X /Y < 0/] :

(3.4)

In practical situations, we cannot determine the exact value of this abstract error L; however, we can usually estimate this value by a validation technique based on given training data. Then we obtain the following theorem. Theorem 1.

For any xed ´1 , 0 · ´1 < 1,

L.®.´1 /¢ f / ¸ L.®.´0 /¢ f / D L.®0 ¢ f /;

(3.5)

where ®0 is the parameter of probability density function g´0 .x; y/ dened in equation 3.1. Proof. First, we show that ®.´0 / D ®0 . If we replace ®.´0 / with ®0 , we have the following equation, from equation 3.3, Z (± ² .1 ¡ ´0 /e® 0 ¢ f .x/ C ´0 ¡.1 ¡ ´0 /e¡® 0 ¢ f .x/ ¡ ´0 .1 ¡ ´0 /.e® 0 ¢ f .x/ C e¡®0 ¢ f .x/ / C 2´0 ) ± ² .1 ¡ ´0 /e¡® 0 ¢ f .x/ C ´0 ®0 ¢ f .x/ ¡ ¡.1 ¡ ´0 /e ¡ ´0 .1 ¡ ´0 /.e® 0 ¢ f .x/ C e¡® 0 ¢ f .x/ / C 2´0 £ f .x/g.x/dx D 0; since the terms in the braces cancel. Thus, ®0 is a solution of equation 3.3. Since the loss function generated by .1 ¡ ´1 /e¡z ¡ ´1 z is convex in z, equation 3.2 has a unique solution. Thus, we obtain ®.´0 / D ®0 . Next, we consider the Bayes rule as ¸´0 .x/ D log

g´0 .1jx/ : g´0 .¡1jx/

(3.6)

Noting that ®.´0 /¢ f .x/ > 0 , ¸´0 .x/ > 0 , we have L.®.´0 /¢ f / D E [I.®.´ 0 /¢ f .X /Y < 0/] D L.¸´0 /:

(3.7)

Since the Bayes rule minimizes the abstract error rate, equation 3.4 (see Mclachlan, 1992), we can prove that L.®.´1 /¢ f / ¸ L.®.´ 0 / ¢ f / for all 0 · ´1 < 1.

Robustifying AdaBoost

779

This theorem shows that the abstract error rate, equation 3.4 reaches a minimum when we use the loss function generated by .1 ¡ ´0 /e¡z ¡ ´0 z. Hence, estimating ´0 using this property would be benecial. In fact, we evaluate the abstract error rate by 10-fold cross-validation, which will be given in a subsequent discussion. 3.2 Interpretation of Our Model. We now discuss the statistical interpretation of the model, equation 3.1. Consider a degree of deviation of g´0 .yjx/ from the logistic model, g0 .yjx/ D

e® 0 ¢ f .x/y : e¡® 0 ¢ f .x/ C e® 0 ¢ f .x/

(3.8)

See Eguchi and Copas (2002) regarding properties of the logistic model. We next observe that ¡ ¢ g´0 .yjx/ D 1 ¡ ².´0 ; ®0 ¢ f .x// g0 .yjx/C².´0 ; ®0¢f .x//g0 .¡yjx/; (3.9) where ².´0 ; z/ D

.1 ¡ ´0

/.ez

´0 : C e¡z / C 2´0

(3.10)

Thus, we can interpret the model, equation 3.1, as a contamination model in which the transpositions in label y occur with the following probability: ².´ 0 ; ®0¢f .x//. Note that ².´0 ; ®0¢f .x// depends on ´0 , ®0 , x. Thus, ².´0 ; ®0¢ f .x// is maximized at ®0¢f .x/ D 0, and its maximum value is ´20 . Therefore, contamination occurs primarily near the boundary, and the probability of contamination decreases exponentially as x moves away from the boundary (see Figure 4). 3.3 Normalized Model. In section 2.3, we discussed the unnormalized set P . Here, we treat the set 1 in P , which is the set of normalized conditional probability distributions, 8 < 1D

:

p.yjx/ 2 P I

X y2Y

9 = p.yjx/ D 1; for each x 2 X

;

:

(3.11)

Q q/ where .q.yjx/¡´/ We now consider the minimization problem of D´ .p; is proportional to the logistic model, equation 3.8, such that q.yjx/ ¡ ´ D g0 .yjx/: 1 ¡ 2´

(3.12)

780

T. Takenouchi and S. Eguchi

0.5 0.45 0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0 -0.05

1 0.8 0.6 eta

-10

0.4 0.2 0 10

5

-5

0 z

Figure 4: Graph showing ².´0 ; z/ with respect to ´0 and z.

This model can be rewritten as q.yjx/ D .1 ¡ ´/g0 .yjx/ C ´g0 .¡yjx/:

(3.13)

This model is a contamination model in which the label y is transposed with the constant probability ´ (Copas, 1988). In addition, the above minimization problem is equivalent to the minimization of the loss function, which is generated by .1 ¡ 2´/ log.1 C e¡2z / ¡ ´z. 4 Simulation Study In this section, we describe a number of simulation studies using synthetic data sets and a real data set. One simulation was intended to demonstrate theorem 1, showing that ´-Boost is more robust than AdaBoost under the contaminated model, equation 3.1. A second simulation dealt with a case in which the model was not described by equation 3.1 but in which the data set was affected by some noise. Throughout this section, we implemented AdaBoost, ´-Boost, LogitBoost (Friedman et al., 2000), MadaBoost (Domingo & Watanabe, 2000), and AdaBoostreg (Ra¨ tsch, Onoda, & Muller, ¨ 2001). 4.1 Case 1. We examined a case in which a training data set was generated by model 3.1. Weak hypotheses were stumps (Friedman et al., 2000).

Robustifying AdaBoost

781

Figure 5: Training data set generated by equation 4.1. Circled examples were contaminated. In this gure, 31 examples were contaminated.

Let g.x/ be the probability density function of the two-dimensional uniform distribution on .¡¼; ¼ / £ .¡¼; ¼/ and assume g´0 .yjx/ D

.1 ¡ ´0 /e.x2 ¡3 sin.x1 //y C ´0 ¡ ¢ : .1 ¡ ´0 / e.x2 ¡3 sin.x1 // C e¡.x2 ¡3 sin.x1 // C 2´0

(4.1)

We generated 200 observations of training data and 2000 observations of test data, which followed equation 4.1. Figure 5 shows a training data set. We xed ´0 D 0:5. Examples were observed to be particularly contaminated near the boundary. In Figure 6, the training error rate, the test error rate, and the 10-fold cross-validation error rate of ´-Boost at M D 50 are shown for each ´. The training error rate was observed to be minimized at ´ D 0, but the test error rate and the 10-fold cross-validation error rate were both minimized near ´ D 0:5. If we select ´ based on the 10-fold cross-validation error rate, ´-Boost is more robust than AdaBoost. Figure 7 shows that the decision boundary constructed by AdaBoost was affected by contaminated exam-

782

T. Takenouchi and S. Eguchi

Figure 6: Test error, 10-fold cross-validation error, and training error at M D 50 with respect to ´. The circled point is the minimum value of each curve.

ples. In contrast, the decision boundary constructed by ´-Boost appeared robust for contaminated examples, as shown in Figure 8. Table 1 provided the results of the test error rate and the training error rate of several methods. The value ´ in ´-Boost was determined by 10-fold cross-validation as well as tuning parameters C; p in AdaBoostreg. The ´Boost outperformed other methods in terms of test error, as predicted by theorem 1. 4.2 Case 2. We next investigated the case of a data set that was not generated by model 3.1 but was affected by some type of noise, so that AdaBoost does not work well. The class F of weak hypotheses and the distribution of x were the same as in case 1, and we made the following assumptions: g0´0 .yjx/ D .1 ¡ ´0 / C ´0

e.x2 ¡3 sin.x1 //y e.x2 ¡3 sin.x1 // C e¡.x2 ¡3 sin.x1 // e¡.x2 ¡3 sin.x1 //y ; C e¡.x2 ¡3 sin.x1 //

e.x2 ¡3 sin.x1 //

(4.2)

where ´0 was xed as 0:1. This is a contamination model in which the transposition in the label y occurs with probability ´0 uniformly throughout

Robustifying AdaBoost

783

Figure 7: The discriminant function constructed by AdaBoost.

the feature space. We generated 200 observations of training data and 2000 observations of test data from this model. Table 2 shows the results. The ´-Boost is superior to other methods, except for AdaBoostreg in terms of test error, although the simulation setting, equation 4.2, was not suited for that associated with ´-Boost as in equation 4.1. 4.3 Breast Cancer Wisconsin. We used a real data set from the UC-Irvine machine learning archive, Breast Cancer Wisconsin. This data set consisted of 683 observations. A feature vector x consisted of nine variables, which were coded on a scale 1 to 10, and the number of classes was two. We split the data into 500 training examples and 183 test examples. Table 3 shows that AdaBoostreg was superior to other methods in term of test error. Here, we made a further consideration to a case in which some examples were virtually marked reverse labels. This articial mislabeling was randomly performed only for examples near the decision boundary, which was obtained by one of the above methods based on the real data set, for example, AdaBoost. The choice of methods does not affect the result. Approximately 10% of the example labels were exchanged.

784

T. Takenouchi and S. Eguchi

Figure 8: The discriminant function constructed by ´-Boost where ´ was selected by the 10-fold cross-validation error. Table 1: Test Error Rates and Training Error Rates for the Synthetic Data Set.

AdaBoost ´-Boost LogitBoost MadaBoost AdaBoostreg

Test Error

Training Error

0.2180 0.1934 0.2352 0.2433 0.1972

0.1224 0.1410 0.1488 0.1588 0.0930

Table 2: Test Error Rates and Training Error Rates for the Synthetic Data Set.

AdaBoost ´-Boost LogitBoost MadaBoost AdaBoostreg

Test error

Training error

0.2461 0.2439 0.2566 0.2565 0.2256

0.1609 0.1991 0.1740 0.1753 0.1145

Robustifying AdaBoost

785

Table 3: Test Error Rates and Training Error Rates for the Breast Cancer Wisconsin Data Set.

AdaBoost ´-Boost LogitBoost MadaBoost AdaBoostreg

Test Error

Training Error

0.0494 0.0508 0.0549 0.0562 0.0285

0.0168 0.0386 0.0230 0.0222 0.0475

Table 4: Test Error Rates and Training Error Rates for the Breast Cancer Wisconsin Data Set with Synthetic Noise.

AdaBoost ´-Boost LogitBoost MadaBoost AdaBoostreg

Test Error

Training Error

0.0578 0.0527 0.0631 0.0666 0.0539

0.0646 0.0773 0.0712 0.0754 0.0714

The results of this test are shown in Table 4. The ´-Boost outperformed the other methods in term of test error. 4.4 Comparison. Let us now discuss the relation between the ´-Boost and the previously proposed boosting algorithms LogitBoost, MadaBoost, and AdaBoostreg. LogitBoost can be viewed as the normalized version of AdaBoost (see Lebanon & Lafferty, 2001) in which the loss function is LLogit .F/ D

N X iD1

log.1 C e¡2F.xi /yi /:

The loss function of MadaBoost is LMada .F/ D

N X

U.F.xi /yi /;

iD1

where U.z/ is »1 ¡z U.z/ D 12 ¡2z e 2

.z < 0/ .z ¸ 0/:

A loss function of AdaBoostreg is regularized as follows: Lreg .F/ D

N X iD1

¡ ¢ exp ¡F.xi /y ¡ C¹.F.xi /; yi I p/ ;

786

T. Takenouchi and S. Eguchi

where C; p are tuning parameters and ¹.F.xi /; yi I p/ is a nonnegative regularization quantity, which has a high value if an example is difcult to learn. If C D 0, AdaBoostreg reduces to AdaBoost. (See Ratsch ¨ et al., 2001, for a detailed discussion.) In the above simulation results, ´-Boost and AdaBoostreg were observed to have good performance on robustness for all types of noisy data sets. The ´-Boost has an advantage on computational simplicity compared with AdaBoostreg. In the algorithm of AdaBoostreg, we need some numerical optimizations for ®, the coefcient of a weak hypothesis, as well as for LogitBoost and MadaBoost. Additionally, the algorithm has two tuning parameters C and p, and C takes a wide range of values. Thus, AdaBoostreg has a high computational cost with respect to time. In contrast, the algorithm of ´-Boost has the closed form of ® as given in equation 2.2, and the tuning parameter is only ´, which is embedded between 0 and 1. The loss function of LogitBoost and MadaBoost is saturated in the sense of growing only linearly with F.x/y. Such losses are said to be B-robust (Hampel et al., 1986) in the feature space x and MadaBoost is the most B-robust (Murata et al., 2002). But in the classication context, it is more important and effective to consider the robustness for outliers in the label space y. This consideration leads to the contamination modeling associated with the loss function of ´-Boost. We observed that ´-Boost is more effective for several noisy data sets than LogitBoost and MadaBoost. 5 Conclusion We have derived ´-Boost by adding a simple modication to AdaBoost. The algorithm used is equivalent to the sequential minimization of the loss function, which is a mixture of the exponential loss function and the naive error loss function. In addition, we can interpret this minimization problem as the sequential minimization problem of ´-divergence between the empirical distribution and the extended exponential model. The key feature of ´-Boost is the consideration of the uniform weight distribution and the naive error rate. With respect to the weight distribution for the training data set, ´-Boost moderates the exponential update rule of AdaBoost by using the uniform weight distribution. In addition, the coefcient of a weak hypothesis f .x/ in the output function is tuned by the unweighted error rate of f .x/. These operations make the ´-Boost algorithm robust. We have given the statistical interpretation of ´-Boost, demonstrating that ´-Boost is especially robust when dealing with a contaminated logistic model, which has more contamination near the class boundary. We numerically compared ´-Boost with several other methods and conrmed this by computer simulation. For the real data set, we compared ´-Boost with other several methods and conrmed that ´-Boost attains better performance on robustness for noisy data sets.

Robustifying AdaBoost

787

References Copas, J. (1988). Binary regression models for contaminated data. J. Royal Statist. Soc. B, 50, 225–265. Csisz a´ r, I. (1984). Sanov property, generalized I-projection and a conditional limit theorem. Ann. Probability, 12, 768–793. Domingo, C., & Watanabe, O. (2000). MadaBoost: A modication of AdaBoost. In Proceedings of the 13th Conference on Computational Learning Theory. San Mateo, CA: Morgan Kaufmann. Eguchi, S., & Copas, J. (2001). Recent developments in discriminant analysis from an information geometric point of view. J. Korean Statist. Soc., 30, 247– 264. Eguchi, S., & Copas, J. (2002). A class of logistic type discriminant functions. Biometrika, 89, 1–22. Freund, Y., & Schapire, R. E. (1997). A decision-theoretic generalization of online learning and an application to boosting. J. Computer and System Sciences, 55, 119–139. Friedman, J., Hastie, T., & Tibshirani, R. (2000). Additive logistic regression: A statistical view of boosting. Ann. Statist., 28, 337–407. Hampel, F. R., Rousseeuw, P. J., Ronchetti, E. M., & Stahel, W. A. (1986). Robust statistics: The approach based on inuence functions. New York: Wiley. Hastie, T., Tibshirani, R., & Friedman, J. (2001). The elements of statistical learning. New York: Springer-Verlag. Huber, P. J. (1981). Robust statistics. New York: Wiley. Lebanon, G., & Lafferty, J. (2001). Boosting and maximum likelihood for exponential models. In T. G. Dietterich, S. Becker, & Z. Ghahramani (Eds.), Advances in neural information processing systems, 14. Cambridge, MA: MIT Press. Mason, L. Baxter, J. Bartlett, P., & Frean, M. (1999). Boosting algorithms as gradient descent in function space. In S. A. Solla, T. K. Leen, & K.-R. Muller ¨ (Eds.), Advances in neural information processing systems, 11. Cambridge, MA: MIT Press. Mclachlan, G. (1992). Discriminant analysis and statistical pattern recognition. New York: Wiley. Murata, N., Eguchi, S., Takenouchi, T., & Kanamori, T. (2002). Information geometry of U-boost and Bregman divergence (ISM research memorandum 860). Tokyo: Institute of Statistical Mathematics. R¨atsch, G., Onoda, T., & Muller ¨ K.-R. (2001). Soft margins for AdaBoost. Machine Learning, 42, 287–320. Schapire, R. (1990). The strength of weak learnability. Machine Learning, 5 , 197– 227. Schapire, R. (1999).Theoretical view of boosting. In Proceedingsof the 4th European Conference on Computational Learning Theory. Received January 2, 2003; accepted June 27, 2003.

Communicated by Peter Bartlett

LETTER

Boosting with Noisy Data: Some Views from Statistical Theory Wenxin Jiang [email protected] Department of Statistics, Northwestern University Evanston, IL 60208, U.S.A.

This letter is a comprehensive account of some recent ndings about AdaBoost in the presence of noisy data when approached from the perspective of statistical theory. We start from the basic assumption of weak hypotheses used in AdaBoost and study its validity and implications on generalization error. We recommend studying the generalization error and comparing it to the optimal Bayes error when data are noisy. Analytic examples are provided to show that running the unmodied AdaBoost forever will lead to overt. On the other hand, there exist regularized versions of AdaBoost that are consistent, in the sense that the resulting prediction will approximately attain the optimal performance in the limit of large training samples.

1 Introduction During the past decade, especially after the invention of the popular AdaBoost algorithm (Freund & Schapire, 1997), boosting techniques have attracted much attention from researchers from computer science, machine learning, and statistics. These techniques are characterized by using sequential linear combination of functions in a base hypothesis space for the purpose of classication or regression. At each time or step t, the linear combination incorporates one more term to optimize an objective function. Examples include AdaBoost (Freund & Schapire, 1997) for classication and Matching Pursuit (Mallat & Zhang, 1993) for regression. As Schapire (1999) stated in an overview article, “The most basic theoretical property of AdaBoost concerns its ability to reduce training error.” The training error decreases exponentially fast, subject to an assumption of “weak base hypotheses,” which we will discuss here. Furthermore, AdaBoost can also reduce the generalization error. In many cases, the generalization error continues to decrease after hundreds of steps and even after the training error becomes zero. There have been some explanations of this phenomenon based on upper bounds using the theory of margin (Schapire, Freund, Bartlett, & Lee, 1998; Koltchinskii & Panchenko, 2002) or top (Breiman, 1997a). Neural Computation 16, 789–810 (2004)

c 2004 Massachusetts Institute of Technology °

790

W. Jiang

However, there are some limitations of these previous results. First, regarding the results on training error, the validity and implications of the assumption of weak base hypotheses are not clear. For general sequential optimization algorithms, the assumption of weak base hypotheses requires that the percentage improvement of the objective function be bounded away from zero at each time t so as to guarantee an exponential rate of overall improvement. For AdaBoost, this is equivalent to requiring that the error ²Ot on a weighted data set be bounded away from 0.5, the value corresponding to random guessing. Schapire et al. (1998) are uncertain whether the “weak edge” .²Ot ¡ 0:5/ will eventually deteriorate to zero. They state that “characterizing the conditions under which the increase is slow is an open problem.” The rst half of this article addresses this open problem. Second, regarding these previous results on generalization error, no comparison is made to the optimal standard: the Bayes error. Consider the case of noisy data. Here P.Z D 1jX/ 62 f0; 1g, where X D predictor or input (e.g., patient characteristics such as age, smoking status, family history), Z D response or label (e.g., whether having breast cancer). No prediction rule for Z given X can be perfect here since two patients can have the same X yet different Z’s. This is in contrast to traditional machine learning, where X completely determines Z. The best possible generalization error for the noisy case is the so-called Bayes error (e.g., Devroye, Gyor, ¨ & Lugosi, 1996): L¤ ´ EX minfP.Z D 1jX/; P.Z D ¡1jX/g D inf (Gen. error), which is also a measure of the intrinsic level of noise and plays a similar role to the error variance in regression problems. Therefore, when data are noisy, it is more appropriate to study the difference (Gen. error)¡(Bayes error) rather than (Gen. error) itself. Many previous works on AdaBoost bound the generalization error with sample quantities related to, for example, margin (Schapire et al., 1998; Koltchinskii & Panchenko, 2002) or top (Breiman, 1997a), without comparing to the Bayes error. However, when data are noisy (with a nonzero Bayes error), the generalization error (or any of its upper bounds) cannot be on .1/. (Here, n denotes the sample size and on .1/ denotes a small term that converges to zero as n increases.) One can only hope for consistency, that is, the difference being small: (Generalization error)¡(Bayes error)=on .1/. The second half of this article makes comparisons of the generalization error to the optimal Bayes error. We will consider the following questions: ² Under what conditions will the assumption of weak base hypotheses be valid? Is the assumption restrictive or widely valid? What are some implications of the assumption on the generalization error? ² Is it good to run AdaBoost forever? Is limt!1 (generalization error) good, or is it suboptimal when compared to the optimal Bayes error?

Boosting with Noisy Data

791

² How good is the best prediction generated during the process of AdaBoost? Is inft (generalization error) good, or is it suboptimal when compared to the optimal Bayes error? ² Is regularization unnecessary for AdaBoost, or is it potentially helpful? A careful study of these questions is reported in this article, which summarizes the results from our less accessible technical reports and extended abstracts (Jiang, 1999, 2000a, 2000b, 2000c; Jiang, in press). Due to limitation on programming expertise, we have opted for an analytic approach from statistical theory. Therefore, the conclusions are often obtained in situations that are friendly to analytic treatment and have been focused on AdaBoost. The context and conditions for each result are carefully reported. Despite these limitations, our preliminary study of these posed questions has suggested a coherent picture of the behaviors of AdaBoost when treating noisy data and has emphasized the importance of regularization. There are numerous works on other interesting aspects of boosting with general cost functions—to name a few, Lugosi and Vayatis (2002) and Mannor, Meir and Zhang (2002) on consistency of regularized boosting algorithms; Zhang (in press), Lin (2001), and Bartlett, Jordan, and McAuliffe (2003) on cost functions and risk bounds; Zhang and Yu (2003) on convergence analysis; Blanchard, Lugosi, and Vayatis (2003) on rate of convergence; Rosset, Zhu, and Hastie (2002) on boosting and maximum margin; and Efron, Hastie, Johnstone, and Tibshirani (2002) on least angle regression. The rest of this article is organized as follows. Section 3 serves as a prelude to a brief summary of some ndings about Matching Pursuit (MP) (Mallat & Zhang, 1993), which is recognized as a regression-boosting algorithm by Friedman, Hastie, and Tibshirani (2000). These results have led us to suspect that AdaBoost for classication may possess similar properties when handling noisy data, which forms the main topic of the article and will be considered in detail. We begin by introducing the formalisms for regression, classication, and boosting. 2 Regression, Classication, and Boosting In statistical learning, we have a set of data S D .Xi ; Yi /n1 , where Xn1 are predictors, which are assumed here (only for convenience) to lie in [0; 1] and take distinct values. We allow the responses Y1n to be random (conditional on the Xi ’s) for potential noises of the data. We use lowercase letters to denote sample realizations of random objects; for example, xn1 is used to denote a set of values for X1n , which is called a set of design points. The sample size is n. The Yi ’s are real for regression problems and are f0; 1g valued for classication problem, where a useful transform Zi D 2Yi ¡ 1 valued in f¡1; C1g (the labels) is often used. In learning, we usually have a hypothesis space of real regression functions Hr or a hypothesis space of f§1g valued classication functions Hc to t

792

W. Jiang

the data. Here, a hypothesis space Hr;c is a set of functions f : [0; 1] 7! < or f : [0; 1] 7! f§1g, respectively. A relatively simple hypothesis space, called the base hypothesis space or base system Hr;c , can be made more complex by linear combinations of t members as the t-combined system or t-combined hyP pothesis space denoted as lint .Hr;c /. Formally, lint .H/ D f t1 ®s fs : .®s ; fs / 2 < £ Hg. A sign-valued function fO D fO.¢jS/ from Hc can be selected to form a prediction of future labels given a future predictor, based on a training sample P S D .Xi ; Zi /n1 . The training error is dened to be n¡1 niD1 I[Zi 6D fO.Xi jS/]. The generalization error is dened to be P[Z 6D fO.XjS/], the probability of misclassifying a future predictor-response pair .X; Z/. It is assumed that .Xi ; Zi /n1 and .X; Z/ are independent and identically distributed (i.i.d.). We briey note that the corresponding concepts in the regression case can also be dened based on the squared error. In boosting, a cost function C.FjS/ is used, which depends on the sample S and is a functional of F, where F 2 lint .Hr;c / for some t. A boosting algorithm acting on a base system Hr;c minimizes C.FjS/ with respect to F sequentially and adaptively in a way similar to the following. (We here assume complete minimizations in step 2a for simplicity. Approximate minimizations can also be allowed; see, e.g., Jiang, 1999.) 1. Let FO0 D 0.

2. For all t D 1; 2; : : : ;

a. Let ®O t fOt D arg min®f 2<£Hr;c C.FOt¡1 C ®f jS/. b. Let FOt D FOt¡1 C ®O t fOt .

For the regression case or the classication case, respectively, at step (or time) t, FOt .x/ or sgn.FOt .x// form the prediction for a future response Y or Z when the predictor takes value x, 8x 2 [0; 1]. This framework of boosting with a general cost function is similar to the ones used in Friedman et al. (2000) and Mason, Baxter, Bartlett, and Frean (1999). Some examples we will consider include LSBoost.Reg, the leastsquares P boosting algorithm for regression, where a square cost C.FjS/ D n¡1 niD1 fYi ¡ F.Xi /g2 is used; and AdaBoost, the adaptive boosting algorithm for classication P by Freund and Schapire (1997), where an exponential cost C.FjS/ D n¡1 niD1 e¡Zi F.Xi / is used. In AdaBoost, with a minimization partially completed on the coefcient of the linear combination, step 2a is equivalent to the more familiar formulation 2a0 of training on a reweighted data set: 2a0 . Find some f D fOt 2 Hc that minimizes 1t . f / ´ inf fC.FOt¡1 C ®f jS/=C.FOt¡1 jS/g D 2 ®2<

s

» ¼ 1 1 ¡ ²t . f / ¡ 4 2

2

:

Boosting with Noisy Data

793

Then set ®O t D 0:5 ln. 1¡²Ot²Ot /. (When Hc is negation closed, the minimization step is equivalent to nding fOt to minimize the weighted training error ²t . f /.)

Pn Here we denote the weighted training errors ²t . f / D jD1 wj.t/ If f .xj / 6D zj /g P O O and ²Ot D ²t . fOt /, with weights w.t/ D e¡Ft¡1 .xj /zj = n e¡Ft¡1 .xk /zk . j

kD1

It is noted that LSBoost.Reg is essentially the same as the matching pursuit (MP) algorithm of Mallat and Zhang (1993) used in signal processing for sequential combination of waveforms, which is later recognized by Friedman et al. (2000) as an analog of AdaBoost for least-squares regression. This analogy is helpful for studying some theoretical properties of AdaBoost. Properties for LSBoost.Reg are often easier to derive due to the use of the square cost, which can be instructive for the study of AdaBoost. Other approaches of regression boosting include methods that involve discretization (see, e.g., Freund & Schapire, 1997; Ridgeway, Madigan & Richardson, 1999), work based on a boosting theorem for noiseless regression (Avnimelech & Intrator, 1999), and methods using splines for regularized base learning (Buhlmann ¨ & Yu, in press). 3 Prelude: What We Found from LSBoost.Reg We rst briey summarize some results from Jiang (1999, 2000c) on regression boosting. They suggest that AdaBoost may possess similar properties, which are described in more detail later: ² The assumption of weak base hypotheses is valid, and the percentage reduction of square cost can be bounded away from 0 whatever the responses yn1 are, if and only if the base hypothesis space H is complete on xn1 (i.e., H.xn1 / ´ f f .xn1 / : f 2 Hg spans
² This implies a perfect t on the training set with probability one at t D 1, which has a poor generalization error for nonparametric regression with xed xn1 . Therefore, boosting forever is asymptotically suboptimal. ² When H.xn1 / is an orthonormal basis of
794

W. Jiang

Now we consider AdaBoost for classication and provide more details in coming to similar conclusions. 4 Weak Hypotheses for AdaBoost The following result, proved in the appendix, provides a way of guaranteeing the weak base hypotheses based on a characterization of the base hypothesis space, which may be called the completeness on the set of design points. The subsequent lemma shows that the completeness condition is widely valid. Theorem 1 (A Characterization for Weak Hypotheses). The weighted training error, when optimized over the base hypothesis space H, is bounded away from 0.5 by a nite amount ±, uniformly for all choices of weights and labels, if and only if H is complete on xn1 . Pn More formally, denote Pn D fwn1 2 [0; 1]n : 1 wi D 1g. Then we have: n i. X inf sup D 6 ¡ 0:5 ¸± w Ifz f .x /g i i i .wn1 ;zn1 /2Pn £f§1gn f 2H iD1

for some ± > 0, if and only if H.xn1 / ´ f f .xn1 / : f 2 Hg spans
ii. When H is negation closed such that f 2 H whenever ¡ f 2 H, we have sup

inf

n X

.wn1 ;zn1 /2Pn £f§1gn f 2H iD1

wi Ifzi 6D f .xi /g · 1=2 ¡ ±

for some ± > 0, if and only if H.xn1 / ´ f f .xn1 / : f 2 Hg spans
iii. the set of disks with any radius r ¸ 0: DSK.r/ ´ fI.¢ 2 [a ¡ r; a C r]/ ¡ I.¢ 62 [a ¡ r; a C r]/ : a 2
Boosting with Noisy Data

795

but widely valid instead, and the weak edge (²Ot ¡ 0:5) will not deteriorate to 0 for almost all data realizations and even for H as simple as the stumps. This to a large extent answers the open problem of Schapire et al. (1998) afrmatively. (The case with ties of x-values is treated in Jiang, 1999.) ² On the other hand, the wide validity of this assumption, over almost all possible data realizations with arbitrary labels, has the following implications: A perfect t will be generated on the training set at t D 1 (and for all sufciently large t). This is because, due to the theory of Freund and Schapire (1997) and Schapire (1999), the weak edges can accumulate and the training error decreases exponentially. The generalization error will be comparable to the error of the NN (nearest neighbor) rule for large t. Such a performance is suboptimal as compared to the optimal Bayes error, when data are noisy. This point will be discussed in more detail later. 4.1 Other Works on Weak Hypotheses. Originally, the weak edges required in the theory of training error reduction were guaranteed by a base learning algorithm that is weak PAC (probably and approximately correct; see Freund & Schapire, 1997). Soon it was realized (Breiman, 1998) that the weak PAC learning algorithm does not exist for noisy data (when L¤ > 0). Although no base learning algorithm can be weak PAC according to the original formulation when treating noisy data, our work shows that the weak edges .O²t ¡ 0:5/ on the weighted training sets are still guaranteed to be bounded away from zero in most common situations, which sufce for obtaining an exponential rate of training error reduction. This, however, implies a suboptimal generalization error when running AdaBoost forever. This implication of the assumption of weak hypotheses on the generalization error is obtained by studying the assumption in a way that accounts for all possible realizations of the random labels. Previously, this assumption has been studied by Freund (1995), Freund and Schapire (1996), and Breiman (1997a, 1997b), for example, but the characterization depends on a given realization of the pattern or labels. In comparison, our approach claries that the assumption holds for most common base hypothesis spaces and that it holds for all possible realizations of the random labels, which leads to the implications regarding the generalization error. On the other hand, since our formalism protects against all possible labels, the rate we found for the training error reduction may be slower than the actual reduction that one experiences for a given data realization. Recently, Mannor and Meir (2002) independently showed the existence of weak learners such that the weak edge ²Ot < 1=2 ¡ ± for all t, for some ± > 0. They p also showed the existence of effective weak learners where ± Â log n= n, by assuming a nite gap between data points with opposite

796

W. Jiang

labels. In contrast to their existence results, our work has provided a more complete characterization based on properties of the base hypothesis space H. For example, the base hypothesis space of delta functions or disks in lemma 1 typically guarantees weak hypotheses but is not covered by the existence result of Mannor and Meir (2002). Also, we note that effective weak learners do not exist for noisy data (with a nonzero Bayes error L¤ > 0). Otherwise, by boosting long enough and choosing a large sample, the bound from Schapire et al. (1998) would become smaller than the Bayes error, which is impossible. We therefore suspect that the nite gap assumption used in Mannor and Meir (2002) may not be reasonable for situations of noisy data. The negative result on the existence of effective weak learners in noisy situation may be summarized in the following proposition(proved in the appendix), which suggestspthat with high probability for large data, the weak edge jO²t ¡ 0:5j Á log n= n innitely often as t increases. This phenomenon was also noticed empirically by Breiman (private communication). Proposition 1. Suppose the base hypothesis space H ¾ SGN and is negation closed with VC dimension VC.H/ < 1. Suppose Xn1 are distinct with probability one and the Bayes error L¤ > 0. Then the weighted training errors ²Ot of AdaBoost satisfy the following: "

³

lim PS limt!1 j²Ot ¡ 0:5j ·

n!1

8 L¤

´

logfn=VC.H/g p n=VC.H/

# D 1:

5 Boosting Forever Below we will show why running the unmodied AdaBoost forever (or for a sufciently long time) will lead to a suboptimal generalization error. 5.1 Boosting with Fixed Predictors. We rst consider the case with xed predictors. Consider xn1 being nonstochastic and without ties. Dene the generalization error as the average of probabilities for misclassifying the Pn ¡1 t Ot new label Znew i ’s at all the design points, that is, Ln D n iD1 P[Z .xi jS/ 6D Ot O Znew i ], where Z .xi jS/ D sgn.Ft .xi // is the prediction obtained from AdaBoost at time t based on sample S. Assume that P.Znew D 1jxi / D P.Zi D 1jxi / D ¹i , i where .¹i /n1 is called the signal. The minimal possible generalization error P is the Bayes error L¤ D n¡1 n1 min.¹i ; 1 ¡ ¹i /, which also measures the level of noise. The following result is proved in the appendix: Theorem 2 (Limit Inconsistency, Fixed x). Suppose H ¾ SGN. Then for all t ¸ 2n2 log.n C 1/, the prediction ZO t is the same as the NN rule, and we have Ltn D L¤ C R (the NN limit),

Boosting with Noisy Data

797

where the overt R D n¡1

n X 1

2j¹i ¡ 0:5j min.¹i ; 1 ¡ ¹i / 2 [0; min.0:125; L¤ /]:

Remark 2. ² The NN limit is generally suboptimal or not consistent. The amount of overt R can be nonzero for noisy data (L¤ > 0). ² R is small when data are generated with small noise (small L¤ ).

² R can become as large as 0:125 when ¹i D 0:25 or 0:75, but it cannot be larger than 0:125. ² The time needed to reach the NN limit may be long (» n2 log n). 5.2 Boosting with Random Predictors on [0; 1]. Now we consider raniid

dom continuous predictors on [0; 1]. We consider Xn1 » Unif [0; 1] (therefore, we can assume the sample realization xn1 being distinct: with probability one, this will happen). Assume that .Xi ; Zi /n1 and .X; Z/ are i.i.d., and denote P.Z D 1jX D ¢/ D ¹.¢/, the signal. We consider a base hypothesis space H D SGN, or any set (PWC) of piecewise constant sign-valued functions with jump locations free to vary. The boosted prediction at any t is badly nonunique; there exist uncountably many solutions. This is because moving the jump locations for f 2 H without crossing any xn1 leaves the cost function unchanged. To reduce the number of solutions, one can use the midpoint standardCx ization: allow jumps only at the mid-data points from M D f0; x.1/ 2 .2/ ; : : : ; x.n¡1/ Cx.n/ ; 1g. Here x.j/ is the jth smallest of xn1 . For example, after midpoint 2 standardization, the set of sign functions SGN becomes SGN M D fsgn.¢¡a/ : a 2 Mg. Dene the generalization error to be Ltn D P[ZO t .XjS/ 6D Z] for the prediction ZO t obtained from t steps of AdaBoost with midpoint standardization; the optimal Bayes error L¤ D E minf¹.X/; 1 ¡ ¹.X/g. The following result is proved in the appendix: Theorem 3 (Limit Inconsistency, Random X on [0; 1]). With midpoint standardization, for all t ¸ 2n2 log.n C 1/, the boosted prediction ZO t becomes the NN rule with probability one, and therefore Ltn D L¤ C R C on .1/; where the limiting overt R D E[2j¹.X/ ¡ 0:5j minf¹.X/; 1 ¡ ¹.X/g] 2 [0; min.0:125; L¤ /].

798

W. Jiang

Remark 3. ² The NN limit L¤ C R is generally suboptimal or not consistent. ² The limiting overt R can be nonzero for noisy data (L¤ > 0). ² R is small when the noise level is low (with a small L¤ ).

² R D 0:125 when ¹.X/ 2 f0:25; 0:75g, but it cannot be larger than that. ² The time needed to reach the NN limit may be long (» n2 log n).

What about boosting forever with a higher-dimensional random continuous predictor X with dim.X/ > 1? We do not have theoretical results on this so far. However, we conjecture similar suboptimal generalization errors for noisy cases. The results on training error reduction imply that the t will be perfect for almost all sample realizations and agree with the NN rule at all the data points as well as in some of their neighborhoods. The limiting prediction presumably cannot perform much better than the NN rule. Recent empirical studies also conrm that even for high-dimensional data, boosting forever is still suboptimal when compared to predictions obtained earlier in the boosting process (Grove & Schuurmans, 1998; Friedman et al., 2000; Mason et al., 1999). 6 Boosting in the Process: Consistency It can be shown that a prediction generated during the process of AdaBoost is not only better than running AdaBoost forever but is also consistent, in the sense of having a generalization error being close to that of the optimal Bayes error for large sample size n. The proof is more technically involved and is contained in Jiang (in press). Theorem 4 (Process Consistency for AdaBoost). conditions:

Under general regularity

inf Ltn D L¤ C on .1/: t

Remark 4. ² Here X can be random, and dim.X/ can be larger than 1.

² The proof uses the results of Breiman (2000) on the population version of AdaBoost (n D 1). Breiman provides some conditions and shows that with innite sample size (n D 1), running AdaBoost forever generates an optimal prediction, which attains the Bayes error L¤ . ² The major conditions on the base hypothesis space H include compactness, a nite VC dimension, and that the linear span of H is complete in

Boosting with Noisy Data

799

L2 .PX / where PX is the probability measure of X. Also, the base learning (step 2a0 in section 2) is assumed to generate a consistent sequence that achieves approximate minimization over H in some sense as the sample size n increases. ² So even if boosting forever is generally suboptimal or inconsistent, somewhere in the process, AdaBoost can generate a better or consistent prediction. ² It is then recommended that regularization with truncation is a reasonable subject for future study. However, the optimal time t is not known from our nonconstructive proof technique. We used tn , which increases very slowly to innity with n. This sufces to prove consistency but is probably associated with a very slow rate of convergence. In other words, in some situations, a very large sample size n may be needed for AdaBoost at the optimal time to achieve a good performance (close to the optimal Bayes error). 7 Some Regularization Schemes There are other regularization methods that lead to consistency, are easier to study than truncation, and may have better rates of convergence than what can be obtained from simply truncating AdaBoost. We will discuss two of the many possibilities of regularization in detail. We rst summarize a study of AdaBoost with a regularization step called quantization (Jiang, 1999, 2000c). The quantization method is probably not useful in practice, but it is easy to derive analytical results there and show the existence of regularized variants of AdaBoost that can prevent overtting for noisy data. For a family of smooth signals, the quantized algorithm generates consistent classication rules. In fact, the rate of convergence (as sample size n increases) of the generalization error to the Bayes error is minimax optimal for this family. Another regularization scheme that we will discuss is based on averaging AdaBoost predictions from subsamples, which may be more practically useful. This was motivated by a similar method of “bag boosting” introduced and studied numerically by Krieger, Long, and Wyner (2001). We are able to derive analytic results only in the one-dimensional case [dim.X/ D 1], where we show consistency and near-optimal convergence rate (within a log n factor to the minimax rate) for a family of smooth signals. However, experiments in higher dimensions using similar ideas have also shown signicant improvement over AdaBoost when treating noisy data (Krieger et al., 2001). There are many different ways of regularizing AdaBoost to avoid overt. For example, Mannor et al. (2002) (via regulating step sizes) and Lugosi and Vayatis (2002) (via regulating the sum of weights in some sense) provided universally consistent regularization methods that do not overt for any sig-

800

W. Jiang

nal and have studied other cost functions in addition to the exponential one used in AdaBoost. Other methods of regularization include shrinkage and randomization (Friedman, 2001, 2002), complexity penalty (Mason et al., 1999), and different approaches of bagging (Breiman, 1996, 1999; Buhlmann ¨ & Yu, 2002). The convergence rates of different regularization schemes are generally dependent on what problems and what types of signals are considered. Different regularization methods may suit different problems better. In fact, AdaBoost without regularization may handle certain noiseless problems better than the regularized variants (Jiang, 2000b). Clearly, further study on regularization schemes is needed, with attention paid to the practical side on the ease of implementation and computational burden, to the side of theoretical studies on consistency and convergence rates for different families of signals, and also to the side of numerical assessment of nite sample performance. iid

7.1 Quantization. Suppose Xn1 » Unif .[0; 1]d /. Partition [0; 1]d into cells [m 1 Ak , and choose the representative design points »k 2 Ak for k D 1; : : : ; m, where m ¼ 1=hd and the volume vol.Ak / ¼ hd . First, construct a quantized data set Sqa D .»k.i/ ; zi /n1 out of the original S D .xi ; zi /n1 , where »k.i/ is the design point for the cell that contains xi . Then run AdaBoost on Sqa , and qa dene ZO t .x/ D ZO t .»k /P if x 2 Ak . Denote N.§jk/ D niD1 I.xi 2 Ak ; zi D §1/ as the number of positive or negative labels in cell Ak . The generalization error for the quantized rule is qa Ltn D P[ZO t 6D Z]; the Bayes error L¤ D E minfP.Z D 1jX/; P.Z D ¡1jX/g as before. Then the following results can be obtained (result ii is obtained from adapting the known results on quantization; see, e.g., Devroye et al., 1996), with the proof given in the appendix: Theorem 5 (Quantization & Limit Consistency). : f 2 Hg spans <m . Then:

Suppose H.»1m / D f f .»1m /

i. For all sufciently large t, for any x in an unbalanced cell Ak with N.Cjk/ 6D qa N.¡jk/, we have ZO t .x/ D sgnfN.Cjk/ ¡ N.¡jk/g, which coincides with the histogram rule. ii. Suppose the signal P.Z D 1j¢/ is smooth (being Lipschitz on [0; 1]d ). When h » n¡1=.dC2/ , we have limt!1 Ltn D L¤ C O.n¡1=.dC2/ /:

Remark 5. ² With quantization, one can run boosting forever and still have asymptotic optimality and consistency, even with noisy data (L¤ > 0), for smooth signals.

Boosting with Noisy Data

801

² In addition to consistency, the rate n¡1=.dC2/ of limt!1 Ltn ¡ L¤ is minimax optimal for protecting against all smooth signals (see Yang, 1999, for d D 1). ² With noiseless labels (L¤ D 0), however, the unmodied AdaBoost at t D 1 may be consistent and may even possess a better convergence rate than n¡1=.dC2/ in some situations (Jiang, 2000b).

7.2 Averaging Subsample Predictions. The context is similar to that of section 5.2. We are considering X » Unif [0; 1]. The base hypothesis space is SGNM (or, more generally, PWC M ), which contains piecewise constant hypotheses having splits chosen from mid-data points. The regularization scheme involves averaging subsample predictions as follows: i. Divide the training sample .Xi ; Zi /n1 randomly into K subsamples of size m. (For convenience, we assume n D Km. A careful treatment of the remainder sample in the general case, using weighted average, leads to the same asymptotic results.) ii. For subsample k D 1; : : : ; K, run AdaBoost t-steps, and dene the resulting prediction rule (at any x 2 [0; 1]) as zO .t/ .x/. k PK .t/ ¡1 iii. Compute the average of these predictions zN .t/ kD1 zO k , and K D K use ZO tK .x/ ´ sgn.Nz.t/ to predict the value of the unknown label .x// Z for a K future observation with X D x. It turns out that the average prediction ³OKt .x/ ´ zN .t/ K .x/ estimates the (conditional) mean function ³ .x/ D E.ZjX D x/, and the corresponding classication rule ZO tK .x/ D sgnf³OKt .x/g achieves the optimal Bayes error L¤ at a near-optimal rate. To formulate the results, denote LtK as the generalization error for rule ZO tK ; denote mean square error MSEtK D Ef³OKt .X/ ¡ ³ .X/g2 . Denote D D supx2[0;1] j³ 0 .x/j, or alternatively assume ³ is Lipschitz so that j³ .u/ ¡ ³ .v/j · Dju ¡ vj for all u; v 2 [0; 1]. Then we have the following result, which is proved in the appendix: Theorem 6 (Averaging and Consistency). 2.nD2 /2=3 logf.nD2 /1=3 C 1g, we have LtK ¡ L¤ ·

q

For K D .n=D/2=3 and t ¸

MSEtK · .1=3/.n=D/¡1=3 log.n=D/f1 C on .1/g:

Remark 6. ² With averaging of subsample results, overtting is prevented. One can run boosting forever and still have asymptotic optimality and consistency, even with noisy data (L¤ > 0), for smooth mean functions. ² In addition to consistency, the rate n¡1=3 log n of LtK ¡ L¤ is within a log n factor to the minimax optimal rate .n¡1=3 / for protecting against all smooth mean functions in the Lipschitz family (see Yang, 1999).

802

W. Jiang

² Correspondingly, the average prediction ³OKt .x/ ´ zN .t/ K .x/ also estimates the mean function ³ .x/ D E.ZjX D x/ consistently with mean square error MSEtK converging to zero at a near optimal rate. This can also be used to produce a reliable estimate for “the signal” ¹.x/ or probability of a positive label, by noting that as ¹.x/ ´ P.Z D 1jX D x/ D f1 C ³ .x/g=2. ² The implementation of this regularization scheme is simple and modular, since it involves only manipulation of outcomes from the standard AdaBoost algorithm. Although all the results are derived for onedimensional X, we suspect that the performance of this algorithm, with suitable choice of the number of subsamples and the number of boosting rounds, will also work well in higher dimensions. A similar “bagboosting” algorithm, where the subsamples can overlap, has shown good numerical performance in higher dimensions in both noisy and noiseless situations (Krieger et al., 2001), which motivated the current work. 8 Conclusion: What We Found for AdaBoost In summary, similar to the case of regression boosting briey described in section 3, we found the following: ² Weak hypotheses can be guaranteed based on a characterization about the base hypothesis space (H.xn1 / spanning
² The assumption of weak hypotheses is usually valid, and the weak edges are bounded away from 0 as t increases, even for base hypothesis spaces as simple as the stumps. ² This implies a perfect t on the training set with probability one for all sufciently large time and a generalization error comparable to the NN error, which is suboptimal for noisy data. Therefore “boosting forever” can overt for noisy data. ² Under general regularity conditions, inft (Generalization error) = (Bayes error) + on .1/. Therefore, even if boosting forever can be suboptimal, boosting in the process can generate consistent predictions. ² For noisy data, regularization methods can do better than boosting forever without regularization. Therefore, although AdaBoost seems to be resistant to overtting, there is still room for improvement that is achievable by regularization, especially when one is treating noisy data. The emerging literature on regularization of boosting is not unnecessary, but should be encouraged instead: for example, shrinkage and randomization (Friedman, 2001, 2002), complexity

Boosting with Noisy Data

803

penalty (Mason et al., 1999), bagging (Breiman, 1996, 1999; Buhlmann ¨ & Yu, 2002; Krieger et al., 2001), regulating step sizes (Mannor et al., 2002), and regulating the effective sum of weights (Lugosi & Vayatis, 2002). An important future step for the research on boosting algorithms will be to study and compare the efcacy of different regularization methods in various situations. This includes theoretical studies on consistency and convergence rates of the regularization methods when used in learning different families of probability signals, numerical studies of nite sample performance, and practical considerations on implementability and computational burden. Appendix: Technical Proofs Proof of Theorem P 1. We rst prove statement i. Note that 0:5 ¡ 6D f .xi /g D 0:5 niD1 wi zi f .xi /. For the “if” part of statement i, suppose H.xn1 / spans
Pn

iD1 wi Ifzi

n X inf sup D 0: w z f .x / i i i n n n n £f§1g .w1 ;z1 /2P f 2H iD1

n n n n Then Pn there would exist .w1 ; z1 / 2 P £ f§1g such that for P nall f 2 H, D 0. Then all D 0 and therefore 0 D w z f .x / w z i i i i i iD1 iD1 jwi zi j D Pn iD1 wi D 1, which would be absurd. For the “only if” part of statment i, suppose (*) inf.wn1 ;zn1 /2Pn £f§1gn sup f 2H Pn j iD1 wi zi f .xi /j > 0 but H.xn1 / does not span
Proof for Proposition 1. Denote ´ D .8=L¤ /fn=VC.H/g¡1=2 logfn=VC.H/g. Apply the last equation (together with their choice on the auxiliary parameter N) of the proof of theorem 2, as well as theorem 5, of Schapire et al. (1998). Take their tuning parameter µ D 0:5´. Then some algebra leads to the following: For any 1 > ± > 0, with probability over S being at least 1 ¡ ±, we have .a/

L¤ · P Z;XjS [ZFOt .X/ · 0] ·

t Y sD1

g.O²s ; µ/ C 0:5L¤ f1 C on .1/± g;

(A.1)

804

W. Jiang

where g.O²s ; µ/ D 2²Os.1¡µ /=2 .1¡ ²Os /.1Cµ/=2 , and on .1/± is a term that depends on ± and converges to zero as n ! 1 for any ±. Note that in step a, due to the negation closure of H, H ¾ SGN, and that Xn1 are distinct with probability one, we know from our theorem 1 and lemma 1 that ²Ot < 1=2 for ³ allPt, and therefore ´ t Pt ®O fO .X/ sD1 s s O P O 0 and therefore [Z · 0] D [Z · 0] PZ;XjS Ft .X/ PZ;XjS t sD1 ®s > sD1

®O s

with probability one over S. This has allowed us to apply theorems 2 and 5 of Schapire et al. (1998), where the linear combination in AdaBoost is P normalized by a denominator tsD1 ®O s (see their equation 10). Under such events of random data S with probability at least 1 ¡ ±, suppose that as t increases, j²Ot ¡ 0:5j > 2µ for all but a nite number of Q times. Then the product tsD1 g.O²s ; µ/ would converge to zero, since for all s after a nite time, g.O²s ; µ/ are bounded below one in the product. Note that g.²Os ; µ/ < maxfg.0:5 ¡ 2µ; µ/; g.0:5 C 2µ; µ/g < 1 whenever j²Os ¡ 0:5j > 2µ. (We take n to be large enough so that 0:5 § 2µ are between 0 and 1.) Therefore taking the limt!1 both sides of equation A.1 we would obtain L¤ · 0 C 0:5L¤ f1 C on .1/± g, which cannot be true, if the data size n is greater than some n.±/ such that the term on .1/± is less than 0:99. So we have to conclude instead that j²Ot ¡ 0:5j · 2µ innitely often, with probability at least 1 ¡ ±, for all n > n.±/. Proof for Theorems 2 and 3. Due to Freund and Schapire (1997, theorem 6), we have the training error LO tn ´ n¡1

n X iD1

I[zi 6D sgnFOt .xi /] ·

t p Y sD1

1 ¡ 4.O²s ¡ 0:5/2 ;

which will decrease exponentially fast if the weak edge jO²t ¡ 0:5j is bounded away from zero. Applying lemma 2 (below) on the weak edge implies that LO tn · .1 ¡ n¡2 /t=2 , which will be · .n C 1/¡1 for all t ¸ ¿ .n/ ´ 2n2 log.n C 1/, and will in fact be exactly 0 since the training error takes value from fk=ngnkD0 . Then 8t ¸ ¿ .n/ ´ 2n2 log.n C 1/, the prediction will be perfect on data S, as long as xn1 are distinct in .0; 1/, which happens with probability one in the contexts of both theorems 2 and 3. This implies that the limiting prediction is equivalent to the NN rule in both contexts. (In the context of theorem 3, note that with the midpoint standardization, a piecewise constant rule that ts S perfectly almost surely is the same as the NN rule. This is because, near any x-location of a data point, say, xi , the predicted label will remain the same as the corresponding observed label zi before reaching the nearest mid-data points in M, where all candidate jumps are located.) The derivation of the generalization error of the NN rule and how it compares to the optimal Bayes error are well known (see, e.g., Devroye et al., 1996).

Boosting with Noisy Data

805

Lemma 2. Suppose H ¾ SGN M and xn1 2 .0; 1/ are distinct. Then the weak edge j²Ot ¡ 0:5j ¸ 1=.2n/, for all t. (A similar result has been independently proved by Breiman, private communication.) Proof for Lemma 2. When H ¾ SGN M and xn1 are distinct and located in .0; 1/, the following condition is satised, which sufces to prove the lemma: H and xn1 are such that for each xi , i D 1; : : : ; n, there exist two functions f§ 2 H, which take different values (§1, respectively) at xi but agree at all other xj (j 6D i) locations. We can just take f§ to be two step functions with jumps located immediately to the left and to the right of xi , respectively, from the set of mid-data points M. This condition ensures that sup j f 2H

n X iD1

wi Ifzi 6D f .xi /g ¡ 0:5j ¸ .2n/¡1

(A.2)

for all weights wn1 2 Pn . To prove this equation, note that in order that the n weights sum to one, one weight, say, w1 , without loss of generality, has to be ¸ 1=n. P P ThenPnote that 0:5 ¡ niD1 wi Ifzi 6D f .xi /g D 0:5 niD1 wi zi f .xi /, and theren fore Pn 2j iD1 wi Ifzi 6D f .xi /g ¡ 0:5j D jw1 z1 f .x1 / C 1. f /j where 1. f / D iD2 wi zi f .xi /. The previous condition ensures that there exist two functions f§ 2 H such that f§ .x1 / D §1 but 1. fC / D 1. f¡ / (call this common value 1). Pick f 2 f fC ; f¡ g such that f .x1 / D z1 sgn.1/. Then we have 2j

n X iD1

wi Ifzi 6D f .xi /g ¡ 0:5j D jw1 z1 z1 sgn.1/ C 1j D w1 C j1j ¸ w1 ¸ 1=n;

thereby proving equation A.2. n D n , (for any implies that This equation, when applied to .wj /jD1 .wj.t/ /jD1 t) jO²t ¡ 0:5j D j²t . fOt / ¡ 0:5j n X D sup j w.t// i Ifzi 6D f .xi /g ¡ 0:5j ¸ 1=.2n/: f 2H

iD1

Proof of Theorem 5. For statement i, Consider running AdaBoost on the quantized data set. The criterion function C.FOt / used in AdaBoost is bounded below by zero and is nonincreasing due to the sequential minimization. Therefore, C.FOt / converges. Note that (Breiman, 2000) ( )2 n X ¡1 ¡FOt¡1 .»k.i/ /zi qa 2 qa 2 O O 0 ¸ C.Ft¡1 jS / ¡ C.Ft jS / D sup n e zi f .»k.i/ / : f 2H

iD1

806

W. Jiang

So the right-hand side sup f 2H 1t . f /2 ! 0 as t increases. Here we deP P O t¡1 note 1t . f / ´ n¡1 niD1 e¡Ft¡1 .»k.i/ /zi zi f .»k.i/ /, which is equal to m kD1 f .»k /!k , where !kt¡1 D n¡1

X i: xi 2Ak

O

O

O

e¡Ft¡1 .»k /zi zi D n¡1 fN.Cjk/e¡Ft¡1 .»k / ¡ N.¡jk/eFt¡1 .»k / g:

Dene !t¡1 D .!kt¡1 /m a column vector. Pick m function f fl gm in H kD1 lD1 .m;m/ such that the matrix 8 D [ fl .»k /].l;k/D.1;1/ is nonsingular, which is possible by the completeness of H.»1m /. Then 1t . fl / ! 0 for l D 1; : : : ; m imply 8! t¡1 ! 0 and therefore ! t¡1 ! 0. Then for any k D 1; : : : ; m, FOt¡1 .»k / ! 0:5 lnfN.Cjk/=N.¡jk/g if N.§jk/ are not both zero. Then when N.Cjk/ 6D N.¡jk/, ZO t¡1 .»k / D sgn ± FOt¡1 .»k / D sgnfN.Cjk/ ¡ N.¡jk/g for all qa sufciently large t. Then the quantized rule ZO t .x/ D sgnfN.Cjk/ ¡ N.¡jk/g whenever x 2 Ak , which coincides with the histogram rule, when N.Cjk/ 6D N.¡jk/. For statement ii, note that qa .L¤ · / Ltn D P[ZO t .X/ 6D Z] .a/

m X

.b/

m X

D D

kD1

kD1

C · .c/

D

jAk jP[ZO t .»k / 6D ZjX 2 Ak ] jAk jP[ZO t .»k / 6D Z; ek jX 2 Ak ]

m X kD1

m X kD1 m X kD1

jAk jP[ZO t .»k / 6D Z; eck jX 2 Ak ]

jAk jP[ZO t .»k / 6D Z; ek jX 2 Ak ] C

m X kD1

jAk jEfI[ZO t .»k / 6D Z; ek ]jX 2 Ak g C

jAk jP[eck jX 2 Ak ]

m X kD1

jAk jP[eck ]:

In step a, jAk j D P[X 2 Ak ]; in step b, ek and eck represent the events, respectively, [N.Cjk/ 6D N.¡jk/] and [N.Cjk/ D N.¡jk/]; in step c, we used the independence of the events eck and [X 2 Ak ]. Then take limt!1 both sides of the above expression of Ltn , and utilize the result in statement i, that under event ek , for all sufciently large t, ZO t .»k / becomes the same as the histogram rule ZO hist .»k / D sgnfN.Cjk/ ¡ N.¡jk/g. Then due to the dominated convergence theorem, EfI[ZO t .»k / 6D Z; ek ]jX 2 Ak g ! EfI[ZO hist.»k / 6D Z; ek ]jX 2 Ak g.

Boosting with Noisy Data

807

So we obtain .L¤ · / limt!1 Ltn m m X X · jAk jEfI[ZO hist .»k / 6D Z; ek ]jX 2 Ak g C jAk jP[eck ] kD1

·

m X kD1

kD1

jAk jEfI[ZO hist .»k / 6D Z]jX 2 Ak g C

D P[ZO hist.»k / 6D Z] C

m X kD1

m X kD1

jAk jP[eck ]

jAk jP[eck ]:

Note that the rst term on P the right-hand side is L¤ C O.n¡1=.dC2/ / due to ¡1=.dC2/ by utilizing the c Devroye et al. (1996), and m /, kD1 jAk jP[ek ] D O.n c result on the probability of a balanced cell P[ek ] in Devroye et al. (1996, Lemma A.3). Hence, L¤ · limt!1 Ltn · L¤ C O.n¡1=.dC2/ /, and therefore limt!1 Ltn D L¤ C O.n¡1=.dC2/ /. Proof of Theorem 6. For the time step t considered in the theorem, the AdaBoost prediction in each subsample is the NN prediction with probP ability one, according to theorem 3. Therefore ³OKt .x/ D K¡1 KkD1 zO .t/ D k .x/ P K¡1 KkD1 Z.XkNN .x// and the averaged prediction is ZO tK D sgn.³OKt /, where is the NN to a location x 2 [0; 1] in the kth subsample, and Z.XkNN .x// XNN k .x/ is the corresponding value of the label Z. (We can assume all these quantities to be uniquely dened since with probability one, the sample X-values have no ties and the NN to x is unique for almost all x locations.) Applying corollary 6.2 of Devroye et al. (1996), we have LtK ¡ L¤ · q R1 MSEt . Note that MSE t D .V C B2 /dx where V D Varf³O t .x/g and K

K

0

K

B D jE³OKt .x/¡³ .x/j. We will show that V · 1=K and B · .D=m/ logf2e.m=D/g where Km D n. Then MSEtK · 1=K C .D=m/2 log2 f2e.m=D/g. Letting K take the value in the theorem produces the upper bounds of MSEtK and LtK ¡ L¤ . Proof of V · 1=K. For each x 2 [0; 1], since the Z.XkNN .x//’s are sign valued P and i.i.d., we have V D VarfK¡1 KkD1 Z.XNN D K¡1 VarfZ.XNN 1 .x//g · k .x//g ¡1 . K Proof of B · .D=m/ logf2e.m=D/g. have B D jEK¡1 D

K X

For each x 2 [0; 1] and each ² > 0, we

Z.XkNN .x// ¡ ³ .x/j D jEZ.XNN 1 .x// ¡ ³ .x/j

kD1 NN jE³ .X1 .x//

¡ ³ .x/j · Ej³ .XNN 1 .x// ¡ ³ .x/j

808

W. Jiang NN D Ej³ .XNN 1 .x// ¡ ³ .x/jI[jX1 .x/ ¡ xj · ²] .i/

NN C Ej³ .XNN 1 .x// ¡ ³ .x/jI[jX1 .x/ ¡ xj > ²]

· D² C 2P[jX1NN .x/ ¡ xj > ²]: In step i, we used the Lipschitz property of ³ and that j³ j 2 f¡1; 1g. Then note that the event [jX1NN .x/ ¡ xj > ²] implies that all m X-values in independent draws from Unif [0; 1] are at least ² away from x, which has probability at most .1 ¡ ²/m , for any x 2 [0; 1]. Then we have B · D² C 2.1 ¡ ²/m · D² C 2e¡m² . Now taking ² D .1=m/ log.2m=D/ leads to the proof. Acknowledgments I thank Martin Tanner and Leo Breiman for encouraging me to work on the topic of boosting and Leo Breiman for very useful e-mail exchanges. This work is supported in part by NSF grant DMS-01-0 2636. References Avnimelech, R., & Intrator, N. (1999). Boosting regression estimators. Neural Computation, 11, 491–513. Bartlett, P., Jordan, M., & McAuliffe, J. (2003). Convexity, classication, and risk bounds (Tech. Rep. No. 638). Berkeley: Department of Statistics, University of California. Blanchard, G., Lugosi, G., & Vayatis, N. (2003). On the rate of convergence of regularized boosting methods. Unpublished manuscript. Available on-line: http:// www.proba.jussieu.fr/pageperso/vayatis/download/statboost.pdf. Breiman, L. (1996). Bagging predictors. Machine Learning, 24, 123–140. Breiman, L. (1997a). Prediction games and arcing classiers (Tech. Rep. No. 504). Berkeley: Department of Statistics, University of California. Breiman, L. (1997b). Arcing the edge (Tech. Rep. No. 486). Berkeley: Department of Statistics, University of California. Breiman, L. (1998). Arcing classiers (with discussion). Annals of Statistics, 26, 801–849. Breiman, L. (1999). Using adaptive bagging to debias regressions (Tech. Rep. No. 547) Berkeley: Statistics Department, University of California. Breiman, L. (2000). Some innity theory for predictor ensembles (Tech. Rep. No. 579). Berkeley: Department of Statistics, University of California. Buhlmann, ¨ P., & Yu, B. (2002). Analyzing bagging. Annals of Statistics, 30, 927–961. Buhlmann, ¨ P., & Yu, B. (in press). Boosting with the L2 -loss: Regression and classication. Journal of the American Statistical Association. Devroye, L., Gyor, ¨ L., & Lugosi, G. (1996). A probabilistic theory of pattern recognition. New York: Springer-Verlag. Efron, B., Hastie, T., Johnstone, I., & Tibshirani, R. (2002). Least angle regression (Tech. Rep.). Stanford: Department of Statistics, Stanford University.

Boosting with Noisy Data

809

Freund, Y. (1995). Boosting a weak learning algorithm by majority. Information and Computation, 121, 256–285. Freund, Y., & Schapire, R. E. (1996). Game theory, on-line prediction and boosting. In Proceedings of the Ninth Annual Conference on Computational Learning Theory (pp. 325–332). New York: ACM Press. Freund, Y., & Schapire, R. E. (1997). A decision-theoretic generalization of online learning and an application to boosting. Journal of Computer and System Sciences, 55, 119–139. Friedman, J. H. (2001). Greedy function approximation: A gradient boosting machine. Annals of Statistics, 29, 1189–1232. Friedman, J. H. (2002). Stochastic gradient boosting. Computational Statistics and Data Analysis, 38, 367–378. Friedman, J., Hastie, T., & Tibshirani, R. (2000). Additive logistic regression: A statistical view of boosting. Annals of Statistics, 28, 337–407. Grove, A. J., & Schuurmans, D. (1998). Boosting in the limit: Maximizing the margin of learned ensembles. In Proceedings of the Fifteenth National Conference on Articial Intelligence (AAAI-98). Menlo Park, CA: AAAI Press. Jiang, W. (1999). On weak base hypotheses and their implications for boosting regression and classication (Tech. Rep.). Evanston, IL: Department of Statistics, Northwestern University. Available on-line: http://neyman.stats.nwu.edu/ jiang/boost/boost.largetime2.ps. Jiang, W. (2000a). Some results on weakly accurate base learners for boosting regression and classication. In J. Kittler & F. Roli (Eds.), Lecture notes in computer science: Multiple classier systems (pp. 87–96). New York: Springer. Available on-line: http://neyman.stats.nwu.edu/jiang/boost/boost.mcs.ps. Jiang, W. (2000b). Does boosting overt: Views from an exact solution (Tech. Rep.). Evanston, IL: Department of Statistics, Northwestern University. Available on-line: http://neyman.stats.nwu.edu/jiang/boost/boost.onedim.ps. Jiang, W. (2000c). Is regularization unnecessary for boosting? In Proceedings of the Eighth International Workshop on Articial Intelligence and Statistics, 2001 (pp. 57–64). San Mateo, CA: Morgan Kaufmann. Available on-line: http://neyman.stats.nwu.edu/jiang/boost/boost.or.ps. Jiang, W. (in press). Process consistency for AdaBoost. Annals of Statistics. Available on-line: http://neyman.stats.nwu.edu/jiang/boost/boost.process.ps. Koltchinskii, V. I., & Panchenko, D. (2002). Empirical margin distributions and bounding the generalization error of combined classiers. Annals of Statistics, 30, 1–50. Krieger, A., Long, C., & Wyner, A. (2001). Boosting noisy data. In Proceedings of the Eighteenth International Conference on Machine Learning (pp. 274–281). San Mateo, CA: Morgan Kaufmann. Lin, Y. (2001). A note on margin-based loss functions in classication (Tech. Rep. 1044r). Madison: Department of Statistics, University of Wisconsin. Lugosi, G., & Vayatis, N. (2002).A consistent strategy for boosting algorithms. In J. Kivinen & R. H. Sloan (Eds.), COLT 2002, Lecture Notes in Articial Intelligence 2375 (pp. 303–319). Berlin: Springer-Verlag. Mallat, S., & Zhang, S. (1993). Matching pursuit in a time-frequency dictionary. IEEE Transactions on Signal Processing, 41, 3397–3415.

810

W. Jiang

Mannor, S. & Meir, R. (2002). On the existence of linear weak learners and applications. Machine Learning, 48, 219–251. Mannor, S., Meir, R., & Zhang, T. (2002). The consistency of greedy algorithms for classication. In J. Kivinen & R. H. Sloan (Eds.), COLT 2002, Lecture Notes in Articial Intelligence 2375 (pp. 319–333). Berlin: Springer-Verlag. Mason, L., Baxter, J., Bartlett, P., & Frean, M. (1999). Boosting algorithms as gradient descent in function space. (Tech. Rep.). Canberra: Department of Systems Engineering, Australian National University. Ridgeway, G., Madigan, D., & Richardson, T. (1999). Boosting methodology for regression problems. In The Seventh International Workshop on Articial Intelligence and Statistics (Uncertainty ’99) (pp. 152–161). San Mateo, CA: Morgan Kaufmann. Rosset, S., Zhu, J., & Hastie, T. (2002). Boosting as a regularized path to a maximum margin classier (Tech. Rep.). Stanford, CA: Department of Statistics, Stanford University. Schapire, R. E. (1999). Theoretical views of boosting. Computational Learning Theory: Fourth European Conference, EuroCOLT’99 (pp. 1–10). Berlin: SpringerVerlag. Schapire, R. E., Freund, Y., Bartlett, P., & Lee, W. S. (1998). Boosting the margin: A new explanation for the effectiveness of voting methods. Annals of Statistics, 26, 1651–1686. Yang, Y. (1999). Minimax nonparametric classication—Part I: Rates of convergence. IEEE Trans. Info. Theory, 45, 2271–2284. Zhang, T. (in press). Statistical behavior and consistency of classication methods based on convex risk minimization. Annals of Statistics. Zhang, T., & Yu, B. (2003). Boosting with early stopping: convergence and consistency (Tech. Rep. No. 635). Berkeley: Department of Statistics, University of California, Berkeley. Received November 18, 2002; accepted July 10, 2003.

LETTER

Communicated by Peter Bartlett

Different Paradigms for Choosing Sequential Reweighting Algorithms Gilles Blanchard blanchar@rst.fhg.de D´epartement de Math´ematiques, Universit´e Paris-Sud, 91405 Orsay Cedex, France; and Fraunhofer FIRST Kekul´estr. 7, 12489 Berlin, Germany

Analyses of the success of ensemble methods in classication have pointed out the important role played by the margin distribution function on the training and test sets. While it is acknowledged that one should generally try to achieve high margins on the training set, the more precise shape of the empirical margin distribution function one should favor in practice is subject to different approaches. We rst present two concurrent philosophies for choosing the empirical margin prole: the minimax margin paradigm and the mean and variance paradigm. The best-known representative of the rst paradigm is the AdaBoost algorithm, and this philosophy has been shown by several other authors to be closely related to the principle of the support vector machine. We show that the second paradigm is very close in spirit to Fisher’s linear discriminant (in a feature space). We construct two boosting-type algorithms, very similar in their form, dedicated to one or the other philosophy. We consequently derive by interpolation a very simple family of iterative reweighting algorithms that can be understood as different trade-offs between the two paradigms and argue from experiments that this can allow for a suitable adaptivity to different classication problems, particularly in the presence of noise or excessive complexity of the base classiers. 1 Introduction 1.1 Motivation and Previous Work. In classication problems, recent work has given important attention to various procedures in building several different classiers of the same type (such as decision trees or radial basis function networks), usually obtained by perturbations of the training set, and then combining these to form an aggregated classier, usually by some (perhaps weighted) voting process. The popularity of this type of method comes from their reported good performance in very different practical problems (see, e.g., Dietterich, 2000; Amit & Geman, 1997; Viola & Jones, 2001, and the review of Meir & Ra¨ tsch, 2003). Usually aggregated classiers outperform signicantly, and sometimes in a spectacular way, single classiers of the same type. Neural Computation 16, 811–836 (2004)

c 2004 Massachusetts Institute of Technology °

812

G. Blanchard

The theoretical approaches trying to explain the success of these methods have focused on an important quantity called the margin (representing, for a given example, the proportion of votes in favor of the true class versus the wrong one). In this regard, Schapire, Freund, Bartlett, and Lee (1998) should be considered a milestone by giving a qualitative (albeit not really practical) bound of the generalization error by quantities involving the empirical cumulative distribution function of the margins and the complexity of the base space. However, while most authors seem to emphasize that the margins of the training set should be used in some way for guiding the construction of the aggregated classier, it is not necessarily clear precisely what criterion should be used. This article investigates this issue. Its goal is twofold: 1. Identify within a common framework two concurrent existent paradigms (or philosophies) concerning the choice of this criterion and their relation to classical statistical principles. We emphasize that these criteria favor different shapes of the empirical margin distribution. This allows for a unifying point of view. 2. Derive a boosting-type (i.e., based on iterative reweighting) algorithm able to interpolate between these two paradigms and therefore adapt to different situations where one or the other (or some intermediate choice) is more suitable. Sections 2 and 3 present the two paradigms. The rst paradigm (presented in section 2) is linked to the large margin classiers philosophy, and it is usually argued that it is the underlying principle of the popular AdaBoost algorithm. The second paradigm (presented in section 3) may seem a little less familiar. It is based on analyses put forward by Breiman (2001) and Amit and Blanchard (2001), and we show its similarity to Fisher’s linear discriminant. (When considered in a linear feature space, the large margin paradigm and the Fisher discriminant paradigm have been compared by Mika, 2002.) For each of these two paradigms, we put forward an iterative reweighting ensemble algorithm dedicated to nding a good empirical margin distribution according to the associated paradigm. Strikingly, these two algorithms share a very similar form: the shape of the reweighting function (as a function of the current margin) is the same in the two algorithms (it is formed of a constant piece and a linear piece); only the choice of the two parameters differs (the abscissa of the change point and the height of the constant part). In section 4, we put forward an algorithm that we understand as a heuristic interpolation between these two paradigms, allowing sampling a family of candidate empirical margin distributions among which one is nally picked by cross-validation. This algorithm is tested on benchmark data sets with two types of base classiers, classication trees and radial basis function (RBF) networks, and compared to other popular ensemble

Different Paradigms for Choosing Sequential Reweighting Algorithms

813

methods: Random Forests (Breiman, 2001) and AdaBoost-Reg, a regularized AdaBoost method (Ra¨ tsch, Onoda, & Muller, ¨ 2001). We emphasize that this algorithm is simple and cheap from a computational point of view (apart from the computational burden for the weak learner itself—the latter is variable since the weak learner can be chosen arbitrarily), and that by its very nature, it can be applied to any weak learner. These two features (simplicity and applicability to any weak learner) are at the root of the success of AdaBoost, and we argue that the algorithm proposed here retains these two features while exhibiting improved results in cases where AdaBoost would be overtting. The experimental results also show clearly improved performance with respect to Random Forests when RBF networks are used as base classiers. Finally, the algorithm is on par with AdaBoost-Reg while being noticeably simpler. The approach in this article is mainly heuristic. Our belief is that while theoretical bounds on the generalization error are an extremely important tool to give general guidelines and a qualitative appraisal of the behavior of algorithms, they are, as far as statistical learning is concerned, seldom accurate enough to give a quantitatively directly usable tool for model selection. A study of theoretical bounds related to the problems studied here was presented elsewhere (see Blanchard, 2001, 2003, improving the results of Schapire et al., 1998, and Breiman, 1998b), but we believe that in practical applications, it is also always necessary to adopt heuristic guidelines (what we call paradigms)—the latter, of course, based on sound arguments. In this regard, we think in particular that the theoretical bounds of Schapire et al. (1998) are in a sense agnostic as to what shape of empirical margin distribution should be preferred in practice, although the authors seem to favor what we call here the minimax paradigm. Our suggestion here is nally to try to maintain an agnostic attitude and nd a means (preferably as easy as possible from an algorithmic point of view) to sample several different good candidate margin proles and pick among them by cross validation, as is argued in section 4. 1.2 General Framework and Notations. We will consider a two-class classication problem where X 2 X is the observed variable (in all the numerical simulations presented here, X will be a real vector space of nite dimension) and Y 2 f¡1; 1g is the class of the observation. The set of classiers F is the set of (measurable) functions f from X to f¡1; 1g. We assume that couples .X; Y/ are drawn according to some underlying probability distribution P, and for any f 2 F , the generalization error of f 2 F is dened by

E . f / D EP [1f f .X/ 6D Yg]: The training set is a nite set of observations .Xi ; Yi /iD1;:::;N used to build classiers. We suppose that we have at hand an automated learning algorithm W , which, given the training set and a set of nonnegative weights

814

G. Blanchard

.!i /iD1;:::;N , outputs a classier f . This algorithm is generally called the weak learner and is considered as a black box on which we have no inuence. It is usually assumed that the outputs of the weak learner belong to some subset H ½ F of the set of all possible classiers and that H is of small “size” in terms of complexity (e.g., it has nite VC dimension). The set H is called the set of base classiers. The weak learner W can be any classical learning algorithm,1 like a neural net, a decision tree, or a linear discriminant. Ideally, the algorithm W can be thought of as a procedure minimizing the weighted training error over the base classier space H. It is often a reasonable approximation although almost never exactly true in practice. 1.3 Voting Methods and Sequential Reweighting Schemes. Aggregation, voting, and ensemble methods are different names for a general family of procedures whose goal is to improve the weak learner ’s generalization error by combining several of the base classiers. Informally, such a combination is obtained by building (according to some protocol to be made precise) a nite set of base classiers f1 ; : : : ; fT , along with some associated family of positive coefcients .®i /iD1;:::;T . The aggregated function F® : X ! R is then dened by F® .x/ D

T X

®i fi .x/;

iD1

and the associated classication function F® is obtained by taking the sign of F® (if F® .x/ D 0, the class is determined arbitrarily), which corresponds to a (weighted) majority vote among the considered classiers. We also dene the normalized aggregated function G® : X ! [¡1; 1] by G® .x/ D

Á T X

!¡1

®i

F® .x/:

iD1

To simplify notations, we will conceive ® as a positive measure with nite support on the set F , thus implying that ® alone entirely determines the aggregated functionsPabove. We will refer to ® as a combination (of classiers) and denote j®j D f 2F ® f . Among ensemble methods, a wide subfamily is given by sequential reweighting schemes (SRS; also called ARCing methods by Breiman, 1998a). Sequential reweighting schemes build aggregate classiers according to the following general principle: call the weak learner sequentially, and iteratively update the weights .!i / at each iteration, depending on past errors. 1 In the case that the original algorithm is not well suited for dealing with weighted data, one generally takes the weights into account by rst randomly resampling from the training set using the prescribed weights. We will consider this as part of the weak learning procedure.

Different Paradigms for Choosing Sequential Reweighting Algorithms

815

The coefcient .®t / of classier ft at iteration t is also computed on the spot, depending on past errors. 1.4 Margins. The literature devoted to ensemble methods has given particular attention to the study of margins (see, e.g., Schapire et al., 1998; Koltchinskii & Pachenko, 2002; Breiman, 1998b; Smola, Bartlett, Sch¨olkopf, & Schuurmans, 2000; Blanchard, 2003; Meir & Ra¨ tsch, 2003), the main object of interest in this letter. The (normalized) margin function M® : X £ f¡1; 1g ! [¡1; 1] associated with a combination ® is dened as X M ® .x; y/ D yG® .x/ D yj®j¡1 ® f f .x/: f 2F

Basically, for a given example .x; y/, the margin function tells us with what condence the “vote” obtained with combination ® is in favor of the correct class. The example is misclassied iff the margin is nonpositive; for a positive but close to zero margin, the example is correctly classied, but it is a close call. It is intuitively clear that when building a combination of base classiers belonging to H , one would be happy to have margins as high as possible on the training examples: this should bring more stability to the decision function. Theoretical bounds based on margins give a qualitative support to this intuition. However, if H is of limited complexity (which surely is necessary to prevent overtting), it is not possible to increase simultaneously the margins on all examples, as we expect that no base classier achieves perfect classication. Thus, increasing the margin for some training examples will also mean decreasing the margin of some others. The main problem is what kind of trade-off one is willing to make in this situation. Our focus here will concern the choice of classier combination based on different philosophies concerning the shape the margin distribution of the training set should have. 2 The Minimax Margin Philosophy 2.1 The AdaBoost Algorithm and the Support Vector Machine. The best-known SRS, which, thanks to its excellent practical performance, has renewed the interest in voting methods, is probably the AdaBoost algorithm (Freund & Schapire, 1996a). We will not describe this algorithm in detail here, but will review a few important facts (coming mainly from Schapire et al., 1998) to understand its behavior. Let ® t denote the combination obtained up to iteration t. It can be shown that the AdaBoost algorithm is such that the weight !i;tC1 of example i at step t C 1 is proportional to exp.¡j® t jM® t .Xi ; Yi //. We write it this way to emphasize that the unnormalized margin is used inside the exponential. It means that iteration t C 1 will mainly concentrate on those examples that have the lowest margins and

816

G. Blanchard

maybe almost “forget” the examples that have a relatively higher margin, all the more when t becomes larger since j® t j is growing with t (see also the related analysis of Onoda, Ra¨ tsch, & Muller, ¨ 1998). This remark, along with other arguments, has led Schapire et al. (1998) to claim that AdaBoost is iteratively trying to nd a combination ® for which the minimum margin on the training set is maximum, that is, approximately solves the optimization problem max min M ® .Xi ; Yi /: ®

(2.1)

i

We call this way of choosing the combination ® the minimax margin philosophy. Schapire et al. (1998; see also Ra¨ tsch, Mika, Sch¨olkopf, & Muller, ¨ 2002) note also that there is a close connection between this philosophy and the principle underlying the (hard margin) support vector machine (SVM). For a hard margin SVM, a linear classier is chosen so as to maximize the minimum geometrical margin over the training examples (see Figure 1). Thus, the general philosophy for choosing the classier is comparable in these two methods.2 This analogy also has led other authors to adapt the “soft margin” SVM principleto the Boosting case to avoid overtting (R¨atsch et al., 2001).

um im in in rg M ma Figure 1: A pictorial idealization of the minimax margin paradigm for a support vector machine.

2 One of the main differences between the two is that the SVM margin is dened with a Euclidean metric, whereas the boosting margin is measured in terms of an L1 metric.

Different Paradigms for Choosing Sequential Reweighting Algorithms

817

2.2 Theory of Games and Blackwell’s Strategy. The minimax margin principle also has a natural interpretation in the framework of the theory of games. Consider a “game” where the rst player chooses an example .Xi ; Yi / of the training set, the second player chooses a classier f from the set H , and the second player wins (and the rst player loses) 1 unit if the example is correctly classied and loses 1 unit otherwise; that is, player 2 wins f .Xi /Yi . Then the equilibrium mixture strategy ® ¤ of player 2 exactly corresponds to the solution of equation 2.1, and the value of the game is the minimax margin. This point of view has been one of the initial motivations for boosting methods (Freund & Schapire, 1996b; Breiman, 1998b). It has several interesting consequences, like the following fact, which is an easy consequence of the minimax theorem: if, for any set of weights .!i / on the training sample, there exists a classier f 2 H with weighted error strictly less than 0.5, then there exists an aggregated classier with zero training error. If ° denotes the value of the game, then it can be shown (Schapire et al., 1998) that the AdaBoost algorithm will asymptotically produce a combination ® whose minimum margin on the training set is at least ° =2 (as a consequence, if ° > 0, AdaBoost will asymptotically produce a classier with zero training error). However, it is up to our knowledge unknown if the AdaBoost algorithm will, or not, asymptotically attain the minimax margin ° . If we want to follow the minimax margin philosophy to the end, we would like to have an algorithm for which we are sure that the minimax margin will be asymptotically attained. Various iterative methods have been proposed by different authors to this end (Breiman, 1998b; Ra¨ tsch & Warmuth, 2001). However, there also exists an algorithm known in game theory as Blackwell’s strategy, which is guaranteed to attain the minimax margin under certain hypotheses (see Blackwell, 1956, and some recent related results in Hart & Mas-Colell, 2001). This algorithm is also an SRS, which we describe in Figure 2. This algorithm is extremely simple to describe: a “regret” vector Rt is kept along, which keeps track, for each example, of the difference between the number of times it has been misclassied and the sum of the weighted errors. For the next iteration, the weights are taken proportional to the positive part of Rt . After reaching a xed number of iterations, take the uniform average of all the base classiers thus obtained. For this procedure, the following theorem (essentially from Hart & MasColell, 2001) holds in the case where H is nite: Theorem 1. Let ° denote the minimax margin and "t denote the weighted error at step t. Assume that H is nite and that the weak learning algorithm W outputs the classier f 2 H having the least weighted error when it is called. Then the following assertions hold true: .i/

8i D 1; : : : ; N

1 lim sup Ri;t · 0: t!1 t

818

G. Blanchard

1. For i D 1; : : : ; N , initialize weights !i;1 D 1=N and the “regret vector” Ri;1 D 0.

2. Iteration t: Call the weak learner W with the weights .!i;t /, resulting in classi er ft . Let "t denote the weighted error of ft . 3. For i D 1; : : : ; N , update regret vector the following way: Ri;tC1 D Ri;t C 1f ft .Xi / 6D Yi g ¡ "t .

4. Update weights the following way: for i D 1; : : : ; N, !i;tC1 / .Ri;tC1 /C , where .¢/C is the positive part; then renormalize so that the weights sum to 1. 5. If t < Tmax , proceed to next iteration and point 2. If iteration Tmax is reached, take for the aggregated classi er the simple uniform average (vote) of the ft ’s, t D 1; : : : ; T . Figure 2: Blackwell’s strategy applied in our setting.

.ii/

8i D 1; : : : ; N

lim inf t!1

t 1X fk .Xi /Yi ¸ ° : t kD1

Proof. Point i: See Hart and Mas-Colell (2001), sec. 3. Note that this point remains true if the weak learner W satises the weaker hypothesis that it always outputs a classier having weighted error strictly less than 0.5. Point ii is then straightforward (see the proof in the appendix). P Note that 1t tkD1 ft .Xi /Yi is exactly the margin of the aggregated classier at step t on example i. Hence, under the conditions of the theorem, the algorithm will converge to a solution of the minimax problem, equation 2.1. Blackwell’s strategy may seem of little interest, since there exist other and probably faster methods to reach the minimax margin. However, besides the advantage of its simplicity, it has the property (which will be of interest to us) that the weights at step t are proportional to a two-parameter piecewise linear function 8a;b of the margin of the current combination ® t ; more precisely, :

8a;b .x/ D .x ¡ a/¡ C b;

(2.2)

where .¢/¡ denotes the negative part ( .x/¡ D ¡x if x < 0 and is zero otherwise). The choice of parameters at step t leading to Blackwell’s strategy 2 Pt¡1 is at D 1 ¡ t¡1 kD1 "k ; bt D 0 (where "k is the weighted error of classier fk ). This is easily seen from the description of the algorithm on Figure 2 and using the fact that 1f f .x/ 6D yg D 12 .1 ¡ f .x/y/ for a binary classier function f (taking values in f¡1; 1g).

Different Paradigms for Choosing Sequential Reweighting Algorithms

819

3 The Mean-and-Variance Philosophy 3.1 Chebyshev’s Inequality and Fisher’s Linear Discriminant. Another way of understanding the performance of voting methods, proposed by Amit and Geman (1997), Amit and Blanchard (2001), and Breiman (2001), relies on looking at the two rst moments of the margin function and a simple Chebyshev’s inequality. The main idea of this analysis can be summed up in the following inequality: Theorem 2. Assume that combination ® is such that E[M® .X; Y/] > 0. Then the following inequality holds:

E .F® / · Proof.

Var[M® .X; Y/] : E[M® .X; Y/]2

(3.1)

The proof is immediate:

E .F® / · D · ·

P[M® .X; Y/ · 0]

P[M® .X; Y/ ¡ E[M® .X; Y/] · ¡E[M ® .X; Y/]]

P[.M® .X; Y/ ¡ E[M ® .X; Y/]/2 ¸ E[M® .X; Y/]2 ] Var[M® .X; Y/]=E[M® .X; Y/]2 ;

where the last line results from Markov’s inequality. Note that there exist various renements and variants of this inequality. Amit and Geman (1997) and Amit and Blanchard (2001) derive a slightly different analysis for multiclass problems, and it is shown that the variance factor can mainly be interpreted as the average covariance, conditional on class, of two base classiers in the aggregate. In Devroye, Gyor, ¨ and Lugosi (1996), a somewhat tighter bound based on the same quantities can be found. What is the signicance of this inequality? On the one hand, it is certainly true that Chebyshev’s inequality is very coarse and that the above bound cannot be very tight. On the other hand, one can argue that we expect that the average and variance of the margin function are two elementary statistical quantities that should be able to be estimated from their empirical counterparts on the training set without excessive error (for a theoretical study supporting this point, see Blanchard, 2001). It is therefore more relevant to understand this inequality as an indication pointing toward what we should try to do in practice: nd some compromise between the mean and variance of the margin function (we want high mean but low variance), based on their empirical estimations. Furthermore, another argument supporting the interest of this approach comes from the following useful comparison with Fisher ’s discriminant. It is indeed interesting to note that while the minimax margin philosophy

820

G. Blanchard

was comparable to an SVM classier in a linear classication framework, the mean-and-variance philosophy is comparable to the principle underlying Fisher’s linear discriminant classier. The following simple theorem makes the link apparent by showing that when we apply the same line of reasoning (based on Chebyshev’s inequality) to a Euclidean classication setup, we naturally nd Fisher’s discriminant function. Theorem 3. Consider a classication in a Euclidean space F; let m1 and m¡1 be the average values of class 1 and ¡1. For a 2 F, let fa be the linear classier dened by fa .x/ D sign.ha; .x ¡ .m1 C m¡1 /=2/i/. Dene the margin function Ma .x; y/ D yha; .x ¡ .m1 C m¡1 /=2/i. Then the following inequality holds for any a such that ha; m1 i > ha; m¡1 i:

E . fa / ·

p1 s21 C p¡1 s2¡1 Var[Ma ] D 4 ; ha; .m1 ¡ m¡1 /i2 E[M a ]2

(3.2)

where pi D P.Y D i/ and s2i D Var[ha; XijY D i]; i D ¡1; 1. Proof. The rst inequality results from Chebyshev’s inequality similarly to theorem 2. The second equality is just a little calculation: for i D ¡1; 1 we note that E[M a jY D i] D ha; .m1 ¡ m¡1 /=2i; and thus E[M a ] D p1 E[Ma jY D 1] C p¡1 E[Ma jY D ¡1] D ha; .m1 ¡ m¡1 /=2i; and Var[M a ] D E[Var[Ma jY]] C Var[E[M a jY]] D p1 s21 C p¡1 s2¡1 C 0: Thus, a very similar mean-variance analysis performed in a linear classication setup leads us to inequality 3.2, where the right-hand side is exactly the inverse of Fisher’s discriminant function, used precisely to choose the optimal direction a of the linear classier (see Figure 3). Note that Fisher’s linear discriminant is traditionally justied by considering the idealized situation where the classes have a gaussian distribution with identical covariance matrix, but this is rarely the case in practice; hence, it can be seen in more generic situations as essentially based on the mean-variance approach. The point of that philosophy is to try to be suitable in situations where the two classes form two overlapping clusters, that is, when there exist noisy regions, whereas the minimax philosophy is more suited for situations where

Different Paradigms for Choosing Sequential Reweighting Algorithms

821

Figure 3: A pictorial idealization of the mean-and-variance paradigm for Fisher ’s linear discriminant. Compare with Figure 1 (the data points are the same).

the two classes are well separated, although maybe by an irregular boundary. A related analysis appears in Mika (2002), to compare the principles of SVMs and Fisher discriminant in a linear feature space. 3.2 The Sequential Lower-Variance Minimization. In this section, we discuss how to implement in practice the method suggested by the meanand-variance philosophy. First, if we believe that Chebyshev’s inequality and the resulting inequality 3.1 is a good indication toward what should be done in practice, then an immediate remark is that we can replace in this inequality the variance of the margin by what we call its lower variance: :

Var¡ [X] D E[..X ¡ E.X//¡ /2 ]; where .¢/¡ denotes the negative part (i.e., .x/¡ D ¡x when x < 0 and .x/¡ D 0 otherwise). This makes sense since we want to take into account only the lower deviations of the margin in order to bound the probability that it becomes negative. The second point is that we would like to nd an SRS (see section 1.3) corresponding to the mean-and-variance philosophy. It has been noted by several authors (Breiman, 1998b; Frean & Downs, 1998; Friedman, Hastie, & Tibshirani, 2000, among others) that the AdaBoost algorithm can be interpreted as a gradient descent with an exponential cost function (this has

822

G. Blanchard

led to proposing other SRSs based on different cost functions). We propose to follow this principle as applied to a cost function that reects our philosophy. A rst naive choice would be to choose the right-hand side of equation 3.1. Two arguments lead to choosing another solution. First, we expect inequality 3.1 to be too coarse to capture in all generality the optimal trade-off between mean and variance. We therefore suggest introducing a one-parameter family of cost functions corresponding to different possible trade-offs (and the parameter will be selected by cross validation). Second, our rst experiments showed that a cost function dened as a ratio between mean and variance yielded quite unstable results, and we prefer an additive cost function using this two quantities. Finally, we choose the following cost function depending on the parameter ¯ > 0: C¯ .G® / D j®j¡1

±q ² Var¡ [M® ] ¡ ¯E[M® ]

(3.3)

(the square root is for homogeneity). The normalization by j®j¡1 is necessary if we want a scale-invariant cost (the scale does not change the nal classication function). Note that this is very similar to a cost function that we suggested earlier (Amit & Blanchard, 2001). What SRS will we derive from that cost function if we apply a gradient descent to it? First, we note that this cost function is convex in the simplex fj®j D 1g (see the appendix for a proof), so that gradient descent is a legitimate method to minimize it. We mainly need to compute the gradient of the cost along the direction corresponding to adding a new classier to the mixture. Denote by G0 D G® the normalized aggregated function corresponding to our current mixture ®; consider adding classier f to the mixture with a small weight, resulting in the normalized aggregated function Gt D .1¡ t/G0 Ctf for some small t. We thus want to compute @C¯ .Gt /=@ tjtD0 and minimize this quantity as a function of f . We then have the following property (see the proof in the appendix): Theorem 4. Denote E f .x; y/ D 1f f .x/ 6D yg, w.x; y/ D .M® ¡ E[M® ]/¡ . Then the classier f minimizes @C¯ .Gt /=@ tjtD0 iff it minimizes the quantity J.®; f / D E[w.X; Y/E f .X; Y/] p C .¯ E[w.X; Y/2 ] ¡ E[w.X; Y/]/E[E f .X; Y/]:

(3.4)

Keep in mind that the goal of the algorithm is to minimize the empirical cost function, that is, when the variance and expectation operators in equation 3.3 are taken under the empirical distribution. Similarly, in the above denition of J.®; f /, the expectation is to be understood under the empirical distribution. The quantity J.®; f / can be interpreted as a weighted error: it can be put under the form J.®; f / D E[.w.X; Y/ C C/E f .X; Y/], with a constant C depending on ® but not on .X; Y/. Therefore, minimizing J.®; f / with

Different Paradigms for Choosing Sequential Reweighting Algorithms

823

respect to f amounts to minimizing the weighted error of f when example .X; Y/ is given a weight proportional to w.x; y/ C C. It can actually happen, if ¯ is small, that some weights are negative because of the negative part in the constant term C. This is not a problem in practice, since we can then ip the label of the corresponding example and take the absolute value of the weight for the next training set. However, if one chooses ¯ ¸ 1, then obviously the weights are always nonnegative by Jensen’s inequality. As a consequence, the sequential lower-variance minimization algorithm goes as follows: call at each step the weak classier with the weights corresponding to equation 3.4, and add the output f to the current combination with coefcient 1. Note that in the AdaBoost algorithm, the coefcient given to the classier is again the result of an optimization in the direction of f . But since this is not necessary to perform a simple gradient descent, we prefer this simpler uniform average here. 4 An Algorithmic Interpolation of the Two Philosophies 4.1 The Interpolated Algorithm. Now that we have explored the two paradigms about the margin distribution, we notice that the two algorithms built in the previous sections—Blackwell’s strategy applied to classication and sequential lower-variance minimization (SLVM)—have the interesting common point that the weights given to the training examples at each step t are in both cases of the form !t;i / 8at ;bt .M® t¡1 .Xi ; Yi //, where ® t¡1 is the current combination at the beginning of step t and 8a;b .x/ D .x ¡ a/¡ C b (see Figure 4). Moreover, both methods use only a simple vote among the base classiers built (which corresponds to a coefcient of 1 given to each base classier in the combination). For Blackwell’s strategy, bt D 0 and 2 P t¡1 at D 1 ¡ t¡1 kD1 "k , where "k is the weighted error of base classier fk at step k; in other words, at is the average of the mean weighted margins of the base classiers forming the combination up to the present step. For the SLVM, at D E[M® t¡1 ] is the (empirical) average margin of the current combination—in other words, the sum of the mean (unweighted) margins of the base classiers built up to the present step. If ek denotes the (unweighted) empirical error of base classier fk , this is also equivalent 2 Pt¡1 to at D 1 ¡ t¡1 kD1 ek ; notice that comparing with Blackwell’s algorithm, we have just replaced p the weighted error by the unweighted error in at . Additionally, bt D ¯ E[w.X; Y/2 ] ¡ E[w.X; Y/] (where w is dened as in theorem 4) is a quantity depending on the parameter ¯. As noted earlier, the minimax philosophy is more suited to low noise problems and the mean-variance philosophy to cases where there are noisy regions. This is in accordance with experimental results reporting that AdaBoost’s performance can become extremely poor in the presence of noise on the labeling of training examples (Dietterich, 2000) and more generally

824

G. Blanchard

Fa,b(x)

b a

x

Figure 4: The weighting function 8a;b .

that AdaBoost can actually exhibit overtting (Grove & Schuurmans, 1998; R¨atsch et al., 2001). Therefore, we would like to have an algorithm that is able to behave according to the one or the other philosophy according to the situation. More precisely, we want to derive a parameterized family of SRSs able to interpolate between the two. In view of the similar form taken by the reweighting in the two above algorithms, we propose the following simple solution: reweight the examples using a function 8at ;bt of the margins, where .at ; bt / are just given by some weighted mean with coefcients .1 ¡ ±; ±/ of the corresponding values for our two initial algorithms. This gives rise to the following choice for some ± 2 [0; 1]: recalling the notation w.x; y/ D .M® t¡1 .x; y/ ¡ E[M® t¡1 ]/¡ , (

at D 1 ¡ bt D ±.¯

2 t¡1

p

Pt¡1

kD1 ..1

¡ ±/"k C ±ek / ;

E[w.X; Y/2 ]

¡ E[w.X; Y/]/:

(4.1)

The resulting algorithm is summed up in Figure 5 (where it was slightly extended to also accept values of ± greater than 1). 4.2 Experiments. In our tests of the algorithm, we xed arbitrarily ¯ D 2 in the interpolated algorithm in order to reduce to a single tuning constant ± > 0 and also to ensure that the weights are always nonnegative. Note that the parameter ¯ could also be chosen by cross-validation in a reasonable range; however, in our experience, it resulted in little difference.

Different Paradigms for Choosing Sequential Reweighting Algorithms

825

1. For i D 1; : : : ; N , initialize weights !i;1 D 1=N, vectors Ri;1 D 0; Si;1 D 0. Put ± ¤ D max.±; 1/. 2. Iteration t: call the weak learner W with the weights .!i;t /, resulting in classi er ft . Let et denote the empirical unweighted error of ft and "t denote its weighted error. 3. For i D 1; : : : ; N, update vectors R; S the following way:

Ri;tC1 D Ri;t C 1f ft .Xi / 6D Yi g ¡ "t I Si;tC1 D Si;t C 1f ft .Xi / 6D Yi g ¡ et : ³ q ´ 1 PN 1 PN 2 4. Put VtC1 D ¯ N iD1 .Si;tC1 /C ¡ N iD1 .Si;tC1 /C . Update weights the following way:

for i D 1; : : : ; N;

!i;tC1 D ..1 ¡ ± ¤ /Ri;tC1 C ± ¤ Si;tC1 /C C ±VtC1 I

then renormalize so that the weights sum to 1.

5. If t < Tmax , proceed to next iteration and point 2. If iteration Tmax is reached, take for the aggregated classi er the simple uniform average (vote) of the ft ’s, t D 1; : : : ; T . Figure 5: The interpolated algorithm, depending on constants ¯ > 0 and ± > 0.

We performed two series of experiments: the rst with classication trees of limited depth and the second with RBF networks. The rst series aims at investigating the qualitative properties of the algorithm and to compare it to AdaBoost, in particular, in the presence of (labeling) noise. For this rst series, we chose classication trees and stumps because they have been used as examples in other works on ensemble methods and because they are fast to compute, allowing us to perform large sets of experiments. On the other hand, since classication trees are not excellent classiers taken individually, the goal is not to achieve record performance. In particular, it appears that the Random Forest (RF) algorithm often has better performance than the ensemble of trees obtained with our algorithm; however, RF is completely dedicated to classication trees: contrarily to AdaBoost and other reweighting schemes, it is not divided clearly into a weak learner and an ensemble scheme. In particular, it is difcult to state exactly what the “weak learner ” would be for RF, since the randomization is part of the tree building procedure itself (a random subset of features is selected at each node, and the “ensemble” part merely consists in repeating the base procedure with a bootstrap). This makes it difcult to make a fair comparison with other ensemble schemes that use a “black box” weak learner. In any case, the fact that AdaBoost and the interpolated algorithm presented here can be applied to any weak learner proves a decisive advantage: in the

826

G. Blanchard

second series of experiments, using small-size RBF networks as base classiers (which are generally better classiers than single classication trees), the test error rates of the reweighting schemes are very clearly better than RF for an important majority of the data sets. 4.2.1 Experiments with Classication Trees and Stumps. For this rst set of experiments we tested the algorithm for seven benchmark data sets, six of which have been used by Ratsch ¨ et al. (2001), 3 and the last one (Breast-c) 4 coming from the UCI repository. Our main point in this series of experiments is to compare the behavior of the interpolated algorithm with plain AdaBoost for an arbitrary weak learner, so we preferred to use a fast, but not very accurate, weak learner. We tried two procedures: decision trees of depths 1 (stumps) and 3, split using the Gini criterion, with no pruning and a coarse stopping rule. For each of the data sets, we tried the learning algorithms with labeling noise levels of 0%, 5%, 10%, and 20% respectively; these noise levels correspond to the proportionof training examples whose labels are ipped before learning (the labels of the test samples used to estimate the error remain unchanged). For the interpolated algorithm, the constant ± was determined at each round by a ve-fold cross-validation on the training set, choosing among an arbitrary set of nine values for ±: f0; :05; :1; :15; :2; :5; :75; 1; 1:5g (this probably could be improved). For each of these situations, the error rates were estimated with 100 different training and test sets (the same training and test sets are used with the different algorithms). Finally, for each of the SRSs used, we performed Tmax D 500 iterations at each round. First, we show in Figure 6 different empirical margins “proles” (cumulative distribution functions) of the training set, obtained with different values of the parameter on the Waveform data set. It is interesting to note that these proles follow basically what we should have been expecting from the construction of the algorithm: for low ±, the distribution has almost a steep jump near what should be the minimax margin; for higher values of ±, the distribution of the margins is more spread out but also has a higher mean. On the gure, it is noticeable that the AdaBoost algorithm does a better job than Blackwell’s strategy (corresponding to ± D 0) at pulling the minimum training margin higher; this is probably because since the AdaBoost algorithm is based on an exponential cost function, it converges very quickly (this nice property of AdaBoost has been pointed out several times in the literature). By comparison, Blackwell’s strategy uses only a linear function for its weighting and therefore needs more iterations to converge, and the 500 iterations performed here are probably not enough to reach the asymptotic regime (although the classication rates are very close). On the

3 4

Made available on-line at http://ida.rst.gmd.de/˜raetsch/data/benchmarks.htm. Available on-line at http://www.ics.uci.edu/˜mlearn/MLRepository.html.

Different Paradigms for Choosing Sequential Reweighting Algorithms

827

Margin empirical CDF. on training (Waveform) 1

Empirical CDF.

0.8

AB delta=0 delta=0.1 delta=0.2 delta=0.6 delta=2

0.6 0.4 0.2 0 1

0.5

0 Margin value

0.5

1

Margin empirical CDF. on test (Waveform) 1

Empirical CDF.

0.8

AB delta=0 delta=0.1 delta=0.2 delta=0.6 delta=2

0.6 0.4 0.2 0 1

0.5

0 Margin value

0.5

1

Figure 6: Margin proles (top: on training set; bottom: on test set) obtained on the Waveform data set, with depth 3 decision trees, with AdaBoost and the interpolated algorithm for different values of ±.

828

G. Blanchard

other hand, Figure 6 shows that the proposed algorithm achieves precisely its initial purpose, which was to be able to sample different candidate ensemble classiers that achieve qualitatively different trade-offs concerning the shape (prole) of the margin cumulative distribution function (c.d.f.), this trade-off being basically between the proportion of examples having a high margin, and the average margin of all the examples. The test set margin distribution also reported in Figure 6 shows that the qualitative differences appearing on the training set produce test proles following the same qualitative patterns, so that varying the parameter indeed makes a difference for test sample classication. (On the example shown on the gure, AdaBoost actually offers the best solution, as is seen at the test c.d.f. at the point 0, which represents the test error. This matches the results shown in Table 2.) The classication errors obtained on the different data sets are given in Tables 1 and 2. The interpolated algorithm does noticeably better than AdaBoost in general. It is interesting to note that the classication rates are generally worse (for both algorithms) using depth 3 trees than using stumps (except for the Waveform and Banana data sets). This must indicate that the depth 3 trees must be overtting in most of the cases, which has important consequences in terms of the performance of the aggregated classier. The interpolated algorithm outperforms AdaBoost the most clearly for depth 3 trees, on the one hand, and for the higher labeling noise levels, on the other, which indicates that the interpolated algorithm is much more resistant to noise and overtting. The German and Diabetes data sets are particularly instructive: when we compare the classication errors obtained with stumps and depth 3 trees, we see clearly that AdaBoost’s performances are severely degraded, while the interpolated algorithm suffers only a small increase in classication errors. The AdaBoost algorithm outperforms the interpolated algorithm with stumps in some cases on the data sets Waveform, German, and Fsolar, but when we look at the numbers, the two algorithms actually appear mostly on par (only in three cases is the difference signicant in the sense of a 95% t-test). One can sum up the results saying that over all the cases considered, there are 32 wins of the interpolated algorithm versus AdaBoost with 3 losses and 21 ties. As far as classication trees are concerned, it should be noted that for a majority of these data sets, the RF algorithm actually outperforms the interpolated algorithm (compare to Table 3). However, as pointed out earlier, it is not obvious how to identify a “weak learner” in the RF algorithm, that could be used by the other ensemble algorithms for fair comparison. Moreover, the reweighting schemes can be applied to other weak learners, yielding in most cases better nal classication rates than RF, as shown in the next section. 4.2.2 Experiments with RBF Networks. For this second set of experiments, we used RBF networks as weak learners. In contrast to decision trees, RBF

3.7 § 1.8 = 27.5 § 1.7 = 24.3 § 2.2 = 33.1 § 2.0 = 18.6 § 3.9 + 24.7 § 1.8 = 12.4 § 0.7 =

3.9 § 2.1 27.8 § 1.6 24.0 § 2.3 33.0 § 2.0 21.7 § 4.0 24.7 § 1.7 12.5 § 0.6

Breast Cancer Banana German FSolar Heart Diabetes Waveform

5.1 § 2.4 28.2 § 1.8 25.0 § 2.3 33.3 § 1.6 23.8 § 3.8 25.8 § 1.9 15.4 § 1.1

AdaBoost 4.1 § 2.2 + 27.9 § 1.8 = 25.0 § 2.2 = 33.5 § 2.0 = 19.2 § 4.3 + 25.0 § 1.9 + 14.0 § 1.2 +

Interpolated Algorithm

5%

5.5 § 2.4 29.0 § 2.2 25.9 § 2.4 33.6 § 1.9 26.1 § 4.8 26.1 § 2.1 17.7 § 1.3

AdaBoost 4.4 § 2.0 + 28.5 § 2.1 = 25.9 § 2.5 = 33.9 § 2.3 = 21.1 § 4.6 + 24.9 § 1.8 + 14.9 § 1.5 +

Interpolated Algorithm

10%

7.3 § 3.2 30.4 § 2.8 27.9 § 2.7 34.0 § 2.4 29.9 § 5.2 28.8 § 2.6 22.5 § 2.0

AdaBoost

5.1 § 2.5 + 29.8 § 2.8 = 27.6 § 2.8 = 34.9 § 2.6 25.4 § 5.0 + 26.6 § 2.9 + 17.6 § 2.0 +

Interpolated Algorithm

20%

Notes: The sign after the interpolated algorithm results indicates the result of a 95% two-sided t-test of equality with the AdaBoost results: + or ¡ indicates that the equality is rejected; = indicates that it is not. Boldface entries indicate the best raw results (least average test error) for each benchmark experiment.

Interpolated Algorithm

AdaBoost

Method

0%

Laboratory Noise %

Table 1: Results with Stumps as Base Classiers (100 Training and Test Sets, 500 Iterations of the Ensemble Algorithms).

Different Paradigms for Choosing Sequential Reweighting Algorithms 829

3.7 § 2.0 = 13.9 § 1.2 = 24.3 § 2.1 + 33.8 § 1.9 + 21.3 § 4.4 = 25.0 § 1.9 + 12.3 § 1.0 -

3.2 § 1.9 13.9 § 0.7 25.1 § 2.3 34.6 § 1.8 22.2 § 4.2 27.3 § 2.0 11.7 § 0.6

Breast Cancer Banana German FSolar Heart Diabetes Waveform

5.1 § 2.4 16.6 § 1.4 27.6 § 2.3 34.6 § 1.7 24.3 § 4.4 28.6 § 2.2 12.9 § 0.8

AdaBoost

Note: The same conditions as in Table 1 apply here.

Interpolated Algorithm

AdaBoost

Method

0%

4.4 § 2.6 + 15.8 § 1.5 + 25.6 § 2.4 + 34.1 § 2.1 = 23.0 § 4.8 + 25.6 § 1.7 + 13.5 § 1.0 -

Interpolated Algorithm

5%

6.8 § 2.9 19.6 § 1.6 29.2 § 2.7 35.2 § 2.0 26.2 § 4.3 30.0 § 2.3 14.7 § 1.0

AdaBoost

Laboratory Noise %

Table 2: Results with (Nonpruned) Depth 3 Tree Classiers.

4.4 § 2.3 + 17.9 § 1.8 + 26.0 § 2.4 + 34.8 § 2.2 = 24.1 § 4.7 + 25.5 § 2.3 + 14.7 § 1.1 =

Interpolated Algorithm

10%

9.9 § 4.2 25.8 § 2.1 33.0 § 2.8 35.6 § 2.2 31.6 § 5.3 34.0 § 2.9 20.2 § 2.0

AdaBoost

5.5 § 2.9 + 23.1 § 2.2 + 27.9 § 2.7 + 35.8 § 2.3 = 28.0 § 5.7 + 27.9 § 3.5 + 18.4 § 2.1 +

Interpolated Algorithm

20%

830 G. Blanchard

AB. RBF-net

12.0 § 0.6 26.6 § 2.5 35.5 § 1.8 20.5 § 3.0 26.9 § 2.2 11.0 § 0.6 22.6 § 1.2 30.6 § 4.8 5.4 § 2.3 4.7 § 1.6

Random Forest

12.5 § 0.8 23.0 § 2.1 34.2 § 1.7 18.8 § 4.0 24.7 § 1.7 11.1 § 0.7 23.0 § 2.2 26.6 § 4.4 3.6 § 0.3 3.5 § 0.5

11.0 § 0.6 24.4 § 2.3 33.1 § 1.9 16.9 § 3.9 23.9 § 1.9 9.8 § 0.4 22.4 § 1.1 26.6 § 4.7 1.6 § 0.1 3.5 § 0.4

ABReg RBF-net 10.7 § 0.5 23.9 § 2.3 34.7 § 1.7 17.6 § 2.9 23.7 § 2.0 9.9 § 0.4 23.6 § 2.1 27.3 § 4.6 1.8 § 0.2 2.9 § 0.3

Interpolated RBF-net + = = = = = +

Versus ABR

+ + + + = + +

Versus RF

Notes: The last two columns show the results of 95% condence two-sided t-test of the interpolated algorithm versus AdaboostReg and Random Forest, respectively. Same conditions as Table 1.

Banana German FSolar Heart Diabetes Waveform Titanic Breast Cancer (2) Ringnorm Twonorm

Data Set

Table 3: Performance Comparison Chart Between Random Forest, AdaBoost, AdaBoost-Reg and the Interpolated Algorithm (Applied to RBF-Nets).

Different Paradigms for Choosing Sequential Reweighting Algorithms 831

832

G. Blanchard

networks are quite accurate weak learners; they are also slower to train, so in that case, we made a more limited set of experiments, considering data only without labeling noise. The goal of this series of experiments is also to provide a performance comparison with the algorithm AdaBoostReg (R¨atsch et al., 2001), which is an alternative regularized version of AdaBoost based on different principles. For a fair comparison, we followed the protocol used in Ra¨ tsch et al. (2001); more precisely, we used the data and the RBF parameters provided on R¨atsch’s repository. The interpolation parameter was estimated by cross validation for the rst ve realizations of each data set, and the median of these ve values was used for the other realizations (which is exactly the protocol followed by Ra¨ tsch et al., 2001: this is rendered necessary to limit the computation time needed). We also used a bigger number of benchmark data sets as compared to the rst series of experiments (note that the Breast-Cancer data set here is not the same as in the previous series). The results for this experiment are reported in Table 3. It is rst interesting to note that the best ensemble method for RBF networks is never AdaBoost. This shows clearly that AdaBoost largely overts when the base classiers are too complex. Individually, the interpolated algorithm applied to RBF networks is almost always better than AdaBoost (only one loss), outperforms RF in a clear majority of cases (six victories, three losses, one tie), and is on par with AdaBoost-Reg (ve ties, three losses, two victories). (An algorithm wins against another whenever the corresponding two-sided t-test for equality of the mean error rates is rejected.) The advantage of the interpolated algorithm with respect to the latter is that it is computationally simpler (in particular, AdaBoost-Reg involves a line search at each step to determine the coefcient of the classier). The raw results suggest that AdaBoost-Reg may have a slight edge over the interpolated algorithm, but a down-to-earth view of the results, taking into account the condence intervals on test classication, leads to the conclusion that the algorithms have essentially equivalent performance. 5 Conclusion Numerous theoretical works in the past few years providing bounds on classication error rates of ensemble methods have supported the intuition that one should generally try to obtain high margins on the training set. Once this principle is posed, there still are some partially heuristic choices to be made regarding the trade-offs and the desired shape of the margin distribution. We have pointed out two philosophies that can be considered as different strategies to achieve this goal and that we showed or recalled to be linked with standard statistical learning procedures used in more classical frameworks (support vector machines and Fisher ’s linear discriminant), thus trying to provide a unifying point of view over different types of ensemble methods.

Different Paradigms for Choosing Sequential Reweighting Algorithms

833

Based on this analysis and observing the remarkably similar form of two specic algorithms designed to achieve the aims of each paradigm, we built a very simple family of algorithms corresponding to interpolated parameter values between the two initial methods. In our opinion, the main virtue of this algorithm is to propose a simple method able to sample a variety of good candidate empirical margin proles. The different types of proles correspond basically to different trade-offs between the proportion of examples having a high enough margin, and the average margin of the examples. Figure 6 shows clearly on an example how this goal is attained. It is then quite simple to pick among the sampled proles by cross validation to choose the trade-off best suited to the data. This proved, on benchmark data sets, to be an efcient method to improve over the performances of AdaBoost, more noticeably in noisy situations or when the base classiers tend to be too complex. This algorithm outperforms RF when used with good base classiers (small-size RBF networks) and exhibits the same level of performance as the Regularized AdaBoost method, while being noticeably simpler. Appendix Proof of theorem 1. Point ii is a direct consequence of point i. By the minimax theorem, the minimax margin ° is such that for any collectionof weights .!i /, there exists a classier f 2 H having average weighted margin higher than ° or, in other words, weighted error less than .1 ¡ ° /=2. Since W nds a classier with minimum weighted error, we must have "t · .1 ¡ ° /=2 for all t. Now, we have for all i D 1; : : : ; N, Ri;t D

´ t t ³ X X 1 .1ffk .Xi /Yi · 0g ¡ "k / D .1 ¡ fk .Xi /Yi / ¡ "k ; 2 kD1 kD1

and point i hence implies that for any ± > 0, for t big enough we have for all i, t t 1X 2X fk .Xi /Yi ¸ 1 ¡ "k ¡ ± ¸ ° ¡ ±; t kD1 t kD1

which proves point ii. Proof of the convexity of cost function C¯ .G® /. We want to prove that C¯ .G® / given by equation 3.3 is convex on the simplex S D fj®j D 1g. Since the margin functionpM ® is a linear function of j®j, this amounts to proving that the function Var¡ [M ® ] is convex. Let ® 1 ; ® 2 2 S , ´ 2 [0; 1],

834

G. Blanchard

and consider A D Var¡ [M´®1 C.1¡´/® 2 ]

D E[.´.M® 1 ¡ E[M ®1 ]/ C .1 ¡ ´/.M® 2 ¡ E[M® 2 ]//2¡ ]

· E[.´.M® 1 ¡ E[M ®1 ]/¡ C .1 ¡ ´/.M ®2 ¡ E[M® 2 ]/¡ /2 ]

D ´ 2 E[.M® 1 ¡ E[M® 1 ]/2¡ ] C .1 ¡ ´/2 E[.M® 2 ¡ E[M® 2 ]/2¡ ] C 2´.1 ¡ ´/E[.M® 1 ¡ E[M ®1 ]/¡ .M ®2 ¡ E[M® 2 ]/¡ ];

where at the third line, we have used the convexity of the negative part .¢/¡ ; and B D .´Var¡ [M® 1 ]1=2 C .1 ¡ ´/Var¡ [M ®2 ]1=2 /2

D ´2 E[.M® 1 ¡ E[M® 1 ]/2¡ ] C .1 ¡ ´/2 E[.M® 2 ¡ E[M ®2 ]/2¡ ]

C 2´.1 ¡ ´/.E[.M® 1 ¡ E[M® 1 ]/2¡ ]/1=2 .E[.M® 2 ¡ E[M ®2 ]/2¡ ]/1=2 I

we have A · B by the Cauchy-Schwartz inequality, hence the result. Proof of theorem 4. Let us start with recalling the following simple fact: if h.x/ is a differentiable function on R, then it is easy to check that d.h.x//2¡ D ¡2.h.x0 //¡ h0 .x0 /: dx xDx0

Let us assume without loss of generality that j®j D 1, and put M f .x; y/ D f .x/y (so that since f .x/ 2 f¡1; 1g, we have M f .x; y/ D 1 ¡ 2E f .x; y/); then Var¡ [.1 ¡ t/M® C tM f ] D E[.M® ¡ E[M® ] C t.M f ¡ M® ¡ E[M f ¡ M® ]//2¡ ]; so that using the rst remark, @ Var [.1 ¡ t/M® C tM f ] ¡ @t tD0 D ¡2E[.M® ¡ E[M® ]/¡ .M f ¡ M ® ¡ E[M f ¡ M® ]/] D 4E[w.X; Y/E f .X; Y/] ¡ 4E[w.X; Y/]E[E f ] C C® ; where C® is independent of f ; similarly, @ E[.1 ¡ t/M® C tM f ] D E[M f ¡ M® ] D ¡2E[E f ] C C0 ; ® @t tD0

Different Paradigms for Choosing Sequential Reweighting Algorithms

835

so that nally Á ! E[w.X; Y/E f .X; Y/] ¡ E[w.X; Y/]E[E f ] @ C¯ .Gt / D 2 p C ¯E[E f ] @t Var¡ [M® ] tD0 C C00® :

The result follows by translation and multiplication by constants independent of f . Acknowledgments This work nds some of its roots in a joint research project with Y. Amit, whom I thank. I also thank K-R. Muller, ¨ G. Ra¨ tsch, and S. Mika for stimulating discussions about this work. I nished this letter while I was a host at the Fraunhofer FIRST, Berlin. References Amit, Y., & Blanchard, G. (2001). Multiple randomized classiers: MRCL (Tech. Rep.). Chicago: University of Chicago. Available on-line: http://galton. uchicago.edu/ amit/Papers/mrcl.ps.gz. Q Amit, Y., & Geman, D. (1997). Shape quantization and recognition with randomized trees. Neural Computation, 9, 1545–1588. Blackwell, D. (1956). An analog of the minmax theorem for vector payoffs. Pacic Journal of Mathematics, 6, 1–8. Blanchard, G. (2001). Mixture and aggregation of estimators for pattern recognition: Application to decision trees. Unpublished doctoral dissertation, Universit e´ Paris-Nord. (In English, with an introductory part in French) Available online: http://www.math.u-psud.fr/ blanchard/publi/these.ps.gz. Q Blanchard, G. (2003). Generalization error bounds for aggregate classiers. In D. D. Denison, M. H. Hansen, C. C. Holmes, B. Mallick, & B. Yu (Eds.), Nonlinear estimation and classication. Berlin: Springer-Verlag. Breiman, L. (1998a). Arcing classiers. Annals of Statistics, 26(3), 801–849. Breiman, L. (1998b). Prediction games and arcing algorithms (Tech. Rep.). Berkeley Statistics Department, University of California at Berkeley. Available on-line: ftp://ftp.stat.berkeley.edu/pub/users/breiman/games.ps.Z. Breiman, L. (2001). Random forests. Machine Learning, 45, 5–32. Devroye, L., Gy¨or, L., & Lugosi, G. (1996). A probabilistic theory of pattern recognition. Berlin: Springer-Verlag. Dietterich, T. G. (2000). An experimental comparison of three methods for constructing ensembles of decision trees: Bagging, boosting and randomization. Machine Learning, 40, 139–158. Frean, M., & Downs, T. (1998). A simple cost function for boosting. (Tech. Rep.). Queensland, Australia: Department of Computer Science and Electrical Engi-

836

G. Blanchard

neering, University of Queensland. Available on-line: http://www.boosting. org/papers/FreDow98.ps.gz. Freund, Y., & Schapire, R. (1996). Experiments on a new Boosting algorithm. In Machine Learning: Proceedings of the 13th International Conference (pp. 148–156). San Mateo, CA: Morgan Kaufmann. Freund, Y., & Schapire, R. E. (1996). Game theory, on-line prediction and boosting. In Proceedings of the 9th Annual Conference on Computational Learning Theory (pp. 325–332). New York: ACM Press. Friedman, J., Hastie, T., & Tibshirani, R. (2000). Additive logistic regression: A statistical view of boosting. Annals of Statistics, 28, 337–374. Grove, A., & Schuurmans, D. (1998). Boosting in the limit: Maximizing the margin of learned ensembles. In Proceedings of the Fifteenth National Conference on Artical Intelligence (pp. 692–699). New York: AAAI Press. Available online: http://www.boosting.org/papers/GroSch98.ps.gz. Hart, S., & Mas-Colell, A. (2001). A general class of adaptive strategies. Journal of Economic Theory, 98, 26–54. Koltchinkskii, V., & Panchenko, D. (2002). Empirical margin distributions and bounding the generalization error of combined classiers. Annals of Statistics, 30(1), 1–35. Meir, R., & R¨atsch, G. (2003). An introduction to Boosting and leveraging. In S. Mendelson & A. Smola (Eds.), Advanced lectures on machine learning (pp. 119–184) Berlin: Springer-Verlag. Available on-line: http://www.boosting. org/papers/MeiRae03.ps.gz. Mika, S. (2002). Kernel Fisher discriminants. Unpublished doctoral dissertation, University of Technology, Berlin. Onoda, T., R¨atsch, G., & Muller, ¨ K.-R. (1998). An asymptotic analysis of AdaBoost in the binary classication case. In L. Niklasson, M. Bod´en, & T. Ziemke (Eds.), Proc. of the Int. Conf. on Articial Neural Networks (ICANN’98) (pp. 195–200). Berlin: Springer-Verlag. Available on-line: http://www. boosting.org/papers/ICANN98.ps.gz. R¨atsch, G., Mika, S., Schlkopf, B., & Muller, ¨ K.-R. (2002). Constructing Boosting algorithms from SVMs: An application to one-class classication. IEEE P.A.M.I., 24(9), 1184–1199. R¨atsch, G., Onoda, T., & Muller, ¨ K.-R. (2001). Soft margins for AdaBoost. Machine Learning, 3(42), 287–320. R¨atsch, G., & Warmuth, M. (2001). Marginal boosting (Tech. Rep.). London: Royal Holloway College. Schapire, R. E., Freund, Y., Bartlett, P., & Lee, W. S. (1998). Boosting the margin: A new explanation for the effectiveness of voting methods. Annals of Statistics, 26(5), 1651–1686. Smola, A. J., Bartlett, P. L., Sch¨olkopf, B., & Schuurmans, D. (Eds.). (2000). Advances in large margin classiers. Cambridge, MA: MIT Press. Viola, P., & Jones, M. J. (2001). Robust real-time object detection (Tech. Rep. CRL2001/01). Cambridge, MA: COMPAQ.

Received November 6, 2002; accepted September 5, 2003.

LETTER

Communicated by Robert Hecht-Nielsen

Robust Formulations for Training Multilayer Perceptrons Tommi K¨arkk¨ainen [email protected].

Erkki Heikkola [email protected]. Department of Mathematical Information Technology, University of Jyv¨askyl¨a, FIN-40014 University of Jyv¨askyl¨a, Finland

The connection between robust statistical estimates and nonsmooth optimization is established. Based on the resulting family of optimization problems, robust learning problem formulations with regularizationbased control on the model complexity of the multilayer perceptron network are described and analyzed. Numerical experiments for simulated regression problems are conducted, and new strategies for determining the regularization coefcient are proposed and evaluated. 1 Introduction Multilayered perceptron (MLP) is the most commonly used neural network for nonlinear regression approximation. The simplest model of data in regression is to assume that the given targets are generated by yi D Á .xi / C "i ;

(1.1)

where Á .x/ is the unknown stationary function and "i ’s are sampled from an underlying noise process. K¨arkk¨ainen (2002; see also Bishop, 1995) showed that for any MLP with linear nal layer and special regularization (weight decay without nal layer bias), the usual least-mean-squares (LMS) learning problem formulation corresponds to the gaussian assumption for the noise statistics in equation 1.1. In statistics, relaxation of this assumption underlies the so-called robust procedures (see, e.g., Huber, 1981; Rousseeuw & Leroy, 1987; Rao, 1988; Hettmansperger & McKean, 1998; Oja, 1999). In the neural networks literature, there have been some attempts to combine robust statistical procedures with learning problem formulations and training algorithms for MLP networks (see, e.g., Kosko, 1992; Chen & Jain, 1994; Liano, 1996). Moreover, single-layer (linear) perceptron (SLP) for classication and robust regression approximation has been extensively studied in a special algorithmic setting by Raudys (1998a, 1998b, 2000). However, the general combination of robustness and MLP without focusing on special algorithms or architectures has, as far as we know, not been considered on a solid basis. Neural Computation 16, 837–862 (2004)

c 2004 Massachusetts Institute of Technology °

838

T. K¨arkka¨ inen and E. Heikkola

Based on other initial settings than equation 1.1 there exist many techniques for studying learning properties of feedforward neural networks, especially probably approximately correct learnability and Vapnik-Chervonenkis dimension (see, e.g., Mitchell, 1997). The above-mentioned result concerning the LMS learning problem means that for every local solution of the learning problem, the output is equal to the sample mean of the given outputs. Hence, every locally optimal MLP provides an unbiased estimate of the true mean. The main theoretical observation of this article is the generalization of this result in accordance with the two most common robust estimates of location: median and spatial median. This yields generalized unbiasedness so that we can then concentrate on controlling the variance (i.e., the complexity) of MLP. For this purpose, we apply the special quadratic weight decay, which penalizes large values of weights using only a single hyperparameter and, being strictly convex, also improves the local properties of the learning problem. Surprisingly, in addition to numerous practical and theoretical studies, the analysis of fat-shattering dimension (scale-sensitive version of VC dimension more appropriate for studying neural networks) also supports the utilization of such an approach (Bartlett, 1998). The main emphasis here is to describe, analyze, and test robust learning problem formulations for the MLP in a batch mode. For numerical comparisons, we need to use black box training algorithms for solving the optimization problems based on these formulations. An essential concept then is the convergence of an algorithm, which depends on the regularity of the optimization problem (Nocedal & Wright, 1999). Hence, rigorous treatment of robust MLP requires us to establish a link between the norms behind the robust statistics and the regularity of such problems (Clarke, 1983; M¨akela¨ & Neittaanmaki, ¨ 1992). As far as we know, this fundamental relation has not been explicitly established in other works. Another basis for this work is to treat the MLP transformation in a layerwise form (Hagan & Menhaj, 1994; K¨arkk¨ainen, 2002). This allows us to derive the optimality system in a compact form, which can be used in an efcient computer implementation of the proposed techniques. Together with the given new heuristics for controlling model complexity, the proposed approach allows a rigorous derivation of an MLP for real applications with different noise characteristics within the training data. In section 2, we establish, discuss, and illustrate the connection between robust statistics and nonsmooth optimization. We also present the layerwise architecture and family of learning problem formulations for training an MLP. In section 3, we compute the optimality conditions for the network learning and derive and discuss some of their consequences. In section 4, we report results of numerical experiments for comparing different formulations and introduce two novel techniques for determining the complexity of an MLP model. Finally, in section 5 we briey make some conclusions.

Robust Formulations for Training Multilayer Perceptrons

839

2 Preliminaries Throughout the article, we denote by .v/i the ith component of a vector v 2 Rn : Without parentheses, vi represents one element in the set of vectors fvi g: The lq -norm of a vector v is given by kvkq D

Á n X iD1

!1=q q

j.v/i j

;

q < 1:

(2.1)

2.1 Nonsmooth Optimization and Robust Statistics. In this section, we establish the connection between nonsmooth optimization and robust statistics. More details and further references on nonsmooth optimization can be found in Clarke (1983) and Makel ¨ a¨ and Neittaanm a¨ ki (1992). Robust statistics are treated in Huber (1981), Rousseeuw and Leroy (1987), Rao (1988), Hettmansperger and McKean (1998), and Oja (1999). Nonsmooth optimization concentrates on functionals and optimization problems, which cannot be described by using the classic (C1 ) differential calculus. We consider the following unconstrained optimization problem, min J .u/;

(2.2)

u2Rn

where J : Rn ! R is a given Lipschitz continuous cost function. Denition 1. is dened by

The subdifferential @ J (according to Clarke, 1983) of J at u 2 Rn

@ J .u/ D f» 2 Rn j J 0 .uI d/ ¸ » T d

8d 2 Rn g;

(2.3)

where J 0 .uI d/ is the generalized directional derivative lim sup v !u t&0

J .vCtd/ ; t

which

0

coincides with the usual directional derivative J .uI d/ when it exists. Notice that @ J .u/ denes a nonempty, convex, and compact set. Theorem 1. Every local minimizer u¤ 2 Rn for problem 2.2 is substationary, that is, it satises 0 2 @ J .u¤ /:

(2.4)

Moreover, if J is convex, then the necessary optimality condition in equation 2.4 is also sufcient. To summarize, in nonsmooth optimization, generalization of the directional derivative is the set-valued subdifferential, and, correspondingly,

840

T. K¨arkka¨ inen and E. Heikkola

generalization of the smooth, local indication of an extremum point r J .u¤ / D 0 is the existence of a substationary point 0 2 @ J .u¤ /: Let us illustrate the above denitions with an example f .u/ D juj for u 2 R: The subdifferential of f .u/ is given by 8 ¡1; for u < 0; > > < for u D 0; (2.5) @ f .u/ D sign.u/ D [¡1; 1]; > > : 1; for u > 0: As can be seen, the subdifferential (i.e., generalized sign function) coincides with the usual derivative in the well-dened case u 6D 0; and contains the whole set [¡1; 1] with end points of left and right converging directional 0 derivatives of the sequence f .ui / for ui ! 0: Moreover, u¤ D 0 is the unique minimizer of juj; because 0 2 @ f .u/ only for u¤ D 0: Next we turn our attention to robust statistics. Let fx1 ; : : : ; xN g be a sample of a multivariate random variable x 2 Rn : Consider the following family of optimization problems: minn Jq® .u/;

u2R

for Jq® .u/ D

N 1X ku ¡ xi k®q : ® iD1

(2.6)

We restrict ourselves to the following combinations (cf. Rao, 1988): q D ® D 2 (average). In this case, problem 2.6 is the quadratic least-squares P problem, and the gradient of Jq® is given by r J22 .u/ D N iD1 .u ¡ xi /: From 2 ¤ ¤ r J2 .u / D 0, we obtain the unique solution a D u of problem 2.6 in the form aD

N 1 X xi ; N iD1

(2.7)

which is the marginal mean (average) for the given sample. q D ® D 1 (median). This choice leads to the minimization of the sum of l1 -norms, which is a nonsmooth optimization problem. The subdifferential of J11 .u/ reads as @ J11 .u/ D

N X

»i ; iD1

where .» i /j D sign..u ¡ xi /j /:

(2.8)

In practice, the median is realized by the marginal middle values of the feature-wise ordered sample set. Hence, the median is unique for odd N; whereas for N even, all points in the closed interval between the two middle values satisfy equation 2.8 (see, e.g., K¨arkk¨ainen & Heikkola, 2002).

Robust Formulations for Training Multilayer Perceptrons

841

q D 2® D 2 (spatial median). By equation 2.3 the subgradient of J21 .u/ is characterized by the condition

@ J21 .u/ D

N X iD1

»i ;

8 > <.» /j D .u ¡ xi /j ; for ku ¡ xi k2 6D 0; i ku ¡ xi k2 with (2.9) > : k» i k2 · 1; for ku ¡ xi k2 D 0:

Thus, in this case, solution of problem 2.6 is realized by the so-called spatial median s; which satises 0 2 @ J21 .s/: Milasevic and Ducharme (1987) show that if the sample fx1 ; : : : ; xN g belongs to a Euclidean space and is not concentrated on a line, the spatial median s is unique. This result is generalized to strictly convex Banach spaces in Kemperman (1987). 2.1.1 Comparison of Different Estimators. In a statistical context, robustness refers to the insensitivity of estimators toward outliers—observations that do not follow the characteristic distribution of the rest of the data. The sensitivity of the average a toward observations lying far from the origin (representing the mean-value estimator) is illustrated in Figure 1 (left), where the gradient eld r f .x/ D .x1 ; x2 / of 2d function kxk22 is given. As we can see, the size of the gradient vector increases when moving away from the origin, so that those points are weighted more heavily at the equilibrium rkxk22 D 0: This readily explains why the (symmetric) gaussian distribution with enough samples is the intrinsic assumption behind the least-squares estimate a: On the other hand, an estimator with equal weight of all samples is obtained by dividing the gradient (not the samples!) by its length, and then we precisely get the spatial median s; which is illustrated through the gradient eld of function kxk2 in Figure 1 (right). As stated, for example, in Hettmansperger and McKean (1998) and also evident from equation 2.9, the

1

1

0

0

1

1 1

0

1

1

0

Figure 1: Scaled (scale 0.4) gradient elds of kxk22 (left) and kxk2 (right).

1

842

T. K¨arkka¨ inen and E. Heikkola

1

0

1 1

0

1

Figure 2: Scaled (scale 0.4) gradient eld of f .x/ D kxk1 :

corresponding estimating function J21 depends only on the directions and not on the magnitudes of u ¡ xi ; i D 1; : : : ; N; which signicantly decreases both the sensitivity toward outliers and requirements concerning the necessary amount of data. Finally, in Figure 2, the gradient eld of a function kxk1 is depicted, where the insensitivity with respect to the distance but also the lack of rotational invariance (due to different contour lines of the unit ball in the 1-norm) are clearly visible. 2.2 MLP in a Layer-Wise Form. A compact description for the action of the multilayered perceptron neural network is given by Hagan and Menhaj (1994) and K¨arkk¨ainen (2002): o0 D x;

ol D F l .Wl oO .l¡1/ /

for

l D 1; : : : ; L:

(2.10)

Here, the superscript l corresponds to the layer number (starting from zero for the input), and by the circumex we indicate the normal extension of a vector by unity. F l .¢/ denotes the usual componentwise activation on the lth level, which can be represented by using a diagonal function matrix F D F .¢/ D Diagf fi .¢/gm iD1 supplied with the natural denition of the matrix vector product y D F .v/ ´ .y/i D fi ..v/i /: Notice, though, that the following analysis generalizes in a straightforward manner to the case of an activation with nondiagonal function matrix (K¨arkk¨ainen, 2002). The dimensions of the weight matrices are given by dim.Wl / D nl £ .nl¡1 C 1/; l D 1; : : : ; L; where n0 is the length of an input vector x; nL the length of the output

Robust Formulations for Training Multilayer Perceptrons

843

vector oL ; and nl ; 0 < l < L determine the sizes (number of neurons) of the hidden layers. Due to the special bias weights in the rst column, the column numbering for each weight matrix starts from zero. Instead of precisely equation 2.10, we consider an architecture of MLP containing only a linear transformation in the nal layer as oLi D N .fWl g/.xi / n0 nL D WL oO .L¡1/ : With given training data fxi ; yi gN i iD1 ; xi 2 R and yi 2 R ; the l L unknown weight matrices fW glD1 are determined as the solution of the optimization problem, min L

fWl gLlD1

® l q;¯ .fW g/;

(2.11)

where the cost functional is of the general form ® l q;¯ .fW g/

L

D

N 1 X kN ®N iD1

.fWl g/.xi / ¡ yi k®q C

L X ¯ X jWli;j j2 (2.12) 2 lD1 .i;j/2I l

for ¯ ¸ 0: Here, the index set Il contains all other indices of the unknown weight matrices except the ones corresponding to the bias vector of WL as suggested by the test results in K¨arkk©ainen ¨ ª (2002); see also Bishop (1995). All features in the training data xi ; yi are preprocessed to the range [¡1; 1] of the k-tanh functions tk .a/ D

2 ¡1 1 C exp.¡2 k a/

for k 2 N;

(2.13)

which are used in the activation. In this way, we balance the scaling of unknowns (components of weight matrices at different layers) in problem 2.11 (K¨arkk¨ainen, 2002). It is well known (as suggested by equation 1.1) that for a successful application of MLP, one needs to avoid overtting by taking into account both the model complexity and errors in data. In equation 2.12, choosing the single hyperparameter, the weight decay coefcient ¯ positive favors smaller weights, thus further balancing their scale for any iterative training algorithm. In addition, this pushes the linear part of each neuron toward the linear region of the k-tanh activation functions. However, because in our approach, we let k in equation 2.13 run from 1 to nl in each layer, thus diminishing the linear region, this kind of penalization does not directly reduce the MLP to a linear transformation. Function 2.12 is also related to Bayesian statistics with compatible choices of the sample data and prior distributions (see, e.g., Rognvaldsson, ¨ 1998). Moreover, we consider the same three combinations for the parameters q and ® as in the previous section: q D ® D 2; q D ® D 1; and q D 2® D 2: Hence, we conclude that the considered family of learning problem formulations for the MLP results from a compound (though special) application of robust and Bayesian statistics.

844

T. K¨arkka¨ inen and E. Heikkola

For solving the optimization problems in equation 2.11, we use generalizations of gradient-based methods for nonsmooth problems known as bundle methods (Makel ¨ a¨ & Neittaanm a¨ ki, 1992). We recall from K¨arkk¨ainen (2002) that for ® D 1, the assumptions of convergence for ordinary training algorithms like gradient descent (on batch-mode Lipschitz continuity of gradient, for on-line stochastic iteration C2 continuity), CG (Lipschitz continuity of gradient), and especially quasi-Newton methods (C2 -continuity) are violated (Haykin, 1994; Nocedal & Wright, 1999). As documented, for example, for SLP in Raudys (2000), for MLP in Saito and Nakano (2000), and for image restoration with similar functionals in K¨arkk¨ainen, Majava, and M¨akel¨a (2001), this yields nonconvergence of ordinary training algorithms, when the cost function does not fulll the required smoothness assumptions. Furthermore, results in K¨arkk¨ainen et al. (2001) indicateqthat simple smoothing techniques like replacing a norm kvk2 for v 2 R2 by v21 C v22 C " for " > 0 are not sufcient to restore the convergence of ordinary optimization methods. 3 Sensitivity Analysis and Its Consequences Next we apply a useful technique, also presented in K¨arkk¨ainen (2002), to derive the optimality conditions for the network training problem 2.11. From now on, for any vector v 2 Rn ; the notation sign[v] means a componentwise application of the sign function in equation 2.5, and the abbreviated notation v=kvk2 actually refers to the existence of vector » such that .» /i D

.v/i ; for kvk2 6D 0; kvk2

k» k2 · 1; for kvk2 D 0:

(3.1)

For simplicity, we assume that the activation functions in all function matrices F .¢/ are differentiable, although the analysis below can be extended and given algorithms applied also to nondifferentiable activation functions. Note that the use of nonsmooth activation functions (step function or a=.1 C jaj; e.g., Prechelt, 1998) makes the learning problem nonsmooth even for q D ® D 2; and therefore ordinary gradient-based optimization algorithms cannot be used for solving. 3.1 MLP with One Hidden Layer. For clarity, we start with MLP with ¤ ¤ only one hidden layer. Then any local solution .W1 ; W2 / of the minimization problem 2.11 is characterized by the conditions " # O O

2 2 @.W1 ;W2 / L

1¤ 2¤ ® q;¯ .W ; W /

D4

@W1 L @W2 L

3 1¤ 2¤ ® q;¯ .W ; W / 5: ® 1¤ 2¤ q;¯ .W ; W /

(3.2)

Robust Formulations for Training Multilayer Perceptrons

845

Here, @Wl L ®q;¯ ; l D 1; 2; are subdifferentials presented in a similar matrix form as the unknown weight matrices. We begin the derivation with some lemmata. The proofs are omitted here, because they follow exactly the same lines as the proofs of the corresponding lemmata in K¨arkk¨ainen (2002), using the already introduced results on subdifferential calculus in section 2.1. We also assume that except the unknown variable dened in each cost function, other xed quantities (matrices and vectors) are given with appropriate dimensions. Furthermore, to compress the presentation, we introduce the following notation: 8 v; > > > < » ®q .v/ D sign[v]; > > v > : ; kvk2

for q D ® D 2; for q D ® D 1; for q D 2® D 2:

Lemma 1. For the functional J.W/ D ®1 kW v ¡ yk®q , the matrix of subdifferentials is of the form @W J.W/ D » ®q .W v ¡ y/ vT : Lemma 2. as

For the functional J.u/ D

1 kW F .u/ ®

0

¡ yk®q , the subgradient reads 0

@u J.u/ D .W F .u//T » ®q .W F .u/ ¡ y/ D DiagfF .u/g WT » ®q .W F .u/ ¡ y/: Lemma 3. For the functional J.W/ D ifferentials is of the form

1 N kW F .Wv/ ®

¡ yk®q , the matrix of subd-

0 N T » ® .W N F .Wv/ ¡ y/ vT : @W J.W/ D DiagfF .Wv/g W q

Now we are ready to state the actual results for the perceptron with one hidden layer. In what follows, we denote by W21 the submatrix .W2 /i;j ; i D 1; : : : ; n2 ; j D 1; : : : ; n1 ; which is obtained from W2 by removing the rst column W20 containing the bias nodes. Furthermore, the error in the ith b.W1 xO i / ¡ yi : output is denoted by ei D W2 F Theorem 2. @W1 L

1 ® q;¯ .W ;

Matrices of subdifferentials @W2 L W2 / ½ Rn1 £.n0 C1/ are of the form

® 1 q;¯ .W ;

W2 / ½ Rn2 £.n1 C1/ and

@W2 L

1 2 ® q;¯ .W ; W /

D

N 1 X b.W1 xO i /]T C ¯ [0 W2 ]; »® .ei / [F 1 N iD1 q

@W1 L

1 2 ® q;¯ .W ; W /

D

N 1 X 0 DiagfF .W1 xO i /g .W21 /T » ®q .ei / xO Ti C ¯ W1 : (3.4) N iD1

(3.3)

846

T. K¨arkka¨ inen and E. Heikkola

3.2 MLP with Several Hidden Layers. Next, we generalize the previous analysis to the case of several hidden layers. N FN .W Q FQ .Wv// ¡ yk® , the matrix of Lemma 4. For the functional J.W/ D ®1 kW q subdifferentials is of the form 0 Q T DiagfFN 0 .W Q FQ .Wv//g W N T » ® .e/ vT ; @W J.W/ D DiagfFQ .Wv/g W q

N FN .W Q FQ .Wv// ¡ y: where e D W Theorem 3.

Matrices of subdifferentials @Wl L ® l q .fW g/

@Wl L

D

® l q;¯ .fW g/;

l D L; : : : ; 1; read as

N 1 X e l; » l [oO .l¡1/ ]T C ¯ W N iD1 i i

where

» Li D »®q .ei /;

(3.5) 0

» li D Diagf.F l / .Wl oO .l¡1/ /g .W.lC1/ /T » .lC1/ : i i 1

(3.6)

e l D [0 WL ] for l D L; and coincides with the whole matrix Wl for Furthermore, W 1 1 · l < L: The compact presentation of the optimality system in theorem 3 can be readily exploited in the implementation, which consists of a few basic linear-algebraic operations realizing equations 3.5 and 3.6. Moreover, the ¤ following result concerning every local minimizer O 2 @Wl L ®q;¯ .fWl g/ of problem 2.11 holds. Corollary 1. in theorem 3:

¤

For locally optimal MLP network fWl g satisfying the conditions 8 N > 1 X > > e¤i D 0; > > > N > iD1 > > > > < N X 02 sign[e¤i ]; > > iD1 > > > > > N X > e¤i > > > ; :0 2 ke¤ k iD1

for all ¯ ¸ 0:

i 2

for q D ® D 2; for q D ® D 1; for q D 2® D 2;

Robust Formulations for Training Multilayer Perceptrons

Proof.

847

The optimality condition O 2 @W L L

® l¤ q;¯ .fW g/

D

N ¤ 1 X ¤ ]T C ¯ [0 WL1 ] » Li [oO .L¡1/ i N iD1 ¤

(with the abbreviation oO .L¡1/ D oO .L¡1/ ) in theorem 3 can be written in the i i nonextended form as O2

N ¤ 1 X ¤ T »Li [1 .o.L¡1/ / ] C ¯ [0 WL1 ]: i N iD1

By taking the transpose on the right-hand side, we obtain " # " # " # ¤ N N 1 0T .» Li /T 1 X 1 X L¤ T D .» i / C : ¤ N iD1 o.L¡1/ N iD1 o.L¡1/ .» L¤ /T C ¯ .WL¤ /T ¯ .WL1 /T i i i 1 ¤

Finally, using the denitions in equation 3.5 for » Li in the rst row shows the results. The result of corollary 1 shows that by means of the error distribution, the local optimality conditions for the three learning problem formulations coincide with the conditions satised by the three statistical estimates in equations 2.7 through 2.9. Hence, we draw the following conclusions: ² We are able to generate robust MLPs using the two nonsmooth norms for tting. This also suggests that other good properties of robust estimates, like a smaller amount of learning data needed, could be a further benet when training a network. ² The result of corollary1 quanties precisely the fault tolerance of neural networks by means of erroneous data according to equation 1.1. ² The proof of corollary 1 reveals that by means of the regression model, the role of MLP is suppressed: we only needed a linear nal layer with a separate bias. Therefore, these results are actually valid for all kinds of regression approximators with these two properties. 4 Numerical Experiments 4.1 Univariate Single-Valued Regression. In the rst test setting, we study the use of the MLP network in the reconstruction of a given singlevalued function of one variable, which is disturbed by random noise. We train the network by solving the optimization problem 2.11 with both ® D q D 2 and ® D q D 1; and we compare the results given by these two

848

T. K¨arkka¨ inen and E. Heikkola

approaches. In this case, nL D 1; and thus the functionals L 12;¯ and L 11;¯ are identical. The minimizations are performed by the proximity control bundle method, which is applicable to both the smooth functional L 22;¯ and the nonsmooth functional L 11;¯ (M¨akel a¨ & Neittaanm a¨ ki, 1992). 4.1.1 Denition of the Test Problem. We consider the reconstruction of the function f .x/ D sin.x/ in the interval x 2 [0; 2¼]: The input vectors of the training data are chosen to be N uniformly spaced values xi from the interval [0; 2¼ ] given by xi D .i ¡ 1/ 2¼=.N ¡ 1/: The samples of function values involve two types of random noise: low-amplitude normally distributed noise affects the values over the whole interval [0; 2¼ ]; and at some isolated points, the values are disturbed by high-amplitude uniformly distributed noise (outliers). Hence, we choose yi D sin.xi / C ± "i C ³ ´i ;

(4.1)

where "i » N .0; 1/; i 2 OC ; and ´i » U .¡1; 1/; i 2 O: Here, U .¡1; 1/ denotes the uniform distribution on .¡1; 1/, and O is an index set of outliers such that O ½ f1; 2; : : : ; Ng: OC denotes the complement of O: We use the MLP with one hidden layer (i.e., L D 2) considered in section 3.1. The activation is performed with the k-tanh functions 2.13 such that 1 F .¢/ D Diagfti .¢/gniD1 : The input and output dimensions are both equal to one (n0 D n2 D 1), and we use the values 5; 10; and 20 for the dimension n1 of the hidden layer. The size N of the training data is 30 or 120 (N D 60 is given in K¨arkk¨ainen & Heikkola, 2002), and, correspondingly, the index set O is chosen to contain 3 or 10 randomly selected indices between 1 and N: The amplitudes of the normally and uniformly distributed noise are ± D 0:3 and ³ D 2; respectively. The training data set fxi ; yi g in the case N D 30; created according to the denitions above (before scaling to the range of the activation functions), is illustrated in Figure 3. 4.1.2 Comparison of Formulations. Our goal is to estimate the approximation capability of the MLP networks corresponding to the minimization of the two functionals L 11;¯ and L 22;¯ with different values of the parameters N; n1 ; and ¯: For this purpose, we dene a validation set of input values by xO i D .i ¡ 1/ 2¼=.Nt ¡ 1/; Nt D 257; which does not coincide with the input values xi of the training data. The difference between the MLP approximation and the exact function f is then calculated by using the norm err .W1 ; W2 / D

Nt 1 X b.W1 xO i / ¡ f .Oxi /j: jW2 F Nt iD1

(4.2)

Let us emphasize that the choice of error measure is not based on favoring L 11;¯ but on the fact that this form weights equally both small and large deviations from the exact function.

Robust Formulations for Training Multilayer Perceptrons

849

2.5 2 1.5 1

yi

0.5 0

0.5 1 1.5 0

1

2

3

xi

4

5

6

7

Figure 3: The training data set fxi ; yi g in the case N D 30 and the exact graph of function f: The circled markers also involve uniformly distributed noise.

We performed a series of tests with the three different values for N and n1 : For xed N and n1 ; the value of the regularization parameter ¯ varied in the interval [0; 1]; and for each ¯; we repeated the optimization algorithm 100 times with randomly created initial values for the weight matrices fW1 ; W2 g and computed the minimum and average values of the errors 4.2. The optimization problems to be solved for training the MLP are nonconvex, and thus there is a large number of local minima (all satisfying corollary 1) in the error surface to be explored by random initialization. PL P l 2 We monitored the value of the regularization term ¯2 lD1 .i;j/2Il jWi;j j of the functional L ®q;¯ corresponding to the MLP with minimum error in equation 4.2 for nding an effective way to choose the parameter ¯: This computation was motivated by the fact that our previous studies in image restoration with similar functions to be optimized have shown a strong correlation between the reconstruction error and the value of the regularization term (K¨arkkainen ¨ & Majava, 2000). In addition to the well-known cross-validation techniques, simpler heuristics for this purpose have been proposed and tested with the backpropagation algorithm (e.g., in Rognvald¨ sson, 1998).

850

T. K¨arkka¨ inen and E. Heikkola

Table 1: Optimal Values ¯ ¤ of the Regularization Parameter for Different Values of N and n1 with the Functionals L 11;¯ and L 22;¯ . L

1 1;¯

L

2 2;¯

N

n1

¯¤

err ¤a

¯¤

err¤a

30

5 10 20

2.0e-2 4.1e-2 1.0e-1

1.2e-1 1.2e-1 1.3e-1

7.7e-3 2.0e-2 4.1e-2

1.6e-1 1.6e-1 1.6e-1

120

5 10 20

1.9e-3 1.3e-2 3.1e-2

4.3e-2 4.7e-2 4.8e-2

1.2e-4 6.4e-4 2.6e-3

5.7e-2 5.9e-2 6.2e-2

minimum error obtained with the value ¯ ¤ over 100 tests. a The

The computational results are illustrated in Figures 8 through 15 in the appendix. Each gure includes three graphs corresponding to the three dimensions n1 of the hidden layer. For certain functional and xed values of N; the graphs represent either the average value of the errors in the 100 tests or the value of the regularization term. In each case, the norm 4.2 obtains minimum value at certain ¯; and these optimal points and minimal values of error are listed in Table 1. We see that in each test case, the MLP network based on the minimization of the functional L 11;¯ ¤ leads to a better approximation of the exact function f than the MLP based on L 22;¯ . We can also make the natural conclusion that the error is reduced by increasing N: The gures show that with a larger dimension of the hidden layer, the overall values of the error become smaller, but the minimum value remains essentially the same. Moreover, with a larger value of n1 ; the error of the MLP approximation becomes less sensitive to the choice of the regularization parameter. However, there is a remarkable difference in the behavior of the two learning problem formulations for n1 D 20 (i.e., with high representation capability of MLP): when ¯ grows from ¯ ¤ ; the average error for L 11;¯ essentially stays on the same level, whereas for L 22;¯ , there is approximately a linear increase. In addition, for L 22;¯ , small deviations from the optimal regularization parameter ¯ ¤ may lead to a large increase in the error. Interestingly this suggests that the wellknown approach in statistics “to integrate over the nuisance parameters” like ¯ would here yield a poorer result (especially for L 22;¯ ) than choosing an appropriate single value. From the graphs of the regularization term, we conclude that the strong oscillation indicates that the value of ¯ is smaller than the optimal value ¯ ¤ : In other words, the MLP is too complex with unnecessary variance. However, otherwise the reconstruction error and the regularization term

Robust Formulations for Training Multilayer Perceptrons

851

are not clearly correlated, and thereby our rst approach does not contain enough information to choose the parameter ¯ exactly. For q D ® D 1 and n1 D 20, there seems to be some similarity in the graphs for different values of N to indicate ¯ ¤ ; although this visual information is difcult to quantify precisely. 4.2 Bivariate Vector-Valued Regression. In the second set of experiments, we consider the reconstruction of a vector-valued function from noisy data. We use again the MLP network with one hidden layer and ktanh activation and train the network by minimizing the functional L ®q;¯ with the three choices of q and ®: The test function is formed as a sum of a global term, which affects the function values over the whole domain, and a local term, which is nonzero only in a small part of the domain. It is well known that MLP is efcient in approximating the global behavior of a function, but due to its structure, it tends to ignore the local variations. Another commonly used neural architecture is the radial basis function network (RBFN) (Broomhead & Lowe, 1988), which builds approximations with local basis functions. Therefore, when properly focused, RBFN can catch the local term but gives poor approximations to the global term. A simple idea to combine the advantages of locally and globally approximating networks is to augment the input of the MLP by the squares of the input values. Flake (1998) refers to such MLP networks as SQUARE-MLP (square unit augmented, radially extended, multilayer perceptron; see also Sarajedini & Hecht-Nielsen, 1992). Such networks retain the ability to form global representations, but they can also form local approximations with a single hidden node. 4.2.1 Denition of the Test Problem. We dene the vector-valued function f: R2 ! R2 ; f.x; y/ D .f1 .x; y/; f2 .x; y// as ³ ´ 2 .x ¡ x0 /2 C .y ¡ y0 /2 f1 .x; y/ D 4 exp ¡ ¡ C 1; (4.3) 0:7 1 C exp.¡6x/ and f2 .x; y/ D f1 .y; ¡x/ with x0 D y0 D 2:5: The function 4.3 is an example of a Hill-Plateau surface (Sarle, 1997), which is a sum of a local gaussian and a global tanh function. The approximation of such a function is known to be difcult for both MLP and RBFN. We attempt to reconstruct the function f in 5 D [¡5; 5] £ [¡5; 5]: The input vectors of the training data are obtained by rst constructing a uniform grid in 5 with the grid points given by .xi ; yj / D ..i ¡ 1/ h ¡ 5; .j ¡ 1/ h ¡ 5/;

i; j D 1; : : : ; n g ;

(4.4)

852

T. K¨arkka¨ inen and E. Heikkola

where h D 10=n g : For the tests, we choose n g D 21 as in Flake (1998). These coordinate values are then prescaled to the range [¡1; 1] of the activation functions, and they are included in the input vector together with the squares of the scaled coordinates. As in the previous section, all output vectors involve low-amplitude normally distributed noise, while some isolated outputs are also disturbed by high-amplitude outliers. More precisely, the output corresponding to the i;j

i;j

input xi;j D .xi ; yj ; x2i ; yj2 /T is of the form yi;j D .y1 ; y2 /T with i;j

yk D fk .xi ; yj / C ± "i;j C ³ ´i;j ;

(4.5)

where "i;j » N .0; 1/; .i; j/ 2 OC ; and ´i;j » U .¡1; 1/; .i; j/ 2 O: In the tests, the index set O is chosen to include approximately 0:05 n2g randomly selected elements, ± D 0:1; and ³ D 2: The training data created according to the denitions above (before prescaling to the range of the activation function) are illustrated in Figure 4. 4.2.2 Comparison of Formulations. We performed tests by using the values 5; 10; and 20 for n1 with the three different training formulations equations 2.11, initially without regularization (i.e., ¯ D 0). The error of the MLP approximations was measured using a uniform 49 £ 49 validation grid over 5: Let us denote the input vectors of the validation data by xO i;j and the corresponding grid points by .xO i ; yO i /: Then the error is calculated by err .W1 ; W2 / D

49 X 49 1 X b.W1 xO i;j / ¡ f.xO i ; yO i /k2 : kW2 F 492 iD1 jD1

(4.6)

With xed n1 ; we again repeated the optimization algorithm 100 times with random initialization and computed the minimum and average values of the errors (see equation 4.6).

3

3

2

2 1

1

0 0 1 1

2

2

3

3 5

4 5 5 0

0 5

5

5 0

0 5

5

Figure 4: The components of the output vectors in the training data (f1 left, f2 right).

Robust Formulations for Training Multilayer Perceptrons

853

Table 2: Minimum and Average Errors with the Functionals L L

2 2;¯

1 1;¯

L

L

2 2;¯ ;

L

1 1;¯ ; and

L

1 2;¯ .

1 2;¯

n1

err¤

err

err¤

err

err¤

err

5 10 20

9.6e-2 1.2e-1 1.1e-1

2.8e-1 2.4e-1 2.3e-1

3.7e-2 4.0e-2 4.7e-2

1.7e-1 1.5e-1 1.4e-1

3.4e-2 3.6e-2 4.4e-2

1.6e-1 1.4e-1 1.4e-1

The results are collected in Table 2. We conclude that the functionals L 11;0 and L 12;0 are more accurate and clearly more robust with respect to the initial guess than the smooth functional L 22;0 : In all cases, the smallest minimum error is achieved with the choice q D 2® D 2; while the dimension n1 does not have a strong effect on the accuracy. The best reconstruction, given by the functional L 12;0 ; is illustrated in Figure 5. 4.2.3 Determination of the Regularization Parameter. In the previous section, the regularization parameter ¯ was equal to zero. However, as already pointed out by the results in section 4.1, the value of ¯ has a strong effect on the accuracy of results, and it is thereby important to be able to choose the value correctly. Next, we describe another strategy for choosing the value of ¯ and estimate the quality of the resultant MLP network. The dimension of the hidden layer is xed to n1 D 20, and we use the choice q D 2® D 2: The learning data are exactly the same as in the previous tests and are rst divided into two disjoint parts C1 and C2 of equal sizes. More precisely, for N D 441, the randomly chosen sets C1 and C2 contain N1 D 221 and N2 D 220 elements, respectively. Only the input-output pairs .xi;j ; yi;j / in the set C1 are used in the functional 2.12, while the set C2 is

4

4

3

3

2

2

1

1

0

0

1 5

1 5 5 0

0 5

5

5 0

0 5

5

Figure 5: The components of the best reconstruction, obtained by minimizing L 12;¯ with n1 D 5 (f1 left, f2 right).

854

T. K¨arkka¨ inen and E. Heikkola

reserved for testing. This choice originates from the relaxed requirements concerning the necessary amount of data for robust training. We dene the two errors errk .W1 ; W2 / D

1 Nk

X .xi; j ;yi;j /2Ck

b.W1 xO i;j / ¡ yi;j k2 ; kW2 F

k D 1; 2; (4.7)

while err3 refers to the already dened validation error in equation 4.6. We search for an optimal nonzero ¯ in the interval [10¡9 ; 10¡6 ]; which can be determined by monitoring the value of the regularization term as described within the univariate test. The interval is covered with a predened set of values l ¢ 10¡s ; l D 1; 2; 4; 6; 8I s D 9; 8; 7; and for each xed ¯; the optimization algorithm is repeated 50 times with random initialization. We compute the errors err1 and err2 in all 50 tests and choose the MLP network with the smallest err2 to be the best one. For this MLP, we also compute the error err3 : The results of the rst stage of our strategy are illustrated by the graphs in Figure 6. We see that all three curves have similar behavior, which suggests that the errors err1 and err2 ; which can be computed without knowing the 0.4

err 2 err 3 err 1

0.35

the errors

0.3 0.25 0.2

0.15 0.1 0.05 9 10

10 8

b

10 7

Figure 6: Graphs of the three errors errk as functions of ¯:

10 6

Robust Formulations for Training Multilayer Perceptrons

855

exact function f; contain the same information as the true error err3 : Based on err1 and err2 , we now limit the complexity of MLP by choosing ¯ D 10¡8 : After choosing the value of the regularization parameter, we proceed to the second stage of our strategy, which is the determination of the nal MLP network. Because the model complexity of MLP is now xed along with ¯; we are able to use all of the data in the learning problem. We perform the minimization 100 times and choose the nal MLP to be the one that corresponds to the smallest value of local minima for L 12;¯ (i.e., the best candidate for global minimum). We evaluated the quality of the obtained MLP by computing the corresponding err3 ; which was approximately 0:05: By comparing this value to the graph of Figure 6, we see that approximation is improved from the rst stage, and we obtain a very good overall error level. We also computed err3 in each of the 100 tests and compared these values to the local minima of L 12;¯ : To study the correlation of these values and thus the validity of the nal choice, we sorted the 100 tests in ascending order according to L 12;¯ : The results of this procedure, together with the corresponding values of err3 , are given in Figure 7. The increase of err3 from its smallest value stresses the importance of the nal choice, which recovered almost the best alternative among the 100 candidates.

0.22

err

0.2

3

L12,b

0.18

error

0.16 0.14 0.12 0.1 0.08 0.06 0.04

0

20

Figure 7: The minimal value of L

40 1 2;¯

60

80

and err3 in the nal 100 tests.

100

856

T. K¨arkka¨ inen and E. Heikkola

5 Conclusion We considered robust learning problem formulations for the MLP network with regularization-based pruning. The MLP transformation was presented in a layer-wise form, which yielded a compact representation of the optimality systems and allowed a straightforward analysis and computer implementation of the proposed techniques. Different learning problem formulations were tested numerically for studying the effect of noise with outliers. We also proposed and tested two novel strategies for blind determination of the regularization parameter and thus the generality of MLP. Altogether, the combination of robust training, square unit augmentation of input, and effective control of model complexity yielded very promising computational results with simulated data. Appendix: Error and Regularization Graphs

0.34

n1=5 n1=10 n1=20

0.32 0.3

average error

0.28 0.26 0.24 0.22 0.2 0.18 0.16 0.14 0

0.05

0.1

b

Figure 8: Average error with the functional L

0.15

1 1;¯

and N D 30.

0.2

0.25

Robust Formulations for Training Multilayer Perceptrons

857

0.08 0.07

regularization term

0.06 0.05 0.04 0.03 0.02 n1=5 n1=10 n1=20

0.01 0 0

0.05

0.1

b

0.15

Figure 9: Regularization term with the functional L

0.2

1 1;¯

0.25

and N D 30.

0.34 0.32

average error

0.3 0.28 0.26 0.24 n1=5 n1=10 n1=20

0.22 0.2 0.18 0.16 0

0.05

0.1

b

Figure 10: Average error with the functional L

0.15

2 2;¯

0.2

and N D 30.

0.25

858

T. K¨arkka¨ inen and E. Heikkola 0.035

regularization term

0.03 0.025 0.02 0.015 0.01 n1=5 n1=10 n1=20

0.005 0 0

0.05

0.1

b

0.15

Figure 11: Regularization term with the functional L 0.2 0.18

0.2

2 2;¯

0.25

and N D 30.

n1=5 n1=10 n1=20

average error

0.16 0.14 0.12 0.1 0.08 0.06 0.04 0

0.01

0.02

0.03

0.04

b

Figure 12: Average error with the functional L

1 1;¯

0.05

and N D 120.

0.06

Robust Formulations for Training Multilayer Perceptrons

859

0.06

n1=5 n1=10 n1=20

regularization term

0.05

0.04

0.03

0.02

0.01

0 0

0.01

0.02

0.03

0.04

b

Figure 13: Regularization term with the functional L

1 1;¯

0.05

0.06

and N D 120.

0.3

average error

0.25

0.2

0.15

n1=5 n1=10 n1=20

0.1

0.05 0

0.01

0.02

0.03

0.04

b

Figure 14: Average error with the functional L

2 2;¯

0.05

and N D 120.

0.06

860

T. K¨arkka¨ inen and E. Heikkola 0.012

regularization term

0.01

0.008

0.006

0.004

0.002

0 0

n1=5 n1=10 n1=20 0.01

0.02

0.03

0.04

b

Figure 15: Average error with the functional L

2 2;¯

0.05

0.06

and N D 120.

Acknowledgments We thank Hannu Oja and Docent Marko M. Ma¨ kel a¨ for their help during the course of this research. We also gratefully acknowledge the comments and suggestions made by the anonymous referees, which improved the presentation of the results. This work was nancially supported by the Academy of Finland, grant 49006, and by the National Technology Agency of Finland, grant 2371/31/02 and InBCT-project. References Bartlett, P. L. (1998). The sample complexity of pattern classication with neural networks: The size of the weights is more important than the size of the network. IEEE Trans. Inform. Theory, 44(2), 525–536. Bishop, C. M. (1995). Neural networks for pattern recognition. Oxford: Oxford University Press. Broomhead, D. S., & Lowe, D. (1988). Multivariable functional interpolation and adaptive networks. Complex Systems, 2(3), 321–355. Chen, D. S., & Jain, R. C. (1994). A robust backpropagation learning algorithm for function approximation. IEEE Transactions on Neural Networks, 5(3), 467–479. Clarke, F. H. (1983). Optimization and nonsmooth analysis. New York: Wiley.

Robust Formulations for Training Multilayer Perceptrons

861

Flake, G. W. (1998). Square unit augmented, radially extended, multilayer perceptrons. In G. B. Orr & K.-R. Muller ¨ (Eds.), Neural networks: Tricks of the trade (pp. 145–163). Berlin: Springer-Verlag. Hagan, M., & Menhaj, M. (1994). Training feedforward networks with the Marquardt algorithm. IEEE Transactions on Neural Networks, 5(6), 989–993. Haykin, S. (1994). Neural networks: A comprehensive foundation. New York: Macmillan. Hettmansperger, T. P., & McKean, J. W. (1998). Robust nonparametric statistical methods. London: Edward Arnold. Huber, P. J. (1981). Robust statistics. New York: Wiley. K¨arkka¨ inen, T. (2002). MLP-network in a layer-wise form with applications to weight decay. Neural Computation, 14(6), 1451–1480. K¨arkka¨ inen, T., & Heikkola, E. (2002). Robust MLP (Rep. No. C 1). Jyv¨askyla¨ : University of Jyv¨askyla¨ , Department of Mathematical Information Technology. K¨arkka¨ inen, T., & Majava, K. (2000). Determination of regularization parameter in monotone active set method for image restoration. In P. Neittaanm¨aki, T. Tiihonen, & P. Tarvainen (Eds.), Proceedings of the Third European Conference on Numerical Mathematics and Advanced Applications (pp. 641–648). Singapore: World Scientic. K¨arkka¨ inen, T., Majava, K., & Ma¨ kel¨a, M. M. (2001). Comparison of formulations and solution methods for image restoration problems. Inverse Problems, 17(6), 1977–1995. Kemperman, J. (1987). The median of a nite measure on a Banach space. In Y. Dodge (Ed.), Statistical data analysis based on the L1 -norm and related methods (pp. 217–230). Amsterdam: North-Holland. Kosko, B. (1992). Neural networks and fuzzy systems: A dynamical systems approach to machine intelligence. Englewood Cliffs, NJ: Prentice Hall. Liano, K. (1996). Robust error measure for supervised neural network learning with outliers. IEEE Transactions on Neural Networks, 7(1), 246–250. Makel¨ ¨ a, M. M., & Neittaanm¨aki, P. (1992). Nonsmooth optimization. River Edge, NJ: World Scientic. Milasevic, P., & Ducharme, G. (1987). Uniqueness of the spatial median. Ann. Statist., 15(3), 1332–1333. Mitchell, T. M. (1997). Machine learning. New York: McGraw-Hill. Nocedal, J., & Wright, S. J. (1999). Numerical optimization. New York: SpringerVerlag. Oja, H. (1999). Afne invariant multivariate sign and rank tests and corresponding estimates: A review. Scand. J. Statist., 26(3), 319–343. Prechelt, L. (1998). Early stopping—but when? In G. B. Orr & K.-R. Muller ¨ (Eds.), Neural networks: Tricks of the trade (pp. 55–70). Berlin: Springer-Verlag. Rao, C. R. (1988). Methodology based on the L1 -norm, in statistical inference. Sankhy¯a Ser. A, 50(3), 289–313. Ï (1998a). Evolution and generalization of a single neurone: I. SingleRaudys, S. layer perceptron as seven statistical classiers. Neural Networks, 11(2), 283– 296.

862

T. K¨arkka¨ inen and E. Heikkola

Ï (1998b). Evolution and generalization of a single neurone: II. ComRaudys, S. plexity of statistical classiers and sample size considerations. Neural Networks, 11(2), 297–313. Ï (2000). Evolution and generalization of a single neurone: III. PrimiRaudys, S. tive, regularized, standard, robust and minimax regressions. Neural Networks, 13(4–5), 507–523. Rognvaldsson, ¨ T. S. (1998). A simple trick for estimating the weight decay parameter. In G. B. Orr & K.-R. Muller ¨ (Eds.), Neural networks: Tricks of the trade (pp. 71–92). Berlin: Springer-Verlag. Rousseeuw, P. J., & Leroy, A. M. (1987). Robust regression and outlier detection. New York: Wiley. Saito, K., & Nakano, R. (2000). Second-order learning algorithm with squared penalty term. Neural Computation, 12(3), 709–729. Sarajedini, A., & Hecht-Nielsen, R. (1992). The best of both worlds: Casasent networks integrate multilayer perceptrons and radial basis functions. In International Joint Conference on Neural Networks, 1992 (Vol. 3, pp. 905–910). Piscataway, NJ: IEEE. Sarle, W. (1997). The comp.ai.neural-nets frequently asked questions list. Available on-line at: http://www.faqs.org/faqs/ai-faq/neural-nets/(part2/ section-14.html).

Received January 29, 2003; accepted August 20, 2003.

LETTER

Communicated by Daniela Pucci de Farias

An Extended Projection Neural Network for Constrained Optimization Youshen Xia Department of Applied Mathematics, Nanjing University of Posts and Telecommunications, China

Recently, a projection neural network has been shown to be a promising computational model for solving variational inequality problems with box constraints. This letter presents an extended projection neural network for solving monotone variational inequality problems with linear and nonlinear constraints. In particular, the proposed neural network can include the projection neural network as a special case. Compared with the modied projection-type methods for solving constrained monotone variational inequality problems, the proposed neural network has a lower complexity and is suitable for parallel implementation. Furthermore, the proposed neural network is theoretically proven to be exponentially convergent to an exact solution without a Lipschitz condition. Illustrative examples show that the extended projection neural network can be used to solve constrained monotone variational inequality problems.

1 Introduction Constrained optimization problems arise in a wide variety of scientic and engineering applications, including signal processing, system identication, lter design, regression analysis, and robot control (Bazaraa, Sherali, & Shetty, 1993). In many practical applications such as robot motion control, related optimization problems are of a time-varying parameter characteristic (Sciavicco & Siciliano, 2000) and thus have to be solved in real time. For such applications, conventional solution methods for static optimization may not be effective due to the time-varying nature of such optimization problems and the requirement on computational time. Neural networks are composed of massively connected simple neurons. Having the structures similar to their biological counterparts, articial neural networks are representational and computational models processing information in a parallel distributed fashion (Hopeld, 1982). Feedforward neural networks and recurrent neural networks are two major classes of neural network models. Feedforward neural networks, such as the popular multilayer perceptron (Rumelhart, Hinton, & Williams, 1986), are usually used as representational models trained using a learning algorithm based on a set of sampled inputNeural Computation 16, 863–883 (2004)

c 2004 Massachusetts Institute of Technology °

864

Y. Xia

output data. It has been proven that the multilayer feedforward neural networks are universal approximators (Hornik, Stinchcombe, & White, 1989). Recurrent neural networks, as dynamical systems, are usually used as computational models for solving computationally intensive problems (Golden, 1996). Because of the inherent nature of parallel and distributed information processing in neural networks, neural networks are promising computational models for real-time applications. Typical examples of recurrent neural network applications are NP-complete combinatorial optimization problems, minor component analysis, associative memory, and real-time optimization problems (Peterson & Soderberg, 1989; Ohlasson, Peterson, & Soderberg, 1993; Abe, Kawakami, & Hirasawa, 1992; Oja, 1992; Sevrani & Abe, 2000; Cichocki & Unbehauen, 1993). The analog circuit research for solving linear programming problems perhaps stemmed from Pyne’s work (Pyne, 1956) a half-century ago. Tank and Hopeld (1986) rst developed a linear programming neural network suitable for applications that require on-line optimization. Their seminal work has inspired many researchers to investigate alternative and improved neural networks for solving linear and nonlinear programming problems. For example, Kennedy and Chua (1988) proposed an extended neural network for solving nonlinear programming problems. Because this network contains a penalty parameter, it generates approximate solutions only and has an implementation problem when the penalty parameter is very large (Lillo, Loh, & Hui, 1993). To avoid using penalty parameters, some signicant work has been studied in recent years. Rodr´iguez-V a´ zquez, Dom´inguezCastro, Rueda, Huertas, and S´anchez-Sinencio (1990) proposed a switchedcapacitor neural network for solving a class of optimization problems. This network is suitable for the case that the optimal solution lies in the feasible region only. Otherwise, the network may have no equilibrium point. Zhang, Zhu, and Zou (1992) developed a second-order neural network for solving a class of nonlinear optimization problems with equality constraints. Although it has a fast convergence rate, this network requires computing a time-varying inverse matrix and thus has a high model complexity for implementation. Bouzerdoum and Pattison (1993) presented a neural network for solving quadratic optimization problems with bounded variables only. Xia (1996a) proposed one primal-dual neural network for programming problems, Xia (1996b) proposed another primal-dual network for solving linear and quadratic programming problems, and Xia and Wang proposed (2001) a dual neural network for solving strictly convex quadratic programming problems. Recently, Xia, Leung, and Wang (2002) developed a projection neural network with the following dynamical equation,

dx D ¸fPX .x ¡ F.x// ¡ xg; dt

(1.1)

An Extended Projection Neural Network

865

where F.x/ is a differentiable vector-valued function from Rn into Rn , X D fx 2 Rn jd · x · hg, and PX : Rn ! X is a projection operator dened by 8 > xi < di : hi xi > hi ; where l D [l1 ; : : : ; ln ]T , and h D [h1 ; : : : ; hn ]T . This network has a low model complexity, and its equilibrium point coincides with the solution of the following variational inequality problem with box constraints: .x ¡ x¤ /T F.x¤ / ¸ 0;

x 2 X:

(1.2)

Xia et al. (2002) showed that the projection neural network in equation 1.1 is globally convergent to an equilibrium point when the Jacobian matrix, rF.x/, of F is symmetric and positive semidenite and is globally asymptotically stable at the equilibrium point when rF.x/ is positive denite. Since it can be seen from the optimization literature (Kinderlehrer & Stampcchia, 1980) that many constrained optimization problems, such as linear and nonlinear complementarity problems, can be converted into problem 1.2, the projection neural network can be applied directly to solve monotone nonlinear complementary problems and a class of constrained optimization problems. In this article, we are concerned with the following variational inequality problem with linear and nonlinear constraints, .x ¡ x¤ /T F.x¤ / ¸ 0;

x 2 Ä;

(1.3)

where Ä D fx 2 Rn j g.x/ · 0; Ax D b; x 2 Xg; A 2 Rr£n , and b 2 Rr , g.x/ D [g1 .x/; : : : ; gm .x/]T is an m-dimensional vectorvalued continuous function of n variables, the functions g1 ; : : : ; gm are assumed to be convex and twice differentiable, and the existence of the solution of problem 1.3 is also assumed. Since the constraint set Ä is more complex than X, problem 1.3 is a signicant extension of problem 1.2. Moreover, problem 1.3 has been viewed as a natural framework for unifying the treatment of constrained variational inequality problems and related optimization problems (Bertsekas, 1989; Harker & Pang, 1990; Ferris & Pang, 1997; Dupuis & Nagurney, 1993). The objective of this article is to present an extended projection neural network to solve problem 1.3. The proposed neural network is thus a signicant extension of the projection neural network for solving constrained variational inequality problems from box constraints to linear and nonlinear constraints. Compared with the modied projectiontype numerical method for solving problem 1.3 (Solodov & Tseng, 1996),

866

Y. Xia

which requires a varying step length, the proposed neural network not only has a lower complexity but also is suitable for parallel implementation. Furthermore, the proposed neural network is theoretically proven to be exponentially convergent to the solution of problem 1.3 only under a strictly monotone condition of F. As a result, the proposed neural network can be directly used to solve constrained monotone variational inequality problems and related optimization problems, including nonlinear convex programming problems. Numerical examples also show that the proposed neural network has a faster convergence rate than the modied projectiontype method. 2 An Extended Projection Neural Network In this section, problem 1.3 is equivalently reformulated as a set of nonlinear equations. An extended projection neural network model for solving problem 1.3 is then presented, and a comparison is given. 2.1 Problem Reformulation. It can be seen from the optimization literature (Marcotte, 1991; Bertsekas, 1989) that the Karush-Kuhn-Tucker (KKT) condition for problem 1.3 can be written as »

y ¸ 0; g.x/ · 0; l · x · h Ax D b; yT g.x/ D 0

and 8 T > <.F.x/ C r g.x/y ¡ A z/i ¸ 0 .F.x/ C r g.x/y ¡ AT z/i D 0 > : .F.x/ C r g.x/y ¡ AT z/i · 0

if xi D hi if li · xi · hi if xi D li ;

where y 2 Rm ; z 2 Rr , r g.x/ D .r g1 .x/; : : : ; r gm .x//, and r gi .x/ is the gradient of gi for i D 1; : : : ; m. According to the projection theorem (Kinderlehrer & Stampcchia, 1980) the above KKT condition can be equivalently represented as 8 T > : Ax D b; where .y/C D [.y1 /C ; : : : ; .ym /C ]T with .yi /C D maxf0; xg. Thus, the following lemma holds. Lemma 1. Assume that rF.x/ is positive semidenite. x¤ is a solution to problem 1.3 if and only if there exists y¤ 2 Rm and z¤ 2 Rr such that .x¤ ; y¤ ; z¤ / satises equation 1.4.

An Extended Projection Neural Network

867

2.2 Neural Dynamical Equations. To solve an optimization problem by neural computation, the key is to cast the problem into the architecture of a recurrent neural network whose stable state represents the solution to the optimization problem. Based on the equivalent formulation 2.1, we propose an extended projection neural network for solving problem 1.3 as follows. State equation 0 du B D ¸@ dt

1 PX .x ¡ .F.x/ C r g.x/y ¡ AT z// ¡ x C .y C g.x//C ¡ y A: ¡Ax C b

(2.2)

Output equation w.t/ D x.t/; where u D .x; y; z/ 2 Rn £ Rm £Rr is a state vector, w is an output vector, and ¸ > 0 is a scaling parameter. The state equation and the output equation can be implemented by a recurrent neural network with a simple structure as shown in Figure 1, where ¸ D 1. The circuit realizing the proposed neural network consists of n C m C r integrators, m C n piecewise activation functions for PX .x/ and .y/C , n processors for F.x/, m processors for g.x/, m processors for r g.x/, 2 mn connection weights for A and AT , and some summers. Therefore, the network complexity depends on only the one of the mapping F.x/ and g.x/. 2.3 A Comparison. The extended projection neural network is a significant extension of the existing projection neural network. It can be seen that one major difference between the projection neural networks in equation 1.1 and the extended projection neural network in equation 2.2 lies in essentially the treatment of different constraints. The projection neural network in equation 1.1 is used to solve problem 1.2 with box constraints, while the extended projection neural network in equation 2.2 is used to solve problem 1.3 with linear and nonlinear constraints. In the special case that Ä D X, g.x/ D 0, A D O, and b D 0, the proposed neural network in equation 2.2 becomes the projection neural network in equation 1.1. In addition, when the projection neural network is applied directly to problem 1.3, the projection neural network is then repressed by dx D ¸fPÄ .x ¡ F.x// ¡ xg; dt where P Ä .u/ is given by PÄ .x/ D arg min kx ¡ vk; v2Ä

where k ¢ k denotes l2 norm. Since the set Ä is not a box or sphere, PÄ .u/ cannot be expressed explicitly. As a result, the above projection neural network

868

Y. Xia

Figure 1: A block diagram of the extended projection neural network in problem 3.1.

cannot be used to solve problem 1.3. Next, the extended projection neural network can be used to solve nonlinear convex programming problems. In fact, consider the case that F.x/ D r f .x/, where f .x/ is a twice differentiable convex function and F.x/ is the gradient, r f .x/, of f . Then it can be seen from the optimization literature (Kinderlehrer & Stampcchia, 1980) that problem 1.3 becomes a well-known general nonlinear programming problem, minimize f .x/ subject to g.x/ · 0; Ax D b; x 2 X

(2.3)

where g.x/, A, and b are dened in section 2. The extended projection neural network in equation 2.2 then becomes 0 1 0 1 ¡x C PX .x ¡ .r f .x/ C r g.x/y ¡ AT z// x d B C B C ¡y C .y C g.x//C @yA D ¸ @ A; dt ¡Ax C b z

(2.4)

where x 2 Rn , y 2 Rm , and z 2 Rm . Compared to the Kennedy-Chua neural network for solving equation 2.3, the neural network in equation 2.4 has no penalty parameter and can globally converge to an exact optimal solution

An Extended Projection Neural Network

869

of equation 2.3. Moreover, unlike the switched-capacitor neural networks for solving equation 2.3, the neural network in equation 2.4 allows for the optimal solution of equation 2.3, which may be on the boundary of the feasible region. Finally, we compare the proposed neural network with existing numerical optimization methods. Note that when F.x/ D r f .x/ does not hold, those numerical methods for linear and nonlinear programming, such as the gradient method and the penalty function method, cannot be used to solve problem 1.3. From the viewpoint of both computation implementation and required conditions, the modied projection-type method (Solodov & Tseng, 1996) is viewed as the best one among existing numerical methods for constrained monotone variational inequalities. Thus we compare the proposed neural network with the modied projection-type method in terms of algorithm complexityand the convergence condition. The modied projection-type method is dened as u.kC1/ D u.k/ ¡ ¯h.u.k/ /fw.k/ C ®G.PXO .w.k/ // ¡ PXO .w.k/ /g;

where ® > 0 and ¯ > 0 are scaling parameters, w D u¡®G.u/, u D .x; y; z/ 2 RnCmCr , PXO .u/ D [P X .x/; .y/C ; z]T , PX .x/ and .y/C are dened in equation 2.1, and 0 1 F.x/ C rg.x/y ¡ AT z ku ¡ PXO .w/k2 B C G.u/ D @ ¡g.x/ h.u/ D : A; kw C ®G.PXO .w// ¡ PXO .w/k2 Ax ¡ b On one side, the modied projection-type method is not suitable for parallel implementation due to the choice of the varying step length h.u/. The modied projection-type method is required to compute nonlinear terms PXO .w/; G.PXO .w//, and h.u/ per iteration. The proposed neural network is required only to compute PXO .w/ per iteration and thus has a low complexity. On the other side, the global convergence of the modied projection-type method is guaranteed only when the following Lipschitz condition, .u ¡ v/T .G.u/ ¡ G.v// · ¯ku ¡ vk2 ;

8u; v 2 RnCmCr ;

and ® < 1=¯ are satised. The proposed neural network will be shown to converge exponentially to an exact solution of problem 1.3 without requiring the Lipschitz condition and ® < 1=¯. It is worth pointing out that when g.x/ is nonlinear, the Lipschitz condition is not satised even if F.x/ and g.x/ are all Lipschitz continuous. As a result, the global convergence of the projection-type method cannot be guaranteed for such cases. 3 Exponential Convergence In this section, we study the exponential convergence of the extended projection neural network. Before going further, we give two denitions and a lemma.

870

Y. Xia

Denition 1. on X if

The Jacobian matrix, rF.x/, of F is said to be positive semidenite

vT rF.x/v ¸ 0; 8x 2 X; v 2 Rn : rF.x/ is positive denite if the strict inequality holds in the above equation whenever x 6D 0 and uniformly positive denite if there is ± > 0 so that vT rF.x/v ¸ ±kvk22 ; 8x 2 X; v 2 Rn ; where k ¢ k2 denotes l2 norm. Denition 2. Let x¤ be a solution of problem 1.3. The extended projection neural network is said to be exponentially convergent to x¤ if its output trajectory w.t/ satises kw.t/ ¡ x¤ k2 · c0 e¡´.t¡t0 / ;

8t ¸ t0 ;

where c0 is a positive constant dependent on the initial point and ´ is a positive constant independent of the initial point. Lemma 2. (i) The state equation in 2.2 has at least an equilibrium point. Moreover, if .x¤ ; y¤ ; z¤ / is an equilibrium point of equation 2.2, then x¤ is a solution of problem 1.3. (ii) For any initial point, there exists a unique continuous solution for equation 2.2. (iii) Let u.t/ D .x.t/; y.t/; z.t// be the state trajectory of equation 2.2 with initial point u.t0 / D .x.t0 /; y.t0 /; z.t0 //. If x.t0 / 2 X and y.t0 / ¸ 0, then x.t/ 2 X and y.t/ ¸ 0. Proof. (i) Since problem 1.3 has a solution, we see by lemma 1 that there exist y¤ and z¤ such that .x¤ ; y¤ ; z¤ / is an equilibrium point of equation 2.2. Moreover, if .x¤ ; y¤ ; z¤ / is an equilibrium point of equation 2.2, then x¤ is a solution of that equation. (ii) Note that F.x/; r g.x/; PX .x/; .y/C are locally Lipschitz continuous. Then both PX .x ¡ .F.x/ C r g.x/y ¡ AT // and .y C g.x//C are locally Lipschitz continuous. Thus, the right term of problem 1.3 is too. According to the local existence theorem of ordinary differential equations (Slotine & Li, 1991), there exists a unique continuous solution u.t/ of equation 2.2 for [t0 ; T/. Moreover, it can be seen in theorem 1 that the solution u.t/ is bounded and thus T D C1. (iii) Since »

dx=dt C x D PX .x ¡ .F.x/ C rg.x/y ¡ AT z// dy=dt C y D .y C g.x//C ;

An Extended Projection Neural Network

(R t

871

Rt x/es ds D t0 es P X .x ¡ .F.x/ C r g.x/y/ ¡ AT z/ds Rt Rt s C s t0 .dy=dt C y/e ds D t0 e .y C g.x// ds:

t0 .dx=dt C

It follows that ( Rt x.t/ D e¡.t¡t0 / x.t0 / C e¡t t0 es PX .x ¡ .F.x/ C r g.x/y ¡ AT z//ds Rt y.t/ D e¡.t¡t0 / y.t0 / C e¡t t0 es .y C g.x//C ds: Since .y C g.x//C ¸ 0 and y.t0 / ¸ 0, y.t/ ¸ 0. Furthermore, in term of the integrator mean theorem, the rst equation above can be written as Z t ¡.t¡t0 / ¡t x.t/ D e x.t0 / C e /P X .z0 / es ds t0

¡.t¡t0 /

De

¡.t¡t0 /

x.t0 / C .1 ¡ e

/PX .z0 /;

where z0 D x.s0 /¡.F.x.s0 // Cr g.x.s0 //y.s0 / and t0 · s · t. Since P X .z0 / 2 X and x.t0 / 2 X, x.t/ 2 X. We now establish the main result on the exponential convergence of the extended projection neural network. Theorem 1. Assume that rF.x/ is positive denite on Ä. Then the extended projection neural network with any initial point u.t0 / D .x.t0 /; y.t0 /; z.t0 // 2 ¤ r X £ Rm C £ R is stable in the sense of Lyapunov and converges exponentially to x . Proof.

Dene the set of equilibrium points of equation 2.2 by

E0 D fu¤ D .x¤ ; y¤ ; z¤ / j T.u¤ / D 0g; where 0 1 PX .x ¡ .F.x/ C rg.x/y ¡ AT z// ¡ x B C T.u/ D @ .y C g.x//C ¡ y A: ¡Ax C b It is necessary to show that x¤ is unique. Assume that u1 D .x1 ; y1 ; z1 / 2 E0 and u2 D .x2 ; y2 ; z2 / 2 E0 . From lemma 1, it follows that x1 and x2 are solutions to problem 1.3. Then .x ¡ x1 /T F.x1 / ¸ 0; 8x 2 Ä and .x ¡ x2 /T F.x2 / ¸ 0; 8x 2 Ä:

872

Y. Xia

Thus, (

.x2 ¡ x1 /T F.x1 / ¸ 0 .x1 ¡ x2 /T F.x2 / ¸ 0:

Hence, .x1 ¡ x2 /T .F.x2 / ¡ F.x1 // ¸ 0: This contradicts that rF.x/ is positive denite implies F.x/ is strictly monotone. It follows that x1 D x2 . We now show that the output trajectory of the proposed projection neural network is exponentially convergent to x¤ . Denote the state trajectory of equation 2.2 by u.t/ with the initial point u.t0 /. By lemma 2, we see that x.t/ 2 X and y.t/ ¸ 0. Consider the following Lyapunov function, V.u/ D ¡G.u/T T.u/ ¡

1 1 kT.u/k22 C ku ¡ u¤ k22 ; 2 2

where u¤ D .x¤ ; y¤ ; z¤ / is an equilibrium point of equation 2.2 and 0 1 F.x/ C r g.x/y ¡ AT z B C ¡g.x/ G.u/ D @ A: Ax ¡ b According to the analysis given in Xia et al. (2002), in the following projection inequality, .v ¡ PÄ0 .v//T .P Ä0 .v/ ¡ u/ ¸ 0; v 2 RnCmCr ; u 2 Ä0 ; C r T where Ä0 D X £ Rm C £ R and PÄ0 .u/ D [PX .x/; .y/ ; z] , we let v D u ¡ G.u/. Then

.u ¡ G.u/ ¡ P Ä0 .u ¡ G.u/// T .PÄ0 .u ¡ G.u// ¡ u/ ¸ 0I that is, ¡G.u/T fPÄ0 .u ¡ G.u// ¡ ug ¸ kPÄ0 .u ¡ G.u/ ¡ uk2 : Note that T.u/ D PÄ0 .u ¡ G.u// ¡ u. It follows that ¡G.u/T T.u/ ¸ kT.u/k22

An Extended Projection Neural Network

873

and V.u/ ¸

1 ku ¡ u¤ k22 : 2

Similarly, we have dV.u/ · ¡G.u/T .u ¡ u¤ / ¡ T.u/T rG.u/T.u/; dt where 0 P 2 rF.x/ C m i r gi .x/yi B rG.u/ D @ ¡r g.x/T A

r g.x/ O1 O3

1 ¡AT C O2 A O4

and O1 2 Rn£m , O2 2 Rn£m , O3 2 Rr£m , and O4 2 Rr£r are zero matrices. Note that yi .t/ ¸ 0 and r 2 gi .x/ is positive semidenite. Then rG.u/ is positive semidenite and T.u/T rG.u/T.u/ ¸ 0. It follows that dV.u/ · ¡G.u/T .u ¡ u¤ /: dt It follows that Z V.u.t// · V.u.t0 // ¡

t t0

G.u.s//T .u.s/ ¡ u¤ /ds

and Z ku.t/ ¡ u¤ k22 · 2V.u.t0 // ¡ 2

t

t0

G.u.s//T .u.s/ ¡ u¤ /ds:

Note that u¤ satises r .u ¡ u¤ /T G.u¤ / ¸ 0; 8u 2 X £ Rm C £R :

Then G.u/T .u ¡ u¤ / ¸ .G.u/ ¡ G.u¤ //T .u ¡ u¤ /: Thus, Z ku.t/ ¡ u¤ k22 · 2V.u.t0 // ¡ 2

t

t0

.G.u.s// ¡ G.u¤ //T .u.s/ ¡ u¤ /ds:

874

Y. Xia

On one side, using Z ¤

T

1

¤

.G.u/ ¡ G.u // .u ¡ u / D Z D

0 1 0

.u ¡ u¤ /T rG.u C s.u ¡ u¤ //.u ¡ u¤ /ds O ¡ u¤ /ds; .u ¡ u¤ /T rG.u/.u

where uO D u C s.u ¡ u¤ /, we get .G.u/ ¡ G.u¤ //T .u ¡ u¤ / Z 1 m X O C O yO i /.x ¡ x¤ /ds D r 2 gi .x/ .x ¡ x¤ /T .rF.x/ 0

i

Z C Z ¡ Z D

0 1 0

1 0

1

Z O ¡ y¤ /ds ¡ .x ¡ x¤ /T r g.x/.y Z .x ¡ x¤ /T AT .z ¡ z¤ /ds C

O C .x ¡ x¤ /T .rF.x/

m X i

0

1

1 0

O T .x ¡ x¤ /ds .y ¡ y¤ /T r g.x/

.z ¡ z¤ /T A.x ¡ x¤ /ds

O yO i /.x ¡ x¤ /ds; r 2 gi .x/

O y; O zO /. On the other side, note that dV.u/=dt · 0 implies where uO D .x; r u.t/ ½ S D fu 2 X £ Rm C £ R j V.u/ · V.u0 /g;

and V.uk / ! C1 whenever kuk k2 ! C1. Then fu.t/ D .x.t/; y.t/; z.t//g and S are bounded. Let S1 D fx j x D x.t/ C s.x.t/ ¡ x¤ /; 0 · s · 1; t ¸ t0 g. It follows that the set S1 ½ S. Since rF.x/ is positive denite on S, O 2 D 1: vO T rF.x/vO > 0; 8x 2 S; kvk Let f .x/ D vO T rF.x/vO be a function dened on S. Then f .x/ is continuous on S. Thus, there exists ± > 0 such that vT rF.x/v ¸ ±kvk22 ; 8x 2 S; 8v 2 Rl : Note that x.t/ C s.x.t/ ¡ x¤ / 2 S1 . Then for all t ¸ t0 , .x.t/ ¡ x¤ /T rF.x.t/ C s.x.t/ ¡ x¤ //.x.t/ ¡ x¤ / ¸ ±kx.t/ ¡ x¤ k22 ; 8t ¸ t0 : It follows that Z kx.t/ ¡ x¤ k22 · 2V.u.t0 // ¡ ¯1

t t0

kx.t/ ¡ x¤ k22 ds;

An Extended Projection Neural Network

875

where ¯1 D 2¸±: According to the Bellman-Gronwall inequality (Slotine & Li, 1991), we obtain kw.t/ ¡ x¤ k2 D kx.t/ ¡ x¤ k2 ·

p

2V.u.t0 //e¡¯1 .t¡t0 /=2 ; 8t ¸ t0 :

Therefore, the extended projection neural network converges exponentially to x¤ . Remark 1. According to theorem 1, we conclude that the output trajectory of the projection neural network can converge to a solution with any given precision ² > 0 within a nite time. In fact, we observe that kw.t/ ¡ x¤ k2 ·

p

2V.u.t0 //e¡¯1 .t¡t0 /=2 < ²; 8t ¸ t0 :

It follows that e¯1 .t¡t0 /=2 >

p

2V.u.t0 //=²;

and then .t ¡ t0 / >

p 2 ln. 2V.u.t0 //=²/: ¯1

Thus, kw.t/ ¡ x¤ k < ², provided that t ¸ t0 C

p 2 ln. 2V.u.t0 //=²/: ¯1

Furthermore, if the scaling parameter ¸ is large enough in advance, then the output trajectory w.t/ can reach the solution of problem 1.3 within a nite time. In fact, let us assume the initial point x.t0 / 6D x¤ . Then there exist ¿ > 0 and ¹ > 0 such that Z

tC¿ t0

kx.t/ ¡ x¤ k22 ds ¸ ¹¿ > 0:

It follows that for any t ¸ t0 C ¿ , Z kw.t/ ¡ x¤ k22 · 2V.u.t0 // ¡ ¯

t t0

kx.t/ ¡ x¤ k22 ds · 2V.u.t0 // ¡ ¯1 ¹¿:

876

Y. Xia

Taking ¸ D V.u.t0 //=.¹¿ ±/ we have kw.t/ ¡ x¤ k22 · 2V.u.t0 // ¡ 2V.u.t0 // D 0: Therefore, w.t/ D x¤ when t ¸ t0 C ¿ . Remark 2. It should be noted that because rG.u/ is always positive semidenite, the existing asymptotical stability result (Xia et al., 2002) cannot be used to ascertain the convergence of the extended projection neural network. Moreover, unlike the existing exponential stability result on the projection neural network in equation 1.1, theorem 1 does not require the condition that rG.u/ is bounded and uniformly positive denite. Furthermore, compared with the existing convergence results on the modied projection-type method (Solodov & Tseng, 1996), the convergence result does not require the additional condition that G.u/ is Lipschitz continuous. Since the nonlinearity of g.x/ usually leads that G is not Lipschitz continuous, the global convergence of the modied projection-type method cannot be guaranteed. Later illustrative examples also show that the modied projection-type method may diverge if the scaling parameters ® and ¯ are not small enough. 4 Simulation Examples In order to demonstrate the effectiveness and efciency of the proposed neural network, we implemented it in MATLAB to solve a nonlinear programming problem and two constrained variational inequality problems. The ordinary differential equation solver engaged is ode23s. We compared the performance of this implementation with the gradient method, the penalty function method, and the modied projection-type method, respectively. 4.1 Example 1. Consider the following convex programming problem: minimize subject to

f .x/ D .x1 ¡ 2/2 C .x2 ¡ 1/2 » 0:25x21 C x22 · 1 x1 ¡ 2x2 D ¡1:

(4.1)

p p This problem has a unique optimal solution x¤ D .0:5. 7¡1/; 0:25. 7C1//. It can be seen that r f .x/ D [2.x1 ¡ 2/; 2.x2 ¡ 1/]T , g.x/ D 0:25x21 C x22 ¡ 1, r g.x/ D [0:5x1 ; 2x2 ]T , A D [1; ¡2], b D ¡1, and ³ r 2 f .x/ D

2 0

´ 0 2

An Extended Projection Neural Network

877

is positive denite. When applied to solve problem 4.1, the extended projection neural network in equation 2.4 becomes 0 1 0 1 ¡2.x1 ¡ 2/ ¡ 0:5x1 x3 ¡ z1 x1 C B ¡2.x ¡ 1/ ¡ 2x x C 2z C d B 2 2 3 1 C Bx2 C B B C D ¸B C: @.y1 C 0:25x21 C x21 ¡ 1/C ¡ y1 A dt @y1 A z1 1 C x1 ¡ 2x2

(4.2)

All simulation results show that the neural network in equation 4.2 with any initial point is always convergent to x¤ within a nite time. For example, Figure 2 displays the trajectory behavior of x.t/ D .x1 .t/; x2 .t// with 10 different initial points. For a comparison, we compute this example by using the gradient method, the penalty function method, and the proposed neural network, respectively. Let t D 10 and ¸ D 3, and let the penalty parameter be 1000. Their computational results are listed in Table 1, where the accuracy is dened by l2 -norm error kx ¡ x¤ k2 . From Table 1, we see that the proposed neural network gives a better solution than the gradient method and penalty function method.

3

2 *

*

(x1, x2)

x

2

1

.

0

1

2

3 3

2

1

0

1

2

3

Figure 2: The transient behavior of .x1 .t/; x2 .t// based on equation 4.2 with 10 different initial points in example 1.

878

Y. Xia

4.2 Example 2. Consider linearly constrained variational inequality 1.3, where 2 3 5x1 C .x1 C 2/2 C x2 C x3 C 10 6 7 F.x/ D 4 5x1 C 3x22 C 10x2 C 3x3 C 10 5 ; 10.x1 C 2/2 C 8x22 C 4x3 C 3x23 Ä D fx 2 R3 jx1 C x2 C x3 ¸ 4; x ¸ 0g. Then X D fx 2 R3 j x ¸ 0g, A D 0 , b D 0, g.x/ D ¡x1 ¡x2 ¡x3 C4, and r g.x/ D [¡1; ¡1; ¡1]T . This problem has a unique solution x¤ D [2:5; 1:5; 0]T . When applied to solve this problem, the extended projection neural network in equation 2.2 becomes ³ ´ ³ ´ .x ¡ .F.x/ C r g.x/y//C ¡ x d x D¸ (4.3) ; dt y .y C g.x//C ¡ y where x 2 R3 and y 2 R. All simulation results show that the neural network in equation 4.3 with any initial point is always globally convergent to x¤ within a nite time. For example, let ¸ D 40. Figure 3 displays the convergence behavior of the l2 -norm error kx.t/ ¡ x¤ k2 based on equation 4.3 with 20 random initial points. Next, we compare the proposed method with the modied projection-type method. Table 2 displays results obtained by the proposed method. Table 3 displays results obtained by the modied projection-type method, where the accuracy is dened by the l2 -norm error. From Tables 2 and 3, we can see that the proposed method not only gives a better solution but also has a faster convergence rate than the modied projection-type method. Moreover, from Table 3, we see that the modied projection-type method may diverge if the scaling parameters ® and ¯ are not small enough. 4.3 Example 3. Consider nonlinearly constrained variational inequality 1.3, where Ä D fx 2 R10 j g.x/ · 0; x 2 Xg, X D fx 2 R10 j x ¸ 0g, g.x/ D [g1 .x/; : : : ; g8 .x/]T , F.x/ D [2x1 C x2 ¡ 14; 2x2 C x1 ¡ 16; 2.x3 ¡ 10/; 8.x4 ¡ 5/; 2.x5 ¡ 3/; 4.x6 ¡ 1/; 10x7 ; 14x8 ; 4x9 ; 2.x10 ¡ 7/]T ;

and 8 g1 .x/ D 3.x1 ¡ 2/2 C 4.x2 ¡ 3/2 C 2x23 ¡ 7x4 ¡ 120; > > > > > g2 .x/ D 5x21 C .x3 ¡ 6/2 C 8x2 ¡ 2x4 ¡ 40; > > > 2 2 2 > > > g3 .x/ D .x1 ¡ 8/ =2 C 2.x2 ¡ 4/ C 3x5 ¡ x6 ¡ 30; > < 2 2 g4 .x/ D x1 C 2.x2 ¡ 2/ ¡ 2x1 x2 C 14x5 ¡ 6x6 ; > g5 .x/ D 4x1 C 5x2 ¡ 3x7 C 9x8 ¡ 105; > > > > > g6 .x/ D 10x1 ¡ 8x2 ¡ 17x7 C 2x8 ; > > > > > g7 .x/ D 12.x9 ¡ 8/2 ¡ 3x1 C 6x2 ¡ 7x10 ; > : g8 .x/ D ¡8x1 C 2x2 C 5x9 ¡ 2x10 ¡ 12:

An Extended Projection Neural Network

879

8

7

6

*

||x(t) x ||2

5

4

3

2

1

0 0

0.1

0.2

0.3

0.4

0.5

Time (sec)

0.6

0.7

0.8

0.9

1

Figure 3: Convergence behavior of the norm error kx.t/ ¡ x¤ k2 based on equation 4.3 with 20 random initial points in example 2.

This problem has a unique solution (Charalambous,1977): x¤ D [2:172; 2:364; 8:774; 5:096; 0:991; 1:431; 1:321; 9:829; 8:280; 8:376] T : We use the proposed neural network in equation 2.2 to solve the above problem. All simulation results show the trajectory of equation 2.2 with any initial point is always convergent to u¤ D .x¤ ; y¤ /. For example, let ¸ D 8. Figure 4 shows the transient behavior of x.t/ based on equation 2.2 with ve random initial points. Next, we compare the proposed method with the modied projection-type method. Table 4 displays results obtained by the proposed method. Table 5 displays results obtained by the modied projection-type method. From Tables 4 and 5, we can see that the proposed method not only gives a better solution but also has a faster convergence rate than the modied projection-type method. Moreover, from Table 5, we see that the modied projection-type method may diverge if the scaling parameters ® and ¯ are not small enough. 5 Conclusions We have presented an extended projection neural network for solving nonlinearly constrained variational inequality problems. In contrast to the ex-

880

Y. Xia

18 16 14

x (t) 10

12 x8(t)

10

x3(t)

8

x9(t)

6

x (t) 4

4

x (t) 6

x2(t)

2

x1(t)

2

x5(t)

x7(t)

0

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Figure 4: Transient behavior of x.t/ based on equation 2.2 with ve random initial points in example 3.

Table 1: Results for the Gradient Method, the Penalty Function Method, and the Proposed Method in Example 1. Method Initial point Iterative number CPU time (sec.) Solution l2 -norm error

Gradient Method

Penalty Method

Proposed Method

.¡2; ¡2/ 29; 813 23:94 .2:0019; 1:0015/ 1:1825

.¡2; ¡2/ 28; 296 23:67 .0:8252; 0:9120/ 0:0024

.¡2; ¡2; ¡2; ¡2/ 82 0:11 .0:8229; 0:9114/ 6:4 £ 10¡6

Table 2: Results for the Proposed Method on Linearly Constrained Variational Inequality in Example 2. Initial Point

Parameters

Iterative Number

CPU Time (sec.)

Accuracy

¡.2; 2; 2; 2/ .2; 2; 2; 2/

¸ D 4:5; t D 15 ¸ D 4:5; t D 15

522 506

2:33 2:22

0:0028 0:0032

An Extended Projection Neural Network

881

Table 3: Results for the Projection-Type Method on Linearly Constrained Variational Inequality in Example 2. Initial Point

Parameters

¡.2; 2; 2; 2/ ¡.2; 2; 2; 2/ ¡.2; 2; 2; 2/ ¡.2; 2; 2; 2/ .2; 2; 2; 2/ .2; 2; 2; 2/ .2; 2; 2; 2/ .2; 2; 2; 2/

¯ ¯ ¯ ¯ ¯ ¯ ¯ ¯

D 0:5; ® D 0:035 D 0:1; ® D 0:05 D 0:1; ® D 0:06 D 0:5; ® D 0:04 D 1; ® D 0:04 D 0:7; ® D 0:05 D 0:1; ® D 0:055 D 1; ® D 0:05

Iterative Number

CPU Time (sec.)

Accuracy

3723 13; 432 C1 C1 1649 1922 C1 C1

7:02 25:2 C1 C1 3:07 3:59 C1 C1

0:004 0:004 C1 C1 0:004 0:004 C1 C1

Table 4: Results for the Proposed Method on Nonlinearly Constrained Variational Inequality in Example 3, Where e D [1; : : : ; 1]T 2 R18 . Initial Point

Parameters

Iterative Number

CPU Time (sec.)

Accuracy

e ¡e

¸ D 4; t D 2 ¸ D 4; t D 2

235 319

3:01 4:05

0:033 0:034

isting projection neural network, the extended projection neural network can be directly used to solve monotone variational inequality problems with linear and nonlinear constraints. Compared with the modied projectiontype numerical method, the proposed neural network has lower complexity and is suitable for parallel implementation. More important, unlike the existing projection neural network and the modied projection-type method, the extended projection neural network is theoretically proven to be exponentially convergent to an exact solution of problem 1.3 under a strictly monotone condition of the mapping F. As a result, the proposed neural network not only has an exponential convergence but also can be directly

Table 5: Results for the Projection-Type Method on Nonlinearly Constrained Variational Inequality in Example 3, where e D [1; : : : ; 1]T 2 R18 . Initial point e e e e ¡e ¡e ¡e ¡e

Parameters ¯ ¯ ¯ ¯ ¯ ¯ ¯ ¯

D 0:1; ® D 0:1; ® D 0:1; ® D 0:9; ® D 0:9; ® D 0:9; ® D 0:9; ® D 0:1; ®

D D D D D D D D

0:005 0:005 0:05 0:0005 0:0005 0:0005 0:006 0:05

Iterative Number

CPU Time (sec.)

Accuracy

2194 14; 339 C1 28; 678 13; 410 3559 C1 C1

14:45 71:4 C1 142:7 200 17:71 C1 C1

0:1 0:08 C1 0:098 0:084 0:1 C1 C1

882

Y. Xia

used to solve constrained monotone variational inequality problems and related optimization problems, including nonlinear convex programming problems. Simulation results further show that the extended projection neural network can obtain a better solution and has a faster convergence rate than the gradient method, the penalty function method, and the modied projection-type method. Further investigations will be aimed at engineering applications of the extended projection neural network to control and signal processing. References Abe, S., Kawakami, J., & Hirasawa, K. (1992). Solving inequality constrained combinatorial optimization problems by the Hopeld neural networks. Neural Networks, 5, 663–670. Bazaraa, M. S., Sherali, H. D., & Shetty, C. M. (1993). Nonlinear programming— Theory and algorithms (2nd ed.). New York: Wiley. Bertsekas, D. P. (1989). Parallel and distributed computation: Numerical methods. Englewood Cliffs, NJ: Prentice hall. Bouzerdoum, A., & Pattison, T. R.(1993). Neural network for quadratic optimization with bound constraints. IEEE Transactions on Neural Networks, 4, 293–304. Charalambous, C. (1977). Nonlinear least p-th optimization and nonlinear programming. Mathematical Programming, 12, 195–225. Cichocki, A., & Unbehauen, R. (1993). Neural networks for optimization and signal processing, New York: Wiley. Dupuis, P., & Nagurney, A. (1993). Dynamical systems and variational inequalities. Annals of Operations Research, 44, 19–42. Ferris, M. C., & Pang, J. S. (1997). Engineering and economic applications of complementarity problems. SIAM Review, 39, 669–713. Golden, R. M. (1996). Mathematical methods for neural network analysis and design. Cambidge, MA: MIT Press. Harker, P. T., & Pang, J. S. (1990). Finite-dimensional variational inequality and nonlinear complementary problems: A survey of theory, algorithms, and applications. Mathematical Programming, 48, 161–220. Hopeld, J. J. (1982). Neural networks and physical systems with emergent collective computational ability. Proceedingsof the National Academy of Sciences, USA, Biophysics, 79, 2554–2558. Hornik, K., Stinchcombe, M., & White, H. (1989). Multilayer feedforward networks are universal approximators. Neural Networks, 2, 359–366. Kennedy, M. P., & Chua, L.O. (1988). Neural networks for nonlinear programming. IEEE Transactions on Circuits and Systems, 35, 554–562. Kinderlehrer, D., & Stampcchia, G. (1980). An introduction to variational inequalities and their applications. New York: Academic Press. Lillo, W. E., Loh, M. H., & Hui, S. (1993). On solving constrained optimization problems with neural networks: A penalty method approach. IEEE Transactions on Neural Networks, 4, 931–940.

An Extended Projection Neural Network

883

Marcotte, P. (1991). Application of Khobotov’s algorithm to variational inequalities and network equilibrium problems. Inform. Systems Oper. Res., 29, 258– 270. Ohlasson, M., Pelterson, C., & Soderberg, B. (1993). Neural networks for optimization problems with inequality constraints: The knapsack problem. Neural Computation, 5, 331–339. Oja, E. (1992). Principal components, minor components and linear neural networks. Neural Networks, 5, 927–935. Peterson, C., & Soderberg, B. (1989). A new method for mapping optimization problems onto neural networks. International Journal of Neural Systems, 1, 3– 22. Pyne, I. B. (1956). Linear programming on an electronic analogue computer. Trans. Amer. Inst. Elec. Eng., 75, 139–143. Rodr´õ guez-V´azquez, A., Dom´inguez-Castro, R., Rueda, A., Huertas, J. L., & S´anchez-Sinencio, E. (1990). Nonlinear switched-capacitor “neural networks” for optimization problems. IEEE Transactions on Circuits and Systems, 37, 384–397. Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1986). Learning internal representations by error propagation. In D. E. Rumelhart & J. L. McClelland (Eds.), Parallel distributed processing, (Vol. 1, pp 318–366). Cambridge, MA: MIT Press. Sciavicco, L., & Siciliano, B. (2000). Modelling and control of robot manipulators. New York: Springer-Verlag. Sevrani F., & Abe K. (2000). On the synthesis of brain-state-in-a-box neural models with application to associative memory. Neural Computation, 12, 451–472. Slotine, J. J., & Li, W. (1991). Applied nonlinear control, Englewood Cliffs, NJ: Prentice Hall. Solodov, M. V., & Tseng, P. (1996). Modied projection-type methods for monotone variational inequalities. SIAM J. Control and Optimization, 2, 1814–1830. Tank, D. W., & Hopeld, J. J. (1986). Simple “neural” optimization networks: An A/D converter, signal decision circuit, and a linear programming circuit. IEEE Transactions on Circuits and Systems, 33, 533–541. Xia, Y. S. (1996a). A new neural network for solving linear programming problems and its applications. IEEE Transactions on Neural Networks, 7, 525–529 Xia, Y. S. (1996b). A new neural network for solving linear and quadratic programming problems. IEEE Transactions on Neural Networks, 7, 1544–1547. Xia, Y. S., Leung, H., & Wang, J. (2002). A projection neural network and its application to constrained optimization problems. IEEE Transactions on Circuits and Systems—Part I, 49, 447–458. Xia, Y. S., & Wang, J. (2001).A dual neural network for kinematic control of redundant robot manipulators. IEEE Transactions on Systems, Man and Cybernetics— Part B, 31, 147–154. Zhang, S., Zhu, X., & Zou, L.-H. (1992). Second-order neural networks for constrained optimization. IEEE Transactions on Neural Networks, 3, 1021–1024. Received April 28, 2003; accepted October 6, 2003.

LETTER

Communicated by Laurence Abbott

Spike-Timing-Dependent Plasticity: The Relationship to Rate-Based Learning for Models with Weight Dynamics Determined by a Stable Fixed Point Anthony N. Burkitt [email protected]

Hamish Mefn h.mef[email protected]

David B. Grayden [email protected] The Bionic Ear Institute, East Melbourne, Victoria 3002, Australia

Experimental evidence indicates that synaptic modication depends on the timing relationship between the presynaptic inputs and the output spikes that they generate. In this letter, results are presented for models of spike-timing-dependent plasticity (STDP) whose weight dynamics is determined by a stable xed point. Four classes of STDP are identied on the basis of the time extent of their input-output interactions. The effect on the potentiation of synapses with different rates of input is investigated to elucidate the relationship of STDP with classical studies of long-term potentiation and depression and rate-based Hebbian learning. The selective potentiation of higher-rate synaptic inputs is found only for models where the time extent of the input-output interactions is input restricted (i.e., restricted to time domains delimited by adjacent synaptic inputs) and that have a time-asymmetric learning window with a longer time constant for depression than for potentiation. The analysis provides an account of learning dynamics determined by an input-selective stable xed point. The effect of suppressive interspike interactions on STDP is also analyzed and shown to modify the synaptic dynamics. 1 Introduction Experimental evidence indicates that synaptic modication can depend on the timing of presynaptic inputs (excitatory postsynaptic potentials, EPSPs) and postsynaptic spikes (action potentials, APs) (Markram, Lubke, ¨ Frotscher, & Sakmann, 1997; Bi & Poo, 1998; Zhang, Tao, Holt, Harris, & Poo, 1998; Debanne, G¨ahwiler, & Thompson, 1998). The relationship of such spike-timing-dependent plasticity (STDP) to classical studies of long-term potentiation (LTP) and long-term depression (LTD), as well as previously studied models of Hebbian plasticity based on spiking rates (Bienenstock, Cooper, & Munro, 1982), is not well understood. Issues concerning compeNeural Computation 16, 885–940 (2004)

c 2004 Massachusetts Institute of Technology °

886

A. Burkitt, H. Mefn, and D. Grayden

tition between synapses and the stability of the learning algorithms have been a particular focus of attention (Miller & MacKay, 1994; Miller, 1996; Song, Miller, & Abbott, 2000; Abbott & Nelson, 2000; Rao & Sejnowski, 2001). The emergence of inhomogeneous distributions of synaptic strengths typically requires a competitive mechanism that causes some synapses to increase their strength and others to decrease theirs, while at the same time the synaptic modication should lead to a stable distribution of synaptic conductances that prevents their unconstrained increase. The majority of studies of STDP have assumed that changes in synaptic strengths are additive, that is, they do not scale with synaptic strength (Gerstner & van Hemmen, 1996; Abbott & Blum, 1996; Kempter, Gerstner, & van Hemmen, 1999, 2001; Roberts, 1999; Song et al., 2000; van Hemmen, 2001; Cˆateau & Fukai, 2003). These models typically exhibit strong competition between synapses and have a positive feedback instability, so that an upper bound on the synaptic strength must be imposed as a hard constraint to enforce stability. The learning dynamics (i.e., the evolution of the synaptic weights) of these additive STDP models is determined by an unstable xed point (Kempter et al., 1999), which leads to the conductances displaying a bimodal distribution that is concentrated around zero and the upper bound. Consequently, the strong competition in these models leads to distributions of synaptic strengths that do not necessarily reect patterns of activity in the input. Moreover, such a bimodal distribution does not reect the experimentally observed unimodal distribution of synaptic strengths (Turrigiano, Leslie, Desai, Rutherford, & Nelson, 1998). However, these STDP models based on an unstable xed point of the learning dynamics have been successfully combined with delay selection during development to provide an account of both the acuity of sound localization (Gerstner & van Hemmen, 1996) and the development of temporal feature maps in the avian auditory system (Leibold, Kempter, & van Hemmen, 2002; Burkitt & van Hemmen, in press). A recent analysis involving detailed comparison between STDP and experimental results supports the hypothesis that potentiation and depression have different weight dependence (Bi & Poo, 1998), and in particular that potentiation is approximately additive and depression is predominantly multiplicative (van Rossum, Bi, & Turrigiano, 2000), that is, the potentiating increments are independent of the value of the synaptic strength, whereas the depressing decrements are proportional to the synaptic strength. Such models of temporally asymmetric Hebbian plasticity have been studied recently by a number of authors (Kistler & van Hemmen, 2000; van Rossum et al., 2000; Rubin, Lee, & Sompolinsky, 2001; G utig, ¨ Aharanov, Rotter, & Sompolinsky, 2003). A modeling study of STDP with multiplicative depression by van Rossum et al. (2000) in the presence of uncorrelated Poisson trains of spikes produced an equilibrium distribution of synaptic weights that compared favorably with that observed in cultured cortical pyramidal neurons (Turrigiano et al., 1998). Moreover, they found that spike-timing correlations

Spike-Timing-Dependent Plasticity

887

among a subset of the synaptic inputs increased the corresponding synaptic weights while leaving the weights of the uncorrelated inputs unchanged. This appears to indicate that there is no inherent competition in the STDP of such models (van Rossum et al., 2000). By contrast, in models where both the potentiation and depression are additive, competition tends to drive the weights to either zero or their maximally allowed value (Kempter et al., 1999; Song et al., 2000). The lack of competition in the model of STDP with additive potentiation and multiplicative depression may be overcome by incorporating activity-dependent scaling of the synaptic weights, in which the synaptic strengths are observed to change in a manner that depends on their long-term average spiking rate (Turrigiano et al., 1998). This effect occurs on a much slower timescale than STDP but is nevertheless capable of introducing competition between synapses. One somewhat surprising outcome of the study by van Rossum et al. (2000) was that their model does not recover rate-based Hebbian learning, since an increase in the rate of synaptic inputs does not have any effect on the synaptic strength in their model. Consequently, it would appear that this version of STDP is unable to potentiate weights that have a higher rate of synaptic input, but rather only synaptic inputs that have spike-timing correlations with other inputs will be potentiated. This result is at odds with the classical studies of LTP and LTD, which indicate that the rate of synaptic inputs plays a crucial role in synaptic modication (Lomo, 1971; Bliss & Lomo, 1973). The conventional methodology adopted for the study of LTP and LTD has been based on spike rate and has ignored the timing of individual spikes. The concept of selective synaptic modication based on correlations between mean spiking rates forms the foundation for much of the extensive work on rate-based Hebbian learning in neural systems dating back many years (Sejnowski, 1977). Consequently there has been considerable recent interest in establishing the relationship between rate-based and spike-based learning; for recent reviews of the relationship between LTP, LTD, synaptic scaling, and STDP, see Sejnowski (1999), Abbott and Nelson (2000), and Bi and Poo (2001). However, a number of fundamental questions remain about the relationship between these Hebbian rate-based models of neural plasticity and STDP (Roberts & Bell, 2002). Most of the STDP models studied have resulted in a dynamics of the weights that is controlled by an unstable xed point, which leads to the stability problems and bimodal distribution of weights discussed earlier. Consequently, it has not been possible to relate STDP to rate-based models in which the learning is determined by a stable xed point, since existing models of STDP with such stable learning dynamics show an absence of selective potentiation of synapses with higher spiking rate inputs (van Rossum et al., 2000). One of the aims of this study is to show that STDP (together with synaptic scaling) could be capable of producing stable xed-point dynamics of learning, as well as the selective potentiation of higher input-rate synapses that Hebbian rate-based models incorporate.

888

A. Burkitt, H. Mefn, and D. Grayden

A recent focus of attention in STDP has been the time extent of the EPSPAP interactions. At very low output spiking rates, an input will typically fall only within the STDP time window of a single output spike, while at higher output spiking rates, an incoming EPSP may fall within the STDP time window of a number of output APs. This leads to questions regarding how the plasticity signals from these multiple EPSP-AP interactions are to be integrated. These issues have been addressed in recent studies on thick, tufted layer I pyramidal neurons in rat visual cortex, in which experimental measurements of STDP were compared with a number of models of EPSPAP interactions (Sjostr ¨ om, ¨ Turrigiano, & Nelson, 2001). They particularly analyzed three models in which an input arrives in the STDP time window of more than one output spike and the timing extent of EPSP-AP interactions was limited in various ways. The results indicate a complex dependence of LTP and LTD on the output spiking rate of the neurons and provide support for the hypothesis that the extent of the EPSP-AP interaction is restricted when the input EPSPs fall within more than one STDP time window of an output AP. In this study, we identify and systematically analyze models representing four classes of EPSP-AP interaction in order to elucidate how the different time extent of EPSP-AP interactions gives rise to plasticity rules that depend on input and output rates in a way that is consistent with classical LTP and LTD results. We examine the role, upon the potentiation of synapses with different rates of synaptic input, of (1) the time extent of input-output (EPSP-AP) interactions and (2) time asymmetry of the learning window. The model of STDP is presented in section 2.1, in which the STDP time window and the time constants associated with potentiation (¿P ) and depression (¿D ) are dened. Experimental results indicate that these time constants are different (Bi & Poo, 1998). A framework for describing the STDP evolution of the synaptic weights, based on the nonlinear Langevin equation, is presented in section 2.1. This approach naturally relates the description of the evolution of the mean weights (Kempter et al., 1999) and the Fokker-Planck approach (van Kampen, 1992; van Rossum et al., 2000). In order to illustrate how a stable xed-point model of STDP can form part of a larger neural learning paradigm of working memory, the notion of background and foreground synapses corresponding to non-input-selective (spontaneously active) and input-selective synapses is introduced. In this study we use a leaky integrate-and-re neuronal model with synaptic conductances (Burkitt, 2001), described in section 2.2. Four different classes of EPSP-AP interaction, which are based on the time extent of the interaction, are presented in Section 3. The results for each class of model are presented in sections 3.1, 3.2, 3.3, and 3.4. These results, discussed in section 4, indicate that the selective potentiation of synapses with higher-rate input requires both that the EPSP-AP interactions are limited in a way that depends on the input rate and that the STDP time window is time-asymmetric. The results also show how spike-timing-based models of neural plasticity can be in-

Spike-Timing-Dependent Plasticity

889

tegrated with established rate-based models (Sejnowski, 1977; Bienenstock et al., 1982). In section 4.1 the effect of suppressive interspike interactions on STDP (Froemke & Dan, 2002) is analyzed, and in section 4.2 the relationship of STDP with activity-dependent synaptic scaling is discussed. Section 4.3 contains a discussion of the width of the weight distribution, and section 4.4 gives a discussion of a number of the assumptions and approximations made in this study. The key mathematical details of the analysis are presented in the appendix. 2 The Model The two components of the model are the weight changes caused by the spike-timing-dependent plasticity and the description of the output spikes in terms of the synaptic inputs. Both of these components, outlined in the two subsections below, are described in terms of stochastic processes that arise because of the random nature of the times of the synaptic inputs (described in terms of a Poisson process). 2.1 Spike-Timing-Dependent Plasticity. We consider a neuron with NE excitatory synaptic inputs. The STDP is implemented by making the following changes to the (excitatory) weights wi when a synaptic input on the ith ber arrives at time t0 and an output spike is generated at time t: wi ) wi C 1wi , » ®D .wi / F.tout ¡ tin / for tout < tin (2.1) 1wi D ®P .wi / F.tout ¡ tin / for tout ¸ tin ; where ®P .wi /, ®D .wi / are weight-dependent potentiation and depression functions, respectively. The STDP time window function F.1t/ is given by » 1t=¿ D ¡e for 1t < 0 (2.2) F.1t/ D ¡1t=¿P for 1t ¸ 0 ; e where ¿P , ¿D are the time constants of the potentiation and depression, respectively. Figure 1 illustrates the changes in weights given by this form of the STDP time window. Bi and Poo (1998) measured values of ¿P D 17 §9 ms and ¿D D 34 § 13 ms in cultured hippocampal neurons. There are a number of STDP models that have been studied, and Table 1 denes the weight-dependent potentiation and depression functions ®P .w/, ®P .w/ for the models discussed here. “Van Rossum et al. plasticity” has additive potentiation and multiplicative depression (van Rossum et al., 2000). For “Gutig ¨ et al. plasticity” (Gutig ¨ et al., 2003), the case ¹ D 0 corresponds to purely “additive plasticity” and the case ¹ D 1 corresponds to “multiplicative plasticity,” but with the addition of an explicit upper bound of wmax on the weight, which for all nonzero ¹ and sufciently small learning rate prevents the weight from leaving the range [0; wmax ].

890

A. Burkitt, H. Mefn, and D. Grayden

F(Dt)

50

25

25

0

50

Dt (ms)

Figure 1: The STDP time window F.1t/, equation 2.2: The change in synaptic strength is a function of the time difference between inputs and outputs (1t D tout ¡ tin ). Synaptic inputs (EPSPs) that arrive before an output (AP) is generated give a positive contribution (potentiation) to the change in weight, whereas inputs that arrive after an output give a negative contribution (depression). The time constants here are taken to be ¿P D 17 ms and ¿D D 34 ms. Units on the vertical axis are dependent on the values of ®P .wi /, ®D .wi /.

STDP describes the extent to which a synapse is potentiated or depressed solely in terms of the relative times of the individual input EPSPs and the output APs. In this study, a number of different models are examined in which the time extent of allowed EPSP-AP interactions is restricted in various ways. We identify four classes of models: (I) no restriction on EPSP-AP interactions, (II) input-restricted models in which the extent of the EPSP-AP interactions is limited to an interspike interval (ISI) of the input EPSPs, (III) output-restricted models in which the extent of the EPSP-AP interactions is limited to an ISI of the output APs, and (IV) input- and output-restricted Table 1: Models of STDP Giving the Form of their Weight-Dependent Potentiation and Depression Functions. Model Additive plasticity Multiplicative plasticity van Rossum et al. plasticity Gutig ¨ et al. plasticity

®D .w/

®P .w/

cD cD w cD w c D w¹

cP cP w cP cP .wmax ¡ w/¹

Note: The coefcients cP and cD are constants and 0 · ¹ · 1).

Spike-Timing-Dependent Plasticity

891

models in which the extent of the EPSP-AP interactions is limited in a more complex way that depends on both the input and output ISIs. The details of these models will be given in section 3. In this analysis, we assume that the input EPSP times have a Poisson distribution. The evolution of the individual weights under STDP is described by the nonlinear Langevin equation (van Kampen, 1992; Risken, 1996), dw i .t/ D A.wi ; fwj g/ C C.wi ; fwj g/».t/; dt

(2.3)

where ».t/ is gaussian white noise with zero mean and unit variance. The functions A.wi ; fwj g/, C.wi ; fwj g/ are functions of not only the individual weight wi but also all of the other weights fwj g through the implicit dependence of the output spiking rate on all weights (for notational convenience, we will henceforth suppress the dependence of these functions on the fwj g). The underlying dynamics is described by the Poisson process governing the incoming EPSPs and the spike-generating dynamics of the leaky integrateand-re neuron with synaptic conductances, both of which combine to produce the distribution of relative times between the individual EPSPs and the APs that is necessary for the calculation of the weight changes, equation 2.1. Due to the central limit theorem, averaging over the EPSP-AP interaction time distribution for many independent small-amplitude weight increments gives rise to the effective white noise in the above Langevin equation and enables the coefcients of A.w/ and C.w/ to be calculated. The validity criteria for this approximation are analyzed in more detail below in relation to the equivalent Fokker-Planck formalism (Itoˆ interpretation). The coefcients of A.wi / and C.wi / may vary for different weights wi if their associated synaptic inputs have a different temporal distribution. The mean weight wN i evolves according to dwN i .t/ D A.wN i /: dt

(2.4)

This equation gives the effective rate-based description of the change in the mean synaptic weight for any particular model of STDP. This equation gives the relationship between STDP and rate-based models for the dynamics of the synaptic weights, and hence forms the basis for the analysis of weight dynamics (Kempter et al., 1999). The “drift” function A.wi / is calculated as Z A.wi / D

dr r Q.wi I r/;

(2.5)

where Q.wi I r/ is the probability of the weight wi making an independent jump of size r. For STDP models, the jumps are given by equation 2.1, and A.wi / is the sum of the potentiation and depression components, A.wi / D AP .wi / ¡ AD .wi /. The potentiating contributions AP .wi / arise when the time

892

A. Burkitt, H. Mefn, and D. Grayden

difference between an output AP and an input EPSP is positive, and the depressing contributions AD .wi / arise when the time difference between an output AP and an input EPSP is negative. Consequently, A.wi / can be expressed as an integral over the changes in the synaptic strengths weighted by the probability distribution P.1t/ of the input-output time differences: Z 0 AD .wi / D ®D .wi / aD .fwj g/ D ®D .wi / d1t F.1t/ P.1t/ ¡1 Z 1 AP .wi / D ®P .wi / aP .fwj g/ D ®P .wi / d1t F.1t/ P.1t/;

(2.6)

0

where aD .fwj g/ and aP .fwj g/ are the time integrals on the right side of the equations and typically depend on the output spiking distribution, which gives them an implicit dependence on all the weights fwj g (henceforth suppressed for notational convenience). The difculty in evaluating A.wi / lies principally in the explicit determination of the distribution P.1t/ of input and output times, since the output spike times depend on the particular neural model adopted. Details of the neural model used here are presented in section 2.2 and the calculations of A.wi / are presented in section 3 and appendix B. The stationary mean of the weight distribution, wN i , is given by the xedpoint equation, A.wN i / D AP .wN i / ¡ AD .wN i / D 0, which may be written as R1 d1t F.1t/ P.1t/ ®D .wN i / aP D D R 00 : N ®P .wi / aD d1t F.1t/ P.1t/

(2.7)

¡1

In this analysis, we consider only models in which the learning dynamics of each weight is governed by a stable xed point (i.e., small uctuations in the mean weight are suppressed). The stability condition for this xed point in the case where all the weights are homogeneous (i.e., there is no selective stimulation of particular synapses) is 0 N N ®D .w/ a0D ® 0 .w/ a0 C C P; > P N N ®D .w/ aD ®P .w/ aP

(2.8)

where wN is the xed point of the mean of the weight distribution and the prime denotes the derivative with regard to the weight. In the more general case, the stability analysis involves the calculation of the Jacobian of the drift function, @j A.wi /. Stability requires negative values for the real part of all of the eigenvalues of this Jacobian (evaluated at the xed point). One such model of STDP with a stable xed point is van Rossum et al. plasticity (see Table 1), for which the drift and diffusion functions have an explicit linear and quadratic dependence (resp.) on the weight: A.wi / D A0 ¡ A1 wi , and B.wi / D B0 ¡ B1 wi C B2 w2i (note the sign of the linear terms: with these

Spike-Timing-Dependent Plasticity

893

denitions all An ; Bn > 0 for this model), where for the drift function is given by the coefcients Z 0 A 0 D cD a D D cD d1t e1t=¿D P.1t/ ¡1 Z 1 A 1 D cP a P D cP d1t e¡1t=¿P P.1t/;

(2.9)

0

with cD ; cP > 0. The xed point solution of equation 2.7 is wN i D

A0 cD aD D : A1 cP aP

(2.10)

Although we will be mainly interested in the mean weights in this study, it is also of interest to know the distribution of the weights. The probability density of each weight, P.wi ; t/, is described by the Fokker-Planck equation (van Kampen, 1992; Risken, 1996), 1 @2 @P.wi ; t/ @ D¡ [A.wi /P.wi ; t/] C [B.wi /P.wi ; t/]: 2 @ w2i @t @wi

(2.11)

The diffusion term, B.wi /, is calculated analogously to the drift term, equation 2.5: Z (2.12) B.wi / D dr r2 Q.wi I r/: The complete weight distribution that STDP generates is given by the equilibrium solution of the Fokker-Planck equation (van Rossum et al., 2000), P.w/ D

N exp B.w/

µZ

w

dw0

¶ 2A.w0 / ; B.w0 /

(2.13)

where N is a normalization factor. An estimate of the variance of the weight distribution is given by expanding the exponent in the above expression to second order about the xed point, wN i dened in equation 2.7: ¾i2 ¼ ¡

B.wN i / : 2A0 .wN i /

(2.14)

In order for the Fokker-Planck formalism to provide a valid description of the weight dynamics, two conditions must be fullled (van Kampen, 1992): (1) the drift function must be a slowly varying function of the weight (i.e., A0 .w/=¸ ¿ 1, where ¸ is the rate of independent weight steps r in equation 2.5), and (2) the mean square weight step hr2 i must be much less than

894

A. Burkitt, H. Mefn, and D. Grayden

the variance of the weight distribution (i.e., hr2 i ¿ ¾ i2 ). Using equation 2.14 and hr2 i ¼ B.wN i /=¸, we nd that these two conditions are essentially equivalent, and we show in section 3 that they are fullled for the models studied here. For van Rossum et al. plasticity, Table 1, with the above linear and quadratic form of A.w/ and B.w/ resp., the equilibrium weight distribution is given by »

N P.w/ D

exp

.2A p 0 ¡A1 B1 =B2 / B0 B2 ¡B21 =4

µ atan p

B2 .w B0 B2 ¡B21 =4

.B0 ¡ B1 w C B2 w2 /1CA1 =B2

¶¼ ¡

B1 2B2 /

:

(2.15)

In this case, it is possible to give a more accurate analytical expression than equation 2.14 for the variance of the weight distribution, valid when A.w/ is linear and B.w/ is quadratic, that is derived from a self-consistency relation (Burkitt, 2001; see appendix A for details), ¾i2 D

B0 ¡ B1 A0 =A1 C B2 A20 =A21 : 2A1 ¡ B2

(2.16)

Alternatively, the variance of the distribution can be evaluated numerically directly from the Fokker-Planck distribution, equation 2.13. Some of the issues associated with the width of the weight distribution are discussed in section 4.3. One of the central questions regarding any synaptic modication rule is what its role is in a wider learning paradigm. In order to illustrate the relationship between Hebbian rate-based models with stable learning dynamics that is input selective (i.e., the xed point is dependent on the synaptic inputs) and STDP with a similar stable xed-point learning dynamics, a simple paradigm is considered. We analyze the weights of a single neuron in which the majority of the synapses receive only a background level of synaptic inputs (corresponding to spontaneous activity), and a small subset of the synapses receive a higher rate of inputs (corresponding to inputs from “activated” neurons). These inputs and synapses are referred to as background and foreground, respectively, and consequently the higher-rate inputs are selective to the foreground synapses. For simplicity, we assume that all the background inputs and synapses are statistically identical, and likewise for the foreground inputs and synapses. The equivalence of the foreground inputs represents a considerable simplication that removes the spatial component of activation, but it allows the central features of the learning dynamics of the STDP models analyzed here to be illustrated. Hence the incoming EPSPs are divided into two sets: a larger set of size NB with input rates that are given by the background (spontaneous) spiking rate, taken here to be ¸B D 5 Hz, and a smaller set of size NF D NE ¡ NB with an elevated input rate ¸F . There are correspondingly two sets of weights on

Spike-Timing-Dependent Plasticity

895

the neuron, described as background and foreground (wB , wF , resp.), that also correspond to unstimulated and stimulated synapses, respectively, in studies of classical LTP and LTD. Consequently the equations obeyed by the means of the background and foreground weights, wN B and wN F , respectively, are given by equation 2.7 (with the subscript i replaced by B,F). While most of the analysis will be concerned with the mean values of these distributions, we will also discuss the full distribution in later sections. It is straightforward to show that the stability condition of the xed point of this system of two weights corresponds to the requirement that the trace of the Jacobian @X A.wY / (where X,Y D B,F), evaluated at the xed point, be negative and the determinant be positive. The relationship of this simple learning paradigm with Hebbian rate-based models of learning is discussed in section 4. The results apply more generally to any spatial distribution of rates across inputs. In this analysis, we will be concerned with the effect of a large number of inputs whose spike times are uncorrelated. Consequently we neglect the spike-triggering effect, namely, the increase in the probability of a spike being generated due to a single synaptic input (this is equivalent to assuming that each individual synaptic input is small compared to the sum of all the synaptic inputs). The spike-triggering effect scales with the presynaptic ring rate (Kempter et al., 1999) and may be signicant when there are correlations between the presynaptic inputs (van Rossum et al., 2000; Svirskis & Rinzel, 2000; Stroeve & Gielen, 2001). In particular, in the case where the weight dynamics is not determined by a stable xed point, the spiketriggering effect could be greatly amplied by the instability (Gerstner, 2001; Kempter et al., 2001). For neurons with a large number of temporally uncorrelated small amplitude inputs, this term is very small, and it will be neglected in this study. In addition to STDP, cortical neurons also exhibit activity-dependent scaling of the synaptic weights (i.e., dependent on the long time average output spiking rate of the postsynaptic neurons), which is experimentally observed to occur on a much slower timescale than STDP (Turrigiano et al., 1998). This activity-dependent synaptic scaling (ADSS) has the effect of introducing a competitive mechanism between the weights (van Rossum et al., 2000). The slower timescale for ADSS means that we will not be directly concerned with modeling its effect in this study, although it nevertheless plays an important role in understanding the overall impact of STDP on neural functioning, as discussed in section 4.2. The parameters associated with ADSS can be treated as approximately constant over the timescale relevant to STDP, and the effect of ADSS will not be considered further here. 2.2 Leaky Integrate-and-Fire Neuronal Model with Synaptic Conductances. A central ingredient of STDP calculations is the determination of the output spike times, for which an explicit neural model is necessary. In this study, a one-compartment leaky integrate-and-re neuron with synaptic conductances is used. The membrane potential V.t/ in this model is the

896

A. Burkitt, H. Mefn, and D. Grayden

integrated activity of its excitatory and inhibitory synaptic inputs, and it decays in time with a characteristic time constant (Tuckwell, 1979), dV D ¡

NE X .V ¡ v0 / dt C .VE ¡ V/ gE;i dPE;i ¿ iD1

C .VI ¡ V/

NI X

(2.17)

gI;j dPI;j : jD1

The rst term describes the passive leak of the membrane, with resting potential v0 and passive membrane time constant ¿ . The second and third terms represent the synaptic contribution due to cortical background and foreground activity from NE excitatory (dPE;i ) and NI inhibitory (dP I;j ) neurons, respectively. The inputs dPE;i and dPI;j are trains of delta functions distributed as independent temporally homogeneous Poisson processes with constant intensities (spiking rates) ¸E;i and ¸I;j , respectively. VE and VI are the (constant) reversal potentials (VI · v0 · V.t/ · Vth < VE ). When the membrane potential reaches a threshold Vth , an output spike (action potential) is generated, and the membrane potential is reset to its resting value v0 . The parameters gE;i and gI;i represent the integrated conductances over the time course of the synaptic event (with unit weights) divided by the neural capacitance: they are nonnegative and dimensionless. Since we will be interested only in the plasticity of the excitatory synapses in this study, gI;i D gI is taken to be the same for all inhibitory synapses. In keeping with widespread convention about synaptic strengths, we will also use the notation wi D gE;i and refer to the excitatory synaptic conductances as weights. In the absence of spike generation, the membrane potential approaches an equilibrium value, ¹ Q , about which it uctuates with variance ¾Q . The membrane potential approaches ¹ Q with a time constant given by the effective membrane time constant ¿Q , which determines the decay of the membrane potential and is typically signicantly shorter than the passive membrane time constant ¿ due to the activation of the synaptic conductances. The values of ¹Q , ¾Q , ¿Q are (Hanson & Tuckwell, 1983; Burkitt, 2001) 1 1 D C r10 ¿Q ¿

¾ Q2 D

;

¹2Q r20 ¡ 2¹Q r21 C r22

rmn D VEn

¹Q D ¿Q .v0 =¿ C r11 /;

(2.18)

2=¿Q ¡ r20

NE X iD1

n m gm I;i ¸E;i C NI VI gI ¸I :

The output spiking rate is given as a function of ¹Q , ¾Q , and ¿Q by Z 1 ¡1 ¸out D ¿A C t fµ .t/ dt 0

¿Q D ¿A C ¾Q

r

¼ 2

Z

Vth v0

"

.u ¡ ¹Q /2 du exp 2¾Q2

#"

Á

u ¡ ¹Q 1 C erf p ¾Q 2

(2.19) !# ;

Spike-Timing-Dependent Plasticity

897

where ¿A is the absolute refractory period (neglected in the analysis here). This is the so-called Siegert formula (Siegert, 1951; Ricciardi, 1977; Tuckwell, 1988; Amit & Tsodyks, 1991), but with the membrane time constant ¿ replaced by the effective time constant ¿Q . Note that once the parameters Vth and v0 have been chosen, this formula gives the mean spiking rate as a function of just the three variables ¹ Q , ¾Q , and ¿Q . In the calculations of the drift function A.wi / described in the following sections, the output spike times are completely determined by the interspike interval (ISI) distribution fµ .t/. The Laplace transform of the ISI distribution fµ L .s/ plays an important role due to the assumed Poisson time distribution of the synaptic input. Because the spiking mechanism of the neural model is integrate-and-re, the ISI distribution is the rst passage time density of the membrane potential and is given by (see the appendix of Burkitt, Mefn, & Grayden, 2003) fµ L .s/ D

pL .s/ ; qL .s/

(2.20)

where µ ¶ .yth ¡ yr x/2 exp ¡ 1 ¡ x2 0 2¼¾Q2 .1 ¡ x2 / " # Z 1 y2th .1 ¡ x/2 ¿Q x¿Q s¡1 exp ¡ qL .s/ D dx q ; 1 ¡ x2 0 2¼¾Q2 .1 ¡ x2 / Z

pL .s/ D

1

dx q

¿Q x¿Q s¡1

(2.21)

p p with yth D .Vth ¡ ¹Q /= 2¾Q , yr D .v0 ¡ ¹Q /= 2¾Q . The values of potentials were chosen throughout this study to be VE D 0 mV, VI D ¡80 mV, v0 D ¡70 mV, Vth D ¡55 mV, and the passive membrane time constant was ¿ D 20 ms. The number of background and foreground synapses are taken to be NB =1500 and NF =100, respectively, and the number of inhibitory synapses is NI D 400. 3 Results Results for four different classes of EPSP-AP interaction, based on the time extent of the interaction, are presented in the four sections below. 3.1 Model I: Unrestricted Time Extent of EPSP-AP Interactions. We rst consider the model in which the synaptic inputs contribute to all the STDP time windows in which they fall (Gerstner, Kempter, van Hemmen, & Wagner, 1996; Kempter et al., 1999; Song et al., 2000; Senn, Markram, & Tsodyks, 2001). In this model, a synaptic input gives potentiating contributions for all the subsequent spikes and depressing contributions for all the preceding spikes that lie within the STDP time window. The drift function,

898

A. Burkitt, H. Mefn, and D. Grayden

A.wi /, is given by (see section B.1 in appendix B for details) A.wi / D ¸out ¸i [¿P ®P .wi / ¡ ¿D ®D .wi /];

(3.1)

where ¸i is the rate of synaptic inputs to synapse i with weight wi . In this expression for the drift function, the multiplicative factors ¸out and ¸i result from the cumulative nature of the changes to the weights induced by each output AP and input EPSP. The factors ¿P and ¿D represent the integral of the STDP window function F.1t/, equation 2.2, and hence incorporate the nite timescale of the EPSP-AP interactions represented by the STDP time window. The mean of the weight distribution for wi is determined by the xed-point equation A.wi / D 0. Since the resulting expression is independent of both the input and output spiking rates, the mean weight is identical for all synapses, so that the means of both the background and foreground weight distributions, wN F , wN B , are given by ®D .wN B / ®D .wN F / ¿P D D : ®P .wN B / ®P .wN F / ¿D

(3.2)

The actual xed-point value for wN F D wN B is found by solving this above implicit expression. Consequently, for this class of STDP model, in which the incoming EPSPs interact with all APs, all mean weights are identical, and in particular there is no selective potentiation of the higher input rate foreground synapses relative to that of the lower input rate background synapses. Moreover, this equality between the means of both weight distributions is unaffected by any difference between the two STDP time constants ¿P and ¿D . In the particular case of van Rossum et al. plasticity, Table 1, the mean weight is given by wN i D wN B D wN F D ¿P cP =.¿D cD /. The diffusion function B.w/ cannot be calculated in the same straightforward way as the drift function A.w/, since the successive changes in the weights are not independent (i.e., successive jumps are correlated), as discussed further in section 4.3. The calculation of B.w/ with such correlations is outside the scope of this investigation. 3.2 Model II: Input Restricted EPSP-AP Interactions. We consider now a model in which the time extent of the EPSP-AP interactions is restricted so that each output AP interacts only with its nearest synaptic inputs. Consequently each output AP in this model has a potentiation component that arises by interaction with the most recent input EPSP and a depression component by interaction with the rst subsequent EPSP (i.e., if additional EPSPs fall within the STDP time window of an output spike, these interactions are neglected; van Rossum et al., 2000). The EPSP-AP interactions included in this model are illustrated in Figure 2, which shows that APs interact only with the EPSPs that are within the EPSP ISI in which they are generated. Consequently, each ISI of the input EPSPs can be considered as an independent event.

Spike-Timing-Dependent Plasticity

899

Figure 2: Model II: Input-restricted EPSP-AP interactions (N D 1). Diagrammatic representation of the EPSP-AP interactions included (time goes from left to right). Each AP interacts only with the last and the next EPSP. Action potentials (APs) are represented by vertical arrows above the time line, and EPSPs are represented by vertical arrows below the time line. The brackets indicate the EPSP-AP interactions included in this model: brackets above the time line represent contributions that increase the weight (potentiation), and brackets below the time line represent contributions that decrease the weight (depression).

The drift function A.w/ and diffusion function B.w/ are evaluated using equation 2.5 and make use of the Poisson time distribution of the synaptic inputs. This is done by integrating over a single ISI of the EPSPs, that is, truncating the time window of integration of the EPSP-AP interaction (see section B.2 for details). The drift function is given for this model by " A.wi / D ¸out ¸i

# ®P .wi / .¸i C

1 ¿P /

¡

®D .wi / .¸i C

1 ¿D /

:

(3.3)

The drift function contains the multiplicative factors ¸out and ¸i , which are common to all models (see equation 3.1 and the following discussion). The restrictions imposed on the EPSP-AP interactions (i.e., truncated to extend over only the ISI of the inputs) introduce the terms within the square brackets, which determine the xed point of the weight. When averaged over the distribution of input-output time differences, the input-restricted interactions effectively result in a modication of the STDP window by the multiplicative factor e¡¸i t , that is, a suppressive factor corresponding to the Poisson rate of the input EPSPs. Consequently, the xed-point equation for wi now depends on the input rate ¸i so that for the mean background and

900

A. Burkitt, H. Mefn, and D. Grayden

foreground weights, we obtain .¸X C ¿1D / ®D .wN X / D ; ®P .wN X / .¸X C ¿1P /

X

D B,F:

(3.4)

As for model I, the xed-point equations for the background and foreground weights are identical if the two STDP time constants are equal, resulting in no selective potentiation. However, if the STDP time constants are different, then the potentiation is input selective. This is illustrated in the case of van Rossum et al. plasticity, Table 1, for which the drift function is given by A.wi / D A0 ¡ A1 wi with A0 D

¸out ¸i cP .¸i C

1 ¿P /

;

A1 D

¸out ¸i cD .¸i C

1 ¿D /

:

(3.5)

The validity of the Fokker-Planck formalism, as discussed in section 2.1, requires that A0 .wN i /=¸i ¿ 1, which in this case is given by the condition 2cD ¸out =.¸i C ¿1 / ¿ 1, and is satised for sufciently small cD (in this study, D cD D 0:01). The means of the background and foreground weight distributions evolve toward their respective stable xed points, equation 2.10, wN X D

cP .¸X C

cD .¸X C

1 ¿D / ; 1 ¿P /

X

D B,F:

(3.6)

Now for ¿D > ¿P (as observed experimentally) and ¸F > ¸B , it is clear from this equation that wN F > wN B . Consequently, by limiting the extent of EPSPAP interactions in this model, the weight distributions for the foreground synapses with higher input rate ¸F become potentiated relative to the unchanged weights of the low input rate ¸B background synapses, in contrast to previous studies (van Rossum et al., 2000). Figure 3 illustrates the effect of restricted EPSP-AP interactions and different time constants for potentiation and depression on the mean of the foreground weights. R¿ D ¿D =¿P is the ratio of the STDP time constants, and the plot shows the change of the mean weight wN F as a function of the spiking rate of the foreground inputs ¸F . (The background inputs have a rate of ¸B D 5 Hz and the average equilibrium background weight, which is xed at wN B D 0:0017 for all values of R¿ by suitably choosing cP , is independent of ¸F . This is also the initial value of wN F .) For the case where the time constants are equal, R¿ D 1, there is no potentiation, as observed previously (van Rossum et al., 2000). For larger values of the ratio of the STDP time constants, R¿ > 1, the results show that the mean foreground weight increases as the input rate of the foreground weights ¸F increases, and this effect becomes more pronounced for larger values of R¿ . The dashed line shows the case where the ratio of STDP time constants is reversed, namely, where ¿D < ¿P , which results in the weights of

Spike-Timing-Dependent Plasticity

901

150

Rt = 4

F

w (%)

100

Rt = 2

50

Rt = 1.5 Rt = 1

0

Rt = 0.5 50 0

25

50

l

F

(Hz)

75

100

Figure 3: Model II, with N D 1 and van Rossum et al. plasticity equation 3.6. The N F plotted as a function of the spiking percentage change of the mean weight w rate of the foreground inputs ¸F (the background inputs have a spiking rate of ¸B D 5 Hz). R¿ D ¿D =¿P is the ratio of the STDP time constants, given by xing ¿P D 17 ms and varying ¿D . Parameter values are cD D 0:01, and cP is chosen N B D 0:0017 for each R¿ . All other parameter values are as given in the so that w text. The dashed line corresponds to a value of R¿ < 1 in which the foreground set of weights falls below the background weights. Solid symbols represent the results for the average weight of the foreground synapses obtained by numerical simulation after 1000 APs.

the higher-input synapses becoming smaller than the background weights. The solid symbols in the gure represent the results of numerical simulation (averages taken after 1000 output spikes, in order for the weights to reach their steady-state distribution). Note that the values of the mean weights wN B , wN F , given by equation 3.6, are independent of the output spiking rate. 3.2.1 Model II: Generalization to N Input Restricted EPSP-AP Interactions. The above results can be extended to the case in which the time extent of the EPSP-AP interactions is restricted to the N nearest synaptic inputs to each output AP. The calculation proceeds as described in section B.2.1: the integral over input and output spike times, equation B.4, is generalized to equation B.14 by replacing the single input integral with N input integrals for the successive inputs in exactly the fashion described for model I. Consequently the model is N input restricted, so that each AP interacts with

902

A. Burkitt, H. Mefn, and D. Grayden

only the N closest EPSPs. The drift function is given by (

"

³

A.wi / D ¸out ¸i ®P .wi /¿P 1 ¡ "

¸i ¿P 1 C ¸i ¿P ³

¡ ®D .wi /¿D 1 ¡

´N #

¸i ¿D 1 C ¸i ¿D

´N #) :

(3.7)

Consequently, the xed-point equations of the background and foreground weight distributions are h i ¸X ¿P N 1 ¡ ¿ . / P 1C¸X ¿P ®D .wN X / h i; D ¸X ¿ D N ®P .wN X / ¿D 1 ¡ . 1C¸ / X ¿D

X

D B,F:

(3.8)

Clearly, as N ! 1 the results reduce to those for model I, equation 3.2, with unrestricted time extent of the EPSP-AP interactions, as we would expect. For all other values of N, the same conclusions hold as for the N D 1 case: the foreground synapses with higher input rate ¸F become potentiated relative to the background synapses with lower input rate ¸B if the STDP time constants are different. This is illustrated for the case of van Rossum et al. plasticity in Figure 4, which shows the change of the mean weight as a function of the input spiking rate for various values of N. Note that as N increases, the weight change becomes less and less, eventually reproducing the non-input-selective behavior of model I. 3.2.2 Model II: N D 1=2 Input-Restricted Model of EPSP-AP Interactions. A further variation of an input-restricted model is one in which each AP can interact with only the single closest EPSP, that is, an AP gives rise to either potentiation or depression depending on whether the preceding or the subsequent EPSP was closer in time. Consequently the interactions are restricted to half of the ISI of the input EPSPs. We denote this as N D 1=2, since only half the EPSP-AP interactions of the N D 1 model are included. This is illustrated in Figure 5, which shows how each AP interacts only with its closest EPSP. The time extent of the EPSP-AP interactions in this model is restricted to domains delimited by adjacent synaptic inputs, as for the N D 1 case of Model II. The drift function is given by (see section B.2.2 for details) " A.wi / D ¸out ¸i

# ®P .wi / .2¸i C

1 ¿P /

¡

®D .wi / .2¸i C

1 ¿D /

:

(3.9)

Spike-Timing-Dependent Plasticity

903

N = 1/2 N=1 N=2

50

w (%)

N=3

F

N=4 25

N=6

N = 10 0 0

25

50

l

F

(Hz)

75

100

N = 20

Figure 4: Model II, various N (see equation 3.8) for van Rossum et al. plasticity. N F plotted as a function of the spiking The percentage change of the mean weight w rate of the foreground inputs ¸F (the background inputs have a spiking rate of ¸B D 5 Hz). Different lines show the results for different values of N, the number of nearest synaptic inputs to each output AP interacts. The value N D 1=2 corresponds to the case of interaction between the AP and the single closest EPSP, as described in section 3.2.2. The time constants are taken as ¿D D 34 ms, ¿P D 17 ms (i.e., R¿ D 2). Other parameter values are as for Figure 3.

Consequently, the means of the background and foreground weight distributions for van Rossum et al. plasticity, Table 1, are wN X D

cP .2¸X C

cD .2¸X C

1 ¿D / ; 1 ¿P /

X

D B,F:

(3.10)

Similar to the previous input-restricted models, in this model, the foreground synapses with higher input rate ¸F become potentiated relative to the background synapses with lower input rate ¸B , as illustrated by the N D 1=2 line in Figure 4. 3.3 Model III: Output-restricted EPSP-AP Interactions. In this model, the time extent of the EPSP-AP interaction is restricted to domains delimited by adjacent output spikes; each EPSP is paired only with the last and the next AP. Consequently, each EPSP contributes exactly one potentiating

904

A. Burkitt, H. Mefn, and D. Grayden

Figure 5: Model II, N D 1=2 input-restricted model of EPSP-AP interactions. Each AP interacts only with the single closest EPSP. Diagrammatic representation of the EPSP-AP interactions are included in this model (see the caption of Figure 2).

term (by interaction with the rst subsequent AP) and exactly one depressing term (by interaction with the immediately preceding AP). This pattern of EPSP-AP interaction is illustrated in Figure 6, which shows how each EPSP interacts only with its last and next AP. This model has been studied previously in the case of additive STDP (Cateau ˆ & Fukai, 2003; Izhikevich & Desai, 2003). The physiological rationale for this model of EPSP-AP interactions is that an output AP almost completely resets the internal state of a neuron, thereby reducing the effect of more temporally distant interactions. The drift function (given by (Cˆateau & Fukai, 2003) is (see section B.3 for details) » A.wi / D ¸out ¸i

µ ³ ´¶ 1 ®P .wi /¿P 1 ¡ fµ L ¿P µ ³ ´¶¼ 1 ¡ ®D .wi /¿D 1 ¡ fµ L : ¿D

(3.11)

The terms containing fµ L ./ reect the output spiking rate, and in the limit of small spiking rates, this becomes equivalent to Poisson distributed outputs: Poisson

fµL .s/ H)

¸out : s C ¸out

(3.12)

Spike-Timing-Dependent Plasticity

905

Figure 6: Model III: Output-restricted EPSP-AP interactions (M D 1). Diagrammatic representation of the EPSP-AP interactions are included in this model (see the caption to Figure 2). Each EPSP contributes only to the STDP time window of its last and next AP.

The drift function in this limit becomes " # ®P .wi / ®D .wi / Poisson ¡ A.wi / H) ¸out ¸i .¸out C ¿1P / .¸out C ¿1D /

(3.13)

(cf. the analogous expression for A.wi /, equation 3.3, with Poisson inputs). The means of the background and foreground weight distributions are given by the solution of the xed-point equation: h i 1 1 ¡ f . / L µ N ®D .wX / ¿P ¿P Poisson .¸ out C h i H) D ®P .wN X / ¿D 1 ¡ f µ L . 1 / .¸out C ¿D

1 ¿D / : 1 ¿P /

(3.14)

As for models I and II, the drift function contains the multiplicative factors ¸out and ¸i (see equations 3.1 and 3.3) and the following discussion). In the output-restricted model, the EPSP-AP interactions are truncated to extend over only the ISI of the outputs, intuitively resulting in a similar multiplicative factor that attenuates the STDP time window, as seen for model II. As in the case for model I, the xed-point equation for the mean weight has no explicit dependence on the input rate. Consequently, the mean weights of the higher input rate foreground synapses, wN F , and the low input rate background synapses, wN B , are identical. However, unlike model I, in this model the mean weights do depend on the output spiking rate, that

906

A. Burkitt, H. Mefn, and D. Grayden

is, the mean weights will change in exactly the same fashion as the output spiking rate changes. This is illustrated in the case of van Rossum et al. plasticity, Table 1, for which the mean weights are given by h i 1 1 ¡ f . / µL cP ¿P ¿P h i: wN B D wN F D cD ¿D 1 ¡ f L . 1 / µ ¿D

(3.15)

In order to evaluate the mean weights, we solved equation 3.11 iteratively (by iterating to the stable xed point), since fµ L .s/ depends on wN B and wN F (Cˆateau & Fukai, 2003). (This was not necessary for model II because wN B and wN F depended on only their respective input rates.) Figure 7 shows the change of the mean background and foreground weights (which are equal: wN B D wN F ) as a function of the foreground input rate ¸F . Again the validity of the Fokker-Planck formalism, as discussed in section 2.1, requires that

60

w (%)

40

20

0 0

25

50

l

F

(Hz)

75

100

Figure 7: Model III, with M D 1 and van Rossum et al. plasticity, equation 3.15. NB D The percentage increase of the mean background and foreground weights, w N F , plotted as a function of the spiking rate of the foreground inputs ¸F (the w background inputs have a spiking rate of ¸B D 5 Hz). The plot is for STDP time constants ¿P D 17 ms and ¿D D 34 ms (i.e., R¿ D 2). Parameter values are NB D w N F D 0:0017 at ¸F =5 Hz. All other cD D 0:01 and cP D 3:0 £ 10¡5 , so that w parameter values are as given in the text. Solid symbols represent the results for the average weight of the synapses obtained by numerical simulation after 10,000 APs.

Spike-Timing-Dependent Plasticity

907

A0 .wN i /=¸out ¿ 1. In this case, the condition corresponds to 2cD ¿D ¸i [1 ¡ fµ L . ¿1D /] ¿ 1, which is again satised for sufciently small cD . 3.3.1 Model III: Generalization to M Output-Restricted EPSP-AP Interactions. The above results can be extended to the case in which the time extent of the EPSP-AP interactions is restricted to the M nearest AP outputs to each input EPSP. The calculation proceeds as described in section B.3.1: the integral over input and output spike times, equation B.18, is generalized to equation B.25 by replacing the single output integral with M output integrals for the successive AP outputs in the fashion described for model II. Consequently the model is M output restricted, so that each EPSP interacts with only the M closest APs. The drift function is given by » A.wi / D ¸out ¸i

µ

³

´¶ 1 ®P .wi /¿P 1 ¡ ¿P µ ³ ´¶¼ 1 M ¡ ®D .wi /¿D 1 ¡ fµL : ¿D fµML

(3.16)

The xed-point equations of the background and foreground weight distributions are h i M 1 ®D .wN B / ®D .wN F / ¿P 1 ¡ f µ L . ¿ P / h i: D D ®P .wN B / ®P .wN F / ¿D 1 ¡ f M . 1 / µ L ¿D

(3.17)

Clearly, as M ! 1, the results again reduce to those for model I, equation 3.2, with unrestricted time extent of the EPSP-AP interactions, as we would expect. For all other values of M, the same conclusions hold as for the M D 1 case: the xed-point weight depends on the output rate, and there is no input selectivity in the synapses, which all become equally potentiated as the input rate increases if the STDP time constants are different. 3.4 Model IV: Input- and Output-Restricted EPSP-AP Interactions. This model incorporates the restrictions of the time extent of EPSP-AP interactions contained in both models II and III: only the closest EPSP-AP interactions are included (i.e., EPSP-AP interactions that have no other intervening EPSPs or APs). The time extent of the EPSP-AP interactions in this model is consequently dependent on both the input and output spiking rates, and the pattern of EPSP-AP interactions that results is illustrated in Figure 8. This model is motivated by one of the models studied by Sj¨ostrom ¨ et al. (2001). Their rationale for this model of EPSP-AP interactions is that the amount of depression or potentiation could saturate with a single EPSP-AP interaction, making any further such interactions involving the same AP or EPSP superuous. (Note that in Sjostr ¨ om ¨ et al., 2001, the STDP window

908

A. Burkitt, H. Mefn, and D. Grayden

Figure 8: Model IV: Input- and output-restricted EPSP-AP interactions (N D M D 1). Diagrammatic representation of the EPSP-AP interactions included in this model (see the caption to Figure 2). Each EPSP contributes only to the STDP time window of its last and next AP, with the additional condition that there is no intervening EPSP or AP (i.e., only the closest EPSP-AP interactions are included).

function was taken to be a step function rather than the exponential function used here and illustrated in Figure 1, so that a single input anywhere within the time window leads to saturation in their formulation.) The drift function for this model is given by (see section B.4 for details) 8 < A.wi / D ¸out ¸i

:

®P .wi /

h 1 ¡ fµ L .¸i C h

¡ ®D .wi /

.¸i C

1 ¿P /

i

1 ¿P /

1 ¡ fµ L .¸i C .¸i C

1 ¿D /

1 ¿D /

i9 = ;

:

(3.18)

The means of the background and foreground weight distributions are given by the solution of the xed-point equation h 1 ¡ fµ L .¸X C ®D .wN X / Dh ®P .wN X / 1 ¡ fµ L .¸X C

1 ¿P /

i i

1 ¿D /

.¸X C

1 ¿D /

.¸X C

1 ¿P /

;

X

D B,F:

(3.19)

Spike-Timing-Dependent Plasticity

909

In the case of van Rossum et al. plasticity, Table 1, the mean weights are given by h i cP 1 ¡ fµL .¸X C ¿1P / .¸X C ¿1D / h i (3.20) wN X D ; X D B,F: cD 1 ¡ fµ L .¸X C ¿1D / .¸X C ¿1P / In the case where the STDP time constants are equal, there is no selective potentiation of the foreground synapses (i.e., wN F D wN B ). When the time constant for depression is larger than for potentiation, ¿D > ¿P , the foreground synapses are potentiated relative to the background synapses, as observed in the input-restricted case of model II. However, in addition, both the background and foreground weights also depend on the output spiking rate, as observed in the output-restricted case of model III. Consequently, this input- and output-restricted model displays features contained in models II and III: the foreground synapses are potentiated relative to the background synapses, and the background and foreground weights change with the output spiking rate. This is illustrated in Figure 9, which shows the change of the mean background (solid diamond symbols) and foreground (solid square symbols) weights wN B and wN F for R¿ D 2. 3.4.1 Model IV: Generalization to N Input-, M Output-Restricted EPSP-AP Interactions. The above results can be extended to the case in which the time extent of the EPSP-AP interactions is restricted to the N nearest EPSP inputs to each output AP and the M nearest AP outputs to each input EPSP. The calculation proceeds using the methods developed for each of the models separately. The drift function is given by " (Á ! N¡1 dN¡1 1 1 ¸N i .¡1/ ¡ A.wi / D ¸out ¸i ®P .wi /¿P .N ¡ 1/! d¸N¡1 ¸i ¸i C ¿1P i µ ³ ´¶¼ 1 £ 1 ¡ fµML ¿P N¡1 dN¡1 ¸N i .¡1/ .N ¡ 1/! d¸N¡1 i (Á !µ ³ ´¶)# 1 1 1 M £ ¡ 1 ¡ fµ L : 1 ¸i ¿ ¸i C ¿D D

¡ ®D .wi /¿D

(3.21)

The xed-point equations can be read off in the exactly the same way as for the other models. Clearly, as M ! 1 for xed N, the results reduce to those for model II, equation 3.7, while as N ! 1 for xed M, the results reduce to those for model III, equation 3.16, and as N; M ! 1 the results reduce to those for model I, equation 3.1, as expected. For nite N; M, the model shows behaviors evident in both input-restricted and output-restricted models, as discussed above.

910

A. Burkitt, H. Mefn, and D. Grayden

w (%)

40

20

0 0

25

50

l

F

(Hz)

75

100

Figure 9: Model IV, with N D M D 1 and van Rossum et al. plasticity, equation 3.20. The percentage increase of the mean background and foreground NB D w N F plotted as a function of the spiking rate of the foreground weights w inputs ¸F (the background inputs have a spiking rate of ¸B D 5 Hz). The plot is for STDP time constants ¿P D 17 ms and ¿D D 34 ms (i.e., R¿ D 2). Parameter NB D w N F D 0:0017 at ¸F =5 Hz. values are cD D 0:01 and cP D 2:75 £ 10¡5 , giving w All other parameter values are as given in the text. Solid diamonds (squares) represent the results for the average weight of the foreground (background) synapses obtained by numerical simulation after 10,000 APs.

4 Discussion The results for the four models of stable xed-point STDP indicate how the time extent of EPSP-AP interactions and the time asymmetry of the learning window affect the distribution of synaptic strengths for synapses with different xed rates of input. The effective rate-based description of the dynamics of the mean of a synaptic weight wN i (obtained by averaging over times that are long compared to both the STDP time window and the ISIs of the inputs and outputs) is given by equation 2.4, and the xed-point value of the mean weight is determined by equation 2.7. In order to gain an understanding of the effect of different rates of synaptic input and to relate the STDP models investigated here to both classical LTP and LTD experiments and models of rate-based Hebbian learning, the inputs are divided into a background set of inputs (with a spontaneous rate of 5 Hz) and a foreground set of inputs (with a higher rate). The question that we

Spike-Timing-Dependent Plasticity

911

have addressed is the extent to which this stable xed point model of STDP is capable of generating synaptic weights that are input selective (i.e., whether the weights of foreground synapses are potentiated relative to background synapses at the xed point) and therefore analogous to rate-based Hebbian learning models with learning dynamics determined by a stable xed point. Four classes of model are identied, based on the time extent of the allowed EPSP-AP interactions and dened in terms of how this extent is limited by the input ISIs or the output ISIs. The effect of both the restriction on the time extent of EPSP-AP interactions and the time asymmetry of the STDP window (i.e., different time constants for potentiation and depression) can be summarized as follows: (1) if the STDP time constants are equal, there is no selective potentiation of synapses with higher rates, and (2) if the time constant for depression is larger than for potentiation (R¿ D ¿D =¿P > 1), then there is selective potentiation of synapses with higher input rates only if the time extent of the EPSP-AP interactions displays some form of “input” restriction (such as in models II and IV). In model I, in which all EPSPs arriving within the STDP time window contribute to the synaptic modication, there is no selective synaptic enhancement of the foreground synapses over the background synapses, even when the time constants of potentiation and depression (¿P , ¿D ) are different. This somewhat surprising result is a consequence of the balance of the potentiation and depression contributions—higher rates of synaptic input increase both contributions, but their ratio (which determines the average synaptic strength, equation 2.7) remains unchanged. When the extent of the EPSP-AP interaction is input restricted, as in models II and IV and ¿D > ¿P , the foreground synapses are potentiated relative to the background synapses. The amount of selective potentiation depends on both the rate of the foreground synaptic inputs and the ratio of the time constants of potentiation and depression (R¿ ), as described by equations 3.4, 3.6, and 3.10 and illustrated in Figures 3 and 4. However, when the extent of EPSP-AP interaction is output restricted, as in model III, there is no selective potentiation of the higher rate foreground synapses (but there is some dependence of both background and foreground weights on the output spiking rate), as described by equation 3.14 and illustrated in Figure 7. Model IV, which is input and output restricted in the time extent of its EPSP-AP interactions, displays both behaviors: both background and foreground synapses are potentiated, but the foreground synapses are selectively potentiated relative to the background synapses, as described by equation 3.19. Consequently, the results presented here indicate that the potentiation of foreground synapses in a classical LTP and LTD paradigm involves some form of input restriction on the time extent of the EPSP-AP interactions. These input-restricted models of STDP are potentially signicant in understanding the relationship between STDP and rate-based Hebbian learning. Hebbian rate-based learning, in which the synaptic modication is based on correlations between input and output rates, has been widely

912

A. Burkitt, H. Mefn, and D. Grayden

used as models of neural information processing, including memory (associative and working memory), sensory perception, motor control, and classical conditioning. Most previously studied models of STDP have been shown to have learning dynamics controlled by an unstable xed point, which leads to instabilities and results in a bimodal distribution of weights. The existence of models of STDP with stable xed-point behavior allows the possibility that such spike-timing-based models provide the underlying mechanisms of the corresponding rate-based models. Consider, for example, the Hopeld model of associative memory (Hopeld, 1982), which has been subsequently considerably elaborated as a model of working memory (Amit & Tsodyks, 1991; Amit & Brunel, 1997; Amit & Mongillo, 2003). In these models, there is an interplay between the dynamics of the neurons (on a faster timescale) and the dynamics of the synaptic weights (on a slower timescale). The models contain a background state that is a spontaneously active network state, representing a global attractor of the neuronal dynamics, and a set of attractor states of the neuronal dynamics, corresponding to specic memories. This learning paradigm requires that these input-selective attractor states of the network dynamics (the “memories”) are generated by the learning dynamics of the weights. The results here demonstrate that STDP is potentially capable of generating such stable input-selective xed points. However, such a learning paradigm would require a consideration of other synaptic plasticity mechanisms in order that the weight distribution remain stable once the selective input ceases. The exploration of this double dynamics and the associated additional mechanisms required to stabilize the learned patterns clearly offers many interesting challenges. The calculations presented here for the four types of input-output time relationships are clearly exactly the same for widely used STDP models with learning dynamics determined by an unstable xed point. The drift functions for these models are exactly those presented here: the integrals over the input-output time difference probability distribution, aD , aP in equation 2.6 remain the same, and it is only the weight-dependent potentiation and depression functions, ®D .wi /, ®P .wi /, that differ. However, for these widely used STDP models, the learning dynamics is not changed in any fundamental way: both the value of the unstable xed point and the rate of change of the weights may be altered, but the dynamics will still produce the bimodal distribution of weights characteristic of these STDP models. 4.1 Effect of Suppressive Interspike Interactions. Recent work has shown that in addition to spike-timing-dependent plasticity, there is some evidence of an interspike suppressive interaction effect (Froemke & Dan, 2002), in which the efcacy of a spike in synaptic modication is suppressed by the preceding spike in the same neuron (i.e., either the presynaptic or the postsynaptic neuron). The resulting change in weight of the synapse connecting the ith presynaptic neuron with the jth postsynaptic neuron is

Spike-Timing-Dependent Plasticity

913

modeled as pre

1wij D ²i pre

post

where ²i , ²j

post

²j

(4.1)

F.1tij /;

are the efcacies of the presynaptic and postsynaptic neupost

pre

rons, respectively, 1tij D tj ¡ ti (the indices i and j will henceforth be dropped). Both of these efcacies depend on the times of their respective pre pre post post spikes, ² pre.tk ¡ tk¡1 / and ² post.tk ¡ tk¡1 /, respectively, where tk is the time of the kth spike (input or output resp.). The particular form of these suppressive interaction functions chosen in Froemke and Dan (2002) is " # q q .tk ¡ tk¡1 / q q q q D fpre; postg; (4.2) ² .tk ¡ tk¡1 / D 1 ¡ exp ¡ ; q ¿S where ¿S is the suppression time constant. The values for the suppression time constants found by best t to their experimental data from pyramidal pre neurons in layer 2/3 of rat visual cortical slices are approximately ¿S =34 ms post

and ¿S =75 ms (Froemke & Dan, 2002). The introduction of the efcacies produces a dependence of the change in weight on the presence of other spikes in both the presynaptic and postsynaptic neurons. If the ISI preceding a spike is very small, then the spike will contribute relatively little to the change in synaptic strength, whereas if the preceding ISI is large, then the change in weight associated with the spike will not be attenuated. This suppression could be mediated by either inhibitory synaptic interactions in the local cortical circuitry or mechanisms in the pre- and postsynaptic neurons, and a number of possible mechanisms were proposed (Froemke & Dan, 2002). Observation of paired-pulse depression (Tsodyks & Markram, 1997; Varela et al., 1997) suggested that suppression between presynaptic spikes during synaptic modication may be due to either short-term depression of transmitter release (Zucker, 1999) or desensitization or saturation of postsynaptic glutamate receptors (Zorumski & Thio, 1992). Suppression between postsynaptic spikes during synaptic modication may be due to CaC dependent conductances (Stuart & Sakmann, 1994) or other postsynaptic mechanisms that are involved in activity-dependent synaptic modication (Malenka & Nicoll, 1999). Including this suppressive interspike interaction in the STDP model I with unrestricted EPSP-AP interactions using the formalism in appendix 5 modies the corresponding drift function, equation 3.1, A.w/ D ¸out ¸i Epre Epost[¿P ®P .wi / ¡ ¿D ®D .wi /];

(4.3)

where pre

E pre D ¸i ²L .¸i / Epost D gµ L .0/ D

(4.4)

Z

1 0

dt fµ .t/ ² post.t/;

(4.5)

914

A. Burkitt, H. Mefn, and D. Grayden pre

where ²L .s/ is the Laplace transform of ² pre .t/, and gµ L .s/ is the Laplace transform of gµ .t/ ´ fµ .t/ ² post .t/. Consequently, the means of the background and foreground distributions are identical to those without the suppressive interspike interaction, equation 3.2. The effect of the suppressive interspike interaction on the drift function is an overall multiplicative factor Epre Epost that is the same for the potentiation and depression contributions. This factor is simply the product of the two independent terms corresponding to the presynaptic and postsynaptic contributions. This results in a change in the dynamics that affects the rate at which the xed point is reached but produces no change in the value of the xed-point weights. For the choice of suppressive interaction functions chosen in Froemke and pre post Dan (2002), the factors are Epre D .1C¿S ¸i /¡1 and Epost D [1¡ fµ L .1=¿S /]. For the N-input-restricted case, model II, the corresponding drift function, equation 3.7, becomes ( " ³ ´N # ¸i ¿P pre A.wi / D ¸out ¸i gµ L .0/ ®P .wi /¿P ¸i ²L .¸i / 1 ¡ 1 C ¸i ¿P µ ³ ´ ³ ´ 1 1 pre pre ¡ ®D .wi /¿D ¸i ²L .¸i / ¡ ¸i C ² ¸i C ¿D L ¿D ³ ´N #) ¸ i ¿D £ (4.6) : 1 C ¸ i ¿D For the input-restricted model, only the presynaptic interspike suppressive interactions described by ² pre .1t/ have an effect on the stable xed-point weight. The postsynaptic interspike suppressive interactions ² post .1t/ have an overall multiplicative effect on the drift function, which will affect the rate at which the xed point is reached but will not affect the value of the xedpoint weight. The magnitude of the change in the xed-point weight will depend on the precise form of the suppressive interaction function ² pre .1t/ as well as the STDP time constants ¿D , ¿P . In the case of van Rossum et al. plasticity using the interspike suppressive interaction function of Froemke and Dan (2002), we observed that the suppressive interspike interactions tended to reduce the selective potentiation of the higher-rate input synapses and that this effect is largest for higher rates of input. For the M-output-restricted case, model III, the drift function, equation 3.16, becomes » µ ³ ´ ³ ´¶ 1 1 2 pre M¡1 D ¡ A.wi / ¸out ¸i ²L .¸i / ®P .wi / ¿P gµL .0/ gµ L f ¿P µ L ¿P µ ³ ´¶¼ 1 ¡ ®D .wi / ¿D gµL .0/ 1 ¡ fµML : (4.7) ¿D For the output-restricted model, only the postsynaptic interspike suppressive interactions ² post.1t/ have an effect on the stable xed-point weight.

Spike-Timing-Dependent Plasticity

915

The presynaptic interspike suppressive interactions ² pre .1t/ produce a multiplicative effect on the drift function that will change the rate at which the xed point is reached, but will not affect the value of the xed-point weight. The magnitude of the xed-point weight depends now on the precise form of the postsynaptic suppressive interaction function ² post .1t/ as well as the STDP time constants ¿D , ¿P . The full expression for the N-input, M-output-restricted case, model IV, of the drift function equation 3.21, is given by equation B.29. This expression essentially produces effects on the drift function and xed-point weights that are a combination of those discussed in relation to the input- and output-restricted models above: the pre- and postsynaptic suppressive interaction functions will modify both the STDP xed point of the weight and the rate at which the learning dynamics drives the weight to the xed point. Consequently, suppressive interspike interactions will have an effect on the learning dynamics of STDP (i.e., a change of the corresponding drift function), but they cause a modication of the xed-point weights only when the STDP is either input or output restricted. The expressions given above for the modied drift functions, equations 4.3, 4.6, 4.7, and B.29, are given in terms of arbitrary suppressive interaction functions, ² pre .1t/, ² post .1t/, and there is clearly a need for more experimental data on the form of these functions. 4.2 Activity-Dependent Synaptic Scaling (ADSS) and Metaplasticity. The results presented here for STDP have shown the relationship of inputselective spike-based learning with conventional Hebbian rate-based learning. However, a complete understanding of the role of STDP in learning requires consideration of issues of metaplasticity (Abraham & Bear, 1996; Abbott & Nelson, 2000), that is, how synaptic plasticity is modulated and controlled by effects such as prior synaptic activity, membrane voltage (Fusi, 2003) and calcium concentration (Shouval, Castellani, Blais, Yeung, & Cooper, 2002). There are two main issues that need to be addressed in relation to the broader question of the role of STDP in specic learning schemes. First, our paradigm of input-selective spike-based learning relies on the existence of a stable xed point of the background spiking rate (i.e., spontaneous activity) against which the input-selective weights are potentiated. It is clear that STDP alone cannot generate such a state, since a nonspiking network state would never evolve into a spiking state. Consequently, some form of activity-dependent scaling of the weights is necessary, and we present below a unied framework of STDP and ADSS. Second, once a set of input-selective foreground weights has been generated, there must be some metaplasticity mechanism to ensure that they remain stable and do not return to their unpotentiated (background) value when the selective input is turned off. While the full investigation of such issues is outside the scope of this investigation, some remarks concerning ADSS and its rela-

916

A. Burkitt, H. Mefn, and D. Grayden

tionship to STDP are appropriate here, particularly since ADSS introduces competition between synapses. Activity-dependent scaling of the synaptic weights (i.e., dependent on the long time average output spiking rate of the postsynaptic neurons) has been experimentally observed in neocortical (Turrigiano et al., 1998), hippocampal (Lissin et al., 1998) and spinal cord neurons (O’Brien et al., 1998), and occurs on a much slower timescale than STDP. Recent experimental data suggest that there are two independent signals that regulate ADSS in cortical pyramidal neurons: (1) low levels of brain-derived neurotrophic factor (BDNF) cause excitatory synapses to scale up in strength (Desai, Rutherford, & Turrigiano, 1999) and (2) sustained depolarization of a cortical neuron over an extended time causes excitatory synapses to scale down in strength (Leslie, Nelson, & Turrigiano, 2001). The rst BDNF-mediated mechanism appears to be responsible for increasing synaptic strength when the output spiking rate is low (but there is no evidence that high levels of BDNF reduce synaptic weights in cortical neurons). The second depolarization-mediated mechanism is responsible for reducing the synaptic strength of neurons with high output spiking rates, since the output spiking rate is strongly dependent on the mean membrane depolarization. Together, these mechanisms bring about a global multiplicative rescaling of the weights; all weights are scaled downward by the same multiplicative factor (Turrigiano et al., 1998). Both of these ADSS mechanisms appear to function on a long timescale (from a couple of hours up to two days) that is orders of magnitude slower than the STDP mechanism involving presynaptic and postsynaptic spikes considered here. Nevertheless, it is possible to incorporate ADSS and STDP into a single framework by including terms in the weight update, equation 2.1, that are independent of the presynaptic and postsynaptic activity (® 0 .w/, corresponding to the BDNF mechanism), depend on only postsynaptic spikes (® post.w/, corresponding to the sustained depolarization mechanism outlined above, and depend on only presynaptic spikes (® pre .w/). The generalization of equation 2.4 for the time evolution of the mean weight is (Kempter et al., 1999; van Hemmen, 2001), N dw.t/ N C ® post .w/ N ¸out C ® pre .w/ N ¸in C Apre;post .w/; N D ® 0 .w/ dt

(4.8)

where Apre;post .w/ is the bilinear (Hebbian) term considered in this article. In such an expanded STDP model, the multiplicative ADSS scaling equation (van Rossum et al., 2000) is equivalent to the following form for the coefcients, ® 0 .w/ D ¯ w ¸B ;

® post.w/ D ¡¯ w;

® pre .w/ D 0;

(4.9)

where ¯ > 0 is a constant determining the strength and timescale of the synaptic scaling, and ¸B is the (background) spontaneous spiking rate (i.e.,

Spike-Timing-Dependent Plasticity

917

the desired postsynaptic activity). In the absence of structured synaptic input (i.e., Apre;post .w/ D 0), this has the effect of scaling the synaptic weights in such a way that the output spiking rate evolves toward ¸B . There are two important effects of ADSS. First, the joint effect of the above two mechanisms (corresponding to ® 0 .w/ and ® post .w/ above) is to establish a homeostasis, whereby networks of neurons adjust their synaptic weights up or down as a function of the long time average of their spiking activity. This maintains the neurons in a state that allows them to be responsive to the bilinear terms in STDP that contain information about structure in the synaptic input (Song et al., 2000). The second important function of ADSS is that it introduces a competitive mechanism between the synaptic weights (van Rossum et al., 2000). Competition has been extensively studied in both rate-based Hebbian learning (Miller & MacKay, 1994) and in additive STDP (Kistler & van Hemmen, 2000; Song et al., 2000; Kempter et al., 2001; Song & Abbott, 2001), where it arises naturally as a result of STDP. However, in stable xed-point models of STDP, equation 2.1, and with equal STDP time constants (¿P D ¿D ), there is little or no competition between synapses (van Rossum et al., 2000): spike-time correlated inputs give rise to larger weights but do not depress the weights of the uncorrelated inputs, as discussed above. ADSS acts as a competitive mechanism by rescaling all weights to lower values, thereby effectively depressing the weights of the potentiated inputs relative to those of the unpotentiated inputs. This same mechanism will function in the cases studied here where the STDP time constants are different and the foreground weights are potentiated relative to the background weights (i.e., models II and IV). Consequently, although there is no explicit competition between synapses in the stable xed-point STDP investigated here, ADSS nevertheless provides such a competitive mechanism. Because ADSS occurs on a much slower timescale than STDP, it has not been included in our study, but its effect is simply to rescale the weights by a multiplicative factor after the changes to the weights that are generated by STDP. 4.3 Width of the Weight Distribution. A feature of spike-based neural plasticity that is not present in rate-based models is a description of the width of the weight distribution. The Fokker-Planck analysis, which is based on the stochastic nature of the weight increments that result from the input and output spike times, enables the calculation of the complete weight distribution, equation 2.13. The calculation of the drift function A.w/ is straightforward, since it is only necessary to sum every individual contribution to the change in synaptic weight. However, the calculation of the diffusion function, B.w/, requires that we identify those components of potentiation and depression that are dependent and those that are independent. (Previous studies using the Fokker-Planck formalism have typically obtained an approximation for the diffusion function, B.w/, by assuming that each plasticity event is independent and that the output APs are independent; van Rossum et al., 2000.)

918

A. Burkitt, H. Mefn, and D. Grayden

A full calculation of the diffusion function, B.w/, is beyond the scope of this article for model I (in which there is no restriction on the time extent of the input-output interactions), since the changes in weight caused by successive inputs or outputs, or both, are correlated. However, it is possible to calculate the diffusion function, B.w/, for particular cases of the other models, which have a restricted time extent of the input-output interaction: for model II with N D 1 and 1=2, the contributions from the ISIs of the synaptic inputs are independent (they have a Poisson distribution and the EPSP-AP interactions are all within one ISI), whereas for model III with M D 1, the contributions from the ISIs of the output spikes are independent, and for model IV with M; N D 1, the contributions from both the input and output ISIs can be treated as independent. The width of the weight distributions can be evaluated numerically directly from the Fokker-Planck distribution, equation 2.13, or in the case of van Rossum et al. plasticity, it can be obtained from the self-consistency relation, equation 2.16. An illustration of the resulting widths of the weight distributions is given in Figure 10 for model IV with M; N D 1 and van Rossum et al. plasticity, which shows a plot of ¾F versus ¸F in the case R¿ =2.0 (other parameters are the same as in Figure 9). The solid line is the analytic solution given by equation 2.16, with values for the coefcients given by equations 3.18 and B.28), and the dashed line is the analytic solution obtained by taking the Poisson limit for the output spike distribution, equation 3.12. The solid symbols represent the result of numerical simulations. The value of ¾F depends on the output spiking rate, ¸out , which varies from 5 Hz (at ¸F =5 Hz) to 147 Hz (at ¸F =100 Hz). Notice that the Poisson distribution results at low output spiking rates (i.e., low values of ¸F ) converge with the results obtained using the rst passage time of the leaky integrate-and-re model with synaptic conductances. This is unsurprising, since at low output spiking rates, we expect that the output spiking distribution is Poisson-like. There are three principal effects contained in the calculation of B.w/ in equation B.12: (1) there is a nonzero correlation between the potentiation and depression contributions (corresponding to the term R3 , R4 , R5 in equation B.13 and to the nonzero linear coefcient B1 for van Rossum et al. plasticity), (2) there are terms that represent the correlations within the potentiation and depression contributions, respectively (corresponding to the terms R 1 , R 2 in equation B.13 and to the additional terms in the coefcients B0 and B2 in van Rossum et al. plasticity), and (3) B.w/ depends on the rst passage time density fµ .t/ (through its Laplace transform fµ L .s/), and consequently the coefcients reect the regularity of the output APs. This third effect becomes less important at low output spiking rates, where we nd that approximating the output spiking rate by a Poisson distribution (with the same spiking rate) gives good agreement with numerical simulations, as can be seen for model IV in Figure 10 by the convergence of the plots as ¸F ! ¸B =5 Hz. Representative distributions for the background and foreground weights are illustrated in Figure 11 for model II with N D 1 and van Rossum et al.

Spike-Timing-Dependent Plasticity

1.6

x 10

919

4

s

F

1.2

0.8

0.4

0

25

50

l

F

(Hz)

75

100

Figure 10: Model IV with N D M D 1 and van Rossum et al. plasticity. The standard deviation ¾F plotted as a function of the spiking rate of the foreground inputs ¸F . Parameter values are as for Figure 9. The solid line corresponds to the analytic results of equation 2.16 and equation B.28, and the dashed line corresponds to the values obtained using a Poisson distribution of output spikes with the same output spiking rate. Solid symbols represent the results for the average weight of the foreground synapses obtained by numerical simulation after 10,000 APs.

plasticity, in which the histograms give the two distributions obtained from numerical simulation. The parameters used are NB =1500, NI =400, NF =100, ¸B =5 Hz, ¸F =50 Hz, gI =0.002, cP D0.0000315, cD D0.01, ¿P D 17 ms, and ¿D D 34 ms. These parameter values result in an output spiking rate of ¸out D 35:2 Hz. The histogram on the left (unshaded) represents the background synapses, and the histogram on the right (black) represents the foreground synapses that have a higher input rate. The distributions obtained from the Fokker-Planck equation, equation 2.13, with the analytic expressions for the coefcients Ai , Bi , equations 3.5 and B.13, are given by the solid lines, which provide a good t to the numerical simulation data. The dotdashed curve (plotted here only for the foreground weights), which has the largest variance of the three curves plotted for the foreground weight distribution, is the “naive” Fokker-Planck distribution obtained by assuming that each individual contribution to the change in weight is independent (i.e., neglecting correlations between the potentiation and depression com-

920

A. Burkitt, H. Mefn, and D. Grayden

P(gE)

200

100

0.001

g

0.002 E

Figure 11: Distribution of weights for model II with N D 1 and van Rossum et al. plasticity. Histograms show results of numerical simulation (averaged over 10 independent weight congurations) of the distributions of the excitatory conductances gE (weights) for the NB D 1500 background synapses (larger unshaded peak on left) and NF D 100 foreground synapses (smaller shaded peak on right). Input rates for background and foreground synapses are ¸B D 5 Hz and ¸F D 25 Hz. The solid lines are the (scaled) Fokker-Planck probability distribution, equation 2.13, with A.w/ and B.w/ given by equations 3.5 and B.13). The dashed lines (mostly obscured for the foreground synapses) are the FokkerPlanck distributions when the output spike distribution is approximated by a Poisson distribution with the same output spiking rate of 35.2 Hz. The dotdashed line (only for the foreground weights) is the naive Fokker-Planck distribution, equation 4.10. Other parameter values are as given in section 2.2 and the text.

ponents, which is equivalent to neglecting the coefcient B1 in van Rossum et al. plasticity). The corresponding naive drift function B.w/ is given by " B.w/ D ¸out ¸X

®P2 .w/ .¸X C

2 ¿P /

C

2 ®D .w/

.¸X C

2 ¿D /

# :

(4.10)

The effect of neglecting these correlation terms is that the distribution of weights is broader than that found in numerical simulations, as illustrated for model II in the dot-dashed line of Figure 11. The dashed lines on the plot (mostly obscured for the foreground synapses and with the intermediate

Spike-Timing-Dependent Plasticity

921

variance of the three curves representing the background synapses) are the distributions obtained by approximating the rst passage time density of the output spikes by a Poisson distribution, equation 3.12. By comparing the actual weight distribution with that obtained using a Poisson distribution for the output APs, it is possible to gain some insight into the effect of the spiking dynamics. The spiking dynamics is contained in equation B.13 (the solid lines in Figure 11) but is neglected when a Poisson distribution for the output APs is used (dashed lines). Consequently, because the integrate-and-re spiking dynamics causes the output spikes to be considerably more regular (i.e., smaller coefcient of variation of the output spike distribution) than the equivalent Poisson distribution, there is a resulting difference between the distributions, observed here in the solid and dashed lines in Figure 11. 4.4 Discussion of Assumptions and Approximations. A number of assumptions and approximations are made in this study. First, we have chosen a model of STDP whose weight dynamics is determined by a stable xed point. An example of such a model is that investigated by van Rossum et al. (2000) with additive potentiation and multiplicative depression. While there is some experimental support for such weight-dependent STDP (Bi & Poo, 1998; van Rossum et al., 2000), this is still a matter of experimental investigation, and it may be present only in a particular class of neurons or neural systems. The results are given here for general weight-dependent potentiation and depression functions ®D .w/, ®P .w/, and the stability condition for the xed point is given in terms of these functions (see equation 2.8 and the following text). It is clearly of considerable interest to determine more accurately the form of these functions from experimental studies. Second, the STDP time window F.1t/ used in this study, equation 2.2, is modeled as two exponentials, with independent time constants for the potentiation and depression components. Such a model is both consistent with the experimental data (Bi & Poo, 1998) and relatively straightforward to handle mathematically. However, it is possible that different types of neurons have different shapes of STDP window, as proposed by Bi and Poo (2001) and examined in the case of additive STDP by Cˆateau and Fukai (2003). The effect of different STDP time windows will be to change the numerical value of the right side of equation 2.7. Consequently, the details of the STDP time window will affect the actual value of the stable xedpoint mean weights, but will not materially affect the conclusions of this study. Note that an important feature of STDP as formulated in this and other studies is that the weight-dependent and time-dependent components (corresponding to ®.w/ and F.1t/ in equation 2.1) are independent. Third, a Poisson distribution for the input spike trains is used. Although this is unlikely to be critical for the results presented here, the methods of analysis may not necessarily be applicable to networks, since the input spike-timing distribution for a neuron in a network could differ signicantly from the Poisson distribution. Consequently, while our results show that an

922

A. Burkitt, H. Mefn, and D. Grayden

input-selective stable xed point of the learning dynamics exists, it remains for future studies to elucidate the extent to which such STDP xed points determine the organization of the synaptic connectivity in recurrent networks under the combined inuence of the network dynamics and learning dynamics. It should also be noted that studies of LTP and LTD typically use sequences of spikes that are highly temporally structured, in contrast to the Poisson distribution of synaptic inputs studied here. Fourth, learning of the inhibitory conductances (Holmgren & Zilberter, 2001) is not included, since the timing dependence of the synaptic modication of inhibitory synapses is not yet well established. However, it is straightforward to extend the formalism presented here to include inhibitory synapses. In addition, the spike-triggering effect, is not included in the analysis here. This effect is small in the situation where the neuron receives a large number of small-amplitude synaptic inputs that have no spike time correlation, and it will not have a signicant effect on the results presented here. Likewise, we have considered only the case in which the input spikes do not have any spike time correlations. When the weight dynamics is not determined by a stable xed point, such spike correlations can cause the spike triggering effect to be greatly amplied by the instability (Gerstner, 2001; Kempter et al., 2001). The effect on the models studied here of such correlations is the subject of a separate report. Also, the expressions have been derived in the diffusion approximation, which neglects terms of higher-order in the weight step amplitude (i.e., the Fokker-Planck equation used here neglects higher-order terms in the Kramers-Moyal expansion), and this will give rise to the slight difference observed between the analytical and numerical results. 5 Conclusions In this study, we provide an analysis of spike-timing-based models of neural plasticity that examines their relationship with classical studies of LTP and LTD and with rate-based Hebbian models of learning whose weight dynamics is determined by a stable xed point. The analysis is carried out using a leaky integrate-and-re neuronal model with synaptic conductances. The evolution of the synaptic weights is based on the Langevin equation, which provides a description of the mean weights (Kempter et al., 1999) and is closely related to the Fokker-Planck approach (Kempter et al., 1999; van Rossum et al., 2000). We have examined the effect of two biologically important renements of the timing dependence of synaptic plasticity. First, we analyze the effect of limiting the time extent of the EPSP-AP interactions, whose importance was highlighted in recent experimental studies (Sj¨ostrom ¨ et al., 2001). Second, we analyze the effect of time asymmetry of the STDP time window, in which the time constant associated with synaptic depression ¿D is larger than that associated with potentiation ¿P (Bi & Poo, 1998).

Spike-Timing-Dependent Plasticity

923

One motivation for these studies was earlier results on a model of STDP with a stable xed point of the learning dynamics (van Rossum et al., 2000) that failed to show input rate selectivity, that is, it was found that higherrate uncorrelated synaptic inputs are not selectively potentiated relative to lower-rate inputs. The results here indicate that for the models whose weight dynamics is determined by a stable xed point, there is no selective potentiation of synapses that have higher rates of uncorrelated input if the STDP time constants for potentiation and depression are equal. Selective potentiation of synapses with higher input rates is possible only for models in which the time extent of the EPSP-AP interactions is input restricted and the time constant for depression is larger than for potentiation (¿D > ¿P ). The effect of suppressive interspike interactions is analyzed for each of the input-output restricted models, and the results show that such interactions can modify the learning dynamics, but that the value of the STDP xed-point weights are modied only for input- or output-restricted STDP models. The corresponding drift functions for all models studied here are given for arbitrary forms of the presynaptic and postsynaptic suppression functions. For particular cases of the three classes of models examined here involving some form of time-restricted input-output interaction (models II, III, and IV), we are able to calculate the full distribution of the excitatory conductances (i.e., the function B.w/) by identifying the independent components of potentiation and depression. The width of the distribution was found to depend on both the rst passage time density of the output APs (B.w/ is a function of fµ .t/) and the correlation between potentiation and depression contributions (e.g., a nonzero value of B1 in the case of van Rossum et al. plasticity). In order to gain an understanding of the role of STDP in the wider context of learning and network activity, we have investigated how it may be related to Hebbian rate-based models of learning with stable xed points of the weight dynamics that are input selective, that is, the (foreground) synapses associated with higher-rate inputs become potentiated relative to the remaining (background) synapses, whose only input is spontaneous activity. The input-restricted models of STDP are the only ones found to have the input-selective stable xed-point behavior. A more comprehensive description of models of STDP whose weight dynamics is determined by a stable xed point would require the analysis of correlated synaptic input, the spike triggering effect, and multiple patterns. These issues, together with a more comprehensive analysis of suppressive interspike interactions and an examination of the relationship between STDP and models of working memory, are the subject of a separate investigation. It would also be of interest to model the evolution of inhibitory weights, based on the experimental results for their plasticity (Rutherford, Nelson, & Turrigiano, 1998; Holmgren & Zilberter, 2001). In addition to the stable xed points of the weight evolution equation, equation 2.7, examined here, a more thorough study of the dynamics of weight evolution could be carried out: examine the trajectories of the weights and the time constant of

924

A. Burkitt, H. Mefn, and D. Grayden

approach to their stable xed-point values. These and other questions are currently under investigation. This study highlights the importance of experimental data that will provide accurate quantitative measures of a number of the key variables in the timing dependence of synaptic modication. (1) The results here have been formulated for arbitrary weight-dependent potentiation and depression functions, ®P .w/, ®D .w/, and the question of whether the weight dynamics is determined by a stable or unstable xed point depends on the weight dependence of these functions, which could vary across different neurons or neural systems. (2) The implications of various restrictions in the time extent of the EPSP-AP interaction have been explored here, and our results show that such restrictions have signicant implications for the learning behavior. (3) The analysis of the suppressive interspike interactions has been formulated in a fashion that allows it to be applied to any functional form of the presynaptic or postsynaptic suppression (² pre , ² post resp.), which require accurate experimental determination. In addition, experimental data could shed light on interesting questions concerning the weight dynamics, such as the timescale required for a weight distribution to reach its xed point. Appendix A: Self-Consistent Calculation of Variance In the situation where the drift function, A.w/, is linear and the diffusion function, B.w/, is quadratic, it is possible to nd the variance of the distribution of the weights in the following self-consistent way. Consider the Langevin equation 2.3 and use the following properties of the gaussian noise, Ef».1t/g D 0;

Ef» 2 .1t/g D 1t:

(A.1)

The time-dependent functions 7W .tI w0 / and 0W .tI w0 / are the following expectation values of the mean and variance: 7W .tI w0 / D Efw.t/I w.0/ D w0 g

0W .tI w0 / D Efw2 .t/I w.0/ D w0 g ¡ [Efw.t/I w.0/ D w0 g]2 :

The functions 7W .tI w0 / and 0W .tI w0 / are evaluated in a self-consistent way by considering how they change from time t to t C 1t (and neglecting terms of higher-order in 1t): Z 1 Efw.t C 1t/I w.0/ D w0 g D dw pW .w; t j w0 ; 0/ [w C A.w/1t] ¡1 Z 1 Efw2 .t C 1t/I w.0/ D w0 g D dw pW .w; t j w0 ; 0/ ¡1

£ [w2 C 2wA.w/1t C B.w/1t]:

Spike-Timing-Dependent Plasticity

925

It is possible to evaluate these expressions when A.w/ ´ A0 ¡ A1 w is linear and B.w/ ´ B0 ¡ B1 w C B 2 w2 is quadratic, as for van Rossum et al. plasticity. The expressions for 7W .tI w0 / and 0W .tI w0 / are obtained by evaluating the resulting gaussian integrals in the above expressions and taking the limit 1t ! 0: d7W .tI w0 / D A0 ¡ A1 7W .tI w0 / dt d0W .tI w0 / D .B2 ¡ 2A1 /0 W .tI w0 / C B0 ¡ B1 7W .tI w0 / C B2 7W2 .tI w0 /: dt

The solutions are straightforwardly found using Laplace transforms, A0 .1 ¡ e¡A1 t / A1 0W .tI w0 / D ¾W2 [1 ¡ e¡.2A1 ¡B2 /t ] C 01 [e¡A1 t ¡ e¡.2A1 ¡B2 /t ]

7W .tI w0 / D

C 02 [e¡2A1 t ¡ e¡.2A1 ¡B2 /t ];

where B0 ¡ B1 A0 =A1 C B2 A20 =A21 2A1 ¡ B2 B1 A0 =A1 ¡ 2B2 A20 =A21 01 D A 1 ¡ B2 A20 02 D ¡ 2 : A1

¾W2 D

The variance of the distribution of weights as t ! 1 is given by ¾W2 . Note that 2A1 > B2 for all biologically plausible parameters. This expression for ¾W2 is positive denite for all parameters investigated. Appendix B: Calculation of A.w/ and B.w/ B.1 Model I. The contribution to the change in synaptic weight due to a single output spike (AP) is calculated by integrating over the distribution of input times according to ZZ Z 1 Z 1 ° D ¸out dTP1 pin .TP1 / dTP2 pin .TP2 ¡ TP1 / 0 TP1 Z 1 ¢¢¢ dTPk pin .TPk ¡ TP.k¡1/ / Z

¢¢¢ ¢¢¢

Z

TP.k¡1/ 1

Z

dTD1 pin .TD1 /

0 1 TD.k¡1/

1 TD1

dTD2 pin .T D2 ¡ TD1 /

dTDk pin .TDk ¡ TD.k¡1/ / ¢ ¢ ¢ ;

(B.1)

926

A. Burkitt, H. Mefn, and D. Grayden

where the output spike is taken as occurring at time t D 0 and TPk (TDk ) is the arrival time of the kth input EPSP before (following) the spike. The input EPSPs are taken to be Poisson distributed so that their probability density (and Laplace transform, denoted by the subscript L) is pin .t/ D ¸in exp .¡¸in t/;

pin;L .s/ D

¸in : s C ¸in

(B.2)

The expression for A.w/ is given by # ZZ " 1 1 X X ¡TPn =¿P ¡TDn =¿D ¡ ®D .w/ A.w/ ´ ¸in ° ®P .w/ e e : nD1

(B.3)

nD1

The resulting expression is given in equation 3.1. The above calculation has been carried out from the perspective of a single output spike. It is equally possible to perform the calculation from the perspective of a single input EPSP, and it is straightforward to show that the resulting expression for A.w/ is identical. The calculation for B.w/, however, is much more problematic, since the contributions arising from different APs (or equivalently, from individual EPSPs) are not independent. This can also be seen by calculating the naive expressions obtained from the perspective of an input EPSP and from an output spike, which are found to differ. The full calculation would involve the simultaneous averaging over the time distributions of both the input EPSPs and the output spikes. B.2 Model II. The contribution to the change in synaptic weight due to a single incoming EPSP is calculated by considering a single ISI of the input EPSPs. Since the input EPSPs are Poisson distributed and the APs interact only with the EPSPs over one input ISI, the contribution from each input ISI can be considered and summed separately. We dene the following (normalized) integral over the distribution of input and output times, ZZ Z °´

1

0

¢¢¢

Z

Z

1

dT in pin .T in /

Z dT1 p1 .T1 /

0 1 Tk¡1

dTk fµ .Tn ¡ Tk¡1 / ¢ ¢ ¢ ;

1 T1

dT2 fµ .T2 ¡ T1 / (B.4)

where the previous input is taken to occur at time t D 0, T in is the arrival time of the subsequent input (which we take to be Poisson distributed), and Tk is the time of the kth output spike following the previous input EPSP. Note that this integral does not enforce the APs to be within an input ISI, but that this bound will be imposed in the denition of A.w/ and B.w/ below. The expression for pin .t/ is given in equation B.2. fµ .t/ is the rst passage time density of the output spikes, whose Laplace transform is given by

Spike-Timing-Dependent Plasticity

927

equation 2.20, and p1 .t/ is the probability density of the rst spike following the input EPSP at t D 0, Z 1 ¸out [1 ¡ fµ L .s/] (B.5) p1 .t/ D ¸out dt0 fµ .t0/; p1L .s/ D : s t The expressions for the drift function, A.w/, and diffusion function, B.w/, are obtained using the Fokker-Planck approach, equation 2.5 (van Kampen, 1992; Risken, 1996), ZZ " 1 X A.w/ ´ ¸in ° ®P .w/ e¡Tn =¿P H.T in ¡ Tn / nD1

¡ ®D .w/

1 X

#

e¡.T

in

¡Tn /=¿ D

nD1

H.T in ¡ Tn /

(B.6)

ZZ " 1 X B.w/ ´ ¸in ° ®P .w/ e¡Tn =¿P H.T in ¡ Tn / nD1

¡ ®D .w/

1 X

#2 e

¡.Tin ¡Tn /=¿ D

nD1

in

H.T ¡ Tn /

:

Each integral represents a single ISI of the input EPSPs, and the factor ¸in is the rate of the input ISIs. The Heaviside step functions H.t/ ensure that only the contributions from APs within the ISI are included in the respective potentiation and depression sums. The contributions are calculated using equation B.4: SP;n and SD;n are the contributions to the potentiation and depression, respectively, due to the nth spike, ZZ SPn .¿P / ´ ° e¡Tn =¿P H.T in ¡ T n / h i 1 ¸out 1 ¡ fµL .¸in C ¿1P / fµn¡1 L .¸in C ¿P / D (B.7) ¸in C ¿1P ZZ in SDn .¿D / ´ ° e¡.T ¡Tn /=¿D H.T in ¡ Tn / £ ¤ ¸out 1 ¡ fµL .¸in / fµn¡1 L .¸ in / D : 1 ¸in C ¿D Summing over all input spikes gives

SP .¿P / D SD .¿D / D

1 X nD1 1 X nD1

SPn .¿P / D SDn .¿ D / D

¸out .¸in C

¸out .¸in C

(B.8)

1 ¿P / 1 ¿D /

:

928

A. Burkitt, H. Mefn, and D. Grayden

The drift function, A.w/, dened in equation 2.5, is given by A.w/ D ¸in [®P .w/SP .¿P / ¡ ®D .w/SD .¿D /]:

(B.9)

In order to calculate the diffusion function, B.w/, we require ZZ RPm;Pn ´ ° e¡Tm =¿P H.T in ¡ T m /e¡Tn =¿P H.T in ¡ Tn / D

h ¸out 1 ¡ fµ L .¸in C

i

2 ¿P /

m¡1 fµL .¸in C

¸in C

.n > m/

2 n¡m ¿P / fµ L .¸in

C

2 ¿P

1 ¿P /

ZZ in in RDm;Dn ´ ° e¡.T ¡Tm /=¿D H.T in ¡ Tm /e¡.T ¡Tn /=¿D H.T in ¡ Tn /

D

.n > m/ £ ¤ n¡m ¸out 1 ¡ fµ L .¸in / fµm¡1 L .¸ in / fµ L .¸in C

1 ¿D /

(B.10)

2 ¿D

¸in C

ZZ in RDm;Pn ´ ° e¡.T ¡Tm /=¿D H.T in ¡ Tm /e¡Tn =¿P H.T in ¡ Tn / D RPm;Dn

h ¸out ¸in 1 ¡ fµ L .¸in C

i

1 ¿P /

m¡1 fµL .¸in C 1 ¿P /.¸in

1 n¡m ¿P / fµ L .¸in 1 ¿D /

C

C ZZ in ´ ° e¡Tm =¿P H.T in ¡ T m /e¡.T ¡Tn /=¿D H.T in ¡ Tn / D

.¸in C

.n > m/

h ¸out ¸in 1 ¡ fµ L .¸in C

i

1 ¿P /

.¸in C

m¡1 fµL .¸in C

1 ¿P /.¸in

C

1 ¿P

R2 .¿D / D R 3 .¿P ; ¿D / D D

1 X 1 X mD1 nDmC1 1 X 1 X mD1 nDmC1 1 X 1 X

1 n¡m ¿P / fµ L .¸in /

1 ¿D /

RPm;Pn D

¸out fµL .¸in C ¿1P / h .¸in C ¿2P / 1 ¡ fµL .¸in C

RDm;Dn D

¸out fµ L .¸in C ¿1D / h .¸in C ¿2D / 1 ¡ fµ L .¸in C

RDm;Pn

mD1 nDmC1

¸out ¸in fµL .¸in C ¿1P C ¿1D / h .¸in C ¿1P /.¸in C ¿1D / 1 ¡ fµ L .¸in C ¿1P C

1 ¿D /

.n > m/

:

Summing over all output spikes gives

R1 .¿P / D

C

i

1 ¿D /

1 ¿P /

i

1 ¿D /

i

Spike-Timing-Dependent Plasticity

R4 .¿P ; ¿D / D R5 .¿P ; ¿D / D

1 X 1 X mD1 nDmC1 1 X nD1

929

RPm;Dn D

.¸in C

¸out ¸in fµ L .¸in / 1 C ¿1D /[1 ¡ fµ L .¸in /] ¿P /.¸in

¸out ¸in

RPn;Dn D

.¸in C

1 ¿P /.¸ in

C

1 ¿D /

(B.11)

:

The diffusion function, B.w/, dened in equation 2.5, is given in terms of the above expressions as " B.w/ D

¸in ®P2 .w/

1 X

SP n

±¿ ² P

2

nD1

# C 2R1 .¿P /

¡ 2¸in ®P .w/ ®D .w/ [R5 .¿P ; ¿D / C R3 .¿P ; ¿D / C R4 .¿P ; ¿D /] " # 1 ±¿ ² X D 2 C ¸in ®D .w/ C 2R2 .¿D / : (B.12) SD n 2 nD1 The diffusion function, B.w/, is thus given by ( B.w/ D ¸out ¸X

"

®P2 .w/ .¸X C ¡

1C

2 ¿P /

1 ¿P / fµ L .¸X C ¿1P /

2 fµ L .¸X C

1¡

#

2®P .w/ ®D .w/ ¸X .¸X C "

1 ¿P /.¸X

C

1 ¿D /

# 2 fµL .¸X C ¿1P C ¿1D / 2 fµ L .¸X / £ 1C C 1 ¡ fµ L .¸X / 1 ¡ fµ L .¸X C 1 C 1 / ¿P ¿D " #) 1 2 2 fµ L .¸X C ¿ / ®D .w/ D C 1C (B.13) ; 2 1 ¡ fµ L .¸X C ¿1 / .¸X C ¿ / D

D

where fµ L .s/ is the Laplace transform of the rst passage time density whose explicit expression is given in equations 2.20 and 2.21. B.2.1 Model II: Arbitrary N. The above calculation of the drift function, A.w/, and the diffusion function, B.w/, is generalized for EPSP-AP interaction that have a time extent restricted to the N nearest synaptic inputs to each output AP. This is done by generalizing equation B.4: ZZ Z °D

1 0

Z ¢¢¢

Z dT1in pin .T1in / 1 in TN¡1

1 Tin 1

dT2in pin .T2in ¡ T 1in /

in in in ¡ TN¡1 dTN pin .TN /

930

A. Burkitt, H. Mefn, and D. Grayden

Z

1

0

¢¢¢

Z

Z dT 1out p1 .T1out / 1 Tout n¡1

1 Tout 1

dT2 fµ .T2out ¡ T1out /

out dTnout fµ .Tnout ¡ Tn¡1 /¢¢¢;

(B.14)

where the previous input is taken as occurring at time t D 0 and the kth subsequent output AP and input EPSP times are explicitly denoted by Tkout and Tkin , respectively. In addition, each of the Heaviside step functions in in equation B.6 is replaced by H.TN ¡T out n /, which gives the required restriction of the EPSP-AP interactions to the N nearest inputs (note also that each interaction is counted exactly once with these denitions). B.2.2 Model II: N D 1=2. The calculation of the drift function, A.w/, and the diffusion function, B.w/, proceeds in a similar fashion to that above. The contributions are calculated using equation B.4 as before, but with ZZ " 1 X A.w/ ´ ¸in ° ®P .w/ e¡Tn =¿P H.T in ¡ Tn /H.Tn ¡ T in =2/ nD1

¡ ®D .w/

1 X

#

¡.Tin ¡Tn /=¿ D

e

nD1

in

H.T =2 ¡ Tn /

(B.15)

ZZ " 1 X B.w/ ´ ¸in ° ®P .w/ e¡Tn =¿P H.T in ¡ Tn /H.Tn ¡ T in =2/ nD1

¡ ®D .w/

1 X

#2 ¡.Tin ¡Tn /=¿ D

e

nD1

in

H.T =2 ¡ Tn /

:

The contributions to the potentiation and depression due to the nth spike are SQ P;n and SQ D;n , respectively: ZZ SQ Pn ´ ° e¡Tn =¿P H.T in ¡ Tn /H.Tn ¡ T in =2/ h i n¡1 ¸out 1 ¡ fµ L .2¸in C ¿1P / fµL .2¸in C ¿1P / D 2¸in C ¿1 P ZZ in ¡T ¡.T in /=¿ n D SQ Dn ´ ° e H.T =2 ¡ Tn / h i 1 ¸out 1 ¡ fµ L .2¸in C ¿1D / fµn¡1 L .2¸in C ¿D / D : 2¸in C ¿1 D

Summing over all input spikes gives equation 3.9.

(B.16)

Spike-Timing-Dependent Plasticity

931

The calculation of the diffusion function, B.w/, proceeds in much the same way as for model II. The cross-terms (required for B1 ) are more straightforward to calculate because the integrals for the potentiation and depression terms are independent in this model. The resulting expression is given by ( " # 2 fµ L .2¸X C ¿1 / ®P2 .w/ P 1C B.w/ D ¸out ¸X 1 ¡ fµL .2¸X C ¿1 / .2¸X C ¿2P / P ¡ C

®P .w/ ®D .w/ ¸out

.2¸X C

1 ¿P /.2¸X

2 ®D .w/

.2¸X C

C

"

2 ¿P /

1C

1 ¿D / 1 ¿P / fµL .2¸X C ¿1P /

2 fµL .2¸X C

1¡

#) :

(B.17)

B.3 Model III. The calculation of the drift function, A.w/, and the diffusion function, B.w/, proceeds in a fashion analogous to that in section B.2, except that for this model, the ISIs of the output APs dene the independent events. The contributions are calculated by dening the following integral over the distribution of input and output times, ZZ Z 1 Z 1 Z 1 °´ dT out fµ .T out / dT1 pin .T 1 / dT2 pin .T2 ¡ T1 / 0

¢¢¢

Z

0

1 Tn¡1

T1

dTn pin .Tn ¡ Tn¡1 / ¢ ¢ ¢ ;

(B.18)

where the previous output AP is taken as occurring at time t D 0 and T out is the output time of the subsequent AP, which has a time distribution given by the rst passage time density fµ .t/, equation 2.20. T k is the time of the kth input EPSP following the output AP at time t D 0, and each Tk has a Poisson distribution pin .t/, given by equation B.2. Note that analogous to equation B.4, the above integral does not enforce the EPSPs to be within an output ISI, but that this bound on the input times will be imposed in the denition of A.w/ and B.w/ below. The drift function, A.w/, and diffusion function, B.w/, are given by ZZ " 1 X out A.w/ ´ ¸out ° ®P .w/ e¡.T ¡Tn /=¿P H.T out ¡ T n / nD1

¡ ®D .w/

1 X

#

¡Tn =¿D

e

H.T

out

nD1

¡ Tn /

(B.19)

ZZ " 1 X out B.w/ ´ ¸out ° ®P .w/ e¡.T ¡Tn /=¿P H.T out ¡ T n / nD1

¡ ®D .w/

1 X nD1

#2 ¡Tn =¿D

e

H.T

out

¡ Tn /

:

932

A. Burkitt, H. Mefn, and D. Grayden

Each of these integrals represents a single ISI of the output APs, and the factor ¸out is the rate of the output ISIs. The contributions are calculated using equation B.18: SP;n and SD;n are the contributions to the potentiation and depression, respectively, due to the nth input EPSP, ZZ out SPn .¿P / ´ ° e¡.T ¡Tn /=¿P H.T out ¡ Tn / ( n ) Z 1 pin;L .s ¡ ¿1P / ¡Tout =¿P ¡1 D L (B.20) dTout fµ .Tout /e s 0 ZZ SDn .¿D / ´ ° e¡Tn =¿P H.T out ¡ Tn / ( n ) Z 1 pin;L .s ¡ ¿1D / ¡Tout =¿D ¡1 D L dTout fµ .Tout /e ; s 0 where L ¡1 f f .s/g is the inverse Laplace transform of f .s/. Summing over all input spikes gives µ ³ ´¶ 1 X 1 SP .¿P / D (B.21) SPn .¿P / D ¸in ¿P 1 ¡ fµL ¿P nD1 µ ³ ´¶ 1 X 1 SD .¿D / D SDn .¿D / D ¸in ¿P 1 ¡ fµ L : ¿ D nD1 This gives the drift function, A.w/, in equation 3.11. In order to calculate the diffusion function, B.w/, we require ZZ out out RPm;Pn ´ ° e¡.T ¡Tm /=¿P H.T out ¡ Tm /e¡.T ¡Tn /=¿P H.T out ¡ Tn / .n > m/ Z 1 D dTout fµ .Tout /e¡2Tout =¿P L RDm;Dn

RDm;Pn

( ¡1

pm in;L .s ¡

2 n¡m ¿P /pin;L .s

¡

1 ¿P /

)

s 0 ZZ ´ ° e¡Tm =¿D H.T out ¡ T m /e¡Tn =¿D H.T out ¡ Tn / .n > m/ ( ) Z 1 2 n¡m 1 pm in;L .s C ¿D /pin;L .s C ¿D / D dTout fµ .Tout /L ¡1 s 0 ZZ out ´ ° e¡Tm =¿D H.T out ¡ T m /e¡.T ¡Tn /=¿P H.T out ¡ Tn / .n > m/ Z 1 D dTout fµ .Tout /e¡Tout =¿P

L

0

¡1

( £

pm in;L .s C

1 ¿D

¡

1 n¡m ¿P /pin;L .s

s

¡

1 ¿P /

)

Spike-Timing-Dependent Plasticity

933

ZZ out RPm;Dn ´ ° e¡.T ¡Tm /=¿P H.T out ¡ Tm /e¡Tn =¿D H.T out ¡ Tn / Z 1 D dTout fµ .Tout /e¡Tout =¿P 0

L

¡1

(

£

1 ¿D

pm in;L .s C

1 n¡m ¿P /pin;L .s

¡

s

C

1 ¿D /

.n > m/

)

(B.22)

:

Summing over all output spikes gives "

R1 .¿ P / D

¸2in

R2 .¿D / D

¸2in

"

fµ L . ¿2P / .¸in ¡ .¸in C

¸2in

.¸in C

¡ "

¡

1 ¿D /

1 ¿D /

C

1 ¿D

¡

2 ¿P /

C

#

.¸in ¡

1 ¿P /

#

1 ¿P /

2 ¿D /

¿D fµL .¸in C .¸in C

¿P fµ L .¸in C

2 ¿D /

1 ¿D /

1 ¿P /.¸ in

#

¡

C

1 ¿P /

¿D fµL .¸in C

1 ¿D / 1 ¿P /

¿P fµ L .¸in C

1 ¿D / 1 ¿P /

.¸in C

1 ¿D

¡

¿D fµL .¸in / .¸in ¡

.¸in C

h

C

.¸in ¡

1 ¿P /

fµ L . ¿1P /

C

R5 .¿P ; ¿D / D ¸in

1 ¿P /

fµL . ¿1P / .¸in C

¸2in

2 ¿D /.¸in

¿D fµL .¸in C

"

R4 .¿P ; ¿D / D

¡

1

¡

R3 .¿P ; ¿D / D

2 ¿P /.¸ in

¿P fµL .¸in /

1 ¿D

¡

1 ¿P /.¸ in

¿P fµL .¸in C .¸in C 1 ¿D

¡

1 ¿D /

1 C ¿1P / ¿D 1 ¿D /

fµ L . ¿1P / ¡ fµ L .¸in C .¸in C

C

1 ¿D /

1 ¿P /

¡

#

.¸in C

1 ¿D

¡

i

(B.23)

;

where the R ’s are dened in equation B.11. The diffusion function, B.w/, is given in terms of the above expressions by equation B.12 for model III: " B.w/ D ¸out ¸X

( ®P2 .w/

µ ³ ´¶ 2¸X fµ L . ¿2P / 2 ¿P 1 ¡ fµL C 2 ¿P .¸X ¡ ¿2P /.¸ X ¡

¡

2¸X ¿P fµ L .¸X / .¸X ¡

2 ¿P /

C

2¸X ¿P fµ L .¸X C .¸X ¡

1 ¿P /

1 ¿P /

1 ¿P /

)

934

A. Burkitt, H. Mefn, and D. Grayden

¡ 2®P .w/® D .w/

8h < fµL . ¿1 / ¡ fµ L .¸X C P :

"

.¸X ¡

1 ¿P /

¡

.¸X C

C

#

.¸X C

.¸X C

1 ¿D

¸X ¿P fµ L .¸X C

¡

2¸X ¿D fµL .¸X C .¸X C

.¸X C

1 ¿P /

¡

1 ¿P /

1 ¿P

C

1 ¿D /

1 ¿D /

¡

¸X ¿D fµ L .¸X /

9

.¸X ¡

1 ¿P /

1 = ¿D /

;

1 ¿D /

2 ¿D /

2 ¿D /

2¸X ¿D fµL .¸X C

¡

1 ¿D /

µ ³ ´¶ 2 2¸X ¿D 1 ¡ fµ L C 2 2 ¿D .¸X C ¿D /.¸X C C

1 ¿D

1

C

.¸X C

¸X fµL . ¿1P /

C

1 ¿P /

¸X .¿D ¡ ¿P / fµL .¸X C

C

(

1 ¿D

1

£

2 C ®D .w/

.¸X C

i

1 ¿D /

1 ¿D /

1 ¿D /

)# ;

X

D B,F: (B.24)

B.3.1 Model III: Arbitrary M. The above calculation of the drift function, A.w/, is generalized for EPSP-AP interactions that have a time extent restricted to the M nearest AP outputs to each input EPSP. This is done by generalizing equation B.18: ZZ Z °D

1

0

Z ¢¢¢ Z

1 Tout M¡1

1

0

Z ¢¢¢

Z dT 1out fµ .T1out /

Tin k¡1

T1out

dT2out fµ .T2out ¡ T1out /

out out out ¡ TM¡1 dTM fµ .TM /

dT 1in pin .T1in / 1

1

Z

1 T1in

dT2in pin .T2in ¡ T1in /

dTkin pin .Tkin ¡ T in k¡1 / ¢ ¢ ¢ ;

(B.25)

where the previous output AP is taken as occurring at time t D 0 and the kth subsequent output AP and input EPSP times are explicitly denoted by Tkout and Tkin , respectively. In addition, each of the Heaviside step functions in in equation B.19 are replaced by H.T out M ¡ Tn /, which gives the required

Spike-Timing-Dependent Plasticity

935

restriction of the EPSP-AP interactions to the M nearest outputs (again note that each interaction is counted exactly once with these denitions). B.4 Model IV. The calculation of the drift function, A.w/, and the diffusion function, B.w/, proceeds in a fashion analogous to that in the previous sections. Each EPSP interacts with at most two APs, and each AP interacts with at most two EPSPs. We choose here to consider one ISI of the output APs and dene ZZ Z °´

1 0

Z dTout fµ .Tout /

Tout

Z dt1 pin .t1 /

0

Tout

dt2

0

f±.t1 ¡ t2 / e¡¸in .Tout ¡t2 / C pin .T out ¡ t2 / H.t2 ¡ t1 /g;

(B.26)

where t1 is the rst EPSP following the input AP at time t D 0, and t2 is the last input EPSP before the subsequent AP at time Tout . The rst term in the brackets represents the case where there is exactly one EPSP within the ISI of the output APs, and the second term represents the case where there are two or more EPSPs within the ISI. The drift function, A.w/, and diffusion function, B.w/, are given by ZZ h i out A.w/ ´ ¸out ° ®P .w/e¡.T ¡t2 /=¿P ¡ ®D .w/e¡t1 =¿D

(B.27)

ZZ h i2 out B.w/ ´ ¸out ° ®P .w/e¡.T ¡t2 /=¿P ¡ ®D .w/e¡t1 =¿D :

These integrals are straightforward to evaluate, giving the drift function, equation 3.18, and the diffusion function, B.w/: 2 B.w/ D ¸out ¸X 4®P2 .w/

£

h

i

1 ¡ fµ L .¸X C .¸X C

8 h < ¸X 1 ¡ fµ L .¸X C :

2 ¿P /

i

1 ¿D /

2 ¿P /

¡

±

2®P .w/®D .w/ .¸X C

fµ L .¸X C

¡ .¸X C ¿1D / h i3 1 ¡ fµ L .¸X C ¿2D / 2 5; C ®D .w/ .¸X C ¿2D /

1 ¿P /

1 ¡ fµ L .¸X ¿P / .1 ¡ ¿¿DP /

X

D B,F:

C

1 ¿D /

²9 = ;

(B.28)

The expression for the drift function for the N input-, M output-restricted model IV with interspike suppressive interactions, section 4.1, calculated

936

A. Burkitt, H. Mefn, and D. Grayden

using the methods described in sections B.2.1 and B.3.1, is given by " pre

A.wi / D ¸out ¸i ®P .wi / ¿P ²L .¸i / (Á £

1 1 ¡ ¸i ¸i C

¸NC1 .¡1/N¡1 dN¡1 i .N ¡ 1/! d¸N¡1 i !µ ³

1 ¿P

´ 1 ¿P ´¶ )

gµ L .0/ ¡ gµL ¸i C

³ 1 £ ¸i C ¿P » µ ³ ´¶ 1 pre M ¡ ®D .wi / ¿D gµL .0/ ¸i ²L .¸i / 1 ¡ fµ L ¿D Z 1 M ¡ ¸N dtL ¡1 f fµL .s/g i fµM¡1 L

0

(Á £L

¡1

1 1 ¡ s s C ¿1D

!

pre

1 ¿D / 1 N¡1 ¿D /

²L .s C ¸i C .s C ¸i C

))# ;

(B.29)

pre

where ²L .s/ and gµL .s/ are dened following equation 4.5. All the other drift functions calculated above constitute special cases of this general expression. Acknowledgments This work was funded by the Australian Research Council (ARC Discovery Project #DP0211972) and the Bionic Ear Institute. We acknowledge W. Gerstner ’s suggestion for the inclusion of ADSS into the STDP framework (outlined in section 4.2, equations 4.8 and 4.9), and thank him for detailed helpful comments. We also thank G.Q. Bi for drawing Froemke and Dan (2002) to our attention and suggesting we analyze it within the framework presented in this article and an anonymous reviewer for useful comments on the manuscript. References Abbott, L. F., & Blum, K. I. (1996). Functional signicance of long-term potentiation for sequence learning and prediction. Cerebral Cortex, 6, 406–416. Abbott, L. F., & Nelson, S. B. (2000). Synaptic plasticity: Taming the beast. Nature Neuroscience, 3, 1179–1183. Abraham, W. C., & Bear, M. F. (1996). Metaplasticity: The plasticity of synaptic plasticity. Trends Neurosci., 19, 126–130.

Spike-Timing-Dependent Plasticity

937

Amit, D. J., & Brunel, N. (1997). Model of global spontaneous activity and local structured activity during delay periods in the cerebral cortex. Cerebral Cortex, 7, 237–252. Amit, D. J., & Mongillo, G. (2003). Spike-driven synaptic dynamics generating working memory states. Neural Comput., 15, 565–596. Amit, D. J., & Tsodyks, M. V. (1991). Quantitative study of attractor neural network retrieving at low spike rates: I. Substrate—Spikes, rates and neuronal gain. Network: Comput. Neur. Sys., 2, 259–273. Bi, G-Q., & Poo, M-M. (1998). Synaptic modications in cultured hippocampal neurons: Dependence on spike timing, synaptic strength, and postsynaptic cell type. J. Neurosci., 18, 10464–10472. Bi, G-Q., & Poo, M-M. (2001). Synaptic modication by correlated activity: Hebb’s postulate revisited. Ann. Rev. Neurosci., 24, 139–166. Bienenstock, E. L., Cooper, L. N., & Munro, P. W. (1982). Theory for the development of neuron selectivity: Orientation specicity and bionocular interaction in visual cortex. J. Neurosci., 2, 32–48. Bliss, T. V., & Lomo, T. (1973). Long-lasting potentiation of synaptic transmission in the dentate area of the anesthetized rabbit following stimulation of the perforant path. J. Physiol. (Lond.), 232, 331–356. Burkitt, A. N. (2001). Balanced neurons: Analysis of leaky integrate-and-re neurons with reversal potentials. Biol. Cybern., 85, 247–255. Burkitt, A. N., Mefn, H., & Grayden, D. B. (2003). Study of neuronal gain in a conductance-based leaky integrate-and-re neuron model with balanced excitatory and inhibitory synaptic input. Biol. Cybern., 89, 119–125. Burkitt, A. N., & van Hemmen, J. L. (in press). How synapses in the auditory system wax and wane: Theoretical perspectives. Biol. Cybern. Caˆ teau, H., & Fukai, T. (2003). A stochastic method to predict the consequence of arbitrary forms of spike-timing-dependent plasticity. Neural Comput., 15, 597–620. Debanne, D., Ga¨ hwiler, B. H., & Thompson, S. M. (1998). Long-term synaptic plasticity between pairs of individual CA3 pyramidal cells in rat hippocampal slice cultures. J. Physiol., 507, 237–247. Desai, N. S., Rutherford, L. C., & Turrigiano, G. G. (1999). BDNF regulates the intrinsic excitability of cortical neurons. Learning and Memory, 6, 284–291. Froemke, R. C., & Dan, Y. (2002). Spike-timing-dependent synaptic modication induced by natural spike trains. Nature, 416, 433–438. Fusi, S. (2003). Spike-driven synaptic plasticity for learning correlated patterns of mean rates. Rev. Neurosci., 14, 73–84. Gerstner, W. (2001). Coding properties of spiking neurons: Reverse and crosscorrelations. Neural Networks, 14, 599–610. Gerstner, W., & van Hemmen, J. L. (1996). What matters in neuronal locking? Neural Comput., 8, 1653–1676. Gerstner, W., Kempter, R., van Hemmen, J. L., & Wagner, H. (1996). A neuronal learning rule for sub-millisecond temporal coding. Nature, 383, 76–78. Gutig, ¨ R., Aharonov, R., Rotter, S., & Sompolinsky, H. (2003). Learning input correlations through non-linear temporally asymmetric Hebbian plasticity. J. Neurosci., 23, 3697–3714.

938

A. Burkitt, H. Mefn, and D. Grayden

Hanson, F. B., & Tuckwell, H. C. (1983). Diffusion approximations for neuronal activity including synaptic reversal potentials. J. Theoret. Neurobiol., 2, 127– 153. Holmgren, C. D., & Zilberter, Y. (2001). Coincident spiking activity induces long-term changes in inhibition of neocortical pyramidal cells. J. Neurosci., 21, 8270–8277. Hopeld, J. H. (1982). Neural networks and physical systems with emergent collective computational abilities. Proc. Nat. Acad. Sci. USA, 79, 2554–2558. Izhikevich, E., & Desai, N. S. (2003). Relating STDP to BCM. Neural Comput., 15, 1511–1523. Kempter, R., Gerstner, W., & van Hemmen, J. L. (1999). Hebbian learning and spiking neurons. Phys. Rev. E, 59, 4498–4514. Kempter, R., Gerstner, W., & van Hemmen, J. L. (2001). Intrinsic stabilization of output rates by spike-based Hebbian learning. Neural Comput., 13, 2709–2741. Kistler, W. M., & van Hemmen, J. L. (2000). Modeling synaptic plasticity in conjunction with timing of pre- and postsynaptic action potentials. Neural Comput., 12, 385–405. Leibold, C., Kempter, R., & van Hemmen, J. L. (2002). How spiking neurons give rise to a temporal-feature map: From synaptic plasticity to axonal selection. Phys. Rev. E, 65, 051915. Leslie, K. R., Nelson, S. B., & Turrigiano, G. G. (2001). Postsynaptic depolarization scales quantal amplitude in cortical pyramidal neurons. J. Neurosci., 21, RC170(1–6). Lissin, D. V., Gomperts, S. N., Carroll, R. C., Christine, C. W., Kalman, D., Kitamura, M., Hardy, S., Nicoll, R. A., Malenka, R. C., & von Zastrow, M. (1998). Activity differentially regulates the surface expression of synaptic AMPA and NMDA glutamate receptors. Proc. Natl. Acad. Sci., 95, 7097–7102. Lomo, T. (1971). Potentiation of monosynaptic EPSPs in the perforant pathdentate granule cell synapse. Exp. Brain Res., 12, 46–63. Malenka, R. C., & Nicoll, R. A. (1999). Long-term potentiation—A decade of progress? Science, 285, 1870–1874. Markram, H., Lubke, ¨ L., Frotscher, M., & Sakmann, B. (1997). Regulation of synaptic efcacy by coincidence of postsynaptic APs and EPSPs. Science, 275, 213–215. Miller, K. D. (1996). Synaptic economics: Competition and cooperation in synaptic plasticity. Neuron, 17, 371–374. Miller, K. D., & MacKay, D. J. C. (1994). The role of constraints in Hebbian learning. Neural Comput., 6, 100–126. O’Brien, R. J., Kamboj, S., Ehlers, M. D., Rosen, K. R., Fishback, G. D., & Huganir, R. L. (1998). Activity-dependent modulation of synaptic AMPA receptor accumulation. Neuron, 21, 1067–1078. Rao, R. P. N., & Sejnowski, T. J. (2001). Spike-timing-dependent Hebbian plasticity as temporal difference learning. Neural Comput., 13, 2221–2237. Ricciardi, L. M. (1977). Diffusion processes and related topics in biology. Berlin: Springer-Verlag. Risken, H. (1996). The Fokker-Planck equation (3rd ed.). Berlin: Springer-Verlag.

Spike-Timing-Dependent Plasticity

939

Roberts, P. D. (1999). Computational consequences of temporally asymmetric learning rules: I. Differential Hebbian learning. J. Comput. Neurosci., 7, 235– 246. Roberts, P. D., & Bell, C. C. (2002). Spike-timing dependent synaptic plasticity: Mechanisms and implications. Biol. Cybern., 87, 392–403. Rubin, J., Lee, D. D., & Sompolinsky, H. (2001). Equilibrium properties of temporally asymmetric Hebbian plasticity. Phys. Rev. Lett., 86, 364–367. Rutherford, L. C., Nelson, S. B., & Turrigiano, G. G. (1998). BDNF has opposite effects on the quantal amplitude of pyramidal neuron and interneuron excitatory synapses. Neuron, 21, 521–530. Sejnowski, T. J. (1977). Statistical constraints on synaptic plasticity. J. Theor. Biol., 69, 385–389. Sejnowski, T. J. (1999). The book of Hebb. Neuron, 24, 773–776. Senn, W., Markram, H., & Tsodyks, M. (2001). An algorithm for modifying neurotransmitter release probability based on pre- and post-synaptic spike timing. Neural Comput., 13, 35–68. Shouval, H. Z., Castellani, G. C., Blais, B. S., Yeung, L. C., & Cooper, L. N. (2002). Converging evidence for simplied biophysical model of synaptic plasticity. Biol. Cybern., 87, 383–391. Siegert, A. J. F. (1951). On the rst passage time probability problem. Phys. Rev., 81, 617–623. Sj¨ostrom, ¨ P. J., Turrigiano, G. G., & Nelson, S. B. (2001). Rate, timing, and cooperativity jointly determine cortical synaptic plasticity. Neuron, 32, 1149–1164. Song, S., & Abbott, L. F. (2001). Cortical development and remapping through spike-timing dependent plasticity. Neuron, 32, 339–350. Song, S., Miller, K. D., & Abbott, L. F. (2000). Competitive Hebbian learning through spike-timing dependent synaptic plasticity. Nature Neurosci., 3, 919– 926. Stroeve, S., & Gielen, S. (2001). Correlation between uncoupled conductancebased integrate-and-re neurons due to common and synchronous presynaptic ring. Neural Comput., 13, (2005–2029. Stuart, G. J., & Sakmann, B. (1994). Active propagation of somatic action potentials into neocortical pyramidal cell dendrites. Nature, 367, 69–72. Svirskis, G., & Rinzel, J. (2000). Inuence of temporal correlation of synaptic input on the rate and variability of ring in neurons. Biophys. J., 79, 629– 637. Tsodyks, M. V., & Markram, H. (1997). The neural code between neocortical pyramidal neurons depends on neurotransmitter release probability. Proc. Natl. Acad. Sci. USA, 94, 719–723. Tuckwell, H. C. (1979). Synaptic transmission in a model for stochastic neural activity. J. Theor. Biol., 77, 65–81. Tuckwell, H. C. (1988). Introduction to theoretical neurobiology: Vol. 2, Nonlinear and stochastic theories. Cambridge: Cambridge University Press. Turrigiano, G. G., Leslie, K. R., Desai, N. S., Rutherford, L. C., & Nelson, S. B. (1998). Activity-dependent scaling of quantal amplitude in neocortical neurons. Nature, 391, 892–896.

940

A. Burkitt, H. Mefn, and D. Grayden

van Hemmen, J. L. (2001). Theory of synaptic plasticity. In F. Moss & S. Gielen (Eds.), Handbook of biological physics, Vol. 4: Neuro-informatics and neural modelling (pp. 771–823). Amsterdam: Elsevier. van Kampen, N. G. (1992). Stochastic processes in physics and chemistry. Amsterdam: North-Holland. van Rossum, M. C. W., Bi, G. Q., & Turrigiano, G. G. (2000). Stable Hebbian learning from spike timing-dependent plasticity. J. Neurosci., 20, 8812–8821. Varela, J. A., Sen, K., Gibson, J., Fost, J., Abbott, L. F., & Nelson, S. B. (1997). A quantitative description of short-term plasticity at excitatory synapses in layer 2/3 of rat primary visual cortex. J. Neurosci., 17, 7926–7940. Zhang, L. I., Tao, H. W., Holt, C. E., Harris, W. A., & Poo, M.-M. (1998). A critical window for cooperation and competition among developing retinotectal synapses. Nature, 395, 37–44. Zorumski, C. F., & Thio, L. L. (1992). Properties of vertebrate glutamate receptors: Calcium mobilization and desensitization. Prog. Neurobiol., 39, 295–336. Zucker, R. S. (1999). Calcium- and activity-dependent synaptic plasticity. Curr. Opin. Neurobiol., 9, 305–313. Received January 16, 2003; accepted September 29, 2003.

LETTER

Communicated by Jonathan Victor

Estimating the Temporal Interval Entropy of Neuronal Discharge George N. Reeke [email protected]

Allan D. Coop [email protected] Laboratory of Biological Modelling, Rockefeller University, New York, NY 10021, U.S.A.

To better understand the role of timing in the function of the nervous system, we have developed a methodology that allows the entropy of neuronal discharge activity to be estimated from a spike train record when it may be assumed that successive interspike intervals are temporally uncorrelated. The so-called interval entropy obtained by this methodology is based on an implicit enumeration of all possible spike trains that are statistically indistinguishable from a given spike train. The interval entropy is calculated from an analytic distribution whose parameters are obtained by maximum likelihood estimation from the interval probability distribution associated with a given spike train. We show that this approach reveals features of neuronal discharge not seen with two alternative methods of entropy estimation. The methodology allows for validation of the obtained data models by calculation of condence intervals for the parameters of the analytic distribution and the testing of the significance of the t between the observed and analytic interval distributions by means of Kolmogorov-Smirnov and Anderson-Darling statistics. The method is demonstrated by analysis of two different data sets: simulated spike trains evoked by either Poissonian or near-synchronous pulsed activation of a model cerebellar Purkinje neuron and spike trains obtained by extracellular recording from spontaneously discharging cultured rat hippocampal neurons. 1 Introduction Spike trains generated by the discharge activity of neurons are composed of sequences of action potentials (spikes or impulses) that are generally accepted as providing the predominant mode of communication between neurons within the central nervous system (Perkel, 1970). Classically, the rate-coded properties of a spike train have been considered most important, for example, when sensory stimulus intensity is taken to be represented by the mean spike rate in a particular ber or group of bers (Adrian, 1928). Neural Computation 16, 941–970 (2004)

c 2004 Massachusetts Institute of Technology °

942

G. Reeke and A. Coop

However, a signicant feature of neuronal discharge activity, particularly within the cortex, is its high variability (Buracas, Zador, DeWeese, & Albright, 1998; Noda & Adey, 1970; Shadlen & Newsome, 1998; Softky & Koch, 1993; Tolhurst, Movshon, & Dean, 1983; Whitsel, Schreiner, & Essick, 1977). Although less widely accepted, the presence of discharge variability suggests that the timing of neuronal impulses as well as only their rate may be important (Segundo, 2000), and numerous alternatives to rate coding have been proposed (Buzs a´ ki, Llin´as, Singer, Berthoz, & Christen, 1994; Eggermont, 2001; Gilbert, 2001; Meister & Berry, 1999; Perkel & Bullock, 1968). In many cases, analysis is based on the temporal dispersion or pattern of neuronal discharge within a spike train, measures of which include the standard deviation (SD), coefcient of variation (CV = SD/mean), serial correlation of intervals, features of burst discharge such as the spike number and burst duration (e.g., Lisman, 1997), or metric-space techniques (Victor & Purpura, 1996). In contrast with the classical rate-based approach, the characteristic feature of such temporal coding is the importance of the exact time at which neuronal discharge occurs. A theoretically grounded measure of the weighted average probability of individual alternative events, the entropy, originated in statistical mechanics and is fundamental to communication theory (Segundo, 1970; Shannon & Weaver, 1949). This measure has subsequently been adapted (see, e.g., Brillouin, 1962; MacKay & McCulloch, 1952; Rieke, Warland, de Ruyter van Steveninck, & Bialek, 1997) to provide the necessary foundation for estimation of the transmission of information within (Manwani & Koch, 2000) and between neurons (e.g. Bialek, Rieke, de Ruyter van Steveninck, & Warland, 1991; Borst & Theunissen, 1999; Brecht, Goebel, Singer, & Engel, 2001; Brenner, Strong, Koberle, Bialek, & de Ruyter van Steveninck, 2000; Buracas & Albright, 1999; Buracas et al., 1998; Butts, 2003; deCharms & Zador, 2000; Dimitrov & Miller, 2001; Eckhorn & Popel, ¨ 1974; Gershon, Wiener, Latham, & Richmond, 1998; Liu, Tzonev, Rebrik, & Miller, 2001; Optican & Richmond, 1987; Reich, Mechler, Purpura, & Victor, 2000; Rieke et al., 1997; Stein, 1967; Szczepanski, Amig´o, Wajnryb, & Sanchez-Vives, 2003; Werner & Mountcastle, 1965; Wiener & Richmond, 1999). One feature of the methodologies employed in many of these reports is their emphasis on the analysis of signal transmission following the sensory stimulation of visual pathways. Here, however, we are specically concerned with estimating the entropy of individual spike trains, regardless of their dependence on any particular sensory stimulus. With the content and mechanisms of neural coding still in dispute (see, e.g., Eggermont, 1998; Friston, 1997), the estimation of entropy becomes more uncertain as neuronal impulse activity propagates from primary sensory areas to multi- and supramodal cortical areas. In part, this is due to difculties in identifying the set of possible signals from which a particular neuronal response is selected and thus the probability with which a given response occurs. This is compounded by the lack of suitable experimental

Estimating The Temporal Interval Entropy of Neuronal Discharge

943

preparations capable of providing data sets of the length required to establish good estimates of the event probabilities underlying the calculated entropy. Here we describe the analytic distribution method (for convenience, the interval method) for the calculation of the temporal interval entropy or simply the interval entropy of a neuronal signal. The name interval entropy is proposed because probability estimates are based on hypothetical sets of all possible spike trains having the same distribution of interspike intervals as the spike train in question. The methodology requires no knowledge of stimulus properties and is applicable to individual spike trains. It is based on the interspike interval distribution obtained from a particular spike train and involves calculation of the entropy of an equivalent continuous analytic distribution as a means of accomplishing extrapolation to large data record lengths. In the following sections, the interval methodology is described and then applied to analyze spike trains obtained from two different sources. Our primary purpose is to demonstrate the methodology rather than to identify the functional relevance of any spike train properties that may be revealed by the analyses. Initially, articially generated spike trains are analyzed. Results of these analyses are compared with entropy estimates for the same spike trains obtained with either the direct method (de Ruyter van Steveninck, Lewen, Strong, Koberle, & Bialek, 1997; Strong, Koberle, de Ruyter van Steveninck, & Bialek, 1998) or the non-parametric method of Kozachenko and Leonenko (1987). We also apply the interval methodology to spike trains obtained via extracellular single unit recording of spontaneous discharge activity from acutely dissociated, cultured rat hippocampal neurons. 2 Previous Entropy Measures It has previously been suggested that entropy may provide a sensitive measure for quantication of neuronal discharge variability (Sherry & Klemm, 1981). One of the simplest approaches to the analysis of neuronal discharge entropy relies on a rate-based description of the neural code. Here, the fundamental event is assumed to be the action potential (or spike), and the entropy depends on the neuronal discharge rate. When N different spike signals are equally likely, that is, the probability of each signal is 1=N, the entropy, H, is given by the relation H D log2 N (Hartley, 1924). Alternatively, the Shannon formulation derives a measure of the entropy of a signal from estimates of the probabilities that particular signals would be observed if drawn at random from among a nite set of all possible signals that might be emitted under a given set of conditions. It is given in the discrete case by

HD¡

N X iD1

pi log2 pi ;

(2.1)

944

G. Reeke and A. Coop

where P pi is the probability of the ith event, N the total number of events, and pi D 1. Equation 2.1 naturally generalizes to the case of a continuous range of values to give the so-called differential entropy function, Z H.X/ D ¡ f .x/ log2 f .x/ dx; S

where S is the support set of X with a density f .x/ > 0. From H.X/, the discrete entropy can be calculated by H1 .X/ D H.X/ ¡ log2 1t;

(2.2)

where 1t is the temporal bin width (Cover & Thomas, 1991). Several complementary methods are currently available that, via various simplifying assumptions, allow upper or lower bounds of spike train entropy to be estimated (reviewed by, e.g., Borst & Theunissen, 1999). One rate-based measure originates in a derivation of the Shannon entropy function (see equation 2.1) in which a binary representation of a spike train is obtained by dividing the time axis into discrete bins of width 1t. Each bin is coded as 1 if a spike is present in that bin and 0 otherwise (Brillouin, 1962; MacKay & McCulloch, 1952). The observed spike train is taken to have been selected from the set of all possible spike trains with the same number of 1s and 0s. This approach has been developed by various simplifying assumptions to give the mean rate entropy in bits per spike as a function of 1t (from Rieke et al., 1997) HR ¼ log2

± e ² ; rN1t

(2.3)

where e is the base of natural logarithms, rN is the mean spike rate, and it is assumed that the probability of nding a spike in any one bin is ¿ 1. Here, HR is referred to as the rate entropy and is given as the mean entropy in bits per spike. It provides an upper bound to the entropy for a given activation rate and choice of 1t. 2.1 The Direct Method. As an example of a methodology for measuring entropy that takes into account both spike timing and any correlations that may be present within sequences of intervals, we briey summarize the entropy estimation component of the so-called direct method of spike train analysis (de Ruyter van Steveninck et al., 1997; Strong et al., 1998). The direct method begins, like the calculation of the rate entropy HR , with a binary representation of the given spike train in bins of width 1t. Consecutive bins (or “letters”) are grouped into “words” of L bins. For a given word length L and resolution 1t, the entropy of the response to a particular stimulus in bits per second is obtained from the probability distribution of the observed

Estimating The Temporal Interval Entropy of Neuronal Discharge

945

words by application of

HD .L; 1t/ D ¡

1 L1t

X

p.w/ log2 p.w/;

(2.4)

w2W.L;1t/

where w is a specic word (spike pattern), W.L; 1t/ the set of all possible words comprising L bins of width 1t, and p.w/ is the probability of observing pattern w in the spike train. The true entropy for a given measurement precision can be obtained only when observation times are sufciently long to obtain accurate estimates of p.w/ and after extrapolation of HD .L; 1t/ to innite L. In one version of this method, which is used for the calculations reported here, HD is plotted versus 1=L, the point of minimal fractional change in slope is found, and the four values of L up to and including this point are used for the extrapolation (Reinagel, Godwin, Sherman, & Koch, 1999). HD may then be divided by the mean spike rate to give an estimate of the mean entropy in bits per spike. The advantage of the direct method is that it makes no prior assumptions about the distribution of letters in a word or about the nature of the stimulus. However, as is generally the case for accurate estimation of spike train entropy, application of the method faces three obstacles (Optican, Gawne, Richmond, & Joseph, 1991). First, it is not always clear how to make a principled choice for the temporal resolution at which a spike train should be discretized. This is referred to here as the binning problem. Second, responses to a stimulus often appear to be highly variable (but see Reinagel & Reid, 2002). Third, experimental considerations frequently limit the sample size. The impracticality of collecting the large data sets required to estimate the occurrence probability of word patterns accurately, particularly long ones, is referred to here as the sampling problem. 2.2 The KL Method. A nonparametric method of entropy estimation has been proposed by Kozachenko and Leonenko (1987). Based on the method of DobruÏsin (1958), it provides an asymptotically unbiased estimator that gives the entropy of a continuous random vector from a sample of independent observations. For convenience, it is referred to here as the KL method and provides an entropy estimate referred to as the KL entropy .HKL /. (HKL should not be confused with the similarly named relative entropy or Kullback-Leibler distance, which may be calculated for two probability mass functions; e.g., Cover & Thomas, 1991). In brief, let Rm be the m-dimensional Euclidean space with metric

½.x1 ; x2 / D

8 <X m :

jD1

.j/

.j/

.x1 ¡ x2 /2

91=2 = ;

;

946

G. Reeke and A. Coop

.m/ m where xi D .x.1/ i ; : : : ; xi / 2 R , i D 1; 2, and m ¸ 1. From the sample X1 ; : : : ; XN , N ¸ 2, compute ½i D minf½.Xi ; Xj /, j 2 1; 2; : : : ; Ngnfigg and let

(

N X ½i ½N D 1t iD1

)1=N

;

(2.5)

where 1t is a bin width introduced to make ½N a dimensionless quantity. The entropy may then be obtained in bits per interval from ³ ´ 1 [m ln ½N C ln c1 .m/ C ln ° C ln.N ¡ 1/]; (2.6) HKL D ln 2 where c1 .m/ D

¼ m=2 ; 0. m2 C 1/

m D 1 for all analyses reported here, ln ° D 0:577216 (the Euler constant), and 0.²/ is the gamma function. In some runs (see section 4), the temporal resolution of spike timing was increased to that of the double oating-point number precision of MATLAB by the addition of uniform noise to each interval in a spike train. The noise was constrained such that interval durations were randomly varied within the 10 ¹s simulation resolution. Under this condition, all intervals contributed to the estimate of HKL . With the exception of the lower curves in Figures 3A and 3B, all values reported for the KL entropy were obtained as the mean of 10 replicates for each data set where the random number seed used to generate the noise was different in each analysis. 3 Methods 3.1 The Interval Method. Based on the assumption that temporal precision may typically provide a more important limiting factor on information transmission than maximum impulse rate, an early theoretical study concluded that interval modulation provides a more efcient mode of encoding than purely rate-based modulation (MacKay & McCulloch, 1952). Although subsequent studies of interspike interval distributions obtained from spontaneously discharging neurons in slice preparations of rat cerebellar cortex showed that serial correlation of adjacent intervals may extend over clusters of up to three to ve intervals (Klemm & Sherry, 1981; Sherry & Klemm, 1981), for simplicity, the interval method given here assumes independence of successive intervals, as does the KL methodology. The interval method proceeds by assuming that (1) a spike train consists of a sequence of symbols composed of interval spike pairs where each temporal interval following the rst spike is terminated by the occurrence of a spike, (2) the set of possible signals from which an observed spike train

Estimating The Temporal Interval Entropy of Neuronal Discharge

947

is drawn is the set of all spike trains with the same distribution of interspike intervals, (3) the distribution of spike intervals in an arbitrarily long spike train can be obtained by maximum likelihood estimation (MLE) of the parameters that match a suitably chosen analytical distribution to an observed sample of that spike train, and (4) the entropy determined by the timing of neuronal discharge can be calculated from the continuous probability distribution for an appropriate choice of resolution, 1t. (See the end of the following section.) 3.1.1 Estimating the Interval Entropy .HI /. The interval methodology avoids the binning problem by obtaining with MLE techniques the parameter values for the underlying continuous distribution from which the intervals of a particular spike train are most likely to have been drawn. The entropy may then be obtained by analytic or numeric techniques from the continuous distribution at any desired temporal resolution. Within the context of spike train analysis, several different approaches have been formulated to address the sampling problem, that is, the estimation of a probability function from a nite sample size, and different correction factors have been proposed (Optican et al., 1991; Panzeri & Treves, 1996; Sanderson & Kobler, 1976; Wolpert & Wolf, 1995). Here, the sampling problem is addressed by employing a best-t analytic distribution to approximate the limit of an innitely long data set. This approach has the advantage that the spike train duration required to provide sufcient data can be estimated by noting where the entropy estimates obtained from analytic probability density function t to records of increasing durations asymptote to a constant value, with no further correction for undersampling required. The interval method is not limited to a particular analytical density function; any suitable function or combination of functions may be employed to estimate the interval probability density of a given spike train. A sum of functions may be particularly useful when a neuron exhibits periodic burst discharge, a condition under which the observed interval probability density may contain multiple peaks. In the analyses reported here, typically one or both of two analytic probability density functions are employed: the generalized gamma distribution or the gaussian distribution. The generalized gamma distribution, in the case of a neuron exhibiting an absolute refractory period, requires a threeparameter form of the probability density function given by ° .xI a; s; ¿ / D

.x ¡ s/a¡1 ¡.x¡s/ e ¿ ; .x ¸ s ¸ 0I a; ¿ > 0/; ¿ a 0.a/

(3.1)

where a is the order or shape parameter and s and ¿ gave the time axis shift and the time scaling parameters of the distribution, respectively. This distribution is chosen as it is characteristic of simple models of neural spike production (see, e.g., Usher, Stemmler, & Koch, 1994) and because of its rela-

948

G. Reeke and A. Coop

tion to the Poisson distribution when a D 1. Early studies reported excellent ts between neuronal discharge activity and the gamma distribution (Stein, 1965), as have others since (e.g., Tiesinga, Jos´e, & Sejnowski, 2000). In any event, this distribution appears to be justied heuristically by its ability to t distributions that range from exponential to near gaussian as the order parameter, a, is increased. When all three parameters of a gamma distribution are unknown, an efcient method of parameter estimation is that of MLE (Mann, Schafer, & Singpurwalla, 1974). MLE values are estimated from the observed distribution, O, by maximizing the likelihood function Y LD p.xi /; i

where in the gamma case p.xi / D ° .xi I a; s; ¿ / for unimodal distributions and p.xi / D f °1 .xi I a1 ; s1 ; ¿1 / C .1 ¡ f /°2 .xi I a2 ; s2 ; ¿2 / for bimodal distributions. In the bimodal case, f is a constant .0 < f < 1/ that distributes the analytic probability between the two peaks in O, and a1 and a2 ; s1 and s2 , and ¿1 and ¿2 are the order, shift, and scaling parameters, respectively, of the two density functions as determined by the MLE procedure. It is often computationally advantageous to maximize the log likelihood instead of the likelihood function. Thus, for the unimodal case, X ln L D ln[° .i/]; (3.2) i

and the bimodal case, X ln L D ln[ f °1 .i/ C .1 ¡ f /°2 .i/]

(3.3)

i

where °n .i/ D

µ

¶ .xi ¡ sn /an ¡1 ¡.xi ¡sn/ e ¿n ; n D 1; 2: ¿nan 0.an /

The extension to more than two component distributions is obvious. Once the parameters a, s, and ¿ are known, the entropy of the tted gamma distribution can be calculated by summation of the Shannon entropy function (see equation 2.1), where for the unimodal case, the probability that an interspike interval lies in the interval x D [i1t; .i C 1/1t] is given by pi D [Q..i C 1/1tI a; s; ¿ / ¡ Q.i1tI a; s; ¿ /]

(3.4)

Estimating The Temporal Interval Entropy of Neuronal Discharge

949

where i gives the bin number, 1t the bin width, and Q is the incomplete gamma function, ³ ´ Z . x¡s / ¿ 1 x¡s Q.xI a; s; ¿ / D ° ;a ´ ta¡1 e¡t dt; ¿ 0.a/ 0 or analytically (see equation 2.2) where ³ H1 .X/ D

1 ln 2

´µ ln.¿ 0.a// C .1 ¡ a/

¶ 0 0.a/ C a ¡ log2 1t; 0.a/

(3.5)

and 0 0 .a/ is the derivative of the gamma function. For consistency with the bimodal case where the analytic entropy is not available in closed form (see below), the entropy of unimodal distributions is also obtained from equation 2.1. To keep the set of possible messages nite (Shannon, 1948), the summation is truncated after N terms, when the Ppi are sufciently small to achieve numerical convergence (we use 1 ¡ pi < 1 £ 10¡8 ). In the absence of an analytic expression for H.X/ in the bimodal case, the entropy is estimated from the Shannon entropy function (see equation 2.1), now using pi D f [Q..i C 1/1tI a1 ; s1 ; ¿1 / ¡ Q.i1tI a1 ; s1 ; ¿1 /] C .1 ¡ f /[Q..i C 1/1tI a2 ; s2 ; ¿2 / ¡ Q.i1tI a2 ; s2 ; ¿2 /]:

(3.6)

One or more gaussian distributions or a sum of gaussian and gamma distributions may provide an appropriate description when the order parameter, a, of a gamma distribution becomes large, although we note that the matching of single gaussians to individual interval distributions obtained within the visual system has been unsuccessful (Berry, Warland, & Meister, 1997). The parameter values for the analytic gaussian probability density function, ³

¡.x ¡ m/2 exp G.xI m; ¾ / D p 2¾ 2 2¼ ¾ 1

´ ; .¾ > 0/;

(3.7)

from which the observed interval distribution, O, is most likely drawn, are obtained by MLE maximization of the log-likelihood function as given above for the gamma cases. In contrast with the gamma distribution, which extends over the range [0; C1], the gaussian distribution extends over the range [¡1; C1]. As the probability of negative temporal intervals has no meaning, such values are avoided by truncating the distribution at x D 0 by renormalization. If one denes Á GT .X/ D

! 2 erfc. p¡m / 2¼

G.xI m; ¾ /;

(3.8)

950

G. Reeke and A. Coop

where 2 erfc.x/ D p ¼

Z

1

e¡t dt 2

x

is the complementary error function, then Z

1 0

GT .x/ dx D 1 and

N X iD0

GT .xi / ¼ 1;

P where N is chosen as given above such that 1 ¡ pi < 1 £ 10¡8 . An example of a more complex distribution is that of a comb function composed of multiple discrete gaussians, the amplitudes of which are constrained by a gamma envelope. Here, ³ P.x/ D

1 NC

´X g ³

¹j ¡ s ¿

jD1

± ¹ ¡s ²

´a¡1 e

j ¿

¡

³ ¡

e

´2 x¡¹j 2¾ 2 j

(3.9)

;

where NC is a normalization factor given by ± ¹ ¡s ² Á ! r ´ g ³ j ¡¹j ¼ X ¹j ¡ s a¡1 ¡ ¿ NC D e ¾j erfc p ; 2 jD1 ¿ 2¾j g is the number of gaussian components within the gamma envelope, and ¹j is the mean of the jth gaussian .¹j D s0 C jT/, where s0 is the offset of the rst gaussian and T the temporal interval separating the means of each gaussian component. The entropy of such a function may then be obtained from equation 2.1, where the probability of the ith bin is ± ¹ ¡s ² ´ g ³ j 1 X ¹j ¡ s a¡1 ¡ ¿ p.i/ D e NC jD1 ¿ r £

" ¼ 2

Á

.i C 1/1t ¡ ¹j erf p 2¾j

!

Á

i1t ¡ ¹j ¡ erf p 2¾j

!# :

(3.10)

Because in multimodal cases the integral is not always available in closed form, summation is carried out over bins explicitly, a procedure that is equivalent to equation 2.2 in the limit of small 1t. The continuous distribution is divided into bins of width 1t, and the probability of each bin is calculated as a denite integral of the underlying continuous distribution over the width of the bin. The Shannon entropy (see equation 2.1) is calculated from the resulting bin probabilities pi . If a distribution spans only a small number of bins, 1t, the entropy may be calculated for a reduced bin width and then

Estimating The Temporal Interval Entropy of Neuronal Discharge

951

converted to that of the required bin width from ³ H.1t2 / D H.1t1 / ¡ log2

1t2 1t1

´ ;

(3.11)

where 1t1 and 1t2 are the smaller and larger bin width, respectively. Note that binning is not required to obtain the underlying probability distribution as the parameters of the analytic distribution are obtained by the MLE methodology. Thus, the assumption of a particular value of 1t for the entropy calculation plays no role in the determination of the analytic probability distribution P.X/. Intuitively, there is a range of resolutions, 1t, at which the summation of the Shannon entropy function (see equation 2.1) might be performed, corresponding to the temporal resolution of the postsynaptic targets of the spike train. For the purpose of comparing the various entropy estimation methodologies, a bin width of 0.5 ms is employed here (Regehr & Stevens, 1999; Reinagel & Reid, 2000; Sabatini & Regehr, 1999). 3.1.2 Data Analysis. Analysis and tting routines were developed by the authors in MATLAB Version 6.5 (MathWorks Inc., Natick, MA. http:// www.mathworks.com). A MATLAB-supplied function that implements the Nelder-Mead simplex direct search algorithm, fminsearch, was employed to perform the ts. To enforce the requirement sn ¸ 0 (see equation 3.1), this variable is transformed to a variable ¯, ³

xmin ¡ s ¡ " ¯ D ¡ ln xmin

´ ; ¡1 · ¯ · 1;

(3.12)

where xmin is the smallest interval in O and " gives the tolerance to which .xmin ¡ s/ approaches zero (we use, " D 1 £ 10¡8 ). During the MLE procedure, ¯ is allowed to vary freely. The true value of s is recovered following parameter estimation via the inverse transformation, s D xmin .1 ¡ e¡¯ / ¡ ":

(3.13)

Condence intervals were obtained from suitable modications of a MATLAB-supplied function, mle, which uses the input data and the parameter values of the MLE t to calculate the Fisher information matrix, the inverse of which gives the covariance matrix of the parameter value estimates, µ. Condence intervals, Cn (where n gives the magnitude of the condence interval), are expressed as µ § cº, where º is the square root of the corresponding diagonal element of the inverse Fisher information matrix, and c is a constant that depends on the specic condence interval required. For the p analyses reported here, this was typically 99%, that is, c D 2erf ¡1 .0:99/. As entropy is determined by the variance of a distribution and is independent

952

G. Reeke and A. Coop

of the mean (Rieke et al., 1997), mn and sn are not included in the values reported for condence intervals. 3.2 Statistical Tests. The validity of the t of each analytic distribution to the observed data was assessed by both the Kolmorogorov-Smirnov (KS; Press, Tevkolsky, Vetterling, & Flannery, 1992) and Anderson-Darling (AD; Anderson & Darling, 1954) statistical tests. The AD test is a modication of the KS test which increases the power of the KS test in the tails of a distribution. For each test, the probability (pD and pW for the KS and AD test, respectively) of obtaining values of the test statistic (D and W for the KS and AD test, respectively) that are equal to or greater in magnitude than the observed test statistic was calculated. A p-value close to zero indicates that a signicant difference is likely to exist for the sample size used. Critical values for goodness-of-t testing determined from estimated parameters must be calculated for each distribution, as under these circumstances, they are much smaller than tabulated values (e.g., Lilliefors, 1967), and in any event, values have not been tabulated for the composite distributions we used. As the parameters of an analytic distribution estimated by MLE from a given data set may be used directly (Kotz & Johnson, 1983), for each t a number of intervals equal to the number in the observed data set are sampled with replacement from the analytic distribution and the appropriate test statistic is calculated. This procedure is repeated 5000 times to give a distribution of the test statistic from which critical and p-values may be obtained. When resampling from the comb function, the fractional contribution of each gaussian component is obtained from ± ¹ ¡s ² ³ ´ ³ ´ ¹j ¡ s a¡1 ¡ j¿ 1 p 2¼¾j (3.14) f D e : NC ¿ The validity of the MLE ts was also assessed qualitatively by visual comparison of the observed and analytic distributions and quantitatively by calculation of the root mean squared (RMS) error of the t with the observed distribution. In both cases, cumulative interval distributions are constructed by assuming that intervals that varied by less than the recording resolution are equivalent. The RMS error, E, may then be obtained as 8 91=2 <1 X N = [O.xj / ¡ A.xj /]2 £ 100%; ED :N ; jD1

(3.15)

where j ranges over the N observations included in the observed, O, and analytic, A, distributions. Additionally, many of the ts were also performed by direct minimization of E, with results similar to those obtained by MLE tting. All entropy estimates are reported as the entropy per event where an event may be either a spike or an interval.

Estimating The Temporal Interval Entropy of Neuronal Discharge

953

3.3 Source Data 3.3.1 Simulated Spike Trains. Spike trains were obtained from a model Purkinje neuron simulated at a temporal resolution of 10 ¹s (similar to that described by Coop & Reeke, 2001). Different discharge patterns were evoked by changing the number and activity of afferent (parallel ber) synapses under two different stimulus paradigms. In the rst, the Poissonian paradigm, impulse activity on each afferent ber was generated as a Poisson point process with impulses temporally uncorrelated across the active bers. In the second, the pulsed paradigm, the stimulus was composed of packets of pulses delivered at a xed mean interpulse interval. Each packet was composed of impulses distributed as a truncated gaussian in time across the active afferent bers with one impulse from the packet assigned at random to each ber. The temporal jitter of the impulses within each pulse was controlled by varying the coefcient of variation (CV) of the gaussian (CV: 0.1 for the reported simulations). Paired-pulse facilitation following the model proposed by Atluri (1996) was included in the model PC. The number of active afferent bers that contributed to each pulse controlled the rate and pattern of neuronal discharge. In both stimulation paradigms, the afferent pulse rate was held constant for the duration of a simulation. 3.3.2 Cultured Hippocampal Neurons. Extracellular simultaneous recordings of the spontaneous discharge of ve hippocampal neurons were provided by E. Keefer of the Neurosciences Institute (San Diego, CA). These data were obtained from a culture of dissociated E18 rat hippocampal cells after 27 days in culture according to the basic methods of Ransom, Neale, Henkart, Bullock, & Nelson (1977) on a substrate-integrated 64-electrode array (Gross, Wen, & Lin, 1985). For recording, the electrode array was placed in a constant-temperature bath recording chamber and heated to 37± C. pH was maintained at 7.4 in a 90% air/10% CO2 atmosphere. Neuronal activity was recorded with a two-stage, 64-channel amplier system (Plexon Inc., Dallas, TX), and digitized simultaneously on a Dell 410 workstation. Spike identication and separation was accomplished with a template-matching algorithm (Plexon Inc.) in real time to provide single-unit spike timing data at a temporal resolution of 25 ¹s. 4 Results 4.1 Model Purkinje Cell. The essential features of the analytic distribution method are given in Figure 1 for simulated spike trains generated by either Poisson (see Figures 1A and 1C) or pulsed (see Figures 1B and 1D) stimulation at 8 impulses per second per afferent ber. Under Poissonian stimulation, the model PC discharged at a rate of 56 spikes per second when contacted by 1000 afferent bers. The interval probability distribution ob-

954

G. Reeke and A. Coop

A. POISSON 0.08

0.3 (ii) (i)

Probability

0.06

0.2

Data Fit

i ii

0.04

0.1

0.02 0.00

B. PULSED

0

20

40

60

0.0

0

40

Cumulative Probability

C. 1.0

0.5

0.5

0

20

40

120

80

120

D.

1.0

0.0

80

60

0.0

0

40

Interspike Interval (ms)

tained from a 100 s duration spike train was well t (RMS error of t: 1.7%) with a single gamma function to give an interval entropy, HI , of 4.92 bits per interval (see Figure 1Ai). The shape and scale parameters for this t were 3:9 § 0:2 (a § C0:99 ) and 2:0 § 0:1.¿ § C0:99 /, respectively. Although for a bimodal gamma t (see Figure 1Aii) the C0:99 values were increased for the rst (a § C0:99: 5:7 § 1:5, ¿ § C0:99 : 1:3 § 0:3 ms) and second (a § C0:99 : 1:5 § 0:5, ¿ § C0:99 : 4:4 § 1:1 ms) gamma components, a lower RMS error (0.22%) suggested that this t gave a better model of the data than the single gamma t. This was conrmed by calculation of the p-values pD and pW , both of which were below 0.01 and above 0.99 for the single and double gamma ts, respectively. It was therefore concluded that the observed interval distribution was well described by the bimodal analytic distribution.

Estimating The Temporal Interval Entropy of Neuronal Discharge

955

When this model was used to estimate the entropy of the given spike train, a slightly reduced value of 4.90 bits per interval was obtained for HI . Under an 8 impulse per second per ber pulsed stimulus (CV: 0.1, i.e., SD of afferent impulse jitter was 12.5 ms around a 125 ms interpulse period) at a convergence of 800 afferent bers, the model PC also discharged at 56 spikes per second. A reasonable t to the interval data (RMS error 2.4%) was obtained with a bimodal distribution composed of a single gamma (a§C0:99: 3:9 § 0:2, ¿ § C0:99 : 2:0 § 0:1 ms) and a single gaussian distribution (¾ § C0:99: 1:7 § 0:04). As both pD and pW were more than 0.99%, it was concluded that the observed interval distribution was not signicantly different from the composite analytic distribution, and based on this t, HI was estimated to be 3.52 bits per interval. We note that entropy estimates were also obtained from analytic distributions obtained by RMS error minimization. These values were little different (typically less than 2.5%) from those obtained with the MLE methodology. The use of the best equivalent continuous distribution with the interval method allows good estimates of the entropy of neural discharge to be obtained from restricted data sets (as justied below, where interval entropy estimates from longer and shorter samples of the same data are compared with those obtained with the direct and KL methods). Further, the validity of the chosen data model can be quantied by the p-values obtained for the KS and AD statistics. Figure 1: Facing page. The t of analytic distributions by maximum likelihood estimation to observed interval data for simulated spike trains evoked in response to stimulation of a model Purkinje neuron. The stimulus had a mean rate per ber of 8 spikes per second for both Poissonian (A, C) and pulsed (B, D) activation. By simulating 1000 afferent bers in the Poissonian case and 800 in the pulsed case, we obtained a mean discharge rate of 56 spikes per second in both cases. The bin width of all histograms is 0.5 ms. (A) Interval probability distributions obtained from a 100 s spike train (binned histogram) appear to be well t by either unimodal (i) or bimodal (ii) gamma distributions (dashed lines). (B) Interval probability distribution for a pulsed stimulus is well t with a composite bimodal distribution comprising a single gamma and a single gaussian distribution (left and right peaks, respectively). The observed distribution is discontinuous in the range 10–90 ms where no intervals are observed (see D). (C) The bimodal gamma probability distribution (dashed line) obtained from the t in Aii is shown as a cumulative interval histogram for comparison with the cumulative interval histogram constructed from the observed data (unbroken line). (D) The composite cumulative interval probability distribution (dashed line) obtained from the t in B is shown for comparison with a cumulative interval histogram constructed from the observed data. Cumulative interval distributions are binned at the resolution of spike detection (10 ¹s). Note the two discrete peaks in the histogram and the absence of intervals in the range 10–90 ms.

956

G. Reeke and A. Coop

To gain insight into the spike train duration required to ensure data adequacy, spike trains of increasing duration (up to 5000 s) were analyzed. As before, these spike trains were generated in response to 8 impulses per second per ber Poissonian or pulsed stimulation. At a discharge rate of 6 spikes per second (750 and 270 afferent bers, Poissonian and pulsed stimulation, respectively), the entropy estimate provided by each of the interval, KL, and direct methodologies typically stabilized within 10 to 40 s (see Figures 2A and 2B) (see below for comments about estimating HKL ). In

Entropy (bits/event)

A. POISSON

B. PULSED

15

15

10

10

5

5 6 spikes/s: 57 spikes/s:

0 0.1

1

10

HI HI

HKL HKL

HD HD

0 100 1000 0.1 1 10 Spike Train Duration (s)

99% Confidence Interval (%)

C. 100

(1)

10

1

0.1

100 1000

D. (2)

100

10

a1 a2

1

1

t1 t2

10

a1 s2

t1

0.1 1 10 100 1000 Spike Train Duration (s)

100 1000

Estimating The Temporal Interval Entropy of Neuronal Discharge

957

particular, good estimates of the entropy (within 1.6% to 2.7% of asymptotic values) were provided by the interval methodology with records as short as 6 s (see Figure 2A, arrow). At a discharge rate of 56 spikes per second (1000 and 800 afferent bers, Poissonian and pulsed stimulation, respectively), HI stabilized to within 0.5% to 1.7% of asymptotic values within 6 s, while all entropy estimates were close to their asymptotic values within 100 s of discharge (see Figures 2A and 2B). The discrepancy between HR and HD , on the one hand, and HI and HKL , on the other, for all cases except slow discharge under Poissonian stimulation is explored further below. Figure 2 also shows the effect of increased spike train duration on the magnitude of C0:99 values (given as a percentage of the associated parameter value estimate, i.e., C0:99% D .cº=µ/ £ 100%/ for the parameters of the analytic distribution used to calculate HI . For both the uni- and bimodal ts

Figure 2: Facing page. Effect of spike train duration on the entropy magnitude calculated by three different methodologies for spike trains evoked at two different discharge rates under two different stimulation paradigms. (A) Entropy estimates obtained from spike trains evoked in response to 8 impulses per second per ber Poissonian stimulation. At a discharge rate of 6 spikes per second (stimulation by 750 afferent bers), entropy values estimated by the analytic distribution method (HI ), the direct method (HD ), and the Kozachenko-Leonenko method (HKL ), asymptoted to steady-state values with as little as 6, 8 and 100 s of data, respectively. At a discharge rate of 56 spikes per second (stimulation by 1000 afferent bers), all three entropy estimates asymptoted to steady-state values within 10 s. Given sufcient data, the three entropy estimates obtained at a discharge rate of 6 spikes per second were all well matched to the predicted mean rate entropy, HR (upper horizontal dashed line), whereas at 56 spikes per second only HD agreed with the upper bound given by HR (lower horizontal dashed line). (B) Entropy estimates obtained from spike trains evoked in response to 8 impulses per second per ber pulsed stimulation. At a discharge rate of 6 spikes per second (stimulation by 270 afferent bers) HI , H D , and HKL asymptoted to steady-state values with spike train durations of 10–40 s, whereas at 56 spikes per second (stimulated by 800 afferent bers) values asymptoted to steady-state values within 10 s. Given sufcient data, HD again behaved similarly to the rate-based entropy estimator H R (horizontal dashed lines). (C, D) Relationship between 99% condence intervals (given as a percentage of the t parameter value) obtained for the relevant MLE parameters of the analytic distributions t to observed interval distributions as spike train duration was increased from 1 to 5000 s, for the Poissonian and pulsed stimulation paradigms, respectively. For spike train durations less than 40 s (in C), a single gamma function was sufcient to ensure no signicant difference between the observed and analytic distributions, whereas for longer records, two gamma functions were required. Condence intervals for the second gamma component of the 40 s record are not shown as they were above 100% (4513% and 1348% for ®2 and ¿2 , respectively).

958

G. Reeke and A. Coop

in Figures 2A and 2B, the C0:99% values declined with increased spike train duration (see Figures 2C and 2D, respectively). It was notable that with short data records, although condence intervals for the MLE estimated parameter values could be large, HI entropy values were effectively asymptotic and p-values were large. For example, under Poissonian stimulation, a 100 s spike train with a mean rate of 56 spikes per second gave C0:99% values of more than 20% (see Figure 2C), while HI varied by less than 0.3% from its asymptotic value and both pD and pW were more than 0.99. Similarly, under pulsed stimulation for the same spike train duration but a two-component t, C0:99% values for the rst and second components were more than 6% and 2.5%, respectively, while HI varied by less than 0.4% from its asymptotic value and both pD and pW were also more than 0.99 (see Figure 2D). Overall, for HI data shown in Figure 2A, the mean values of pD and pW were more than 0.37 for 6 spikes per second discharge and more than 0.68 for 56 spikes per second discharge, respectively. The equivalent p-values for pulsed stimulation were more than 0.57 and more than 0:95 for 6 and 56 spikes per second discharge, respectively (see Figure 2B). The magnitude of the entropy obtained with the KL method (HKL ; Kozachenko & Leonenko, 1987) was sensitive to the absolute requirement for continuity of the observed interval distribution, and thus the computational resolution at which the discharge activity of the model PC was simulated. The KL method appeared to fail with small increases in PC discharge above the lowest rates observed (see Figure 3, empty circles). This failure was attributed to the reliance of the method on obtaining differences between intervals. Given the temporal resolution of the data (10 ¹s), as the regularity of discharge increased with increasing discharge rate, the number of intervals classed as identical also increased. Thus, under 64 spikes per second per ber stimulation at a discharge rate of 68 spikes per second, only 20% and 14% of the intervals in the spike train were unique and could be used for the calculation of ½N (see equation 2.5) for Poissonian and pulsed stimulation, respectively. A good match to the entropy estimates provided by the interval entropy, HI , was obtained by articially increasing the apparent temporal resolution at which simulated spike trains were recorded (see Figure 3, lled circles, and section 3). Thus, at a mean stimulus rate of 64 spikes per second per ber, the percentage difference between HI and HKL was 0.89% and 1.7% for discharge rates of 2 to 141 and 10 to 163 spikes per second for Poissonion and pulsed stimulation, respectively. At higher and lower discharge rates, the difference between the two entropy estimates increased greatly, for example to 30% for discharge rates of 158 to 257 spikes per second in response to Poissonian stimulation, and more than 10% for discharge rates either less than 10 or more than 192 spikes per second. For discharge rates greater than 257 and 234 spikes per second under Poissonian and pulsed stimulation, respectively, the KL entropy was negative. For these reasons, the KL entropy was not explored further (see section 5).

Estimating The Temporal Interval Entropy of Neuronal Discharge

A. POISSON

10 0 -10 -20 -30 -40 0

20 Entropy (bits/interval)

KL Entropy (bits/interval)

20

Raw Noise added

100

10

959

B. PULSED 64

128

0 -10 -20 -30

-40 200 300 0 100 Discharge Rate (spikes/s)

200

300

Figure 3: Effect of the addition of uniform noise to interval times contributing to entropy estimates provided by the method of Kozachenko and Leonenko (1987). The entropy values given in this gure were obtained by analysis of spike trains evoked by 64 impulses per second per ber. The discharge rate was increased by increased afferent convergence. (A) Poissonian and (B) pulsed stimulation of the model Purkinje neuron. Entropy estimates obtained from computed interval data (open circles) were severely degraded regardless of the stimulation paradigm. Addition of uniform noise improved the apparent measurement resolution of the observed intervals. This satised the methodological requirement for continuity such that entropy estimates (see the lled circles) provided a reasonable match to those obtained with the interval method. All data in the upper curves (lled circles) were obtained from 10 replicate analyses where the random number seed generating the added noise was changed for each case.

We next used entropy measurements to explore the discharge behavior of the model PC in response to afferent stimulation at rates of 8, 64, and 128 impulses per second per afferent ber where the mean impulse rate per second per ber was held constant for the duration of each simulation and the rate of neuronal discharge was controlled by varying the number of active bers. Under Poissonian stimulation, an increase in the stimulus rate from 8 to 128 impulses per second per ber had little inuence on the calculated interval entropy, HI , as under these conditions, the variability of the intervals in a spike train was proportional only to the discharge rate (not shown). As we have previously shown, frequency-locked discharge evoked by pulsed stimulation may be seen when afferent convergence is set such that a single action potential is evoked with each afferent stimulus pulse (Coop & Reeke,

960

G. Reeke and A. Coop

2001). Here, this characteristic feature of spike trains seen for stimulation rates at 64 impulses per second per ber or was lost with an increase in the stimulus rate to 128 impulses per second per ber. Estimates of HI obtained for spike trains generated in response to this stimulation rate were similar to those obtained for Poissonian activation (data not shown). This suggested that under these conditions, the PC was in a rate-coded regime, discharging independently of the temporal characteristics of the activation process. Analysis of the spike trains generated in response to 64 impulse per second per ber is now presented as these data sets adequately display many of the features seen in response to lower stimulation rates. In Figure 4 the entropies obtained with the HR , HD , and HI methods for various discharge rates under both stimulation paradigms are compared. The values obtained for HR (see Figures 4A and 4B, dashed lines) were of course identical. Those obtained for HD (see Figures 4A and 4B, solid lines) under the two protocols were superimposable, thus conrming HD to be a property of the spike train alone, independent of the stimulation paradigm (Strong et al., 1998). Under the pulsed stimulation paradigm, both HKL (see the lled circles in Figure 3B) and HI (see the lled circles in Figure 4B) showed reductions at the locking frequency (stimulus rate = discharge rate) and at the frequencyhalved and doubled rates of 32 and 128 spikes per second. However, both HR and HD failed to display sensitivity to these changes in the timing of cell discharge, which were attributable to the increased periodicity known to occur when each afferent pulse evokes a single action potential. To rule out data insufciency as a source of this failure, a strictly periodic 64 spikes per second spike train was constructed. The respective HD values obtained for this spike train were effectively identical at 6.251 to 6.250 bits per spike with an increase in spike train duration from 100 to 20,000 s. These values compared well with that of 6.248 bits per spike obtained when the duration of an equivalent frequency-locked spike train generated by the model PC was increased from 100 and 5000 s (discharge rates both 64.1 spikes per second) and were also concordant with the values obtained for an increase in spike train duration from 100 to 5000 s under Poissonian stimulation (mean discharge rates 63.7 and 63.4 spikes per second, HD 6.257 and 6.265 bits per spike, for the respective spike train durations). We note that the change from Poissonian to pulsed stimulation typically decreased the spike train CV by a factor of ve. Alternatively, analysis of the strictly periodic spike train for durations of 100 and 20,000 s with the interval methodology gave values of 0 bits per interval for HI in both cases, as expected for application of Shannon theory to a strictly periodic signal. At discharge rates higher than about 256 and 215 spikes per second under Poissonian and pulsed stimulation, respectively, the MLE methodology was unable to provide estimates of parameter values for analytic distributions that were not signicantly different (® D 0:01; 0:05) from the observed distributions. This failure was attributed to the presence of a 3.5 ms absolute

Estimating The Temporal Interval Entropy of Neuronal Discharge

A. POISSON

10

5

0

0

100

B. PULSED

Entropy (bits/event)

Entropy (bits/event)

10

HR HD HI (p>0.05) HI (p<0.05)

200

961

300

32 64

128

256

5

0

0

100

200

300

Discharge Rate (spikes/s) Figure 4: Comparison of the rate, HR , direct, HD , and interval, HI , entropies, calculated for 100 s duration spike trains obtained from a model Purkinje neuron (PC) in response to stimulation at 64 spikes per second per afferent ber. PC discharge rate was controlled by the number of active bers (see the text for further details). (A) Poissonian stimulation. In comparison with estimates of HI , which assumed intervals to be independent, estimates of HD were considerably larger in magnitude for discharge rates higher than about 10 spikes per second. (B) Pulsed stimulation. In this case, the entropy provided by the direct method, HD , did not appear to be sensitive to the changes in spike timing that occur under conditions of frequency locking where each pulse of afferent impulses evokes a single action potential and cell discharge becomes near periodic (vertical dashed lines at 32, 64, and 128 spikes per second discharge). The fact that the values of H R and HD remain effectively superimposed across the two stimulation paradigms tested suggests that unlike the entropy estimates provided by HI , these measures fail to distinguish alterations in the temporal properties of the evoked discharge patterns. Filled circles indicate that ts of observed data to analytic distributions were not signicantly different (® · 0:05) by both the Kolmogorov-Smirnov and Anderson-Darling statistics. Open circles indicate the opposite. See the text for further details.

refractory period in the model PC, which provided an inexible lower limit for the shortest interspike intervals. As the rate of PC discharge increased above 200 to 250 spikes per second, the observed interval distributions became more truncated to the left. The peak associated with this truncation could not be t with the gamma and gaussian distributions employed. Although these peaks undoubtedly could have been well t with one or more Dirac delta functions, this was not pursued.

962

G. Reeke and A. Coop

In summary, the interval method appears to provide an entropy estimate that, under the assumption of independence, is quite sensitive to the distribution of intervals, whereas the direct method provides estimates that appear insensitive to changes in impulse timing as, for example, when the stimulation paradigm is changed from Poissonian to pulsed. 4.2 Cultured Hippocampal Neurons. To further demonstrate the interval methodology, the spontaneous discharge behavior of acutely dissociated, cultured hippocampal neurons was also analyzed. Spike train data were obtained simultaneously via extracellular recording (30 minute duration) from each of ve neurons located in cultured slices from rat hippocampus area CA1. Discharge characteristics for each neuron and the results of three different entropy analyses are given in Table 1. By visual inspection, the interval histogram for each neuron appeared to be bimodal with preferred interval durations that represented mean spike rates of 9 and 70 spikes per second (not shown). However, the mean number of gamma and gaussian components t with the interval methodology ranged from 4 to 6 (pD and pW both over 0.55). As was the case for the simulated spike trains, the effect of spike train duration on entropy magnitude was assessed to determine whether in each case sufcient data were available to obtain a reliable entropy estimate for the hippocampal neurons. Results for cells 1, 2, and 3 are given in Figure 5 as their discharge rates spanned the range observed for the ve neurons analyzed. In each case, the direct entropy, HD , appeared to asymptote within 400 to 1000 s to the values calculated for the respective rate entropy estimates, HR (see Figure 5A), whereas HI and HKL reached their asymptotic

Table 1: Characteristics of the Spontaneous Discharge of Five Acutely Dissociated Cultured Neurons Obtained from Rat Hippocampus Area CA1. Neuron (No.)

Spikes (No.)

Rate (Spikes/s)

pD

pW

HD (Bits/event)

HI (Bits/event)

HKL (Bits/event)

3 4 2 5 1

3519 4704 11,771 19,219 21,655

2.0 2.6 6.5 10.7 12.0

0.90 0.84 0.63 0.85 0.68

0.90 0.83 0.55 0.90 0.57

11.44 11.02 9.69 8.97 8.81

10.28 9.66 8.39 6.05 7.46

10.24 9.63 8.34 6.04 7.43

Notes: Data were obtained from simultaneous 30 minute duration recordings. The mean p-values for the Kolmogorov-Smirnov (pD ) and Anderson-Darling (pW ) statistical tests are given for the ts from which the interval entropy values plotted in Figure 5 were calculated. All entropies (HD —direct, HI —interval, and HKL —Kozachenko and Leonenko entropies) were calculated with a bin width of 0.5 ms. HKL was obtained as the mean of 10 replicates of each data set in which the random number seed for the added noise was varied. The entropy estimates obtained from the HI and HKL methods were quite similar, whereas, HD differed from both of them.

Estimating The Temporal Interval Entropy of Neuronal Discharge

A.

12

8

4

0

Cell 1 HD Cell 2 HD Cell 3 HD

1

10

HR

100

B.

16

Entropy (bits/interval)

Entropy (bits/spike)

16

963

12

8

4 Cell 1 HI Cell 2 HI Cell 3 HI

0 1000 1 10 Spike Train Duration (s)

Cell 1 HKL Cell 2 HKL Cell 3 HKL

100

1000

Figure 5: Entropy-based analysis of the extracellularly recorded spontaneous discharge of cultured rat hippocampus CA1 neurons of unknown type. (A) Comparison of entropy estimates obtained with the rate, HR , and direct, HD , methods from segments of various durations taken from 1800 s spike train records. HD appears to asymptote to the value calculated for HR . (B) Effect of spike train duration on the entropy estimated by the interval, HI , and KL (HKL ; Kozachenko & Leonenko, 1987) methodologies. Both of these entropy estimates asymptote to constant values for spike train durations of 1000 s and longer. These asymptotic values are smaller than those obtained with HD . Further details are given in Table 1 and the text.

values for spike train durations of about 1000 s or longer (see Figure 5B). As was the case for the simulated spike trains (see Figure 2A), here HI and HKL were characteristically smaller than HD and HR . 5 Discussion The analytic distribution method for calculating the interval entropy provides a methodology that appears to be well suited for the quantication of the entropy of biological signals such as those embodied in spike trains. When applied to the interval data obtained from a spike train, the methodology determines the interval distribution from which an observed interspike interval distribution was most likely obtained. Characterization of the corresponding analytic distribution by an appropriate MLE procedure putatively gives the population of intervals predicted to occur in a spike train of effectively innite duration exhibiting temporal properties statistically

964

G. Reeke and A. Coop

indistinguishable from those of the observed spike train. The mean entropy per interval, HI , can then be calculated at any desired temporal resolution, 1t, from the analytic distribution. For the purposes of this method, when applied to a spike train, the “symbol” of the Shannon formulation is taken to comprise the temporal interval between each pair of spikes and the spike that signals the completion of each interval. Thus, calculation of the mean entropy per interval for a given spike train serves to quantify an entropy associated with the timing of each action potential. The interval methodology was developed in response to two problems associated with previous measures of entropy. The rst is that of the sampling error introduced with the estimation of a suitable probability density function from a discrete sample. When neural discharge is assumed to provide the signaling element propagated within the neural code, it is not immediately obvious what distribution is appropriate a priori, particularly when no reasonable assumption on the prior is self-evident. This problem has been considered both generally (Wolpert & Wolf, 1995) and more recently for neural activity (Panzeri & Treves, 1996). However, as has been shown here for the interval method, it is relatively straightforward to choose empirically a continuous function or combination of functions that describes the interval distribution obtained from a given spike train with a degree of precision better than that typically encountered for the experimental error associated with physiological experiments (less than 2% to 5%). The second methodological improvement arises because for the interval method, the selection of the temporal resolution for the entropy calculation is decoupled from the determination of the interspike interval distribution. Unlike previous methods, for which the choice of bin width 1t for digitization of the spike train also provides the resolution at which the entropy calculation is made, by employing MLE techniques to obtain the underlying probability distribution from which the population of observed intervals is most likely obtained, the interval method does not rely on a binning procedure. This allows the resolution of the entropy estimate to be independent of the estimation of the parameters of the continuous distribution from which it is obtained. This step may therefore be based on relevant principles other than those used to select an observational bin width. Other novel features of the interval method are related to quantication of the validity of the entropy estimate. They include condence intervals on the parameters obtained for the analytic distribution, the availability of the RMS error obtained from the t between the observed and the analytic distribution, and statistical testing of the t between the observed and analytic distributions. Although condence intervals were relatively large for the 100 s spike train durations employed in our analyses, entropy estimates typically asymptoted to a constant value prior to the smaller condence intervals obtained as spike train duration was increased. Large values of the RMS error may be used as an internal indicator of situations in which the method

Estimating The Temporal Interval Entropy of Neuronal Discharge

965

may not be applied with condence. Further, visual comparison of the discrete and continuous probability distributions typically indicates whether the error is due to an incorrect choice of the analytic distribution, or, with an appropriate choice, whether the error is either uniformly distributed across all interval durations or localized. Thus, data about which interval durations exhibit anomalous behavior become available. When taken together with the availability of statistical tests suited to determining the validity of the t between the observed and t distributions, the usefulness of the methodology is greatly enhanced. In particular, calculation of both the KS statistic, D, and the AD statistic, W, provide both quantication and localization of any anomalies—at the median or the tails of a distribution for the KS and AD test, respectively. The nonparametric estimator provided by Kozachenko and Leonenko (1987), while at rst appearing attractive, was problematic when applied to neuronal data. This was attributed to the strict requirement for continuity, which was not satised even at temporal resolutions of 10 ¹s. (We note in passing that this characteristic of the KL entropy might advantageously be employed to detect discretization of interval data.) Analyses could be performed by adding uniform noise to the intervals in an observed distribution, a procedure that articially increased the measurement resolution. This was particularly the case when an interval distribution contained a sharp shoulder, for example, at higher discharge rates when interval durations were limited by the absolute refractory period. Ideally, the observed distribution could be generated by resampling from an appropriate analytic distribution, obtained, for example, by the interval methodology outlined here. This data set would then provide sufcient resolution to estimate the KL entropy. The disadvantage with this approach is that crucial steps required to estimate HKL are also those required for estimating the interval entropy, HI , with the entropy in both cases determined by the validity of the analytic distribution obtained. Further, the requirement for continuity of the distribution for the KL method would appear to preclude resampling techniques to obtain condence intervals on the KL entropy estimate. Finally, the KL method provides a single entropy value that is devoid of the mechanistic insights that may be obtained using the interval methodology. We now turn to a comparison of the interval and direct methodologies. Here it was notable that, as might be predicted, the two methodologies gave entropy estimates in good agreement at low discharge rates when the activation process was Poissonian and interspike intervals could be assumed to be independent events. Nonetheless, it should be understood that the basic assumptions of the two methods regarding the nature of the fundamental signaling event are different, and so the results cannot be expected to agree in all cases. This was particularly apparent as the periodicity of neuronal discharge increased near the locking frequency under a pulsed stimulation paradigm. In this case, spike train entropy should be minimal

966

G. Reeke and A. Coop

in accord with Shannon theory on the grounds of the low variability of interval duration and an increased correlation in interval sequences, both of which lower the number of messages that in principal could be conveyed by such spike trains. The apparent overestimation of HD in such cases was attributed to the generation of novel words by the procedure of stepping the word bin-wise through the discretized spike train from an arbitrary origin. This increased the number of different word patterns observed in a spike train and led to a reduction in the probability attributed to any given spike or spike pattern and a consequent ination of its contribution to the calculated entropy. The interval methodology avoids this problematic aspect of the direct method because intervals are assumed to be independent and, as might be considered appropriate for a biological context, no origin for the enumeration of different interval patterns needs to be dened. Comparison of entropies obtained for spike trains of equivalent discharge rates but differing temporal patterns (as occurred when the stimulation paradigm was altered from Poissonian to pulsed) supported the proposition that the entropy estimate provided by the direct method was not sensitive to changes in spike timing. The fact that HD appeared insensitive to interval correlations under conditions of near-periodic neuronal discharge suggests that this measure may be unfavorably biased when applied within a biological context. Finally, because of its ability to quantify temporal aspects of neuronal discharge with a computationally convenient procedure that is effective with relatively short data records, we consider that the interval methodology shows considerable promise for analysis of both experimental and simulated spike trains. Acknowledgments This work was supported by Neurosciences Research Foundation. We thank E. Keefer of the Neurosciences Institute, San Diego, California, for supplying the data on cultured hippocampal neurons, V. Pie¨ ch for insightful discussion regarding implementation of the KL entropy calculations, and the anonymous referees for their helpful comments. References Adrian, E. D. (1928). The basis of sensation. New York: Norton. Anderson, T. W., & Darling, D. A. (1954). A test of goodness of t. Journal of the American Statistical Association, 49, 765–769. Atluri, G. (1996). Determinants of the time course of facilitation at the granule cell to Purkinje cell synapse. Journal of Neuroscience, 16, 5661–5671. Berry, M. J., Warland, D. K., & Meister, M. (1997). The structure and precision of retinal spike trains. Proceedings of the National Academy of Sciences USA, 94, 5411–5416.

Estimating The Temporal Interval Entropy of Neuronal Discharge

967

Bialek, W., Rieke, F., de Ruyter van Steveninck, R. R., & Warland, D. (1991). Reading a neural code. Science, 252, 1854–1857. Borst, A., & Theunissen, F. E. (1999). Information theory and neural coding. Nature Neuroscience, 2, 947–957. Brecht, M., Goebel, R., Singer, W., & Engel, A. K. (2001). Synchronization of visual responses in the superior colliculus of awake cats. Neuroreport, 12, 43– 47. Brenner, N., Strong, S. P., Koberle, R., Bialek, B., & de Reuter van Steveninck, R. R. (2000). Synergy in a neural code. Neural Computation, 12, 1531–1552. Brillouin, L. (1962). Science and information theory (2nd ed.). New York: Academic Press. Buracas, G. T., & Albright, T. D. (1999). Gauging sensory representations in the brain. Trends in Neuroscience, 22, 303–309. Buracas, T., Zador, A. M., DeWeese, M. R., & Albright, T. (1998). Efcient discrimination of temporal patterns by motion-sensitive neurons in primate visual cortex. Neuron, 20, 959–969. Butts, D. A. (2003). How much information is associated with a particular stimulus? Network: Computation and Neural Systems, 14, 177–187. Buzs´aki, G., Llin a´ s, R., Singer, W., Berthoz, A., & Christen, Y. (Eds.). (1994). Temporal coding in the brain. Berlin: Springer-Verlag. Coop, A. D., & Reeke, G. N. (2001). Deciphering the neural code: Neuronal discharge variability is preferentially controlled by the temporal distribution of afferent impulses. Neurocomputing, 38, 153–157. Cover, T. M., & Thomas, J. A. (1991). Elements of information theory. New York: Wiley. de Ruyter van Steveninck, R. R., Lewen, G. D., Strong, S. P., Koberle, R., & Bialek, B. (1997). Reproducibility and variability in neural spike trains. Science, 275, 1805–1808. deCharms, R. C., & Zador, A. (2000). Neural representation and the cortical code. Annual Reviews in Neuroscience, 23, 613–647. Dimitrov, A. G., & Miller, J. P. (2001). Neural coding and decoding: Communication channels and quantization. Network: Computation and Neural Systems, 12, 441–472. DobruÏsin, R. L. (1958). A simplied method of experimentally evaluating the entropy of a stationary sequence. Teoriya Veroyatnosteii ee Primeneniya, 3, 462– 464. Eckhorn, R., & P¨opel, E. (1974). Rigorous and extended application of information theory to the afferent visual system of the cat. I. Basic concepts. Kybernetik, 16, 191–200. Eggermont, J. J. (1998). Is there a neural code? Neuroscience and Biobehavioral Reviews, 22, 355–370. Eggermont, J. J. (2001). Between sound and perception: Reviewing the search for a neural code. Hearing Research, 157, 1–42. Friston, K. J. (1997). Another neural code? NeuroImage, 5, 213–220. Gershon, E. D., Wiener, M. C., Latham, P. E., & Richmond, B. J. (1998). Coding strategies in monkey V1 and inferior temporal cortices. Journal of Neurophysiology, 79, 1135–1144.

968

G. Reeke and A. Coop

Gilbert, C. D. (2001).The neural basis of perceptual learning. Neuron, 31, 681–697. Gross, G. W., Wen, W., & Lin, J. (1985). Transparent indium-tin oxide patterns for extracellular, multisite recording in neuronal cultures. Journal of Neuroscience Methods, 15, 243–252. Hartley, R. V. L. (1924).Transmission of information. Bell System Technical Journal, 7, 535. Klemm, W. R., & Sherry, C. J. (1981). Serial ordering in spike trains: What’s it “trying to tell us”? International Journal of Neuroscience, 14, 15–33. Kotz, S., & Johnson, N. L. (Eds.). (1983). Encyclopedia of statistical sciences. New York: Wiley. Kozachenko, L. F., & Leonenko, L. (1987). Sample estimate of the entropy of a random vector. Problems of Information Transmission, 23, 95–101. Lilliefors, H. W. (1967). On the Kolmorogorov-Smirnov test for normality with mean and variance unknown. Journal of the American Statistical Association, 62, 399–402. Lisman, J. E. (1997). Bursts as a unit of neural information: Making unreliable synapses reliable. Trends in Neuroscience, 20, 38–43. Liu, R. C., Tzonev, S., Rebrik, S., & Miller, K. D. (2001).Variability and information in a neural code of the cat lateral geniculate nucleus. Journal of Neurophysiology, 86, 2789–2806. MacKay, D. M., & McCulloch, W. S. (1952). The limiting information capacity of a neuronal link. Bulletin of Mathematical Biophysics, 14, 127–135. Mann, N. R., Schafer, R. E., & Singpurwalla, N. D. (1974). Methods for statistical analysis of reliability and life data. New York: Wiley. Manwani, A., & Koch, C. (2000). Detecting and estimating signals over noisy and unreliable synapses: Information-theoretic analysis. Neural Computation, 13, 1–33. Meister, M., & Berry, M. J. II. (1999). The neural code of the retina. Neuron, 22, 435–450. Noda, H., & Adey, W. R. (1970). Firing variability in cat association cortex during sleep and wakefulness. Brain Research, 18, 513–526. Optican, L. M., Gawne, T. J., Richmond, B. J., & Joseph, P. J. (1991). Unbiased measures of transmitted information and channel capacity from multivariate neuronal data. Biological Cybernetics, 65, 305–310. Optican, L. M., & Richmond, B. J. (1987). Temporal encoding of two-dimensional patterns by single units in primate inferior temporal cortex. III. Information theoretic analysis. Journal of Neurophysiology, 57, 162–178. Panzeri, S., & Treves, A. (1996). Analytical estimates of limited sampling biases in different information measures. Network: Computation and Neural Systems, 7, 87–107. Perkel, D. H. (1970). Spike trains as carriers of information. In F. O. Schmitt (Ed.), The Neurosciences Second Study Program (pp. 587–596). New York: Rockefeller University Press. Perkel, D. H., & Bullock, T. H. (1968). Neural coding. Neurosciences Research Program Bulletin, 6, 221–348. Press, W. H., Teukolsky, S. A., Vetterling, W. T., & Flannery, B. P. (1992).Numerical Recipes in C (2nd ed.). Cambridge: Cambridge University Press.

Estimating The Temporal Interval Entropy of Neuronal Discharge

969

Ransom, B. R., Neale, E., Henkart, M., Bullock, P. N., & Nelson, P. G. (1977). Mouse spinal cord in cell culture. Journal of Neurophysiology, 40, 1132–1150. Regehr, W. G., & Stevens, C. F. (1999). Physiology of synaptic transmission and short-term plasticity. In G. Stuart, N. Spruston, & M. H¨ausser (Eds.), Dendrites (pp. 135–175). New York: Oxford University Press. Reich, D. S., Mechler, F. M., Purpura, K. P., & Victor, J. D. (2000). Interspike intervals, receptive elds, and information encoding in primary visual cortex. Journal of Neuroscience, 20, 1964–1974. Reinagel, P., Godwin, D., Sherman, S. M., & Koch, C. (1999). Encoding of visual information by LGN bursts. Journal of Neurophysiology, 81, 2558–2569. Reinagel, P., & Reid, R. C. (2000). Temporal coding of visual information in the thalamus. Journal of Neuroscience, 20, 5392-5400. Reinagel, P., & Reid, R. C. (2002). Precise ring events are conserved across neurons. Journal of Neuroscience, 22, 6837–6841. Rieke, F., Warland, D., de Ruyter van Steveninck, R. R., & Bialek, W. (1997). Spikes: Exploring the neural code. Cambridge, MA: MIT Press. Sabatini, B. L., & Regehr, W. G. (1999). Timing of synaptic transmission. Annual Reviews in Physiology, 61, 521–542. Sanderson, A. C., & Kobler, B. (1976). Sequential interval histogram analysis of non-stationary neuronal spike trains. Biological Cybernetics, 22, 61–71. Segundo, J. P. (1970). Communication and coding by nerve cells. In F. O. Schmitt (Ed.), The Neurosciences Second Study Program (pp. 569–586). New York: Rockefeller University Press. Segundo, J. P. (2000). Some thoughts about neural coding and spike trains. BioSystems, 58, 3–7. Shadlen, M. N., & Newsome, W. T. (1998). The variable discharge of cortical neurons: Implications for connectivity, computation, and information coding. Journal of Neuroscience, 18, 3870–3896. Shannon, C. E. (1948). A mathematical theory of communication. Bell Systems Technical Journal, 27, 379–423, 623–656. Shannon, C. E., & Weaver, W. (1949). The mathematical theory of communication. Urbana: University of Illinois Press. Sherry, C. J., & Klemm, W. R. (1981). Entropy as an index of the informational state of neurons. International Journal of Neuroscience, 15, 171–178. Softky, W. R., & Koch, C. (1993). The highly irregular ring of cortical cells is inconsistent with temporal integration of random EPSPs. Journal of Neuroscience, 13, 334–350. Stein, R. B. (1965). A theoretical analysis of neuronal variability. Biophysical Journal, 5, 173–194. Stein, R. B. (1967). The information capacity of nerve cells using a frequency code. Biophysical Journal, 7, 797–826. Strong, S. P., Koberle, R., de Ruyter van Steveninck, R. R., & Bialek, W. (1998). Entropy and information in neural spike trains. Physical Review Letters, 80, 197–200. Szczepanski, J., Amig o, ´ J. M., Wajnryb, E., & Sanchez-Vives, M. V. (2003). Application of the Lempel-Ziv complexity to the analysis of neural discharges. Network: Computation and Neural Systems, 14, 335–350.

970

G. Reeke and A. Coop

Tiesinga, P. H. E., Jos´e, J. V., & Sejnowski, T. J. (2000). Comparison of currentdriven and conductance-driven neocortical model neurons with HodgkinHuxley voltage-gated channels. Physical Review E, 62, 8413–8419. Tolhurst, D. J., Movshon, J. A., & Dean, A. F. (1983). The statistical reliability of signals in single neurons in cat and monkey visual cortex. Vision Research, 23, 775–785. Usher, M., Stemmler, M., & Koch, C. (1994). Network amplication of local uctuations causes high spike rate variability, fractal ring patterns and oscillatory local eld potentials. Neural Computation, 6, 795–836. Victor, J. D., & Purpura, K. P. (1996). Nature and precision of temporal coding in visual cortex: A metric-space analysis. Journal of Neurophysiology,76, 1310– 1326. Werner, G., & Mountcastle, V. B. (1965). Neural activity in mechanoreceptive cutaneous afferents: Stimulus response relations, Weber functions, and information transmission. Journal of Neurophysiology, 28, 359–397. Whitsel, B. L., Schreiner, R. C., & Essick, G. K. (1977). An analysis of variability in somatosensory cortical neuron discharge. Journal of Neurophysiology, 40, 589–607. Wiener, M. C., & Richmond, B. J. (1999). Using response models to estimate channel capacity for neuronal classication of stationary visual stimuli using temporal coding. Journal of Neurophysiology, 82, 2861–2875. Wolpert, D. H., & Wolf, D. R. (1995). Estimating functions of probability distributions from a nite set of samples. Physical Review E, 52, 5841–6854. Received March 13, 2002; accepted September 29, 2003.

LETTER

Communicated by Emanuel Todorov

Dynamic Analysis of Neural Encoding by Point Process Adaptive Filtering Uri T. Eden [email protected] Neuroscience Statistics Research Laboratory, Department of Anesthesia and Critical Care, Massachusetts General Hospital, Boston, MA 02114, and Division of Health Sciences and Technology, Harvard Medical School/Massachusetts Institute of Technology, Cambridge, MA 02139, U.S.A.

Loren M. Frank [email protected]

Riccardo Barbieri [email protected] Neuroscience Statistics Research Laboratory, Department of Anesthesia and Critical Care, Massachusetts General Hospital, Boston, MA 02114, U.S.A.

Victor Solo [email protected] School of Electrical Engineering and Telecommunications, University of New South Wales, Sydney, Australia, and Martinos Center for Biomedical Imaging, Massachusetts General Hospital, Harvard Medical School, Boston, MA 02114, U.S.A.

Emery N. Brown [email protected] Neuroscience Statistics Research Laboratory, Department of Anesthesia and Critical Care, Massachusetts General Hospital, Boston, MA 02114, and Division of Health Sciences and Technology, Harvard Medical School/Massachusetts Institute of Technology, Cambridge, MA 02139, U.S.A.

Neural receptive elds are dynamic in that with experience, neurons change their spiking responses to relevant stimuli. To understand how neural systems adapt their representations of biological information, analyses of receptive eld plasticity from experimental measurements are crucial. Adaptive signal processing, the well-established engineering discipline for characterizing the temporal evolution of system parameters, suggests a framework for studying the plasticity of receptive elds. We use the Bayes’ rule Chapman-Kolmogorov paradigm with a linear state equation and point process observation models to derive adaptive lters appropriate for estimation from neural spike trains. We derive point

Neural Computation 16, 971–998 (2004)

c 2004 Massachusetts Institute of Technology °

972

U. Eden, L. Frank, R. Barbieri, V. Solo, and E. Brown

process lter analogues of the Kalman lter, recursive least squares, and steepest-descent algorithms and describe the properties of these new lters. We illustrate our algorithms in two simulated data examples. The rst is a study of slow and rapid evolution of spatial receptive elds in hippocampal neurons. The second is an adaptive decoding study in which a signal is decoded from ensemble neural spiking activity as the receptive elds of the neurons in the ensemble evolve. Our results provide a paradigm for adaptive estimation for point process observations and suggest a practical approach for constructing ltering algorithms to track neural receptive eld dynamics on a millisecond timescale. 1 Introduction Experience-dependent plasticity has been well documented in a number of brain regions in both animals and humans (Kaas, Merzenich, & Killackey, 1983; Merzenich, Nelson, Stryker, Cyader, Schoppmann, & Zook, 1984; Donoghue, 1995; Weinberger, 1993; Pettet, 1992; Jog, Kubota, Connolly, Hillegaart, & Graybiel, 1999). In rats, for example, place cells recorded in the CA1 region of the hippocampus change their receptive eld location, scale, and temporal ring characteristics as the animal explores familiar environments (Mehta, Barnes, & McNaughton, 1997; Mehta, Quirk, & Wilson, 2000; Frank, Eden, Solo, Wilson, & Brown, 2002). Similarly, the ring properties of neurons in the primary motor cortex of primates change with the load against which the animal must exert force (Gandolfo, Li, Benda, Schioppa, & Bizzi, 2000; Li, Padoa-Schioppa, & Bizzi, 2001). Characterizing this receptive eld plasticity is important for understanding how neural activity represents biological information. To study the plasticity of receptive elds, we turn to adaptive signal processing, a well-established engineering discipline for characterizing the temporal evolution of system parameters (Solo & Kong, 1995; Haykin, 1996). Several well-known ltering algorithms that serve this purpose, such as the Kalman lter, recursive least-squares, and steepest-descent algorithms, fall into a class of lters that can be derived from Bayesí rule and the ChapmanKolmogorov equations. But because these algorithms require continuousvalued measurements, they are of limited utility in the analysis of neural plasticity, where the spike train observations are point processes. We previously introduced a steepest-descent point process adaptive lter that used spike train measurements to analyze neural plasticity (Brown, Nguyen, Frank, Wilson, & Solo, 2001). This adaptive lter was derived from an instantaneous log-likelihood criterion function and was used in a detailed study of receptive eld plasticity in hippocampal CA1 and entorhinal cortical neurons (Frank et al., 2002) to track the evolution of these receptive elds on a millisecond timescale. Those analyses suggested a number of aspects of point process ltering for further development. In particular, those lters used a static learning rate and did not allow calculation of condence

Point Process Adaptive Filtering

973

bounds for the parameter estimates. Furthermore, the relation of the point process lters and well-established adaptive lters for continuous-valued data was not discussed. Here we extend our point process adaptive ltering framework by deriving a new set of lters using the Bayesí rule Chapman-Kolmogorov paradigm with a linear state equation and point process observation models. This extends the derivation in Brown et al. (2001) and allows us to derive point process analogues of the Kalman lter, recursive least squares, and steepest-descent algorithms. We discuss the properties of the new lters and illustrate their performance in two simulated data examples. The rst is a study of both linear and jump evolution of place receptive elds in hippocampal neurons. The second is an adaptive decoding study in which a signal is decoded from ensemble neural spiking activity as the receptive eld properties of the neurons in the ensemble evolve and this evolution is simultaneously estimated. 2 Theory 2.1 The Conditional Intensity Function. Given an observation interval .0; T], let N.t/ be the counting process giving the total number of spikes red by a given neuron in the interval .0; t], for t 2 .0; T]. A stochastic point process representing neural spike data can be characterized by its conditional intensity function, ¸.t j x.t/; µ.t/; H.t//, where x.t/ is a vector representing the biological signal to which the neural system is tuned, µ .t/ is a vector representing the tuning function parameters for this neuron, and H.t/ denotes the history of the spiking process and the biological signal up to time t (Snyder & Miller, 1991). This conditional intensity function denes the neuron’s instantaneous ring rate in terms of the instantaneous probability of spiking as ¸.t j x.t/; µ .t/; H.t// D lim

1t!0

Pr.N.t C 1t/ ¡ N.t/ D 1 j x.t/; µ .t/; H.t// : 1t

(2.1)

Thus, the probability of the neuron’s ring a single spike in a small interval [t; t C 1t/ can be approximated as ¸.t j x.t/; µ.t/; H.t//1t. This conditional intensity function is a history-dependent generalization of the inhomogeneous Poisson rate function (Daley & Vere-Jones, 1988). Here, we switch from a continuous time to a discrete time framework and present our methods as recursive algorithms rather than as differential equations in order to simplify ensuing calculations. The observation interval can be partitioned into ftk gKkD0 and individual time steps represented by 1tk D tk ¡ tk¡1 . The continuous-valued state, parameter, and ring history functions dened above are the same in discrete time, but are now indexed

974

U. Eden, L. Frank, R. Barbieri, V. Solo, and E. Brown

by an integer k representing the associated time step in the partition. For example, we use xk for x.tk /, Nk for N.tk /, and so forth. 1Nk D Nk ¡Nk¡1 is then the new spiking information observed during the time interval .tk¡1 ; tk ]. If 1tk is sufciently small, then the probability of more than one spike occurring in this interval is negligible. In that case, 1Nk takes on the value 0 if there is no spike in .tk¡1 ; tk ] or 1 if there is a spike. For this reason, we refer to 1Nk as the spike indicator function for the neuron at time tk . We let 1N1:k D [1N1 ; : : : ; 1Nk ] denote the ensemble spike observations up to time tk . For a ne partition, this is the sequence of 0s and 1s that is typically used to represent a spike train in discrete time. To simplify notation, we aggregate all of the history that is pertinent to the ring probability at tk into a common term Hk D [µ1:k¡1 ; x1:k ; N1:k¡1 ]. 2.2 Derivation of the Point Process Adaptive Filter. We assume that µ k is a time-varying parameter vector to be estimated at each discretized point in time. To develop an adaptive lter, a recursive expression for µk is derived in terms of its previous value and the state and spiking histories. Therefore, estimates of µ k will be based on its posterior density, conditioned on the set of past observations, p.µk j 1Nk ; Hk /. This posterior density evolves with time and with each observation. Tracking this evolution allows for real-time estimation of the parameters of the conditional intensity function based on all of the spiking data observed up to the current time. 2.2.1 The System and Observation Equations. We dene the system equation as a set of parameter vectors with linear evolution processes and gaussian errors,

µ kC1 D Fk µ k C ´k ;

(2.2)

where Fk is a system evolution matrix and ´k is a zero-mean white-noise process with covariance matrix Qk . If µ0 and ´k have gaussian distributions, then each future µ k will also have a gaussian distribution with its own mean and variance. The second essential component for the construction of a recursive estimation equation is the likelihood or observation model, p.1Nk j µk ; Hk ). This model denes the probability of observing spikes in the interval .tk¡1 ; tk ], based on the current state of the system and the past spiking information. From the conditional intensity function in equation 2.1, we obtain an expression for the probability of observing a spike in the interval .tk¡1 ; tk ], Pr.1Nk / D [¸.tk j µ k ; Hk /1tk ]1Nk [1 ¡ ¸.tk j µk ; Hk /1tk ]1¡1Nk C o.1tk /. For small values of 1tk , this Bernoulli probability expression is well approximated as Pr.1Nk spikes in .tk¡1 ; tk ] j µk ; Hk /

D exp.1Nk log.¸.tk j µk ; Hk /1tk / ¡ ¸.tk j µ k ; Hk /1tk /

(2.3)

Point Process Adaptive Filtering

975

(Brown, Barbieri, Eden, & Frank, 2003). Bayes’ rule can be used to obtain the posterior density for the parameter vector at time tk as a function of a prediction density of the state at time tk given the observations only up to time tk¡1 , and the probability of ring a spike in the interval .tk¡1 ; tk ], p.µ k j 1Nk ; Hk / D

p.1Nk j µ k ; Hk /p.µ k j Hk / : p.1Nk j Hk /

(2.4)

The rst term in the numerator of equation 2.4 is the likelihood in equation 2.3, and the second term is the one-step prediction density dened by the Chapman-Kolmogorov equation as Z p.µ k j Hk / D

p.µ k j µk¡1 ; Hk /p.µk¡1 j 1Nk¡1 ; Hk¡1 / dµ k¡1 :

(2.5)

Equation 2.5 has two components, p.µk j µk¡1 ; Hk /, given by the state evolution model in equation 2.2, and p.µk¡1 j 1Nk¡1 ; Hk¡1 /, the posterior density from the last iteration step. The denominator in equation 2.4 denes a normalizing factor that ensures that the posterior distribution will integrate to 1. Substituting equation 2.5 into equation 2.4 yields the recursive expression for the evolution of the posterior density of the time-varying parameter vector, 1N p.µk j 1Nk ; Hk / D ¸.tk j µk ; Hk /1tk k exp.¡¸.tk j µ k ; Hk /1tk / R p.µk j µ k¡1 ; Hk /p.µ k¡1 j 1Nk¡1 ; Hk¡1 / dµ k¡1 £ : (2.6) p.1Nk j Hk /

Equation 2.6 is the exact posterior density equation. It allows us to compute the evolution of the density of the state vector, conditioned on all the spike observations made up to that time. There are a number of ways to proceed in evaluating equation 2.6, depending on the desired balance between accuracy and computational efciency. In this letter, a gaussian approximation is made to this posterior distribution at each point in time. This leads to a simple algorithm that is computationally tractable as the dimension of the parameter vector increases. By assumption, the state evolution in equation 2.2 has a gaussian random component, so that p.µ k j µ k¡1 / is a gaussian function in the variable µ k ¡ µ k¡1 . If the posterior density at the last time step is approximated by a gaussian, then equation 2.5 is the convolution of two gaussian curves, and hence the one-step prediction distribution, p.µk¡1 j Hk /, must be gaussian itself. Let µkjk¡1 D E[ µk j Hk ] and Wkjk¡1 D var[µ k j Hk ] be the mean and variance for the one-step prediction density and µ kjk D E[µk j 1Nk ; Hk ] and Wkjk D var[µ k j 1Nk ; Hk ] be the mean and variance for the posterior density. The goal in this derivation is to obtain a recursive expression for the

976

U. Eden, L. Frank, R. Barbieri, V. Solo, and E. Brown

posterior mean, µkjk , and the posterior variance, Wkjk , in terms of observed and previously estimated quantities. As detailed in the appendix, by writing a quadratic expansion of the log of the posterior density in equation 2.6 about µkjk¡1 , the following recursive algorithm for updating the posterior mean and variance can be obtained,

µkjk¡1 D Fk µk¡1jk¡1 ; Wkjk¡1 D .Wkjk /¡1

µ kjk

Fk Wk¡1jk¡1 F0k

C Qk ; µ³ ´ ³ ´ @ log ¸ 0 @ log ¸ D .Wkjk¡1 /¡1 C [¸1tk ] @ µk @ µk ¶ @ 2 log ¸ ¡ .1Nk ¡ ¸1tk / ; @ µ k @ µ0k µkjk¡1 µ³ ´ ¶ @ log ¸ 0 D µkjk¡1 C Wkjk .1Nk ¡ ¸1tk / ; @ µk µkjk¡1

(2.7) (2.8)

(2.9)

(2.10)

where, for simplicity, ¸ denotes ¸.tk j xk ; µk ; Hk / and ¸ and its derivatives are evaluated at µ kjk¡1 . Equations 2.7 through 2.10 comprise a recursive point process adaptive lter. Equations 2.7 and 2.8 are the one-step prediction equations for the parameter vector mean and covariance, respectively, corresponding to the distribution on the left-hand side of equation 2.5, whereas equations 2.10 and 2.9 are the posterior mean and variance equations corresponding to the distribution in equation 2.4. Because this adaptive estimation algorithm tracks the evolution of a stochastic state or parameter vector using point process observations, we will refer to it as a stochastic state point process lter (SSPPF). 2.3 Properties of the Algorithm 2.3.1 Innovations. The term .1Nk ¡ ¸.tk j µ kjk¡1 ; Hk /1tk / in equation 2.10 is the point process version of the innovation vector from standard adaptive lter theory (Daley & Vere-Jones, 1988; Haykin 1996; Brown, Frank, Tang, Quirk, & Wilson, 1998). For short intervals, 1Nk is the spike indicator function, and ¸.tk j µ kjk¡1 ; Hk /1tk is approximately the probability of a spike in the interval .tk¡1 ; tk ]. Therefore, at each step of the recursive update algorithm, the innovation can be thought of as the difference between the probability of observing a spike given the current model estimate and whether one is actually observed. At all observation times when a spike does occur, the parameter estimates are adjusted to increase the instantaneous ring rate estimate, and at all nonspike observation times, they are adjusted to decrease it. Thus, this adaptive estimation procedure uses information from both spiking and nonspiking intervals to compute its receptive eld estimate.

Point Process Adaptive Filtering

977

2.3.2 Learning Rate. Each algorithm uses the innovation vector as the basic error signal between the instantaneous rate estimate of the receptive eld model and the current spiking activity. The term that modulates the innovation vector controls the learning rate of the algorithm. This is analogous to the gain on a standard adaptive lter and is determined by the choice of a criterion function or the method of derivation (Haykin, 1996). In the case of the SSPPF, the learning rate is adaptive and is proportional to the one-step prediction variance. In this manner, it is similar to the Kalman gain for a standard Kalman lter. As with the Kalman gain, this learning rate uses the statistical properties of the underlying process to characterize the uncertainty of its estimates. One important feature of the standard Kalman lter is that the recursive update equation for the Kalman gain is independent of the observations and can therefore be computed prior to ltering. The posterior variance in equation 2.9, on the other hand, contains a term relating to the current ring rate estimate and another term relating to the innovation, neither of which is known in advance, and so it must be computed during ltering. 2.3.3 Condence Intervals. Since we estimate the covariance of the parameter vector at each point in time, we can construct condence intervals about the parameter estimates. If the observation and state models are correct, the 1 ¡ ® condence interval provides bounds that contain the true parameter value with probability 1 ¡ ®. For our gaussian posterior densities, the 99% condence interval for the ith component of the parameter i;i i §2:475.W i;i 1=2 , where vector is given by µkjk is the ith diagonal element Wkjk kjk / of the posterior covariance estimate. 2.4 Related Algorithms. As is true for the Kalman lter, we can derive two additional lters as special cases of the SSPPF algorithm. 2.4.1 A Point Process Analogue of the RLS Filter. If the state evolution noise ´k in equation 2.2 is zero, that is, if the state evolution is deterministic rather than stochastic, then the SSPPF reduces to an analogue of the wellknown recursive least squares (RLS) lter (Haykin, 1996). The resulting point process algorithm is

µkjk¡1 D Fk µk¡1jk¡1 ;

@ log ¸ @ µk

´0

³

´

@ log ¸ @ µk ¶ @ 2 log ¸ ¡ .1Nk ¡ ¸1tk / ; @ µk @ µ0k µkjk¡1 µ³ ´ ¶ @ log ¸ 0 D µkjk¡1 C Wkjk .1Nk ¡ ¸1tk / : @ µk µkjk¡1

.Wkjk /¡1 D .Fk Wk¡1jk¡1 F0k /¡1 C

µ kjk

µ³

(2.11)

[¸1tk ]

(2.12) (2.13)

978

U. Eden, L. Frank, R. Barbieri, V. Solo, and E. Brown

This lter is applicable in situations in which the receptive eld parameters undergo deterministic linear evolution. 2.4.2 A Steepest Descent Point Process Filter (SDPPF). If we replace the adaptive learning rate term in the posterior state equation by a constant matrix " and set Fk D I, we can derive a point process analogue of the method of steepest descent (Haykin, 1996; Brown et al., 2001). This algorithm eliminates the variance update equations 2.8 and 2.9 and thus becomes µ ³ ´ ¶ @ log ¸ 0 µ kjk D µ k¡1jk¡1 C " .1Nk ¡ ¸1tk / : @ µk µk¡1jk¡1 We have kept the

(2.14)

@ log ¸ @ µk µk¡1jk¡1 term in the update equation. This term serves to scale the innovation so that updates in µ cause appropriately sized changes in the predicted ring rate. In order to apply a steepest-descent lter, it is rst necessary to choose an appropriate learning rate matrix. A diagonal matrix is often chosen with individual elements representing the learning rates of the individual parameters. In some cases, a good conjecture for these values can be suggested by the nature of the coding problem. When a training data set is available and the values of the parameters are already well estimated, an appropriate matrix can be estimated either by minimizing some error function over this data set or running the full SSPPF on the training data set and then averP N D K¡1 K Wkjk , and using aging the resulting adaptive gain matrix, W kD1 either this full matrix or just its diagonal terms. This algorithm is computationally simpler than the SSPPF because it requires no matrix inversions. What is lost in this simplication is the exibility associated with an adaptive learning rate, as well as the measure of uncertainty that the posterior variance estimate provides. 2.5 Goodness-of-Fit Measures. Point process observation models based on the conditional intensity function offer a framework for measuring the goodness-of-t between the model and the observed spike trains. The Kolmogorov-Smirnov (KS) statistic is one such measure. As described in Brown, Barbieri, Ventura, Kass, and Frank (2002), if ¸.tk j xk ; µ k ; Hk / correctly describes the rate of the process generating the spike train, then the set of R li rescaled interspike intervals (ISIs) zi D 1 ¡ exp. li¡1 ¸.u j µu ; Hu /du/, where li is the time of the ith spike, should be independently distributed according to a uniform density. The KS statistic is then the maximum absolute deviation between the order statistics of these rescaled ISIs and the cumulative density of a uniform. It ranges from 0 to 1, with lower values representing

Point Process Adaptive Filtering

979

better t to the data. To provide meaning to this number, we generally compute the 95% condence intervals to the KS statistic by 1:36 ¢ N ¡1=2 , where N is the number of observed ISIs. If the estimated model is sufciently close to the one that produced the spike train, then the KS statistic should be less than this condence bound with probability 0.95. In addition, in our simulation studies, we compute the mean squared error (MSE) between the true and estimated parameter trajectories. 3 Applications 3.1 Dynamic Analysis of Spatial Receptive Fields in the CA1 Hippocampal Pyramidal Neurons. In this example, we examine the simulated response of pyramidal cells in the CA1 region of the rat hippocampus to the movement of the animal on a linear track. These neurons have place receptive elds in that a neuron res only when the animal is in certain regions of space (O’Keefe & Dostrovsky, 1971), and these place elds evolve during the course of an experiment, even when the animal is in a familiar environment (Mehta et al., 1997, 2000; Frank, Brown, & Wilson, 2000). In previous work (Brown et al., 2001), we demonstrated the feasibility of tracking place eld evolution using an SDPPF. In the current analysis of two simulated place eld evolution scenarios, we compare the performance of the SSPPF, the SDPPF, a pass-by-pass method, and an extended Kalman lter algorithm using an empirical ring rate estimate. The receptive eld model is identical to the one of our previous analyses (Brown et al., 2001), ( ) .xk ¡ ¹k /2 (3.1) ¸.tk j µ k / D exp ®k ¡ ; 2¾k2 where xk is the position of the animal on the linear track at time tk , ¹k is the center of the place eld, ¾k is the scale, exp.®k / is the maximum ring rate, and µ k D [®k ¹k ¾k ]T . In each of the two evolution scenarios we studied, the parameter vector evolved from an initial value of [® ¹ ¾ ]0 D [log.10/ 250 121=2 ] to a nal value [® ¹ ¾ ]800 D [log.30/ 150 201=2 ] over 800 seconds. The values of the initial and nal parameter vectors are based on observed changes in actual hippocampal place eld properties (Mehta et al., 1997, 2001; Frank et al., 2002). In the rst evolution scenario, each parameter evolved linearly, as in Brown et al. (2001). The resulting receptive eld was a slowly evolving gaussian curve that steadily grew in maximum ring rate and increased in scale, as its center migrated at a constant velocity. This scenario is consistent with the types of changes observed when a rat executes a behavioral task in a familiar linear environment (Mehta et al., 1997, 2001). In the second scenario, the parameters remained constant at their initial values for 400 seconds, after which they jumped instantaneously to their nal values, yielding a sudden change in

980

U. Eden, L. Frank, R. Barbieri, V. Solo, and E. Brown

the location, scale, and height of the gaussian place eld. This evolution scenario is motivated by recent studies of learning in the rat while recording from the hippocampus that have suggested that place elds can appear or transform very rapidly (Frank, Stanley, & Brown, 2003). We simulated the movement of a rat running back and forth along a 300 cm linear track at a speed of 125 cm per second. These position values, along with the parameter vectors, µk , determined the ring rate at each time point using equation 3.1 under an inhomogeneous Poisson process model. The spike trains for the linear and jump evolution place elds were generated using the time-rescaling algorithm in Brown et al. (2002). Figure 1 shows the simulated spike trains for each of the place eld evolution scenarios. Our goal was to estimate the temporal evolution of the place elds from the observed spike trains. In order to apply the SSPPF to this problem, we rst dene a state model for the evolution of parameter vector. For simplicity, we used a random walk for the state evolution model of the parameter vector in equation 2.2 by setting Fk D I, letting Qk be the diagonal matrix, Qk D Diag.10¡5 ; 10¡3 ; 10¡4 /, and using a 1tk of 20 msec. The covariance parameters were selected based on the maximum rate of change observed in actual hippocampal place receptive elds (Mehta et al., 1997; Brown et al., 2001). We applied the SSPPF in equations 2.7 through 2.10 with the rst and second derivative terms evaluated as @ log ¸ D [1 ¾ ¡2 .x.tk / ¡ ¹kjk¡1 / ¾ ¡3 .x.tk / ¡ ¹kjk¡1 /2 ]0 ; (3.2) kjk¡1 kjk¡1 @µ µkjk¡1 @ 2 log ¸ @ µ@ µ 0 µkjk¡1 2 3 0 0 0 ¡2 ¡3 6 ¡¾ kjk¡1 ¡2¾kjk¡1 .x.tk / ¡ ¹kjk¡1 / 7 D 40 5 : (3.3) ¡3 ¡4 0 ¡2¾kjk¡1 .x.tk / ¡ ¹kjk¡1 / ¡3¾kjk¡1 .x.tk / ¡ ¹kjk¡1 /2 Next, we applied the SDPPF in equation 2.14, letting " be a time-invariant diagonal 3£3 matrix with diagonal elements "® D 0:02, "¹ D 10, and "¾ D 1, representing the learning rates for the ®, ¹, and ¾ parameters, respectively. These parameters were selected by minimizing the mean-squared error between the true and estimated receptive eld parameters over a simulated training data set. We also used a pass-by-pass approach, in which a histogram of the spiking activity of the neuron over one back-and-forth pass through the environment was used to compute a static estimate of the receptive eld. To implement the pass-by-pass method, on each pass we collected the spikes in 1 cm bins and let xb and Nb give the location and the number of spikes red in the bth bin, respectively. The maximum ring rate was computed as e®O D

Point Process Adaptive Filtering

981 Linear Evolution

A

300 250

Track (cm)

200 150 300

100

250

200

150

50

100

50

0 350

350 0 0 100

400

200

450 400

450

300

500

600

700

800

700

800

Jump Evolution

B300

Track (cm)

250 200 150 300

250

100

200

150

50

100

50

0 350

0 0

350 100

400

200

450 450

300

400

500

600

Time (sec)

Figure 1: Simulated spiking activity of a rat hippocampal place cell with (A) linear and (B) jump evolution in place eld parameters. The simulated animal ran back and forth on a 300 cm linear track. The place cells are directional in that they re as the animal passes through its place eld in one direction but not in the other (see the insets). Due to the stochastic nature of the spiking activity, the number of spikes red can change appreciably from one pass to the next, even if there is little or no change in the place eld.

P P P ¡1 O D . b Nb /¡1 ¢ b xb Nb , Tpass b Nb , the place eld center was estimated as ¹ P P and the place eld scale factor as ¾O 2 D . b Nb /¡1 ¢ b .xb ¡ ¹/ O 2 Nb , where Tpass is the time required to complete one full back and forth pass. For this algorithm, there was no continuous time representation of the place eld and therefore no continual updates of the place eld estimates. Instead, the place eld parameters were estimated only once, after the rat completed a

982

U. Eden, L. Frank, R. Barbieri, V. Solo, and E. Brown

back-and-forth run along the track. Previous analyses of place eld plasticity have been conducted with a pass-by-pass method (Mehta et al., 1997, 2000). To construct the extended Kalman lter, we computed a continuous ring rate estimate by convolving the observed spike train with a normalized half-gaussian curve with a xed width of 0.25 seconds. This convolution kernel was chosen to preserve the causality of the relation between incoming spiking activity and the ring rate estimate. The extended Kalman lter was dened as W¡1 D .Wk¡1jk¡1 C Qk /¡1 kjk µ³ ´ ³ ´¶ @ log ¸ 0 @ log ¸ C .¸1tk / ; @µ @µ µk¡1jk¡1 µ³ ´ ¶ @ log ¸ 0 µ kjk D µk¡1jk¡1 C Wkjk .r.t/ ¡ ¸/1tk ; @µ µk¡1jk¡1

(3.4)

(3.5)

where r.tk / was the ring rate estimate (Haykin, 1996). Recent analyses of neural decoding in motor cortex have used similar rate-based extended Kalman lters (Wu et al., 2002). We compared the ability of these four algorithms to track the evolving receptive elds as estimated from the simulated spike trains for both the linear and rapid jump evolution scenarios, using three different measures: the mean-squared error between the true parameters and their estimated values, the percentage of the time that the true parameter values were covered by the estimated 99% condence intervals, and the Kolmogorov-Smirnov (KS) statistics. Figure 2 shows examples of the parameter estimates predicted by each lter for each place eld evolution scenario. In the linear evolution scenario, Figure 2: Facing page. Example of tracking place eld parameter evolution with each of the adaptive estimation and other statistical analysis methods for the (A) linear or (B) jump parameter evolution scenarios in Figure 1. The estimated parameter trajectories from each procedure are plotted along with the true parameter trajectories in the indicated column: pass-by-pass, column 1; extended Kalman lter (EKF), column 2; steepest descent point process lter (SDPPF), column 3; and stochastic state point process lter (SSPPF), column 4. The passby-pass method provides only one estimate per pass, while the other three algorithms provide an estimate at each 20 msec time step. Additionally, the EKF and SSPPF are plotted with their instantaneous 95% condence interval (gray lines above and below the trajectory estimates). The pass-by-pass method tracks least well the evolution of the place eld parameters. Both the EKF and SDPPF track the parameter trajectories well in the linear evolution scenario but underestimate the change in ® and ¹ and overestimate the change in ¾ in the jump evolution scenario. The SSPPF tracks all three parameters well in both scenarios.

Point Process Adaptive Filtering

A

Alpha (log(spike/sec))

3.7

983

Pass-by-Pass

EKF

SDPPF

SSPPF

Pass-by-Pass

EKF

SDPPF

SSPPF

2.7 1.7

Mu (cm)

250 200 150 Sigma (cm 2)

28 19 10

Alpha (log(spike/sec))

B

3 2 1

Mu (cm)

250 200

Sigma (cm 2)

150 30 20 10

0

800

0

800 0 Time (sec)

800

0

800

984

U. Eden, L. Frank, R. Barbieri, V. Solo, and E. Brown

each algorithm tracked each parameter with the correct direction of movement and approximately the correct magnitude of change (see Figure 2A). In addition, the EKF and SSPPF provided instantaneous variance estimates, which we used to compute the condence intervals shown in gray. The pass-by-pass method gave the most variable parameter trajectory estimates. This was because on several passes, the actual number of spikes observed was small, and thus there was little information on which to base the estimate. Several passes had no spikes, and hence no place eld estimate could be computed. For the pass-by-pass method, there was also a lag in tracking the true parameter evolution because it estimated the parameters only after completing each full pass along the track. For the EKF, SDPPF, and SSPPF, we obtained comparable tracking for each of the three parameters, with the exception of the EKF tracking of the place eld center, ¹, which showed a bias toward overestimation. The condence intervals were comparable between the EKF and SSPPF algorithms for each parameter. In the jump evolution scenario, the pass-by-pass method again performed poorly (see Figure 2B). Its estimates were noisy in both the baseline condition and after the jump. The EKF showed the same bias as in the linear evolution scenario, as can be seen in the initial segment of tracking ¹. Both the EKF and the SDPPF had problems tracking the place eld after the jump. Both algorithms overestimated the change in the place eld scale and underestimated both the changes in the place eld center and in its peak ring rate. Immediately after the jump in the place eld parameters, the rst indication that the place eld estimate was no longer reasonable came when the observed ring activity increased drastically in an area where the place eld estimate predicted a near-zero ring rate. There are multiple ways to update the parameters based on these observations. The rate parameter, ®, could increase, causing an increase in the ring everywhere, the place eld width, ¾ , could increase to include the current area of ring, or the center of the place eld, ¹, could shift. In this example, all three are occurring simultaneously, but the most prominent effect is due to the eld center shifting. The EKF and SDPPF were unable to determine rapidly the relative effects of each of these and tended to overweight the learning rate for ¾ and underweight the rates for ® and ¹. The SSPPF tracked the place eld dynamics in the rapid evolution scenario well. This was because it was the only algorithm that adapted its learning rate in the proper way to scale the information obtained about each parameter from the spiking activity. The SSPPF scaled up each of its relatively low learning rates right after the jump, and then scaled them down rapidly so as to continue to track correctly the again constant parameter trajectories. As shown in Table 1, the mean-squared error (MSE) between the true and estimated place eld parameter trajectories was highest in both evolution scenarios for the pass-by-pass method. Because this algorithm estimated

Point Process Adaptive Filtering

985

Table 1: Summary Analysis of the Adaptive Parameter Estimation of the Place Receptive Field Dynamics for the Linear and Jump Parameter Evolution Scenarios. Linear Evolution

Pass-by-Pass

EKF

SDPPF

SSPPF

MSE

® ¹ ¾

0.5 2200 35

0.1 200 2.5

0.03 12 1.1

0.01 60 0.5

Coverage of 99% condence interval

® ¹ ¾

NA NA NA

68 7 99

NA NA NA

98 74 99

0.18

0.09

0.057

0.058

KS statistic

Jump Evolution MSE

® ¹ ¾

0.4 3000 25

0.15 400 10

0.1 200 40

0.04 50 2

Coverage of 99% condence interval

® ¹ ¾

NA NA NA

55 15 64

NA NA NA

99 99 92

0.09

0.22

0.11

0.06

KS statistic

Notes: Each entry is derived from an average of the indicated method being applied to 10 simulated spike trains. MSE (mean squared error); KS (Kolmogorov-Smirnov) statistic. The coverage probability is the fraction of time the 99% condence intervals for the parameters contained the true parameter value. If the intervals are correct, they should cover the true parameter value 99% of the time. The 95% condence interval on the KS statistic is 0.044 for the linear evolution scenario and 0.040 for the jump evolution scenario.

only one place eld over a full pass, it required multiple passes before a reasonable statement about the place eld could be made. Hence, it was difcult to track the place eld evolution accurately. For the linear evolution, the SSPPF gave the lowest MSE for the place eld scale parameter and maximum ring rate parameter trajectories, and the SDPPF gave the lowest MSE for the place eld center trajectory. For each of these, the lowest MSE was approximately one order of magnitude smaller than the EKF MSE and two orders of magnitude smaller than that of the pass-by-pass algorithm. For the jump evolution scenario, the SSPPF gave the smallest MSE of all the algorithms for each parameter. Neither the pass-by-pass algorithm nor the SDPPF provided an instantaneous variance estimate. Therefore, no measure of condence could be ascribed to their trajectory estimates. For the SSPPF and EKF, the degree to

986

U. Eden, L. Frank, R. Barbieri, V. Solo, and E. Brown

which their parameter trajectory estimates remained within the 99% condence intervals measured how well these algorithms characterized their uncertainty. In both scenarios, the SSPPF parameter trajectories stayed within its condence intervals more than the EKF. This was especially true for the place eld center, which was in the 99% condence interval 74% and 99% of the time for the SSPPF, but only 7% and 15% of the time for the EKF for the linear and jump evolution scenarios, respectively. To assess goodness-of-t, the observed ISIs were rescaled using the timerescaling theorem (Brown et al., 2002) in order to compute the KS statistic for each algorithm. Figure 3 shows examples of KS plots comparing the cumulative distribution of the transformed ISIs with the cumulative distribution function of a uniform on the interval [0; 1/ for each trajectory scenario for each of the four estimation strategies. The 95% condence intervals for the KS statistic are plotted as gray bounds in each panel in Figure 3. The KS plots show that both the pass-by-pass and rate-based estimators tended to predict fewer short ISIs than were observed. The expected ISI for an animal running through its place eld at a constant speed decreased as the ® parameter increased. Therefore, if the estimation algorithm were slow to track these changes, it would underestimate the probability of shorter ISIs. For the linear evolution scenario, the SDPPF and SSPPF were equivalent in terms of their KS statistics and signicantly better than the pass-bypass and EKF algorithms. In the rapid jump scenario, the SSPPF performed much better than the SDPPF. In this case, the pass-by-pass estimates actually described the data better than the SDPPF. This is because even though the pass-by-pass estimates were very imprecise, they contained no history dependence and were adjusted to approximate the new place eld in a single pass after the jump. Although this estimate was inaccurate, it Figure 3: Facing page. Examples of Kolmogorov-Smirnov (KS) plots assessing goodness-of-t by comparing the ts of the pass-by-pass, EKF, SDPPF and the SSPPF to the spike trains from the (A) linear and (B) jump evolution scenarios. Under the time-rescaling transformation, the cumulative distribution function is that of the uniform on the interval [0,1). In each panel, the center 45 degree gray line represents perfect agreement between the indicated model t and the spike train data. The parallel 45 degree black lines on either side are the 95% condence limits. The wiggly black line is the model estimate of the Kolmogorov-Smirnov plot. For the linear evolution scenario, the KS plots for the pass-by-pass method and the EKF lie outside the 95% condence intervals, while the SDPPF and SSPPF plots lie within the bounds, suggesting that the latter two point process lters give reasonably good estimates of the observed spiking data for the linear place eld evolution scenario. For the jump evolution scenario, none of the KS plots was consistently between the 95% condence intervals. However, the KS statistic for the SSPPF was much smaller than those from the other three algorithms. This suggests that although no model describes the observed spiking data completely, the SSPPF offers a more accurate description than the other three algorithms.

Point Process Adaptive Filtering

A

987

Cumulative Distribution Function

Pass-by-Pass 1

0.5

0.5

0 0

1

0 0

SDPPF

0.5

0.5

0.5

1

Pass-by-Pass

1

1

0.5

1

EKF

0.5

0.5

1

SDPPF

0 0 1

0.5

0 0

0 0

1

0.5

0 0

0.5

SSPPF 1

1

Cumulative Distribution Function

0.5

1

0 0

B

EKF

1

0.5

1

SSPPF

0.5

0.5

1

0 0

Empirical Quantiles

0.5

1

988

U. Eden, L. Frank, R. Barbieri, V. Solo, and E. Brown

did not lead to the in-between waiting times predicted by the more slowly adapting SDPPF and EKF. Nevertheless, the pass-by-pass method had a larger KS statistic than the SSPPF, which used a more appropriate learning rate. In order to reevaluate the choice for the constant learning rate matrix in the steepest-descent point process algorithm, the average adaptive gain P N kjk D K¡1 K Wkjk , was matrix from the SSPPF over the entire trial, W kD1 computed, and its diagonal terms were compared to the steepest-descent learning rate. The resulting averaged learning rate parameters were consistently of the same order of magnitude as those obtained by minimizing the mean squared error of the estimates over a simulated training data set. Thus, the SSPPF was able to track the parameter trajectories in the jump evolution scenario better than the SDPPF even though the average magnitudes of their learning rates were the same. We conclude that the SSPPF algorithm’s ability to adapt its learning rate estimate was essential to its tracking success. Overall, the SSPPF provided the most accurate tracking of both the linear and jump evolution scenarios.

3.2 A Simulation Study of Simultaneous Adaptive Estimation of Receptive Fields and Neural Spike Train Decoding. In this example, we show how to combine an adaptive encoding algorithm with a neural spike train decoding algorithm in order to reconstruct a signal in the presence of receptive elds that are evolving. A standard decoding paradigm involves a two-step process (Brown et al., 1998). First, an encoding step, performed on a subset of the spike data, is used to characterize the receptive eld properties of each neuron in the ensemble. Then in a separate step, a decoding algorithm uses the receptive eld estimates from the encoding step along with the remaining spiking activity to reconstruct the signal. This process assumes that the receptive elds are static during both the encoding and decoding periods. If the receptive eld evolves during the decoding interval, no such separation between encoding and decoding periods can be made. Instead, the decoding algorithm must simultaneously update each neuron’s receptive eld estimate and use the updated estimates to reconstruct the signal. This creates an interdependence between encoding and decoding in that the accuracy of the receptive eld estimate depends on the accuracy of the signal estimate and vice versa. To study this problem, we examined an ensemble of four simulated neurons responding to a one-dimensional stimulus. We used a notation similar to that used above to describe the receptive eld dynamics of hippocampal place cells. We added superscripts to denote each neuron in the ensemble. For illustrative purposes, we regard these simulated cells as motor neurons and the simulated stimulus as a one-dimensional hand velocity. Our receptive eld model consists of a background ring rate modulated by the hand

Point Process Adaptive Filtering

989

j

velocity, ¸ j .tk / D exp.¹ C ¯k vk /, where exp.¹/ represents the background j

ring rate for each neuron in the absence of any movement and ¯k represents for neuron j the modulation in the ring rate due to the arm velocity vk , where j D 1; : : : ; 4. This model is conceptually similar to the velocity and direction tuned motor cortical neural model developed by Moran and Schwartz (1999), though simplied for a one-dimensional movement. We allowed both the hand velocity and the modulation parameters to be time varying. The hand movement velocity was simulated as a random walk with a noise variance of 2:5 £ 10¡5 at each 1 ms time step. The background rate for each cell was xed at exp.¹/ D 1. The time-dependent tuning functions for the cells in this simulated ensemble are shown as the straight gray lines in Figure 4. Neurons 1 and 2 had xed modulation parameters throughout the 800 second simulation, ¯k1 D 3 and ¯k2 D ¡3, respectively. Neurons 3 and 4 had time-varying modulation parameters given by 8 <

0 0 · tk · 200 0:0125.tk ¡ 200/ 200 · tk · 400 ; : 2:5 400 · tk · 800 8 0 0 · tk · 400 < ¯k4 D ¡0:0125.tk ¡ 400/ 400 · tk · 600 : : ¡2:5 600 · tk · 800 ¯k3 D

(3.6)

(3.7)

Thus, initially, only two of the four neurons responded to the hand velocity; however, by the end of the simulation, all four neurons responded (see Figure 4C). The goal of this analysis was to decode the hand velocity while simultaneously tracking the evolution of the receptive eld modulation parameters. To conduct this analysis, we augmented the state-space of the SSPPF to accommodate all the receptive eld parameter estimates as well as the hand velocity signal (Ljung & Soderstrom, 1987; Haykin, 1996). That is, the vedimensional augmented state vector is dened as µ k D [vk ¯k1 ¯k2 ¯k3 ¯k4 ]0 . The state equation for this system is given by equation 2.2 with Fk D Diag.0:99; 1; 1; 1; 1/ and Qk D 10¡5 ¢ Diag.2:5; 1; 1; 1; 1/. The derivation of the SSPPF in the appendix is sufciently general to allow the analysis of this ensemble adaptive decoding problem. From equations A.7 and A.8, the derivatives of the intensity function with respect to µ required to dene the SSPPF are

@ log ¸ @µ µkjk¡1

2 1 ¯kjk¡1 6v 6 kjk¡1 D6 6 0 4 0 0

2 ¯kjk¡1 0 vkjk¡1 0 0

3 ¯kjk¡1 0 0 vkjk¡1 0

3 4 ¯kjk¡1 0 7 7 0 7 7; 0 5 vkjk¡1

(3.8)

990

U. Eden, L. Frank, R. Barbieri, V. Solo, and E. Brown

A

Modulation Parameter Tracking

Hand Velocity Tracking

0

b1

2 0 -2

b2

C 2

b3

0 b4

-2

E 2

b5

-1 Hand Velocity (cm/sec)

Modulation Parameter ( log( spikes/sec ) * sec/cm )

B

D

0

-1

F 0

0 b6

-2 0

200

400 600 Time (sec)

800

-1 0

200

400 600 Time (sec)

800

Figure 4: Adaptive-decoding analysis of simulated ensemble neural spike train activity using the SSPPF algorithm. The true modulation parameter trajectory is gray in A, C, and E, and the SSPPF estimate is in black. In B, D, and F the true hand velocity signal is gray and the SSPPF decoding estimate of the hand velocity signal is black. (A) Adaptive estimation of the modulation parameters in the two-neuron ensemble consisting of neurons 1 and 2 with true modulation parameter values of ¯k1 D 3 and ¯k2 D ¡3. (B) True hand velocity trajectory and decoded hand velocity trajectory estimated from the spiking activity of the two neurons in A. (C) Adaptive estimation of the modulation parameters in a fourneuron ensemble consisting of neurons 3 and 4 with true modulation parameter values of ¯k3 and ¯k4 , dened in equations 3.6 and 3.7, respectively, included with neurons 1 and 2 from A. (D) True hand velocity trajectory and the decoded hand velocity trajectory estimated from the spiking activity of the four neurons in C. (E) Adaptive estimation of the modulation parameters of a four-neuron ensemble consisting of neurons 5 and 6 with modulation parameters ¯k5 D 2:5 and ¯k6 D ¡2:5, respectively, included with neurons 1 and 2 from A. (F) True hand velocity trajectory and the decoded hand velocity trajectory estimated from the spiking activity of the four neurons in E.

Point Process Adaptive Filtering log

j

991 2

log

j

where @ @µ ¸ jµkjk¡1 is the jth column of equation 3.8, and @ @µ@µ¸0 jµkjk¡1 is the 5 £ 5 matrix with a 1 in the rst row and .j C 1/th column, a 1 in the rst column and .i C 1/th row, and zeros elsewhere. Figure 4 shows the SSPPF tracking simultaneously the evolution of the modulation parameter of each of the four neurons (see Figure 4C) while reconstructing the hand velocity (see Figure 4D). Initially, there were two neurons with static nonzero modulation parameters, but after 200 seconds, an additional two neurons began spiking, and their output was used as well in the hand velocity reconstruction. The algorithm tracks well the trajectories of the neurons with static modulation parameters as well as the trajectories of the modulation parameters of the two new neurons. To analyze the performance of the simultaneous adaptive estimation (see Figure 4C) and the decoding (see Figure 4D) with the SSPPF algorithm, we compared these ndings to its performance in adaptive decoding in the case where there are two neurons with static modulation parameters (see Figure 4A) and four neurons with static modulations (see Figure 4E). The two neurons in Figure 4A are neurons 1 and 2 in Figure 4C, whereas the four neurons in Figure 4E are neuron 1 and neuron 2 in addition to neurons 5 and 6, whose modulation parameters are ¯k5 D 2:5 and ¯k6 D ¡2:5, respectively. The hand velocity reconstruction is better for the ensemble with four neurons with static nonzero modulation parameters (see Figure 4F) than for the two neurons (see Figure 4B). The MSE between the true and estimated velocity trajectories is 0.01 for the four neurons and 0.038 for the two neurons. By comparison, for the rst 200 seconds, the quality of the decoding and the MSE of 0.041 for the ensemble of adapting neurons in Figure 4C is comparable to decoding with the two neurons with static modulation parameters. However, after 200 seconds as the second two neurons begin to spike and the SSPPF estimates their modulation parameters (see Figure 4E) and uses their output in the decoding, the quality of the decoding improves (see Figure 4F) and the MSE decreases. The MSE for reconstruction for the time between 600 and 800 seconds for the dynamic ensemble is 0.01, similar to the reconstruction error from four neurons with static modulation parameters. This example illustrates the feasibility of reconstructing a neural signal from an ensemble whose receptive elds are evolving.

4 Discussion We have presented a paradigm for constructing adaptive lters to track the temporal evolution of neural receptive elds from spike train observations. Our formulation involves rst specifying a probabilistic model of neural spiking and then combining it with a state evolution model using the Chapman-Kolmogorov equation in a Bayesian framework to obtain a recursive expression for the posterior distribution of a parameter vector to be tracked. We constructed a set of adaptive ltering algorithms using a gaus-

992

U. Eden, L. Frank, R. Barbieri, V. Solo, and E. Brown

sian approximation to this posterior distribution. These lters extended the SDPPF that we have already applied in actual neural data analyses (Brown et al., 2001; Frank et al., 2002; Wirth et al., 2003). We developed a stochastic state point process lter by taking a quadratic approximation to the logarithm of the posterior density in equation 2.6. We did this by expanding the log posterior at tk in a Taylor series about an arbitrary point, µ¤ and retaining only terms of order 2 or smaller. By choosing µ ¤ to be the one-step prediction estimate, µ kjk¡1 , we obtained our SSPPF, which is a point process analogue of the Kalman lter. Alternatively, if we had chosen µ ¤ D µkjk , we would have obtained a nonlinear point process adaptive lter that gives the maximum a posteriori estimate as used in Brown et al. (1998). Our lter construction provided two additional adaptive estimation algorithms as special cases: point process analogues of the RLS and steepest-descent algorithms. Additional algorithms could be similarly constructed based on alternative probability distributions. One option is the smoothing distribution, p.µ k j x1:K ; N1:K /, which can be used to estimate the parameter µ at time k if all the data from time steps 1 to K ¸ k have been observed. Both our SSPPF and EKF use gaussian approximations to the posterior density at each time point. The EKF computes its gaussian approximation by linearizing the state evolution and observation equations to yield a gaussian estimate of the state and observation noise models (Haykin, 1996). In the SSPPF, we used point process observation models to construct an exact posterior density update equation, whose solution is then approximated by a gaussian. The SSPPF maintains the point process description of the neural spiking and obviates using a gaussian approximation to a spike train model, an approximation that may be highly inaccurate in neural systems with low spiking rates. In order to apply the SSPPF, it is rst necessary to specify an observation model that describes the spiking probability distribution in terms of a biological signal, a set of receptive eld model parameters, and the nature of the history dependence. The history dependence allows us to construct non-Poisson neural spike train models. In Frank et al. (2002), we showed how a history-dependent process could be used to obtain improved predictions of theta and burst-dependent ring in rodent hippocampus. In Barbieri, Quirk, Frank, Wilson, and Brown (2001), history dependence was incorporated into the model by using a change of variables to construct inhomogeneous gamma and inverse gaussian temporal ring models. These models can be incorporated into our current framework, as well. Another essential component for these adaptive lters is the state model for the evolution of the place eld parameters. In our examples, the exact state model used to simulate the data was not used in the estimation algorithm. In fact, knowledge of the exact state model for neural receptive elds is not possible at present. It is important, however, to select a model that encompasses the anticipated quality and degree of change to be tracked. In

Point Process Adaptive Filtering

993

our hippocampal example, we used a random walk model, which is useful in situations where the expected direction of change of each of the parameters is unknown, but the magnitude of change can be estimated. In some cases, previous research can suggest a more informative system model. For the example of the rat hippocampus, we could have applied our knowledge that the place eld’s center tends to move backward while its height tends to increase, in order to construct a set of Fk matrices that capture these trends. Additionally, if the form of the model can be specied but the model parameters are unknown, then the parameters can be estimated using an expectation-maximization algorithm (Smith & Brown, 2003; Roweis & Ghahramani, 2000). The advantages of using point process lters on neural data are highlighted in the rst example, in which we compared several traditional and point process adaptive estimation algorithms tracking the evolution of simulated hippocampal place elds. The pass-by-pass method of estimating the receptive eld by histogramming the spike train over a full pass computes only a single place eld estimate for each pass through the environment. In contrast, our lters, based on a stochastic parameter evolution model, allow us to update our receptive eld estimates instantaneously, whether or not a spike was observed. The model-based algorithms impose an implicit continuity constraint on the model place eld dynamics, preventing the large uctuations in the parameter estimates observed with the passby-pass method from one pass to the next. The SSPPF maintains adaptive estimates of both the parameter mean and variance, permitting the construction of condence intervals about the receptive eld estimates. The one-step variance estimate provides an implicit learning rate, which allows the lter to track trajectories that vary over multiple time courses. Finally, as mentioned above, our lters are based on point process observation models of the spike train, so that data enter the algorithm as a point process. This eliminates the need to construct a separate rate estimation algorithm, which could lead to biased estimates and erroneous learning rates, as in the case of the EKF algorithm. The SSPPF was the only one of these estimation algorithms that had all of these advantages and was the only one to track place eld parameters accurately in both the linear and jump evolution scenarios. The second example illustrates a novel application of our adaptive algorithm: the reconstruction of a biological signal from ensemble spiking activity while the receptive elds of the neurons in the ensemble are evolving. To do this, we couple adaptive estimation and decoding. Such adaptive decoding may be necessary for reconstructing signals in neural systems where the receptive elds are dynamic. For example, during a center-out reaching task, the preferred direction for MI neurons can change depending on the load against which the animal is reaching (Li et al., 2001). If the objective were to reconstruct hand trajectory under these conditions, we would need to be able simultaneously to track the receptive eld evolution and

994

U. Eden, L. Frank, R. Barbieri, V. Solo, and E. Brown

to use these estimates to track the intended movement. Another situation where this application becomes important is when the ring properties of a population of neurons are changing. This may occur with long-term electrode implants, where some cells stop responding or are lost to the recording device while the activity of other neurons becomes observable. Both of these problems are likely to arise with the eventual development of neural prostheses (Donoghue, 2002). In this example, the key to adaptive decoding is the augmented state space model, which allows us to incorporate into one equation the temporal evolution of the hand trajectory and of each neuron’s receptive eld properties. If the number of neurons in the ensemble is large, it may be computationally more tractable to implement separate adaptive lters in lock step—one estimating the current state variable from previously estimated receptive eld model parameters and the other updating the model parameters using the current state estimate. This approach treats the encoding and decoding components as separate problems, in much the same way as do two-step static decoders (Brown et al., 1998). While our goodness-of-t tests suggest that the SSPPF algorithm is performing well, a detailed study of the accuracy of the gaussian approximation we use should be undertaken. This can be studied by computing the posterior density (see equation 2.6) and the one-step prediction density (see equation 2.7) using numerical integration or sequential Monte Carlo methods (Kitagawa & Gersh, 1996; Doucet, de Freitas, & Gordon, 2001). The SSPPF can be used to construct the proposal densities in the implementation of the sequential Monte Carlo algorithms. The difference between the Monte Carlo–derived posterior density and the gaussian estimate would provide a straightforward measure of the error in the approximation. In summary, the point process ltering paradigm we have developed should be useful to track the dynamics of neural receptive elds across a wide variety of conditions. Appendix: Derivation of the Stochastic State Point Process Filter For the purpose of this derivation, we examine the general situation of an ensemble of neurons ring simultaneously. For the most part, in the encoding problems we described, information from individual neurons was studied independently. However, in the second example and in most decoding or reconstruction problems, it will be necessary to combine information from an ensemble of spiking neurons. We use superscript notation to distinguish the properties of individual cells. For example, for an ensemble containing C cells, µ ik represents the parameter vector, ¸i .tk j xk ; µ ik ; N1:k¡1 / represents the conditional intensity function, and 1Nki is the number of spikes red for the ith cell in the ensemble.

Point Process Adaptive Filtering

995

The denitions for the means and variances of the one-step prediction and posterior densities, along with the evolution model, are sufcient to compute the gaussian one-step prediction probability density, whose moments are given by,

µ ikjk¡1 D Fik µik¡1jk¡1 ;

Wikjk¡1

D

(A.1)

0 Fik Wik¡1jk¡1 Fik

C

Qik :

(A.2)

We call these two equations the one-step prediction equation and one-step variance equation, respectively. We write the posterior probability density of the neural activity of the ensemble in the interval .tk¡1 ; tk ], by multiplying over all neurons the independent probabilities in equation 2.6 and we make a gaussian approximation: p.µ k j 1Nk ; Hk / /

C Y jD1

j

j

j

.¸ j .tk j µk ; Hk /1tk /1Nk exp.¡¸ j .tk j µ k ; Hk /1tk

1 j j j j j ¡ .µk ¡ µ kjk¡1 /0 .Wkjk¡1 /¡1 .µ k ¡ µ kjk¡1 // 2 ³ ´ C Y 1 / exp ¡ .µ ik ¡ µ ikjk /0 .Wikjk /¡1 .µ ik ¡ µ ikjk / : (A.3) 2 iD1 The rst line in equation A.3 represents the posterior density as a function of the variable µ ik , and the last line is a gaussian approximation to this function with the parameters µ ikjk and Wikjk still to be determined. We could expand the rst line of equation A.3 in a Taylor series about some point, drop all terms of third order or higher, and complete a square to obtain an expression of the form of the second line of equation A.3. Here instead we present an alternate derivation. Taking the log of both sides gives ¡

C C X 1X j .µ ik ¡ µ ikjk /0 .Wikjk /¡1 .µ ik ¡ µ ikjk / D 1Nk log.¸ j / ¡ ¸ j 1tk 2 iD1 jD1

1 j j j j j ¡ .µk ¡ µkjk¡1 /0 .Wkjk¡1 /¡1 .µ k ¡ µ kjk¡1 / C R: 2

(A.4)

Here, R is a catch-all constant that contains terms relating to the state evolution statistics and the normalizing constants. We can now differentiate this expression with respect to µ ik in order to get an equation with only linear terms, .Wikjk /¡1 .µ ik ¡ µ ikjk / D .Wikjk¡1 /¡1 .µ ik ¡ µ ikjk¡1 / Á !0 C X @ log ¸ j j ¡ .1Nk ¡ ¸ j 1tk /: i @ µ k jD1

(A.5)

996

U. Eden, L. Frank, R. Barbieri, V. Solo, and E. Brown

If the gaussian approximation is valid, then this relation should be approximately true for all values of µ ik . We can therefore choose any specic point to evaluate this expression. Evaluating at µik D µ ikjk¡1 gives ¡.Wikjk /¡1 .µikjk¡1

¡ µ ikjk /

D

C X

"Á

jD1

@ log ¸ j @ µ ik

!0

# j .1Nk

j

¡ ¸ 1tk /

: (A.6) µikjk¡1

Solving for µ ikjk then gives the posterior state equation,

µ ikjk D µ ikjk¡1 C Wikjk

C X

"Á

jD1

@ log ¸ j @ µik

!0

# j .1Nk

¡ ¸ j 1tk /

(A.7)

: µikjk¡1

Differentiating equation A.6 again and evaluating at µ ik D µikjk¡1 gives the posterior variance equation, .Wikjk /¡1

D

.Wikjk¡1 /¡1

C

C X jD1

"Á

@ log ¸ j @ µ ik

¡

j .1Nk

!0

Á j

[¸ 1tk ]

@ log ¸ j @ µ ik

@ 2 log ¸ j ¡ ¸ 1tk / i 0i @ µk@ µk

!

#

j

: (A.8) µikjk¡1

In the case where we observe the ring of only one cell at a time, the sums in equations A.7 and A.8 simplify to a single term, and the superscripts are no longer needed. Acknowledgments This research was supported in part by NIH grants MH59733, MH61637, and DA015644, and NSF grant IBN 0081548 to E.N.B. and NIH grant MH65108 to L.M.F. Part of this research was performed while E.N.B. was on sabbatical at the Laboratory for Information and Decision Systems in the Department of Electrical Engineering and Computer Science at MIT. References Barbieri, R., Quirk, M. C., Frank, L. M., Wilson, M. A., & Brown, E. N. (2001). Construction and analysis of non-Poisson stimulus response models of neural spike train activity. J. Neurosci. Meth., 105, 25–37. Brown, E. N., Barbieri, R., Eden, U. T., & Frank, L. M. (2003). Likelihood methods for neural data analysis. In J. Feng (Ed.), Computational neuronscience: A comprehensive approach (pp. 253–286). London: CRC.

Point Process Adaptive Filtering

997

Brown, E. N., Barbieri, R., Ventura, V., Kass, R. E., & Frank, L. M. (2002). The time-rescaling theorem and its application to neural spike train data analysis. Neural Computation, 14, 325–346. Brown, E. N., Frank, L. M., Tang, D., Quirk, M. C., & Wilson, M.A. (1998).A statistical paradigm for neural spike train decoding applied to position prediction from ensemble ring patterns of rat hippocampal place cells. J. Neurosci., 18, 7411–7425. Brown, E. N., Nguyen, D. P., Frank, L. M., Wilson, M. A., & Solo, V. (2001). An analysis of neural receptive eld plasticity by point process adaptive ltering. PNAS, 98, 12261–12266. Daley, D., & Vere-Jones, D. (1988). An introduction to the theory of point process. New York: Springer-Verlag. Donoghue, J. P. (1995). Plasticity of sensorimotor representations. Curr. Opin. Neurobiol., 5, 749–754. Donoghue, J. P. (2002). Connecting cortex to machines: Recent advances in brain interfaces. Nat. Neurosci., 5, 1085–1088. Doucet, A., de Freitas, N., & Gordon, N. (2001). Sequential Monte Carlo methods in practice. New York: Springer-Verlag. Frank, L. M., Brown, E. N., & Wilson, M. A. (2000). Trajectory encoding in the hippocampus and entorhinal cortex. Neuron, 27, 169–178. Frank, L. M., Eden, U. T., Solo, V., Wilson, M. A., & Brown, E. N. (2002). Contrasting patterns of receptive eld plasticity in the hippocampus and the entorhinal cortex: An adaptive ltering approach. J. Neurosci., 22, 3817–3830. Frank, L. M., Stanley, G. B., & Brown, E. N. (2003). New places and new representations: Place eld plasticity in the hippocampus and cortex (Program No. 91.18). Washington, DC: Society for Neuroscience. Gandalfo, F., Li, C. S. R., Benda, B. J., Schioppa, C. P., & Bizzi, E. (2000). Cortical correlates of learning in monkeys adapting to a new dynamical environment. Proc. Natl. Acad. Sci. USA, 97, 2259–2263. Haykin, S. (1996). Adaptive lter theory. Englewood Cliffs, NJ: Prentice Hall. Jog, M. S., Kubota, Y., Connolly, C. I., Hillegaart, V., & Graybiel, A. M. (1999). Building neural representations of habits. Science, 286, 1745–1749. Kaas, J. H., Merzenich, M. M., & Killackey, H. P. (1983). The reorganization of somatosensory cortex following peripheral nerve damage in adult and developing mammals. Annu. Rev. Neurosci., 6, 325–356. Kitagawa, G., & Gersh, W. (1996). Smoothness priors analysis of time series. New York: Springer-Verlag. Li, C. S. R., Padoa-Schioppa, C., & Bizzi, E. (2001). Neuronal correlates of motor performance and motor learning in the primary motor cortex of monkeys adapting to an external force eld. Neuron, 30, 593–607. Ljung, L., & Soderstrom, T. (1987). Theory and practice of recursive identication. Cambridge, MA: MIT Press. Mehta, M. R., Barnes, C. A., & McNaughton, B. L. (1997). Experience-dependent, asymmetric expansion of hippocampal place elds. Proc. Natl. Acad. Sci. USA, 94, 8918–8921. Mehta, M. R., Quirk, M. C., & Wilson, M. A. (2000). Experience-dependent, asymmetric shape of hippocampal receptive elds. Neuron, 25, 707–715.

998

U. Eden, L. Frank, R. Barbieri, V. Solo, and E. Brown

Merzenich, M. M., Nelson, R. J., Stryker, M. P., Cynader, M. S., Schoppmann, A., & Zook, J. M. (1984). Somatosensory cortical map changes following digit amputation in adult monkeys. J. Comp. Neurol., 224, 591–605. Moran, D. W., & Schwartz, A. B. (1999). Motor cortical representation of speed and direction during reaching. J. Neurophysiol., 82, 2676–2692. O’Keefe, J., & Dostrovsky, J. (1971). The hippocampus as a spatial map: Preliminary evidence from unit activity in the freely-moving rat. Brain Res., 34, 171–175. Pettet, N. M. (1992). Dynamic changes in receptive-eld size in cat primary visual cortex. Proc. Natl. Acad. Sci. USA, 89, 8366–8370. Roweis, S., & Ghahramani, Z. (2000). An EM algorithm for identication of nonlinear dynamical systems. Available on-line: http://www.gatsby.ucl.ac. uk/publications/papers/index.html. Smith, A. C., & Brown, E. N. (2003). Estimating a state-space model from point process observations. Neural Computation, 15, 965–991. Snyder, D., & Miller, M. (1991). Random point processes in time and space. New York: Springer-Verlag. Solo, V., & Kong, X. (1995). Adaptive signal processing algorithms: Stability and performance. Upper Saddle River, NJ: Prentice Hall. Weinberger, N. M. (1993).Learning-induced changes of auditory receptive elds. Curr. Opin. Neurobiol., 3, 570–577. Wirth, S., Yanike, M., Frank, L. M., Smith, A. C., Brown, E. N., & Suzuki W. A. (2003). Single neurons in the monkey hippocampus and learning of new associations. Science, 300(5625), 1578–1581. Wu, W., Black, M. J., Gao, Y., Bienenstock, E., Serruya, M., Shaikhouni, A., & Donoghue, J. P. (2002). Neural decoding of cursor motion using a Kalman lter. Paper presented at the Neural Information Processing Systems Conference, Vancouver, BC. Received April 7, 2003; accepted October 6, 2003.

LETTER

Communicated by Maxim Bazhenov

Odor-Driven Attractor Dynamics in the Antennal Lobe Allow for Simple and Rapid Olfactory Pattern Classication Roberto Fdez. Galan ´ [email protected] Institute for Theoretical Biology, Humboldt–Universit¨at zu Berlin, D-10115 Berlin, Germany

Silke Sachse [email protected] Institute for Neurobiology, Freie Universit¨at Berlin, D-14195 Berlin, Germany, and Laboratory of Neurogenetics and Behavior, Rockefeller University, New York, New York 10021, U.S.A.

C. Giovanni Galizia [email protected] Institute for Neurobiology, Freie Universit¨at Berlin, D-14195 Berlin, Germany, and Department of Entomology, University of California Riverside, CA 92521, USA

Andreas V. M. Herz [email protected] Institute for Theoretical Biology, Humboldt–Universit¨at zu Berlin, D-10115 Berlin, Germany

The antennal lobe plays a central role for odor processing in insects, as demonstrated by electrophysiological and imaging experiments. Here we analyze the detailed temporal evolution of glomerular activity patterns in the antennal lobe of honeybees. We represent these spatiotemporal patterns as trajectories in a multidimensional space, where each dimension accounts for the activity of one glomerulus. Our data show that the trajectories reach odor-specic steady states (attractors) that correspond to stable activity patterns at about 1 second after stimulus onset. As revealed by a detailed mathematical investigation, the trajectories are characterized by different phases: response onset, steady-state plateau, response offset, and periods of spontaneous activity. An analysis based on supportvector machines quanties the odor specicity of the attractors and the optimal time needed for odor discrimination. The results support the hypothesis of a spatial olfactory code in the antennal lobe and suggest a perceptron-like readout mechanism that is biologically implemented in a downstream network, such as the mushroom body.

Neural Computation 16, 999–1012 (2004)

c 2004 Massachusetts Institute of Technology °

1000

R. Gal´an, S. Sachse, C.G. Galizia, and A.V.M. Herz

1 Introduction The antennal lobe (AL) is the primary brain structure of insects involved in odor coding (Galizia & Menzel, 2001) and short-term memory (Menzel, 1999). This neural structure is functionally homologous to the olfactory bulb in mammals, sh, and amphibia. The neural activity in the AL is highly structured in both space and time, which makes this system especially interesting for studies about neural dynamics and coding on the system’s level. The spatial order arises from the precise arrangement of neural clusters called glomeruli. Honeybees possess around 160 glomeruli (Flanagan & Mercer, 1989), and the three to ve projection neurons (PN) of each glomerulus relay olfactory information to higher processing areas: the mushroom bodies and the lateral protocerebrum. The temporal response of the PNs shows different features on various timescales: spike activity of single neurons with timing precision of a few milliseconds, membrane-potential oscillations of around 20 Hz, and slow modulations of the membrane potential at around 2 Hz. Membrane oscillations reect network oscillations that emerge from the transient synchronization of PNs during odor presentation (Wehr, 1999). The slow membrane response is odor specic and matches the overall PN ring rate (Wehr, 1999; Laurent et al., 2001). The choice of a timescale to analyze the neural dynamics in the antennal lobe has profound effects on the interpretation of the olfactory code. Two main hypotheses have been proposed to explain how odors are encoded. According to the rst hypothesis, odors are encoded through a stimulus-induced transient synchronization of PN assemblies during stimulation (Laurent, 1996). The second hypothesis emphasizes the role of spatial patterns of neural activity for odor coding (Hildebrand & Shepherd, 1997; Korsching, 2002), as already suggested by the modular structure of the AL. This hypothesis is supported by calcium-imaging studies that show that a given set of glomeruli is specically activated by a given odor (Galizia & Menzel, 2001). The activity of the PNs in these glomeruli, however, is not static but modulated by inhibitory mechanisms (Sachse & Galizia, 2002). In this work, we focus on the detailed temporal development of glomerular activity patterns in the antennal lobe recorded with calcium imaging. A novel experimental technique to stain exclusively projection neurons (Sachse & Galizian, 2002) permits us to specically study the evolution of those odor-evoked responses that are transmitted to higher processing areas. We represent the AL dynamics in a multidimensional space in which each dimension corresponds to the activity of one glomerulus. Our analysis reveals that during stimulation, different odors trigger different odor-specic trajectories that separate rapidly after stimulus onset. For odor stimuli with a steplike time course, the trajectories converge to stable spatial activity patterns within odor-specic regions after about 1 second and remain unchanged up to small uctuations until the end of stimulation. Using methods from dynamical systems theory and support-vector machines, we quantify the ne structure of the trajectories and their discriminability. Our nd-

Simple and Rapid Olfactory Pattern Classication

1001

ings support the hypothesis that odors are encoded through odor-specic attractors that may easily be read out by a downstream network. 2 Methods 2.1 Animal Preparation and Data Recording. The preparation and calcium imaging of PNs was performed as described in Sachse and Galizia (2002). Briey, adult worker honeybees were caught and xed in a plexiglas stage. Projection neurons were backlled from the protocerebrum with the calcium-sensitive dye fura-dextran (potassium salt, 3000 MW, Molecular Probes, Eugene, OR, USA). Imaging was done using a T.I.L.L. Photonics imaging system (Gr¨afelng, Germany). For each measurement, a series of 60 double frames was taken at a sampling frequency of 6 Hz. The interstimulus interval was 40 s. Odors with a duration of 2 s were delivered to the antennae using a custom-made and computer-controlled olfactometer. For each of the tested odors (isoamylacetate, 1-hexanol, 1-octanol, and 1-nonanol; Sigma-Aldrich, Deisenhofen, Germany), 6 ¹l of the odorant dissolved in mineral oil were applied on a lter paper (1 cm2 ) in a plastic syringe. Neural responses were calculated as absolute changes of the uorescence ratio between 340 nm and 380 nm excitation light. Signals were attributed to identied glomeruli by reconstructing the glomerular structure in the fura ratio images. Glomeruli were identied on the basis of their morphological borderlines using a digital atlas of the AL (Galizia, McIlwrath, & Menzel, 1999). In total, data from seven bees were analyzed. 2.2 Data Analysis. At each point in time t, the activity pattern of the AL network is represented by a point in a multidimensional space in which each dimension corresponds to the calcium signal from one glomerulus. Depending on the bee, between 18 and 23 glomeruli could be identied. For simplicity, we will refer to this glomerular subspace as the AL space. During an odor presentation, the AL activity draws an open curve in this multidimensional space. We rst study the system’s evolution along these trajectories. Since the data were sampled at a frequency of 6 Hz, the temporal resolution is 1t D 0:167 s. We denote as ErA .t/ the trajectory triggered by odor A. To quantify the rate of activity changes, the velocity vA .t/ of the activity state along the trajectories is used; j ErA .t C 1t/ ¡ ErA .t/ j 1Er.t/ D vA .t/ D : 1t 1t As a measure of the force causing changes of the neural activity patterns, we introduce the magnitude a A .t/ of the acceleration vector, aA .t/ D

j vE A .t C 1t/ ¡ vEA .t/ j : 1t

1002

R. Gal´an, S. Sachse, C.G. Galizia, and A.V.M. Herz

The acceleration vector can be decomposed into a tangential component atA .t/ and a normal component anA .t/ relative to the trajectory. These two components are given by atA .t/ D and

jE vA .t C 1t/j ¡ jE vA .t/j 1t q

anA .t/ D

jaA .t/j2 ¡ jatA .t/j2 :

The rst component provides information about the amount of force parallel to the current trajectory, preserving the movement direction but changing the movement velocity. The second component measures the amount of force that leads to changes in the direction of the trajectory. In addition to their detailed dynamical properties, we also study the reproducibility and specicity of the odor-driven trajectories. To this end we use support-vector machines (SVMs). An SVM (Vapnik, 1998; Burges, 1998; Boser, Guyon, & Vapnik, 1992) is an algorithm to calculate a hyperplane that separates a multidimensional space into two regions: the rst region contains all points of a specied data set within the given data, and the other contains the remaining points. Let us call these regions I and II, respectively. A hyperplane is completely determined by a normalized orthogonal vector E and the distance jbj to the origin. In fact, every point xEhp on the hyperplane w satises E ¢ xEhp D b: w The optimal separating hyperplane between the two data sets is calculated by maximizing the margin with respect to the points of both regions. This mathematical problem can be turned into the minimization of a quadratic form in the positive quadrant. A separating hyperplane exists only if both data sets do not overlap. However, the minimization problem can be reformulated to tolerate some overlap, and the calculation of an optimal separating hyperplane is also possible. According to SVM theory (Vapnik, 1998; Burges, 1998; Boser et al., 1992), E that characterizes the separating hyperplane can always be the vector w expressed as a linear combination of the points that lie on the margin— E and b of the optimal hyperplane are the so-called support vectors. Once w known, it is straightforward to cast new data xE into I or II according to the following criterion: E ¢ xE > b; xE belongs to I if w otherwise xE belongs to II:

(2.1)

The SVM method can be extended to separating manifolds different from hyperplanes by applying appropriate coordinate transformations (ex-

Simple and Rapid Olfactory Pattern Classication

1003

pressed as kernel functions) equivalent to a change in the metric of the space under consideration—in our case, the AL space. The scalar product, whose associated separating manifold is the hyperplane, was the one that provided the best separation of odors among the tested kernels (scalar product, second- and third-order polynomial, gaussian, and sigmoidal). To quantify the ability of the SVM to classify an odor correctly, we use a measure called classication performance. This measure is dened as the fraction of points cast into the correct region expressed as a percentage. It yields 100% when the data set of a given odor does not overlap with the rest and yields less when it does. When two data sets overlap, the optimal separating hyperplane lies at the intersection, and its orientation is very sensitive to the particular distribution of points there. In this case, it is interesting to quantify the stability of the plane through the generalization performance, dened as the relative number of points that can be singly removed without affecting the classication performance. A lower-bound estimator of the generalization performance is obviously the fraction of points that do not belong to the intersection (unambiguous patterns). The exact value of the generalization performance is calculated based on its denition, which represents a bootstrap or leave-one-out method (Efron & Tibshirani, 1993) commonly used to test SVMs (Boser et al., 1992). Let A be the set of n points we want to separate from the rest. Let us now remove one point P of the set A and compute the separating hyperplane. We then check whether P falls into the correct region, where the other n ¡ 1 points of A lie. We repeat this procedure for all n points of A. The fraction of points successfully classied is expressed as a percentage and measures the generalization performance of a hyperplane. The average of the generalization performance for all odors (all separating hyperplanes) quanties the reliability of the partition of the AL space into odor-specic regions. We call this measure the separability of odors. Separating any set of points from the rest of a given ensemble becomes a trivial task in a space of many dimensions. As the number of dimensions increases, it becomes easier to separate the data points. Equivalently, any subset of a small ensemble of points can typically be separated from the rest, even in a low-dimensional space. Therefore, before carrying out all SVM analyses (classication performance, generalization performance, and separability), we pool the data of the seven bees investigated. By doing this, we study the robustness of the olfactory code across trials and also across individuals. In total, 70 points were available, corresponding to four different odors. Up to eight common glomeruli could be identied in all bees. Thus, the AL space used for the SVM analysis has eight dimensions. 3 Results We represent the activity of the AL network by considering each glomerulus as a component of a vector in a multidimensional space (see section 2.2). At

1004

R. Gal´an, S. Sachse, C.G. Galizia, and A.V.M. Herz

hexanol

octanol nonanol

3

PC 3

2

8

1 6

0 4

1 2 3

2

isoamylacetate 2

1

0

PC 2

1

PC 1

0 2

3

4

2

Figure 1: Odor-specic trajectories of the antennal lobe activity. Several odors were presented twice to the same bee, and data were recorded at xed time intervals (167 ms). Accordingly, the distance between successive data points represents the speed of activity changes. The trajectories depart rapidly from the origin and slow down when they approach odor-specic regions. To generate this gure, the original 21-dimensional AL space has been projected down onto the rst three principal components (PC). These three components account for more than 58% of the whole variance of the trajectories.

each instant, the state of the network thus corresponds to one point in the AL space; the temporal evolution of the network is represented by a trajectory. Figure 1 shows such trajectories, caused by several odors presented to a single bee. These data demonstrate that repeating the same stimulus yields similar trajectories. The trajectories do not evolve at a constant speed but start to slow down after approximately 0.5 s and reach odor-specic stable activity pattern after about 1 s. These patterns can be interpreted as stimulus-induced attractors of the network dynamics. After stimulation, the network returns slowly to the resting state (see Figure 2). This relaxation follows a returning path that is different from the trajectory during stimulation. Although trajectories corresponding to the same odor are similar, they exhibit some variability over several repetitions and also across individuals (data not shown).

Simple and Rapid Olfactory Pattern Classication

1005

hexanol

octanol nonanol

3

PC 3

2

8

1 6

0 4

1 2 3

2

isoamylacetate 2

1

0

PC 2

1

PC 1

0 2

3

4

2

Figure 2: Trajectories of the antennal lobe activity during post-stimulus relaxation in one bee. The solid lines represent the trajectories during stimulation, as in Figure 1. The dashed lines represent the evolution of the trajectories 5 seconds after stimulation. The trajectories during stimulation and relaxation do not coincide; the latter are more irregular.

The observations based on Figures 1 and 2 can be quantied by computing the velocity and the acceleration (see section 2) of the trajectories (see Figure 3). The velocity increases rapidly in the rst 400 ms of stimulation, reaches a maximum, and then slows down. After approximately 1 s, the trajectories do not evolve further, and only small uctuations remain, which are of the same order as those during ongoing activity (see Figure 3A). When the stimulation ends, there is a second peak in the velocity, which is smaller than the rst one. This nding demonstrates that the relaxation to the resting state is slower than the excitation during stimulation. The decomposition of the acceleration vector into a tangential and a normal component relative to the trajectory provides further insight into the origin of the uctuations (see Figures 3B and 3C). Only during odor presentation is the tangential component substantially different from zero. At the beginning, it is positive, indicating an increase of velocity. After approximately 400 ms, it becomes negative, which means that the velocity slows down. During spontaneous activity, the tangential acceleration uctuates

1006

R. Gal´an, S. Sachse, C.G. Galizia, and A.V.M. Herz

A velocity (a.u.)

2

nonanol isoamylacetate hexanol octanol

1.5 1 0.5 0

tangential acceleration (a.u.)

B 2 1.5 1 0.5 0

0.5 1

C normal acceleration (a.u.)

2 1.5 1 0.5 0

0.5 1 1

2

3

4

5

6

7

8

9

time (s)

around zero. In contrast, the normal component of the acceleration vector, responsible for changes in the direction of the trajectory, has a certain bias during spontaneous activity and increases during stimulation. However, the larger contribution to the total acceleration during stimulation comes from the tangential component. Thus, in the AL space, odors represent deterministic forces that pull and smoothly bend trajectories of neural activity, whereas the background activity represents a stochastic force that primarily causes random changes in the direction of the trajectories. How odor specic are the stable activity patterns to which the trajectories converge? The separability measure (see section 2) allows one to answer this question as it quanties the stability of the hyperplanes that divide the AL space into odor-specic regions. The separability at each point in time (see Figure 4, dotted line) shows that the trajectories become more and

Simple and Rapid Olfactory Pattern Classication

1007

more odor specic during stimulation. The separability reaches a maximum after approximately 1.5 s from stimulus onset. At this point in time, an optimal separating hyperplane for each odor is calculated. The classication performance (see section 2) yields for our data 100% for octanol and nonanol, 93% for isoamylacetate, and 79% for hexanol. These values are substantially larger than chance level (50%) and contrast with the fact that all odors used evoke similar overlapping patterns. The result supports the hypothesis that odors are encoded in the AL as stable spatial patterns of neural activity. 4 Discussion Based on a multidimensional representation of the calcium dynamics, our analysis shows that trajectories associated with different odors rapidly evolve toward different regions in AL space. As a result of the network dynamics, the trajectories converge to odor-specic attractors. This nding is incompatible with the “winnerless competition” network proposed as a model for the AL (Rabinovich et al., 2001; Laurent et al., 2001), since such a network encodes stimuli in complex trajectories on heteroclinic orbits. However, our results are in agreement with the ndings on the dynamic representation of odors in the olfactory bulb of the zebrash (Friedrich & Laurent, 2001), where the activity patterns also become increasingly odor specic during stimulation.

Figure 3: Facing page. Kinematics of the antennal lobe activity. The symbols represent the mean values of the velocity (A) and acceleration (B, C) of activity changes at each point in time for a given odor. The continuous line denotes the mean computed from all odors. The black bar in A illustrates the duration of stimulus presentation. The whole data set, pooled over trials with seven bees, was taken into account for the calculation. (A) Before stimulation, the velocity has a roughly constant value, which deviates from zero due to spontaneous activity. At the beginning of the stimulation, the sudden change in velocity indicates a transition leading to a stimulus-dependent state, where activity changes are mainly driven by noise as for the spontaneous activity but with slightly higher values. When the stimulation stops, there appears another peak, though not as pronounced as the rst one. This indicates that relaxation to the resting state is slower. (B, C) The acceleration can be decomposed into a tangential and a normal component. During the stimulation period, the tangential component is larger than the normal component. This indicates that a given odor drives the network into a new stable, odor-specic attractor state along an open, smooth curve. The negative peak of the tangential component corresponds to the arrival at the attractor. The normal component has a certain bias in the absence of stimulation. This means that the effect of the background activity on the AL dynamics consists of random changes of the direction of the trajectory.

1008

R. Gal´an, S. Sachse, C.G. Galizia, and A.V.M. Herz 100

nonanol isoamylacetate hexanol octanol separability

classification performance and separability index (%)

90 80 70 60 50 40 30 20 10 0

1

2

3

4

5

time (s)

6

7

8

9

Figure 4: Classication performance and separability as a function of time. For the calculations, all data collected across trials with seven bees were taken into account. During stimulation (black bar), the partition of the antennal lobe space into odor-specic regions becomes possible, as the increase of the separability (dotted lines) shows. At the time of highest separability, one SVM is trained for each odor. Then, the SVMs are tested over time. The classication performance of a given odor (solid lines) is the fraction of points that are recognized as that odor by the respective SVM. Note that except for hexanol, the odors can be very well discriminated even before the trajectories reach the attractors.

Our experimental technique permits us to accurately resolve the largescale activity patterns in the antennal lobe. These dynamical patterns reproduce reliably odor-specic ring rates of the PNs involved (Wehr, 1999). However, we cannot resolve fast events, such as 20 Hz network oscillation or transient odor-specic synchronizations between the spikes of PNs (Laurent & Davidowitz, 1994; Wehr & Laurent, 1996; Laurent, 1996; Laurent, Wehr, & Davidowitz, 1996). This implies that we cannot answer the question whether ultimately, the olfactory code is based on sequences of oscillating PNs or whether it is based on the attractors of the slow temporal patterns reported here. As part of the ongoing controversy between spike-timing and ring-rate codes, this question remains open for further investigation. Recent work by Lei, Christensen, and Hildebrand (2000) on the moth’s antennal lobe helps us understand the relation between the phenomena observed in electrophysiological and imaging techniques. These authors have studied with intracellular recordings the activity of PNs enclosed in

Simple and Rapid Olfactory Pattern Classication

1009

glomeruli that show an odor-specic response in imaging experiments. They found that PNs within the same glomerulus re coherently during odor presentation and that the degree of synchrony is modulated by lateral inhibition between active neighboring glomeruli. Hence, the modulation of the odor-induced synchrony is translated into variations of the calcium concentration within the glomerulus, as observed with imaging techniques. According to our results, the regions of stability, or correspondingly the associated stable calcium patterns, are a reliable mapping of the input onto the AL, and they potentially provide a reliable output to the next neural structures: the mushroom body and the lateral protocerebrum. Within this framework, neurons of a network downstream of the AL could carry out the same task as we have done in order to interpret the AL code: explore the AL space and “look” at which region the evoked response patterns converge to. Support-vector machines provide a good starting point to realize such a readout process. Criterion 1 can be mapped onto a perceptron network (see Figure 5, Rumelhart & McClelland, 1986), where some (or all) units xn of a bottom layer (the antennal lobe) synapse with strength wAn to a given unit in an upper layer (e.g., a Kenyon cell of the mushroom body). If the activation threshold of this unit is set to b, it res only in response to odor A: if

N X n

E A > b then xE belongs to A xn ¢ wAn D xE ¢ w otherwise xE does not belong to A: Mushroom Body w A1 w A2

w A4 w A3

Antennal Lobe x1

x2

x3

x4

Figure 5: Sketch of the readout mechanism for the network next to the AL. The algorithm used to analyze the imaging data can be mapped onto a perceptronlike neural network, whose architecture is compatible with the anatomy of the bee brain. The units in the lower layer represent individual glomeruli in the antennal lobe. The unit in the upper layer represents a neuron of the mushroom body (Kenyon cell). This unit responds to a given odor A only if the whole acP tivity of the lower units weighted by their synaptic strength N ¢ exceeds n xn w An a threshold b.

1010

R. Gal´an, S. Sachse, C.G. Galizia, and A.V.M. Herz

Within the theory of articial neural networks (see, e.g., Hertz, Krogh, & Palmer, 1991) it has been shown how synaptic weights can adapt through E A , which together with the threshold b Hebbian learning to the vector w determines the optimal separating plane in the AL space for odor A. Notice, however, that no learning process is necessary to identify odors with such a network. Suppose that the threshold b is set constant for all units in the upper layer. By connecting each upper unit with random synaptic weights E to the units of the bottom layer, each upper unit will look at a random w direction of the AL space. If the number of units in the upper layer is large enough, sufciently many directions of the AL space will be covered, and, hence, for any given odor-specic attractor, there will always be at least one upper unit that reacts to it. This perceptron-based readout mechanism is consistent with the anatomy of the olfactory system (Strausfeld, 1976). The very high number of Kenyon cells in the mushroom body suggests that there might even be multiple upper units reacting to one odor. The resulting population code would lead to higher signal-to-noise ratios and thus be benecial for further downstream information processing. Although the trajectories need about 1 second to reach the attractors, when they can be optimally separated, odors are already well discriminated after 300 ms. This result leads to a testable prediction: If it could be shown that bees can differentiate odors in behavioral tests with minimal reaction times that are signicantly shorter than 1 second (the time to reach the steady plateau state), this would provide direct evidence for the coding scheme proposed here. In addition, it would be interesting to investigate whether and to what extent the behavioral discriminability matches that obtained from the neurobiological model. Acknowledgments This work has been supported by a NaFoG ¨ grant (R.F.G.). References Boser, B. E., Guyon, I., & Vapnik, V. (1992). A training algorithm for optimal margin classiers. In Annual Workshop on Computational Theorey Archive, Proceedings of the Fifth Annual Workshop on Computational Learning Theorey, (pp. 144–152). New York: ACM Press. Available on-line: http://citeseer. nj.nec.com/boser92training.html. Burges, C. J. C. (1998). A tutorial on support vector machines for pattern recognition. Data Mining and Knowledge Discovery, 2, 121–167, Available on-line: http://citeseer.nj.nec.com/burges98tutorial.html. Efron, B., & Tibshirani, R., (1993). An introduction to the bootstrap. London: Chapman and Hall.

Simple and Rapid Olfactory Pattern Classication

1011

Flanagan, D., & Mercer, A. R., (1989). An atlas and 3-dimensional reconstruction of the antennal lobes in the worker honeybee. International Journal of Insect Morphology and Embriology, 18, 145–159. Friedrich, W. R., & Laurent, G. (2001). Dynamic optimization of odor representations by slow temporal patterning of mitral cell activity. Science, 291, 889–894. Galizia, C. G., McIlwrath, S. L., & Menzel, R. (1999). A digital three-dimensional atlas of the honeybee antennal lobe glomeruli based on optical sections acquired using confocal microscopy. Cell and Tissue Research, 295, 383– 394. Galizia, C. G., & Menzel, R. (2001). The role of glomeruli in the neural representation of odours: Results from optical recording studies. Journal of Insect Physiology, 47, 115–130. Hertz, J., Krogh, A., & and Palmer, R. G. (2001). Introduction to the theory of neural computation. New York: Perseus Books. Hildebrand, J. G., & Shepherd, G. M. (1997). Mechanisms of olfactory discrimination: Converging evidence for common principles across phyla. Annu. Rev. Neurosci., 20, 595–631. Korsching, S. (2002). Olfactory maps and odor images. Curr. Opin. Neurobiol., 12, 387–392. Laurent, G. (1996). Dynamical representation of odors by oscillating and evolving neural assemblies. Trends Neurosci., 19, 489–496. Laurent, G., & Davidowitz, H. (1994). Encoding of olfactory information with oscillating neural assemblies. Science, 265, 1872–1875. Laurent, G., Stopfer, M., Friedriech, R. W., Rabinovich, M. I., Volkovskii, A., & Abarbanel, H. D. (2001). Odor encoding as an active, dynamical process: Experiments, computation and theory. Annu. Rev. Neurosci., 24, 263– 297. Laurent, G., Wehr, M., & Davidowitz, H. (1996). Temporal representations of odors in an olfactory network. J. Neurosci., 16, 3837–3847. Lei, H., Christensen, T. A., & Hildebrand, J. G. (2002). Local inhibition modulates odor evoked synchronization of glomerulus specic output neurons. Nature Neuroscience, 5, 557–565. Menzel, R. (1999). Memory dynamics in the honeybee. J. Comp. Physiol. A, 185, 323–340. Rabinovich, M., Volkovskii, A., Lecanda, P., Huerta, R., Abarbanel, H., & Laurent, G. (2001). Dynamical encoding by networks of competing neuron groups: Winnerless competition. Phys. Rev. Lett., 87, 68102. Rumelhart, D. E., & McClelland, J. L. (1986). Parallel distributed processing: Explorations in the microstructure of cognition, Vol. 1: Foundations. Cambridge, MA: MIT Press. Sachse, S., & Galizia, C. G. (2002). Role of inhibition for temporal and spatial odor representation in olfactory output neurons: A calcium imaging study. Journal of Neurophysiology, 87 1106–1117. Strausfeld, N. (1976). Atlas of an insect brain. Berlin: Springer-Verlag. Vapnik, V. (1998). Statistical learning theory. New York: Wiley.

1012

R. Gal´an, S. Sachse, C.G. Galizia, and A.V.M. Herz

Wehr, M. (1999). Oscillatory sequences of ring in the locust olfactory system: mechanisms and functional signicance. Unpublished doctoral dissertation, California Institute of Technology. Available on-line: http://www.cns. caltech.edu/»mike/thesis.html. Wehr, M., & Laurent, G. (1996). Odor encoding by temporal sequences of ring in oscillating neural assemblies. Nature, 384, 162–166. Received April 24, 2003; accepted October 6, 2003.

LETTER

Communicated by Steven Zucker

Neural Mechanisms for the Robust Representation of Junctions Thorsten Hansen [email protected] Giessen University, Department of Psychology, D-35394 Giessen, Germany

Heiko Neumann [email protected] Ulm University, Department of Neural Information Processing, D-89069 Ulm, Germany

Junctions provide important cues in various perceptual tasks, such as the determination of occlusion relationships for gure-ground separation, transparency perception, and object recognition, among others. In computer vision, junctions are used in a number of tasks, like point matching for image tracking or correspondence analysis. We propose a biologically motivated approach to junction representation in which junctions are implicitly characterized by high activity for multiple orientations within a cortical hypercolumn. A local measure of circular variance is suggested to extract junction points from this distributed representation. Initial orientation measurements are often fragmented and noisy. A coherent contour representation can be generated by a model of V1 utilizing mechanisms of collinear long-range integration and recurrent interaction. In the model, local oriented contrast estimates that are consistent within a more global context are enhanced while inconsistent activities are suppressed. In a series of computational experiments, we compare junction detection based on the new recurrent model with a feedforward model of complex cells. We show that localization accuracy and positive correctness in the detection of generic junction congurations such as L- and T-junctions is improved by the recurrent long-range interaction. Further, receiver operating characteristics analysis is used to evaluate the detection performance on both synthetic and camera images, showing the superior performance of the new approach. Overall, we propose that nonlocal interactions implemented by known mechanisms within V1 play an important role in detecting higher-order features such as corners and junctions. 1 Introduction and Motivation Corners and junctions are points in the image where two or more edges join or intersect. Whereas edges lead to variations of the image intensity along a c 2004 Massachusetts Institute of Technology Neural Computation 16, 1013–1037 (2004) °

1014

T. Hansen and H. Neumann

single direction, corners and junctions are characterized by variations in at least two directions. In other words, edges are intrinsically one-dimensional signals, whereas corners and junctions are intrinsically two-dimensional signals. Compared to regions of homogeneous intensity, edges are rare events. Likewise, compared to edges, corners and junctions are rare events of high information content. Moreover, corners and junctions are invariant under different viewing angles and viewing distances. Both the sparseness of the signal and the invariance under afne transformations and scale variations establish corners and junctions as important image features. Points of intrinsically two-dimensional signal variations such as corners and junctions have also been termed keypoints (Heitger, Rosenthaler, von der Heydt, Peterhans, & Kubler, ¨ 1992; Michaelis, 1997) or interest points (Schmid, Mohr, & Bauckhage, 2000). Corners and junctions are useful for various higher-level vision tasks such as the determination of occlusion relationships, matching of stereo images, object recognition, and scene analysis. The importance of corner and junction points for human object recognition has been demonstrated in a number of psychophysical experiments (Attneave, 1954; Biederman, 1985, 1987). Biederman showed that object perception of line drawings is severely impaired when corners (i.e., contours of high curvature) are removed, but largely preserved when contours of low curvature are deleted. Junctions also seem to play an important role in the perception of brightness and transparency (Adelson, 1993, 2000; Metelli, 1974; Todorovi´c, 1997). Recently Rubin (2001) proposed that local occlusion cues as signaled by junctions are necessary to trigger modal and amodal surface completion. Rubin showed that other cues, such as surface relatability and surface similarity, did not lead to the perception of illusory contours or amodal completion when junction cues are removed from otherwise unchanged stimuli. In physiological studies in monkey visual cortex cells have been reported that selectively respond to corners and line ends (Hubel & Wiesel, 1968). Das and Gilbert (1999) showed that correlated activities of V1 cells can signal the presence of smooth outline patterns as well as patterns of orientation discontinuity as occurring at corners and junctions. Cells preferentially responding to curves and angles have been found in area V4 (Pasupathy & Connor, 1999, 2001). Recently McDermott (2001, 2002) studied the performance of human observers for the detection of junctions in natural images. He found that the ability to detect junctions is severely impaired if subjects view the location of a possible junction through a small aperture. Detection performance and observers’ condence ratings decreased with decreasing size of the aperture. The results suggest that a substantial number of junctions in natural images cannot be detected by local mechanisms. In this article, we propose a new mechanism for corner and junction detection based on a distributed representation of contour responses within a model hypercolumn (Zucker, Dobbins, & Iverson, 1989). Unlike local ap-

Neural Mechanisms for the Robust Representation of Junctions

1015

proaches as proposed in computer vision (e.g., Harris, 1987; Mokhtarian & Suomela, 1998; Parida & Geiger, 1998), the new scheme is based on a more global, recurrent long-range interaction for the coherent computation of contour responses. Such nonlocal interactions evaluate local responses within a more global context and generate a robust contour representation. A measure of circular variance is used to extract corner and junction points at positions of large responses for more than one orientation. The letter is organized as follows. In section 2, we present the model of recurrent collinear long-range interactions and detail the new junction detection scheme. Simulation results for a number of synthetic and realworld camera images are presented in section 3. A discussion of the results is given in section 4. Section 5 concludes the article. A short version reporting this research has been published in Hansen and Neumann (2002).

2 A Neural Model for Corner and Junction Detection Corner and junction congurations can be characterized by signicantly increased responses for two or more orientations at a particular location in the visual space. A cortical hypercolumn is the neural representation for oriented responses at a particular location. Corners and junctions are thus characterized by signicant activity of multiple neurons within a hypercolumn, as proposed by Zucker et al. (1989). Multiple oriented activities as measured by a simple feedforward mechanism are sensitive to noisy signal variations. In previous work, we have proposed a model of recurrent collinear long-range interaction in the primary visual cortex for contour enhancement (Hansen & Neumann, 1999, 2001). During the recurrent long-range interactions, the initially noisy activities are evaluated within a larger context. In this recurrent process, only coherent orientation responses are preserved—responses supported by responses in the spatial neighborhood—while other responses are suppressed. Besides the enhancement of coherent contours, the proposed model also preserves multiple activity at corners and junctions. Corners and junctions are thus implicitly characterized by a distributed representation of high multiple activity within a hypercolumn. Such a distributed representation may sufce for subsequent neural computations. However, at least for the purpose of visualization and comparison to other junction detection schemes, an explicit representation is required. Following the above considerations, corners and junctions can be marked if multiple orientations are active and high overall activity exists within a hypercolumn. In the following, we rst present the proposed model of collinear recurrent long-range interactions in V1 and then detail a mechanism to explicitly mark corner and junction points. The analysis and experiments are restricted to patterns consisting of straight or extremely low curvature boundaries.

1016

T. Hansen and H. Neumann

2.1 Coherent Contour Representation by a Model of Collinear Recurrent Long-Range Interaction in V1. The model of collinear long-range interactions in V1 is motivated by biological mechanisms. The core mechanisms of the model include localized receptive elds (RFs) for oriented contrast processing, interlaminar feedforward and feedback processing, cooperative horizontal long-range integration, and lateral competitive interactions. The key properties of the model are motivated by empirical ndings: ² Horizontal long-range connections. The grouping of aligned contours requires a mechanism that links cells of proper orientation over larger distances. Horizontal long-range connections found in the supercial layers of V1 may provide such a mechanism. They span large distances (Gilbert & Wiesel, 1983; Rockland & Lund, 1983) and selectively link cells with similar feature preference (Gilbert & Wiesel, 1989) and collinear aligned RFs (Bosking, Zhang, Schoeld, & Fitzpatrick,1997; Schmidt, Goebel, Lowel, ¨ & Singer, 1997). Evidence for nonlocal integration also comes from psychophysical experiments for contrast detection (Kapadia, Ito, Gilbert, & Westheimer, 1995; Kapadia, Westheimer, & Gilbert, 2000; Polat & Sagi, 1993, 1994) and contour integration (Field, Hayes, & Hess, 1993; Yen & Finkel, 1998). Since it is known that facilitation exists that is strong for collinear ankers, here we focus on interactions between collinear, iso-oriented items. The more general case of mutual facilitation for cocircular arrangements is considered in Parent and Zucker (1989) and Zucker et al. (1989). ² Short-range connections. Short-range connections are rather unspecic for a particular orientation (Amir, Harel, & Malach, 1993; Bosking et al., 1997; DeAngelis, Freeman, & Ohzawa, 1994) and most likely belong to an inhibitory system (Kisvarday, Kim, Eysel, & Bonhoeffer, 1994). ² Modulating feedback. Several physiological studies indicate that feedback projections have a modulating or gating rather than generating effect on cell activities (Hirsch & Gilbert, 1991; Hup´e et al., 1998; Salin & Bullier, 1995). Feedback alone is not sufcient to drive cell responses (Sandell & Schiller, 1982), and initial bottom-up activity is necessary to generate activity. The model architecture is dened by a sequence of preprocessing stages and a recurrent loop of long-range interaction, realizing a simplied architecture of V1 (see Figure 1). Processing within the recurrent loop denes a functional architecture of two interacting regions, each with a distinctive purpose (Hansen, Sepp, & Neumann, 2001; Neumann & Sepp, 1999): The lower region serves as a stage of feature measurement and signal detection. The higher region represents expectations about visual structural entities and context information to be matched against the incoming data carried by the feedforward pathway.

Neural Mechanisms for the Robust Representation of Junctions

input image

LGN cells

simple cells

complex cells

combi nation

1017

long range

Figure 1: Overview of model stages together with a sketch of the sample receptive elds of cells at each stage for 0 degree orientation. For the long-range stage, the spatial weighting function of the long-range lter is shown together with the spatial extent of the inhibitory short-range interactions (sketched as black disk).

Feedback from the higher region then selectively enhances those activities of the lower region that are consistent with the top-level expectations or a priori assumptions (Carpenter & Grossberg, 1988; Mumford, 1991). 2.1.1 Feedforward Preprocessing. In the feedforward path, the initial luminance distribution I is processed by isotropic lateral geniculate nucleus (LGN) cells Kon=off , followed by orientation-selective simple cells S and complex cells C (see Figure 1). The interactions in the feedforward path are governed by basic linear equations to keep the processing in the feedforward path relatively simple and to focus on the contribution of the recurrent interaction. In our model, complex cell responses C provide an initial local estimate of contour strength, position, and orientation, which is used as bottom-up input for the recurrent loop. The equations that govern the computations in the feedforward path are detailed in appendix A. 2.1.2 Recurrent Long-Range Interaction. The output of the feedforward preprocessing denes the input to the recurrent loop. The recurrent loop has two stages: a combination stage, where bottom-up and top-down inputs are fused, and a stage of long-range interaction. The equations that govern the interaction at both the combination stage and the long-range stage employ nonlinear, multiplicative interactions. The divisive inhibition resulting from this interaction is sometimes referred to as shunting inhibition (Furman, 1965; Grossberg, 1970; Hodgkin, 1964; Levine, 2000). Shunting inhibition has been proposed as a possible mechanism for divisive interactions in a number of studies (Borg-Graham, Monier, & Fr´egnac, 1998; Carandini & Heeger, 1994). However, a number of results, in particular from modeling studies, have challenged this view, proposing that shunting inhibition has a subtractive rather than divisive effect on cell ring

1018

T. Hansen and H. Neumann

rates (Douglas & Martin, 1990; Holt & Koch, 1997). More recent studies have shown that shunting inhibition can indeed divisibly modulate ring rates in neurons with high-variability synaptic inputs and dendritic saturation (Mitchell & Silver, 2003; Prescott & Koninck, 2003). Based on these results, we propose that shunting inhibition is a physiological plausible candidate mechanism for the divisive inhibition as used in the model. Combination stage. At the combination stage, feedforward complex cell responses C and feedback long-range responses W are added and subject to a nonlinear compression of high-amplitude activity following the WeberFechner law (Fechner, 1889; Weber, 1905): V D ¯V

netµ , ®V C netµ

where netµ D C C ±V W:

(2.1)

This equation is the steady-state response @t V D 0 of the differential equation @t V D ¡®V V C .¯V ¡ V/ netµ :

(2.2)

The divisive inhibition resulting from this interaction yields a bounded activity in the range [0; ¯V ]. More precisely, the activity V is described by V D V.netµ / D ¯V 8 > <0

D

n > 1Cn

:

¯V

¯V

netµ ®V C netµ limnetµ !0 netµ D n®V ; limnetµ !1 :

n 2 RC

(2.3)

Plots of the function dening V for different values of the decay parameter ®V and a unit-valued scaling parameter ¯V D 1 are depicted in Figure 2. The weighting parameter ±V D 2 in equation 2.1 is chosen so that dimensions of C and W are approximately equal, the decay parameter ®V D 0:2 is chosen small compared to netµ , and the scaling parameter ¯V D 10 scales up the activity to be sufciently large for the subsequent long-range interaction. In the rst iteration step, feedback responses W are set to C. Long-range interaction. At the long-range stage, the contextual inuences on cell responses are modeled. Orientation-specic, anisotropic long-range connections provide the excitatory input. The inhibitory input is given by isotropic interactions in both the spatial and orientational domains. Longrange connections are modeled by a lter whose spatial layout is similar to the bipole lter as rst proposed by Grossberg and Mingolla (1985a). The spatial weighting function of the long-range lter is narrowly tuned

Neural Mechanisms for the Robust Representation of Junctions

1

1

0.5

0.5

0 0

0.5

1

0 0

0.5

1

1019

1.5

2

Figure 2: Compressive nonlinearity of the function y D x=.x C ®/ employed in equation 2.1. (Left) Plot of the compression function for various values of the decay parameter ® D 0:1; 0:25; 0:5, and 1 (Right) The decay parameter ® in equation 2.1 can be viewed as dening the unit length of the input values x in the compression function; input values x given as a multiple of ®, that is, x D n®, are mapped to n=.1 C n/ independent of ®. Scaling of the x-axis by 1=® thus maps all y-values as obtained for different values of ® on a single curve given by y D n=.1 C n/.

to the preferred orientation, reecting the highly signicant anisotropies of long-range bers in visual cortex (Bosking et al., 1997; Schmidt et al., 1997). The size of the long-range lter is about four times the size of the RF of a complex cell as sketched in Figure 1. Essentially, excitatory input is provided by correlation of the feedforward input with the long-range lter Bµ . A cross-orientation inhibition prevents the integration of cell responses at positions where responses for the orthogonal orientation also exist. The excitatory input is governed by Lµ D [Vµ ¡ Vµ ? ]C ? Bµ ;

(2.4)

where ? denotes spatial correlation and [x]C D maxfx; 0g denotes half-wave rectication. The long-range lter is dened as a polar-separable function: Bµ .’; r/ D Bang .’/Brad .r/:

(2.5)

The angular function Bang is maximal for the preferred direction µ and smoothly rolls off in a cosine fashion, being zero for angles deviating more than ®=2 from the preferred orientation: » 1 cos.2¼ 2® .µ ¡ ’// Bang .’/ D 0

j’ ¡ µj · ®=2 j’ ¡ µj > ®=2:

(2.6)

The parameter ®, which denes the opening angle of the long-range lter, is set to 20 degrees. The radial function Brad is constant for values smaller

1020

T. Hansen and H. Neumann

Figure 3: Spatial weighting function for the long-range interaction for a reference orientation of 0 degrees.

than rmax D 25 and smoothly decays to zero in a gaussian fashion for values larger than rmax : » 1 B rad.r/ D 2 exp. ¡r 2¾ /

r · rmax r > rmax :

(2.7)

The standard deviation of the gaussian is set to ¾ D 3. The long-range lter is nally normalized such that the lter integrates to one. A plot of the long-range lter for a reference orientation of 0 degrees is depicted in Figure 3. Responses are not salient if nearby cells of similar orientation preference also show a strong response. Therefore, an inhibitory term is introduced that gathers activity from the spatio-orientational neighborhood of the target cell. Gaussian weighting functions are used to sample activity in both space .x; y/ and orientation domain µ. The inhibitory input M to the long-range interaction is thus provided by the three-dimensional (3D) correlation of the input L with a 3D gaussian G¾ : M.x/ D G¾ .x/ ? L.x/:

(2.8)

Since the 3D gaussian G¾ .x/ is separable in each dimension, the 3D correlation can be efciently implemented as the successive correlation with three one-dimensional (1D) gaussians g¾sur .x/, g¾sur .y/ and g¾µ .µ /. Since correlation is associative, the correlation can also be conceptualized as two successive correlations: a two-dimensional (2D) correlation in the spatial domain with an isotropic 2D gaussian G¾sur followed by a correlation in the orientational domain with a 1D gaussian g¾o : M.x/ ´ M.x; y; µ/ D L.x; y; µ/ ? G¾sur .x; y/ ? g¾o .µ /:

(2.9)

The standard deviation of the gaussian in the spatial domain is set to ¾ sur D 8 to model the smaller extent of the inhibitory short-range connections. This parameterization results in an effective spatial extension of about half the size of the excitatory long-range interaction modeled by the long-range lter. The standard deviation in the orientational domain is set to ¾o D 0:5 to give near-zero input for the orthogonal orientation. The 3D gaussian kernel is visualized in Figure 4.

Neural Mechanisms for the Robust Representation of Junctions

³ ¡

¢³

³

³ + ¢³

1021

³ + 2¢³

Figure 4: 3D gaussian in .x; y; µ /-space visualized as Omax D 4 gaussians in .x; y/ space. (Left to right) Gaussians for µ ¡ 1µ , µ , µ C 1µ , and µ C 21µ . Note that the gaussian for the orientation µ? D µ C 21µ orthogonal to µ is near zero at all positions.

The nal activity of the long-range stage results from interactions between the excitatory long-range input L and the inhibitory input M. The excitatory long-range input L is gated by the activity V to implement a modulating rather than generating effect of lateral interaction on cell activities (Hirsch & Gilbert, 1991; Hup´e et al., 1998). Similar to equation 2.1, the inhibition from M is divisive. The nal activity of the long-range stage thus reads W D ¯W

V.1 C ´ C L/ ; ®W C ´¡ M

(2.10)

where ®W D 0:2 is the decay parameter and ´C D 5, ´¡ D 2, and ¯W D 0:001 are scale factors. This equation is the steady-state response @t W D 0 of the differential equation @t W D ¡®W W C ¯W V.1 C ´C L/ ¡ ´ ¡ W M: The result of the long-range stage is fed back and combined with the feedforward complex cell responses in equation 2.1, thus closing the recurrent loop. The multiplicative interactions governing both the long-range interactions and the combination of feedback and feedforward input ensure rapid equilibration of the dynamics after a few recurrent cycles and result in graded responses within a bounded range of activations. The model is robust against parameter changes mainly caused by the compressive transformation equations employed. For the combination of responses (see equation 2.1), however, it is crucial to have activities in both streams of similar order of magnitude. Also, the relative RF sizes must not be substantially altered. The current parameter setting results in relative RF sizes of complex cells: isotropic inhibitory short-range lter: long-range interaction of about 1:2.5:4, assuming a cut-off of the gaussians at 2¾ (or 95% of the total energy).

1022

T. Hansen and H. Neumann

2.2 Junction Detection by Readout of Distributed Information Using a Measure of Circular Variance. As stated above, corners and junctions are characterized by points in the visual space where responses for multiple orientations are present and high overall activity exists within a hypercolumn. For the readout of this distributed information, a measure of circular variance is used to signal multiple orientations. The overall activity is given by the sum across all orientations within a hypercolumn. Thus, the junction map J for a distributed hypercolumnar representation such as the activity of the long-range stage W (see equation 2.10) is given by J D circvar.Wµ /2

X

Wµ :

(2.11)

µ

The junction map J thus receives high values from a combined measure of high activations in multiple orientation channels and high overall activity. The function “circvar” is a measure of the circular variance within a hypercolumn. The squaring operation enhances the ratio between high and low circular variances. Circular variance is dened as one minus the normalized population vector circvar.Wµ / D 1 ¡

P j µ Wµ exp.2iµ/j P : µ Wµ

(2.12)

Circular variance takes values in the range [0I 1]. A value of 0 denotes a single response, whereas a value of 1 occurs if all orientations have the same activity. Circular variance has been used in a number of physiological studies to characterize the response properties of cells in V1 (McLaughlin, Shapley, Shelley, & Wielaard, 2000; Pugh, Ringach, Shapley, & Shelley, 2000; Ringach, Hawken, & Shapley, 1997). To visualize the data, single junction points are marked as local maxima in the junction map. First, the junction map is smoothed with a gaussian (¾ D 3) to regularize the junction map and improvethe localization accuracy. Corner points are then marked as local maxima whose strength must exceed a fraction of the maximum response in the smoothed junction map. Local maxima are computed within a 3£3 neighborhood. These operations help to visualize the junction data and are not meant to have a direct physiological counterpart. The corner detection scheme detailed in this section reads out a scalar value of junction activity from the population vector of orientation responses within a hypercolumn. The readout operation does not rely on any specic properties of the long-range stage and thus can be applied to any input where multiple orientations are locally represented within a hypercolumn. To demonstrate the advantages of the recurrent long-range interaction as opposed to a purely feedforward processing, it is instructive to compare the detection results obtained for two different kinds of distributed input:

Neural Mechanisms for the Robust Representation of Junctions

1023

the long-range activity W and the complex cell activity C. Results of this comparison are presented in the next section. 3 Simulations In this section, we show the competencies of the proposed junction detection scheme for a variety of synthetic and real-world camera images. In particular, the robustness to noise and the localization properties of the new scheme are evaluated. In order to focus on the relative merits of the recurrent long-range interactions for the task of corner and junction detection, the proposed scheme is evaluated using two different kinds of input: the activity W of the long-range stage and the purely feedforward activity C of the complex cell stage. Model parameters as specied in section 2 are used in all simulations. The multiplicative interactions that govern the computations at the longrange stage and the combination stage ensure an equilibration of the model dynamics after a small number of recurrent cycles. For all input images, model dynamics have saturated after 12 recurrent cycles. The model responses continuously improve with each iteration, with the largest improvement achieved in the rst interactions (Hansen, 2003). A xed number of recurrent cycles is therefore not crucial for the model performance. This is of particular relevance for real biological networks, where the number of recurrent interactions is limited by the time available to process a stimulus. 3.1 Localization of Generic Junction Congurations. From the outset of corner and junction detection in computer vision, the junction types have been partitioned into distinct classes like T-, L-, and W-junctions (Huffman, 1971). This catalog of junction congurations has been extended by other junction types such as 9-junctions, which have been suggested to provide strong cues for inferring surface shading and reectance (Adelson, 2000; Sinha & Adelson, 1993). In the rst simulation, we compare the localization accuracy of junction responses based on feedforward versus recurrent long-range responses for L-, T-, Y-, W- and 9-junctions (see Figure 5). To evaluate the localization accuracy, the Euclidean distance between the ground-truth location and the location as measured by either method is computed. For all junction types, the localization is considerably better for the method based on the recurrent long-range interaction. 3.2 Processing of Synthetic and Camera Images. In the second set of simulations, we evaluate the detection performance of the model for a variety of synthetic and real-world camera images. Receiver operator characteristic (ROC) analysis (Dayan & Abbott, 2001; Green & Swets, 1974) is used for a threshold-free evaluation of the different approaches. Details of the application of ROC analysis to junction detection are given in appendix B.

1024

T. Hansen and H. Neumann

Figure 5: Distance in pixels from ground-truth location (ordinate) for L-, T-, Y-, W- and 9-junctions (abscissa). The deviation from ground-truth position is considerably smaller for the recurrent long-range interaction (open bars) compared to the complex cell responses (solid bars).

3.2.1 Synthetic Images. In the rst simulation we employ a synthetic corner test image from Smith and Brady (1997). The ROC curve for the new method based on recurrent long-range interaction is well above the ROC curve obtained for the complex cell responses, indicating a higher accuracy of the new method (see Figure 6). Next, we address the processing of junctions made of lines meeting at small angles. We study junctions with different numbers of lines (two, three, and four) joining at different small angles (5, 10, and 15 degrees). Input images together with the corresponding ROC curves are shown in Figure 7. For all input images, the ROC curves indicate a better performance of the junction detection based on the recurrent long-range processing.

Figure 6: (Left) Synthetic test image from Smith and Brady (1997) used for the evaluation of corner detection schemes. (Middle) Overlay of ground-truth junction points. (Right) ROC curves (long-range stage, solid lines; complex cells, dashed lines). The abscissa denotes the false alarm rate, and the ordinate denotes the hit rate.

Neural Mechanisms for the Robust Representation of Junctions

1025

Figure 7: ROC curves for synthetic junctions of small angles. (Left panel) 3 £ 3 arrangement of input images, each with junctions made of lines meeting at 5, 10, and 15 degrees (top to bottom row). (Right panel) ROC curves (long-range stage, solid lines; complex cells, dashed lines). The abscissa denotes the false alarm rate between zero and one, the ordinate denotes the hit rate. For better visualization, cut-outs of the left part (0 to 0.15) of the ROC curves are shown. The ROC curves indicate a higher accuracy for the results based on the recurrent long-range interaction. Note that for a single junction, the hit rate can take only two values, 0 and 1.

3.2.2 Camera Images. In this section, the junction detection performance is evaluated for various camera images. For each image, the ground-truth data are obtained by manually marking the junction points. Detection performance is evaluated for a cube image within a laboratory environment, a staircase image, a laboratory scene (Mokhtarian & Suomela, 1998), and a subimage of a plant taken from a database of natural images (van Hateren & van der Schaaf, 1998). For each image sample detection results for a xed threshold and the full ROC curves are shown (see Figure 8). For all images, the ROC curves obtained from the junction detection based on recurrent long-range interaction lie above the curves obtained for the feedforward complex cell responses. Thus, detection accuracy is increased by the recurrent long-range processing. The sample plots reveal two main factors that contribute to the increased detection performance. First, many false-positive responses occur based on the complex cell responses due to noisy variations of the initial orientation measurement (e.g., at the long vertical edges of the cube or at the top and bottom edges of the staircase). These variations are reduced at the long-range stage by the recurrent interaction, such that only the positions of signicant orientation variations remain. Second, false-negative responses occur at the complex cell stage because localized measurements within a small aperture fail to detect junctions of locally low contrast (McDermott, 2001, 2002). Such missing responses oc-

1026

T. Hansen and H. Neumann

Figure 8: Simulation of the corner detection scheme for camera images. (Left to right) Input image with an overlay of manually marked ground-truth junction points; detected corners and junctions based on the complex cell responses; detected corners and junctions based on the long-range responses; and the corresponding ROC curves (long-range stage, solid lines; complex cells, dashed lines). The two gray-scale images in the second and third columns depict the edge maps based on the complex and long-range responses, respectively. These edge maps show the overall activity given by the sum across all orientations within each hypercolumn. In the edge maps, high activity is coded black, and low activity is coded white. In the ROC curves, the abscissa denotes the false alarm rate, and the ordinate denotes the hit rate. For better visualization, cutouts of the left part of the ROC curves are shown. The recurrent long-range interaction results in a decrease of false-positive and false-negative responses, leading to an increase in overall accuracy.

Neural Mechanisms for the Robust Representation of Junctions

1027

cur, for example, at the bottom right corner of the staircase or for the plant image. The recurrent long-range interaction enhances those small activities supported by aligned responses within a more global context, such that even junctions of low contrast can be detected. Overall, the results obtained based on the long-range responses are superior to the results based on the purely feedforward complex cells responses. However, the results are not perfect in the sense that every corner is detected by the new method. The focus of this article is not to propose an ideal junction detector, but to show how mechanisms of recurrent long-range processing in V1 lead to a coherent representation of contours and junctions. 4 Discussion 4.1 Models of Recurrent Long-Range Interaction. A number of models have been proposed in the context of contour grouping and texture segmentation that incorporate principles of nonlocal, long-range interactions, or recurrent processing, or both (Grossberg & Mingolla, 1985b; Grossberg & Raizada, 2000; Heitger, von der Heydt, Peterhans, Rosenthaler, & Kubler, ¨ 1998; Li, 1998, 1999, 2001; Neumann & Sepp, 1999; Yen & Finkel, 1998). A comprehensive overview of these different approaches can be found in Neumann and Mingolla (2001) and Hansen (2003). These models have focused on the role of recurrent long-range processing for contour enhancement (Grossberg & Mingolla, 1985b; Li, 1998), preattentive segmentation and pop-out (Li, 1999; Yen & Finkel, 1998), recurrent interaction between two reciprocally connected cortical areas V1 and V2 (Neumann & Sepp, 1999), or the role of the laminar architecture in V1 for contrast-sensitive perceptual grouping (Grossberg & Raizada, 2000). With the exception of Heitger et al. (1998), none of these models has addressed the computation and representation of corners and junctions. The model of Heitger et al., however, relies on the explicit computation of corners and line end points by specic end-stopped lters and is purely feedforward. The idea of an explicit end-stopped lter has been challenged by physiological studies suggesting that end-stopped properties emerge from surround inhibition within a recurrent interlaminar loop (Bolz & Gilbert, 1986; Bolz, Gilbert, & Wiesel, 1989). Further, the use of a purely feedforward computation scheme neglects the role of feedback and recurrent interactions, which presumably play an important role in visual processing for gure-ground segregation (Hup´e et al., 1998; Lamme, 1995; Zipser, Lamme, & Schiller, 1996). Zucker and coworkers have investigated the process of line nding based on a relaxation labeling mechanism followed by a stage of spline approximation to minimize a global energy functional (Parent & Zucker, 1989; Zucker et al., 1989). The representation of junction congurations is only implicitly addressed there. For example, the inference of smooth contours from responses of oriented line detectors necessitates stable endings that neither shrink nor grow at discontinuities in the orientation eld. By explicitly

1028

T. Hansen and H. Neumann

coloring nodes of a relaxation grid and thus splitting activations into separate layers, multiple orientations can be represented at one spatial location (David & Zucker, 1990). These approaches, however, have not investigated the capability to read out junction congurations from orientation elds or analyzed the robustness of junction representation for varying opening angles. Our approach is computationally simple, requiring no explicit class labeling operation. Investigating the issue of detecting and representing corners and junctions demonstrates that our network model gives robust results for varying scene parameters. This demonstrates that such meaningful visual structure information is already available at the stage of V1. 4.2 Decoding Population Codes. In population coding, a single quantity is represented by a number or population of neurons with overlapping sensitivity or tuning curves (Oram, Foldi` ¨ ak, Perret, & Sengpiel, 1998; Pouget, Dayan, & Zemel, 2000). Population coding is a general principle in neural systems. Population coding is advantageous because of its robustness to both single cell damage and noisy response uctuations (Pouget et al., 2000). The benets of population coding of stimulus orientation within a cortical hypercolumn have been studied in a model by Vogels (1990). It has been shown that stimulus orientation can be achieved with high accuracy based on cells that have broad orientation tuning curves. Population codes also allow representing the certainty of a signal by the summed overall activity (Hinton, 1992). Further, the presence of more than one signal can be represented by a multimodal activity distribution. Corners and junctions can be signaled if multiple activity of high amplitude occurs within a cortical hypercolumn (Zucker et al., 1989). Most schemes of decoding population codes are based on the underlying assumption that a single value is encoded (Oram et al., 1998; Seung & Sompolinsky, 1993), with a few exceptions (see Zemel, Dayan, & Pouget, 1998). To extract a single value of junction activity from the distributed, multimodal representation of orientation responses within a hypercolumn, we suggest a simple mechanism that combines two measures in a multiplicative way. The summed overall activity signals the presence or certainty of a signal (Hinton, 1992), and high circular variance signals the presence of multiple orientations. A single measurement of the sum total activity does not allow distinguishing between high activity as generated by a strong response for a single orientation or by medium responses for multiple orientations. Likewise, a single measurement of circular variance does not allow distinguishing between high activity as generated by small responses in the presence of noise or strong responses in the presence of multiple orientations. The combination of both measurements naturally allows for extracting the desired information. 4.3 Circular Variance Function. The proposed junction detection mechanism uses a measure of circular variance for the explicit representation of

Neural Mechanisms for the Robust Representation of Junctions

1029

distributed orientation responses within a hypercolumn. While this measure is motivated mainly computationally, circular variance is essentially a weighted sum of information locally available within a hypercolumn. We hypothesize that a related measure of junction responses can be realized by neurons in the visual cortex. First evidence comes from a study by Das and Gilbert (1999), indicating a graded specialization of neurons for the processing of corners and T-junctions. The degree of selectivity for processing corners was shown to increase with the overlap of the neuron’s dendritic arborization with neighboring orientation columns. The circular variance function used in the model is invariant against rotations of the junction and increases for different numbers of lines meeting at one point. A neural circuit based on a measure of circular variance is predicted to realize a general junction detector in contrast to detectors specialized for a particular junction type such as T-junctions. 4.4 Multiscale Processing for Junction Detection. A number of studies have stressed the importance of multiple scales for the proper extraction of corner and junction information (Lindeberg, 1998; Mokhtarian & Suomela, 1998; Wurtz ¨ & Lourens, 2000). The present model operates on only a single scale. However, in the recurrent interaction, the model neurons integrate information over successively increasing regions of the visual space. Thus, information of different scales is available during the temporal evolution of orientation responses. We have shown for a number of generic junction congurations that accurate, precise localization can be achieved by the proposed model without the need of tracking responses along multiple scales from coarse to ne, as suggested by Mokhtarian and Suomela (1998). 4.5 Evaluation of Junction Detection Schemes. A number of different methods have been proposed to evaluate the various approaches to corner detection. The different methods can be classied into methods based on visual inspection, localization accuracy, and theoretical analysis (Schmid et al., 2000). A simple and popular method relies on visual inspection of the detection results: A number of different detectors are applied to a set of images, and the results are presented. This method suffers from a number of drawbacks since the number of false alarms and misses as well as the precise localization may not be judged correctly. More important, results are shown only for a particular choice of the threshold separating corner from noncorner points. Such a threshold is involved in virtually every corner detection scheme, and detection results crucially depend on the proper choice of the threshold. Localization accuracy is another evaluation method and can be measured based on the correct projection of 3D scene points to 2D image points (Coelho, Heller, Mundy, Forsyth, & Zisserman, 1991; Heyden & Rohr, 1996). Since this method requires the precise knowledge of 3D points, the evaluation is restricted to simple scenes of, for example, polyhedral objects. The performance of various corner detectors can also be

1030

T. Hansen and H. Neumann

assessed by theoretical analysis (Deriche & Giraudon, 1990; Rohr, 1994). Analytical studies are limited to particular congurations such as L-corners. Here we have introduced the method of ROC analysis in the context of junction detection. ROC analysis allows assessment of the capabilities of the detectors over the full range of possible thresholds for every test image. Consequently, ROC-based evaluation results are not awed by choice of a particular threshold, which can strongly bias the obtained results. 5 Conclusions We have proposed a novel method for corner and junction detection based on a distributed representation of orientation information within a hypercolumn (Zucker et al., 1989). The explicit representation of a number of orientations in a cortical hypercolumn is shown to constitute a powerful and exible multipurpose scheme that can be used to code intrinsically 1D signal variations like edges as well as 2D variations like corners and junctions. Orientation responses within a hypercolumn can be robustly and reliably computed by using contextual information. We have proposed a model of recurrent long-range interactions to compute coherent orientation responses (Hansen & Neumann, 2001). In the context of corner and junction detection, we have shown the benets of using contextual information and recurrent interactions, leading to a considerable increase in localization accuracy and detection performance compared to a simple feedforward scheme. We have used ROC analysis for the evaluation of the different junction detection schemes. The results of ROC analysis for both synthetic and real-world camera images show that the detection performance of the new scheme is superior compared to the basic feedforward scheme. The results have been gained with a scheme that focuses on the interaction between collinear items. One might expect results to be even more impressive for a scheme that includes curves and textures. Overall, we have shown how robust and accurate junction detection can be realized based on biologically plausible mechanisms. Appendix A: Feedforward Processing A.1 LGN On- and Off-Cells. Responses of isotropic LGN-cells are modeled by correlation of the initial input stimulus I with a difference-of-gaussians (DoG) operator. Two types of LGN cells are modeled, on and off, which generate rectied output responses Kon=off , K D DoG¾c ;¾s ? I Kon D [K]C

Koff D [¡K]C ;

(A.1)

Neural Mechanisms for the Robust Representation of Junctions

1031

where ? is the spatial correlation operator and [x]C :D maxfx; 0g denotes half-wave rectication. The DoG is parameterized by the standard deviation of the center and surround gaussian (¾ c D 1; ¾s D 3), respectively. A.2 Simple Cells. Simple cells in V1 have elongated subelds (on and off) that sample the input of appropriately aligned LGN responses. Input sampling is modeled by correlation with rotated, anisotropic gaussians. The gaussians are shifted perpendicular to their main axis by §¿ D 3 to model left and right subelds of an odd-symmetric simple cell. Thus, for example, for the on-channel, the equations read Ron;left;µ D Kon ? G¾x ;¾y ;0;¡¿;µ

Ron;right;µ D Kon ? G¾x ;¾y ;0;¿;µ :

(A.2)

The activations of the off-channel are computed analogously. Simple cells are modeled for two polarities (dark-light and light dark) in Omax D 4 orientations (µ D 0; ¼=Omax ; : : : ; .Omax ¡1/ ¼=Omax ). The standard deviations of the anisotropic gaussians are set to ¾y D 1; ¾x D 3 ¾y . For each orientation, the simple cell activity is computed by pooling the two subeld responses. The equations for light-dark (ld) and dark-light (dl) simple cells read Sld;µ D Ron;left;µ C Roff;right;µ

Sdl;µ D Roff;left;µ C Ron;right;µ :

(A.3)

A.3 Complex Cells. Cortical complex cells are polarity insensitive. Their response is generated by pooling simple cells of opposite polarities. Before pooling, simple cells of opposite polarities compete and are spatially blurred. The corresponding equations read e Sld;µ D [.Sld;µ ¡ Sdl;µ / ? G¾x ;¾y ;0;0;µ ]C CDe Sld;µ C e Sdl;µ :

(A.4)

Appendix B: Applying ROC for the Evaluation of Different Junction Detectors ROC analysis allows characterizing different detectors over the full range of possible biases or thresholds. In virtually all junction detection schemes, some kind of thresholding is involved, and the detection performance crucially depends on the determination of the “optimal” threshold value. A threshold-free evaluation of different detectors as provided by ROC analysis allows separating the sensitivity of the detector from its threshold selection strategy.

1032

T. Hansen and H. Neumann

ROC analysis in general is based on ground-truth verication, that is, the comparison of a detection result with ground truth. Thus, the rst step to apply ROC analysis for junction detection is the specication of groundtruth junction points for each test image. For synthetic images, the groundtruth position of junction points is known from the denition of the image or can be rather easily inferred from the gray-level variations. For realworld camera images, no such ground truth exists. In this case, junction points are marked by visible inspection of an enlarged version of the image. The list of these junction points serves as an approximation of an objective ground-truth image. The next step is the estimation of junction points using a particular junction detection scheme. We include two different methods in the comparison, based on long-range responses and feedforward complex cell responses. The resulting junction responses are normalized to the range [0I 1] to compensate for variations of the response amplitude across different methods. ROC curves are then computed based on the groundtruth image and the normalized junction response as follows. A threshold is varied in N steps over the full range [0I 1] of junction responses, and for each value of the threshold, the proportion of true-positive (hits) and false-positive (false alarms) responses is computed. To obtain true-positive responses despite localization errors of the methods, responses are accepted within a certain error radius rerr around each ground-truth location. Finally, the ROC curve characterizing the detection performance of the particular method is obtained by plotting the true-positive rates against the falsepositive rates. To sum up, ROC analysis of the performance of junction detection schemes involves the following ve steps: 1. Selection of an input image and determination of the ground truthposition of junction points 2. Application of a particular junction detection scheme to the image 3. Normalization of the junction responses to the range [0I 1] 4. Variation of a threshold in N steps from 1 to 0 and computation of the respective true-positive tp and false-positive fp rate 5. Plot of the ROC curve, that is, plotting tp against fp The free parameters of the approach are the number of thresholds N and the error radius rerr . We use N D 40, which allows for a sufciently ne resolution of the ROC curves. The error radius is set to rerr D 3 pixels. Acknowledgments We acknowledge the insightful comments and criticisms of two anonymous reviewers whose suggestions helped to improve the article considerably.

Neural Mechanisms for the Robust Representation of Junctions

1033

References Adelson, E. H. (1993). Perceptual organization and the judgment of brightness. Science, 262(5142), 2042–2044. Adelson, E. H. (2000). Lightness perception and lightness illusions. In M. S. Gazzaniga (Ed.), The new cognitive neurosciences (2nd ed., pp. 339–351). Cambridge, MA: MIT Press. Amir, Y., Harel, M., & Malach, R. (1993). Cortical hierarchy reected in the organization of intrinsic connections in the macaque monkey visual cortex. J. Comp. Neurol., 334, 19–46. Attneave, F. (1954). Some informational aspects of visual perception. Psychol. Rev., 61(3), 183–193. Biederman, I. (1985). Human image understanding: Recent research and a theory. Computer Vision, Graphics, Image Proc., 32(1), 29–73. Biederman, I. (1987). Recognition-by-components: A theory of human image understanding. Psychol. Rev., 94(2), 115–147. Bolz, J., & Gilbert, C. D. (1986). Generation of end-inhibition in the visual cortex via interlaminar connections. Nature, 320(6060), 362–365. Bolz, J., Gilbert, C. D., & Wiesel, T. N. (1989). Pharmacological analysis of cortical circuitry. Trends Neurosci., 12(8), 292–296. Borg-Graham, L. J., Monier, C., & Fre´ gnac, Y. (1998). Visual input evokes transient and strong shunting inhibition in visual cortical neurons. Nature, 393, 369–373. Bosking, W. H., Zhang, Y., Schoeld, B., & Fitzpatrick, D. (1997). Orientation selectivity and the arrangement of horizontal connections in tree shrew striate cortex. J. Neurosci., 17(6), 2112–2127. Carandini, M., & Heeger, D. J. (1994). Summation and division by neurons in primate visual cortex. Science, 264(5163), 1333–1336. Carpenter, G. A., & Grossberg, S. (1988). The ART of adaptive pattern recognition by a self-organizing neural network. Computer, 21, 77–88. Coelho, C., Heller, A., Mundy, J. L., Forsyth, D., & Zisserman, A. (1991). An experimental evaluation of projective invariants. In Proc. DARPA-ESPRIT Workshop on Applications of Invariants in Computer Vision, Reykjavik, Iceland (pp. 273–293). Das, A., & Gilbert, C. D. (1999). Topography of contextual modulations mediated by short-range interactions in primary visual cortex. Nature, 399, 655–661. David, C., & Zucker, S. W. (1990). Potentials, valleys, and dynamic global coverings. Int. J. Comput. Vision, 5, 219–238. Dayan, P., & Abbott, L. F. (2001). Theoretical neuroscience. Cambridge, MA: MIT Press. DeAngelis, G. C., Freeman, R. D., & Ohzawa, I. (1994). Length and width tuning of neurons in the cat’s primary visual cortex. J. Neurophysiol., 71(1), 347–374. Deriche, R., & Giraudon, G. (1990). Accurate corner detection: An analytical study. In Proc. 3rd Int. Conf. Computer Vision (pp. 66–70). Los Alamitos, CA: IEEE Computer Society Press. Douglas, R. J., & Martin, K. A. C. (1990). Control of neuronal output by inhibition at the axon initial segment. Neural Comput., 2(3), 283–292.

1034

T. Hansen and H. Neumann

Fechner, G. T. (1889). Elemente der psychophysik. Leipzig: Breitkopf & H a¨ rtel. Field, D. J., Hayes, A., & Hess, R. F. (1993). Contour integration by the human visual system: Evidence for a local “association eld.” Vision Res., 33(2), 173–193. Furman, G. (1965). Comparison of models for subtractive and shunting lateralinhibition in receptor-neuron elds. Kybernetik, 2, 257–274. Gilbert, C. D., & Wiesel, T. N. (1983). Clustered intrinsic connections in cat visual cortex. J. Neurosci., 3, 1116–1133. Gilbert, C. D., & Wiesel, T. N. (1989). Columnar specicity of intrinsic horizontal and corticocortical connections in cat visual cortex. J. Neurosci., 9(7), 2432– 2442. Green, D. M., & Swets, J. A. (1974). Signal detection theory and psychophysics. Huntington, NY: Krieger. Grossberg, S. (1970). Neural pattern discrimination. J. Theoret. Biol., 27, 291– 337. Grossberg, S., & Mingolla, E. (1985a). Neural dynamics of form perception: Boundary completion, illusory gures, and neon color spreading. Psychol. Rev., 92, 173–211. Grossberg, S., & Mingolla, E. (1985b). Neural dynamics of perceptual grouping: Textures, boundaries, and emergent segmentation. Percept. Psychophys., 38:141–171. Grossberg, S., & Raizada, R. D. S. (2000). Contrast-sensitive perceptual grouping and object-based attention in the laminar circuits of primary visual cortex. Vision Res., 40(10), 1413–1432. Hansen, T. (2003). A neural model of early vision: Contrast, contours, corners and surfaces. Doctoral dissertation, Ulm University. Available on-line: http://vts.uni-ulm.de/query/longview.meta.asp?document id=3022. Hansen, T., & Neumann, H. (1999). A model of V1 visual contrast processing utilizing long-range connections and recurrent interactions. In Proc. 9 Int. Conf. on Articial Neural Networks (ICANN99) (pp. 61–66). London: Institution of Electrical Engineers. Hansen, T., & Neumann, H. (2001). Neural mechanisms for representing surface and contour features. In S. Wermter, J. Austin, & D. Willshaw (Eds.), Emergent neural computational architectures based on neuroscience (pp. 139–153). Berlin: Springer-Verlag. Hansen, T., & Neumann, H. (2002). A biologically motivated scheme for robust junction detection. In H. H. Bulthoff, ¨ S.-W. Lee, T. A. Poggio, & C. Wallraven (Eds.), Biologically motivated computer vision (pp. 16–26). Berlin: Springer-Verlag. Hansen, T., Sepp, W., & Neumann, H. (2001). Recurrent long-range interactions in early vision. In S. Wermter, J. Austin, & D. Willshaw (Eds.), Emerging neural architectures based on neuroscience (pp. 127–138). Berlin: Springer-Verlag. Harris, C. J. (1987). Determination of ego-motion from matched points. In Proc. Third Alvey Vision Conference (pp. 189–192). Cambridge, UK. Heitger, F., Rosenthaler, L., von der Heydt, R., Peterhans, E., & Kubler, ¨ O. (1992). Simulation of neural contour mechanisms: From simple to end-stopped cells. Vision Res., 32(5), 963–981.

Neural Mechanisms for the Robust Representation of Junctions

1035

Heitger, F., von der Heydt, R., Peterhans, E., Rosenthaler, L., & Kubler, ¨ O. (1998). Simulation of neural contour mechanisms: Representing anomalous contours. Image Vision Comput., 16, 407–421. Heyden, A., & Rohr, K. (1996). Evaluation of corner extraction schemes using invariance methods. In Proc. 13th Int. Conf. Pattern Recognition, Vienna, Austria (Vol. 1, pp. 895–899). Los Alamitos, CA: IEEE Computer Society Press. Hinton, G. E. (1992). How neural networks learn from experience. Sci. Am., 267(3), 144–151. Hirsch, J. A., & Gilbert, C. D. (1991). Synaptic physiology of horizontal connections in the cat’s visual cortex. J. Neurosci., 11(6), 1800–1809. Hodgkin, A. L. (1964). The conduction of nervous impulses. Liverpool: Liverpool University Press. Holt, G. R., & Koch, C. (1997). Shunting inhibition does not have a divisive effect on ring rates. Neural Comput., 9(5), 1001–1013. Hubel, D. H., & Wiesel, T. N. (1968). Receptive elds and functional architecture of monkey striate cortex. J. Physiol., 195, 215–243. Huffman, D. A. (1971). Impossible objects as nonsense sentences. In B. Meltzer & D. Michic (Eds.), Machine intelligence 6 (pp. 295–323). Edinburgh: Edinburgh University Press. Hup´e, J. M., James, A. C., Payne, B. R., Lomber, S. G., Girard, P., & Bullier, J. (1998). Cortical feedback improves discrimination between gure and background by V1, V2 and V3 neurons. Nature, 394, 784–787. Kapadia, M. K., Ito, M., Gilbert, C. D., & Westheimer, G. (1995). Improvement in visual sensitivity by changes in local context: Parallel studies in human observers and in V1 of alert monkeys. Neuron, 15(4), 843–856. Kapadia, M. K., Westheimer, G., & Gilbert, C. D. (2000). Spatial distribution of contextual interactions in primary visual cortex and in visual perception. J. Neurophysiol., 84(4), 2048–2062. Kisvarday, Z. F., Kim, D. S., Eysel, U. T., & Bonhoeffer, T. (1994). Relationship between lateral inhibitory connections and the topography of the orientation map in cat visual cortex. Europ. J. Neurosci., 6(10), 1619–1632. Lamme, V. A. F. (1995). The neurophysiology of gure-ground segregation in primary visual cortex. J. Neurosci., 15(2), 1605–1615. Levine, D. S. (2000). Introduction to neural and cognitive modeling (2nd ed.). Mahwah, NJ: Erlbaum. Li, Z. (1998). A neural model of contour integration in the primary visual cortex. Neural Comput., 10(4), 903–940. Li, Z. (1999). Pre-attentive segmentation in the primary visual cortex. Spat. Vis., 13, 25–50. Li, Z. (2001). Computational design and nonlinear dynamics of a recurrent network model of the primary visual cortex. Neural Comput., 13(8), 1749– 1780. Lindeberg, T. (1998). Feature detection with automatic scale selection. Int. J. Comput. Vision, 30(2), 77–116. McDermott, J. H. (2001). Some experiments on junctions in real images. Unpublished master ’s thesis, University College London. Available on-line: http:// persci.mti.edu/»jmcderm.

1036

T. Hansen and H. Neumann

McDermott, J. H. (2002). Psychophysics with junctions in real images. J. Vis., 2(7), 131a. McLaughlin, D., Shapley, R., Shelley, M. J., & Wielaard, D. J. (2000). A neural network model of macaque primary visual cortex (V1): Orientation selectivity and dynamics in the input layer 4C®. Proc. Natl Acad. Sci. USA, 97(14), 8087–8092. Metelli, F. (1974). The perception of transparency. Sci. Am., 230(4), 90–98. Michaelis, M. (1997). Low level image processing using steerablelters. Unpublished doctoral dissertation, Christian-Albrechts-Universit a¨ t Kiel. Mitchell, S. J., & Silver, R. A. (2003). Shunting inhibition modulates neuronal gain during synaptic excitation. Neuron, 38(3), 433–445. Mokhtarian, F., & Suomela, R. (1998). Robust image corner detection through curvature scale space. IEEE Trans. Pattern Anal. Mach. Intell., 20(12), 1376– 1381. Mumford, D. (1991). On the computational architecture of the neocortex II: The role of cortico-cortical loops. Biol. Cybern., 65, 241–251. Neumann, H., & Mingolla, E. (2001). Computational neural models of spatial integration and perceptual grouping. In T. F. Shipley & P. J. Kellman (Eds.), From fragments to objects: Segmentation and grouping in vision (pp. 353–400). Amsterdam: Elsevier Science. Neumann, H., & Sepp, W. (1999). Recurrent V1–V2 interaction in early visual boundary processing. Biol. Cybern., 81, 425–444. Oram, M. W., Foldi` ¨ ak, P., Perret, D. I., & Sengpiel, F. (1998). The “ideal homunculus”: Decoding neural population signals. Trends Neurosci., 21(6), 259–265. Parent, P., & Zucker, S. W. (1989). Trace inference, curvature consistency, and curve detection. IEEE Trans. Pattern Anal. Mach. Intell., 11(8), 823–839. Parida, L., & Geiger, D. (1998). Junctions: Detection, classication, and reconstruction. IEEE Trans. Pattern Anal. Mach. Intell., 20(7), 687–698. Pasupathy, A., & Connor, C. E. (1999). Responses to contour features in macaque area V4. J. Neurophysiol., 82(5), 2490–2502. Pasupathy, A., & Connor, C. E. (2001). Shape representation in area V4: Positionspecic tuning for boundary conformation. J. Neurophysiol.,86(5), 2505–2519. Polat, U., & Sagi, D. (1993). Lateral interactions between spatial channels: Suppression and facilitation revealed by lateral masking experiments. Vision Res., 33(7), 993–999. Polat, U., & Sagi, D. (1994). The architecture of perceptual spatial interactions. Vision Res., 34(1), 73–78. Pouget, A., Dayan, P., & Zemel, R. (2000). Information processing with population codes. Nat. Rev. Neurosci., 1(2), 125–132. Prescott, S. A., & Koninck, Y. D. (2003). Gain control of ring rate by shunting inhibition: Roles of synaptic noise and dendritic saturation. Proc. Natl Acad. Sci. USA, 100(4), 2076–2081. Pugh, M. C., Ringach, D. L., Shapley, R., & Shelley, M. J. (2000). Computational modeling of orientation tuning dynamics in monkey primary visual cortex. J. Comput. Neurosci., 8(2), 143–159. Ringach, D. L., Hawken, M. J., & Shapley, R. (1997). Dynamics of orientation tuning in macaque primary visual cortex. Nature, 387, 281–284.

Neural Mechanisms for the Robust Representation of Junctions

1037

Rockland, K. S., & Lund, J. S. (1983). Intrinsic laminar lattice connections in primate visual cortex. J. Comp. Neurol., 216, 303–318. Rohr, K. (1994). Localization properties of direct corner detectors. J. Math. Imag. Vision, 4, 139–150. Rubin, N. (2001). The role of junctions in surface completion and contour matching. Perception, 30(3), 339–366. Salin, P.-A., & Bullier, J. (1995). Corticocortical connections in the visual system: Structure and function. Physiol. Rev., 75(1), 107–154. Sandell, J. H., & Schiller, P. H. (1982). Effect of cooling area 18 on striate cortex cells in the squirrel monkey. J. Neurophysiol., 48(1), 38–48. Schmid, C., Mohr, R., & Bauckhage, C. (2000). Evaluation of interest point detectors. Int. J. Comput. Vision, 37(2), 151–172. Schmidt, K. E., Goebel, R., Lowel, ¨ S., & Singer, W. (1997). The perceptual grouping criterion of collinearity is reected by anisotropies of connections in the primary visual cortex. Europ. J. Neurosci., 9, 1083–1089. Seung, H. S., & Sompolinsky, H. (1993). Simple models for reading neuronal population codes. Proc. Natl Acad. Sci. USA, 90(22), 10749–10753. Sinha, P., & Adelson, E. H. (1993). Recovering reectance in a world of painted polyhedra. In Proc. Fourth International Conference on Computer Vision (ICCV’ 93), Berlin, Germany (pp. 156–163). New York: IEEE. Smith, S., & Brady, J. (1997). SUSAN—a new approach to low level image processing. Int. J. Comput. Vision, 23(1), 45–78. Todorovi´c, D. (1997). Lightness and junctions. Perception, 26(4), 379–394. van Hateren, J. H., & van der Schaaf, A. (1998). Independent component lters of natural images compared with simple cells in primary visual cortex. Proc. R. Soc. London (B), 265(1394), 359–366. Vogels, R. (1990). Population coding of stimulus orientation by striate cortical cells. Biol. Cybern., 64, 25–31. Weber, E. H. (1905). Tastsinn und Gemeingef uhl. ¨ Handw¨orterbuch der Physiologie. Leipzig: Engelmann. Wurtz, ¨ R. P., & Lourens, T. (2000). Corner detection in color images through a multiscale combination of end-stopped cortical cells. Image Vision Comput., 18(6–7), 531–541. Yen, S.-C., & Finkel, L. H. (1998). Extraction of perceptually salient contours by striate cortical networks. Vision Res., 38(5), 719–741. Zemel, R. S., Dayan, P., & Pouget, A. (1998). Probabilistic interpretation of population codes. Neural Comput., 10(2), 403–430. Zipser, K., Lamme, V. A. F., & Schiller, P. H. (1996). Contextual modulation in primary visual cortex. J. Neurosci., 16(22), 7376–7389. Zucker, S. W., Dobbins, A., & Iverson, L. A. (1989). Two stages of curve detection suggest two styles of visual computation. Neural Comput., 1, 68–81. Received July 29, 2002; accepted October 7, 2003.

LETTER

Communicated by Alan Yuille

Greedy Learning of Multiple Objects in Images Using Robust Statistics and Factorial Learning Christopher K.I. Williams [email protected]

Michalis K. Titsias [email protected] School of Informatics, University of Edinburgh, Edinburgh EH1 2QL, U.K.

We consider data that are images containing views of multiple objects. Our task is to learn about each of the objects present in the images. This task can be approached as a factorial learning problem, where each image must be explained by instantiating a model for each of the objects present with the correct instantiation parameters. A major problem with learning a factorial model is that as the number of objects increases, there is a combinatorial explosion of the number of congurations that need to be considered. We develop a method to extract object models sequentially from the data by making use of a robust statistical method, thus avoiding the combinatorial explosion, and present results showing successful extraction of objects from real images. 1 Introduction In this letter, we consider data that are images containing views of multiple objects. Our task is to learn about each of the objects present in the images. Previous approaches (discussed in more detail below) have approached this as a factorial learning problem, where each image must be explained by instantiating a model for each of the objects present with the correct instantiation parameters. By factorial learning, we mean a situation where multiple causes (factors) are needed to explain the data (image).1 A serious concern with the factorial learning problem is that as the number of objects increases, there is a combinatorial explosion of the number of congurations that need to be considered. Suppose there are L possible objects and J possible values that the instantiation parameters of any one object can take on. We will need to consider O.JL / combinations to explain any image. In contrast, in our approach, we nd one object at a time, thus avoiding the combinatorial explosion.

1 This is the same terminology as used in the factor analysis model from statistics, although that model uses linear and gaussian assumptions.

c 2004 Massachusetts Institute of Technology Neural Computation 16, 1039–1062 (2004) °

1040

C. Williams and M. Titsias

In unsupervised learning, we aim to identify regularities in data such as images. One fairly simple unsupervised learning model is clustering, which can be viewed as a mixture model where there is a nite number of types of object, and data are produced by choosing one of these objects and then generating the data conditional on this choice. As a means of discovering objects in images, standard clustering approaches are limited, as they do not take into account the variability that can arise due to translations, rotations, and so forth (the instantiation parameters) of the object. Suppose that there are m different instantiation parameters; then a single object will sweep out an m-dimensional manifold in the image space. Learning about objects taking this regularity into account has been called transformation-invariant clustering by Frey and Jojic (1999, 2003). However, this work is still limited to nding a single object in each image. A more general model for data is that where the observations are explained by multiple causes. In our example, this will be that in each image there are L objects. The approach of Frey and Jojic (1999, 2003) can be extended to this case by explicitly considering the simultaneous instantiation of all L objects (Jojic & Frey, 2001). However, this gives rise to a large search problem over the instantiation parameters of all objects simultaneously, and approximations such as variational methods are needed to carry out the inference. In our method, by contrast, we discover the objects one at a time using a robust statistical method. Sequential object discovery is possible because multiple objects combine by occluding each other or the background, or both. The general problem of factorial learning has a longer history (see, e.g., Barlow, 1989; Hinton & Zemel, 1994; Ghahramani, 1995). However, Frey and Jojic made the important step for image analysis problems of using explicit transformations of object models, which allows the incorporation of prior knowledge about these transformations and leads to good interpretability of the results. A related line of research is that concerned with discovering part decompositions of objects. Lee and Seung (1999) described a nonnegative matrix factorization method addressing this problem, although their work does not deal with parts undergoing transformations. Other relevant work, including that by Shams and von der Malsburg (1999) on learning parts and work from the computer vision community on layered representations of images, is discussed in section 4. The structure of the remainder of this letter is as follows. In section 2, we describe the model, rst for images containing only a single object and then for images containing multiple objects. In section 3, we present experimental results nding objects appearing against static, moving and random backgrounds. We conclude with a discussion in section 4.2

2 This letter is a revised and extended version of Williams and Titsias (2003), which was presented at NIPS 2002.

Greedy Learning of Multiple Objects

1041

2 Theory In section 2.1, we describe how to learn about an object when there is only a single object (plus background) in every image. In section 2.2, we discuss how this model can be robustied to deal with foreground and background occlusion caused by other objects being present in the images. Then in section 2.3, we describe a model that fully explains L objects in the images. An efcient greedy algorithm for training this model is described in section 2.4. 2.1 Learning One Object. In this section, we consider the problem of learning about one object that can appear at various locations in an image. The object is in the foreground, with a background behind it. The problem is set up in terms of a generative model for the image x given the transformations of the foreground and background. The background can be one of three cases: (1) a static background that is xed for all training images, (2) a moving background that occurs, for example, when a moving camera captures a sequence of frames, and (3) random backgrounds where each image can have a completely different background. The two key issues that we must deal with are the notion of a pixel being modeled as foreground or background and the problem of transformations of the object and the background. We consider rst the foregroundbackground issue and assume that the background is static; cases (2) and (3) are discussed later in this section. def

Consider an image x of size P x £P y containing P D Px Py pixels, arranged as a length P vector. Our aim is to learn appearance-based representations of the foreground f and the static background b. As the object will be smaller than Px £ P y pixels, we will need to specify which pixels belong to the background and which to the foreground; this is achieved by a vector of binary latent variables s, one for each pixel. Each binary variable in s is drawn independently from the corresponding entry in a vector of probabilities ¼ . For pixel p, if ¼p ’ 0, then the pixel will be ascribed to the background with high probability, and if ¼p ’ 1, it will be ascribed to the foreground with high probability. We sometimes refer to ¼ as a mask. xp is modeled by a mixture distribution, ( xp »

p f .xp I fp / D N.xp I fp ; ¾ f2 / pb .xp I bp / D N.xp I bp ; ¾b2 /

if sp D 1; if sp D 0;

(2.1)

where ¾ f2 and ¾b2 are, respectively, the foreground and background variances. Thus, ignoring transformations, we obtain p.x/ D

P Y pD1

[¼p p f .xp I fp / C .1 ¡ ¼p /pb .xp I bp /]:

(2.2)

1042

C. Williams and M. Titsias

Notice that the fact that each pixel follows a mixture distribution ensures that the foreground and background appearances strictly combine by occlusion, and thus no transparency between them is allowed. The second issue that we must deal with is that of transformations of the foreground object. Below, we consider only translations, although the ideas can be extended to deal with other transformations such as scaling and rotation (see, e.g., Frey & Jojic, 2002). Each possible transformation (e.g., translations in units of one pixel) is represented by a corresponding transformation matrix, so that matrix Tjf corresponds to transformation j f and Tjf f is the transformed foreground model. In our implementation, the translations use wraparound, so that each Tjf is in fact a permutation matrix. The semantics of foreground and background mean that the mask ¼ must also be transformed, so that we obtain p.xjjf / D

P Y

[.Tjf ¼ /p p f .xp I .Tjf f/p / C .1 ¡ Tjf ¼ /p pb .xp I bp /];

(2.3)

pD1

where 1 denotes the P length vector that contains ones. Notice that the foreground f and mask ¼ are transformed by Tjf , but the static background b is not. In order for equation 2.3 to make sense, each element of Tjf ¼ must be a valid probability (lying in [0; 1]). This is certainly true for the case when Tjf is a permutation matrix (and can be true more generally). To complete the model, we place a prior probability Pjf on each transformation jf ; this is PJf taken to be uniform over all possibilities so that p.x/ D jf D1 Pj f p.xjjf /. So far, the background b was considered static. However, in many cases, as, for example, when a video camera follows an object, the background can change from frame to frame. Next we generalize our method to deal with moving backgrounds. To model a moving background, we assume an underlying static background, which is typically much larger than the input images. We sometimes refer to this large static background as panorama. When we generate an image, a part of this panorama scene is selected and used as the current background of the image, similarly to Rowe and Blake (1995). More specically, we assume that the background b corresponds to an Mx £ My image, where in general Mx ¸ Px and My ¸ Py . b is represented as an Mdimensional vector with M D M x My . We introduce a transformation variable jb that explains how from the panorama b the background of a data image is selected. In our implementation, we consider as possible backgrounds all the P x £ Py image blocks (aligned to the axes of the background image) taken from any possible location within the panorama b.3 Clearly, 3 Of course, the model does not account for rotations or scaling of the background, and it can only approximately model such kinds of situations.

Greedy Learning of Multiple Objects

1043

jb takes on Jb D .Mx ¡ Px C 1/.M y ¡ Py C 1/ total values, and a certain value jb is represented by a M £ P transformation matrix Tjb , so that Tjb b selects the appropriate image Px £ Py block from b. The conditional density of an image given the transformation variables now becomes p.xjj f ; jb / D

P Y

[.Tjf ¼ /p p f .xp I .Tjf f/p / C .1 ¡ Tjf ¼ /p pb .xp I .Tjb b/p /];

(2.4)

pD1

PJf P and the likelihood of an image x is p.x/ D jf D1 jJbbD1 Pj f Pjb p.xjj f ; jb /. Note also that a static background is a special case of the above model; by choosing the background b to have the same size as the data images, there is only one possible value for jb , so the background is static. For random backgrounds, we do not try to model the backgrounds explicitly, but simply use a large-variance gaussian at each pixel, which can account for the large background variability. b is the mean of this gaussian. Given a data set fxn g, n D 1; : : : ; N, we can adapt the parameters µ D P n .f; ¼ ; b; ¾ f2 ; ¾b2 / by maximizing the log likelihood L.µ / D N nD1 log p.x jµ /. This can be achieved through using the expectation-maximization (EM) algorithm to handle the missing data, which are the transformations jf and jb . However, an exact EM algorithm requires a search over Jf Jb possibilities, which can be very demanding even for small images. Our greedy training algorithm deals separately with each transformation by learning rst the background and then the foreground object. This algorithm is presented for the more general case of L foreground objects in section 2.4 and also in the appendix. 2.2 Learning One Object Using Robust Statistics. Suppose that apart from the one foreground object being modeled, the images can additionally contain some other objects. However, we consider these objects as “outlying” information and thus do not wish to model their appearances. Our objective is to learn only the one object of interest and efciently ignore all the other objects. A way to learn one object under the above assumptions is to robustify the model described in section 2.1 so that foreground and background occlusion can be tolerated. More specically, for a foreground pixel, some other objects may be interposed between the camera and our object, thus perturbing the pixel value. This can be modeled with a mixture distribution as p f .xp I fp / D ® f N.xp I fp ; ¾ f2 / C .1 ¡ ® f /U.xp /;

(2.5)

where ® f is the fraction of times a foreground pixel is not occluded and the robustifying component U.xp / is a uniform distribution common for all

1044

C. Williams and M. Titsias

image pixels. When an object pixel is occluded, this should be explained by the uniform component. Such robust models have been used for imagematching tasks by a number of authors, notably Black and colleagues (Black & Jepson, 1996). Similarly for the background, a different object from the one being modeled may be interposed between the background and the camera, so that we again have a mixture model, pb .xp I bp / D ®b N.xp I bp ; ¾b2 / C .1 ¡ ®b /U.xp /;

(2.6)

with similar semantics for the parameter ®b . Note that for random backgrounds, the above robustication makes less sense (since the gaussian will have large variance ¾b2 ), but it will apply to the static or moving background cases. It is not necessary that the robustifying component be a uniform distribution; for example, a broad gaussian would also work. However, as pixels do have maximum and minimum values, the uniform distribution is a natural choice and is also the maximum entropy distribution. Training this model is completely analogous to the nonrobust case. In practice, the above model can be used to learn multiple objects in images. By random parameter initializations and on different runs, we can nd different objects. We denote such an algorithm as RANDOM STARTS. 2.3 Learning Multiple Objects. One way to learn multiple objects in images is by applying the RANDOM STARTS algorithm described in the above section. However, we have found (Williams & Titsias, 2003) that this is rather inefcient, as the basins of attraction for the different objects may be very different in size given the initialization. Thus, in this section, we describe a model that explicitly assumes L foreground objects in the images, and in section 2.4 we present the GREEDY algorithm that learns the background and the objects sequentially. Assume that each image contains L foreground objects. Similarly to the single object case, each object `, with ` D 1; : : : ; L is modeled by a separate foreground appearance f` and mask ¼ ` . The background can be thought of as the L C 1th object having a mask ¼ b D 1, since the background is present everywhere. For each foreground object `, we assume a transformation variable j` representing all possible translations. Below, we assume a moving background where the transformation variable jb is dened in section 2.1; however, all derivations also apply for static or random backgrounds by simply ignoring the variable jb . It will be instructive to introduce the model for the case that there are only two foreground objects. Assuming L D 2, an image x is generated by instantiating the transformation variables j1 ; j2 , and jb and then drawing x

Greedy Learning of Multiple Objects

1045

according to p.xjjb ; j1 ; j2 / D

P Y

f.Tj1 ¼ 1 /p p f1 .xp I .Tj1 f1 /p /

pD1

C .1 ¡ Tj1 ¼ 1 /p [.Tj2 ¼ 2 /p p f2 .xp I .Tj2 f2 /p / C .1 ¡ Tj2 ¼ 2 /p pb .xp I .Tjb b/p /]g;

(2.7)

where the p f1 ; p f2 , and pb pixel densities are gaussians given as in equation 2.1. Note that each image pixel follows a three-component mixture distribution, so that with probability .Tj1 ¼ 1 /p , the pixel can belong to the rst object, with probability .1 ¡ Tj1 ¼ 1 /p .Tj2 ¼ 2 /p to the second object, and with the rest of probability to the background. The fact that the probabilities corresponding to the second object’s pixels are always multiplied by .1¡ Tj1 ¼ 1 /p implies an occlusion ordering between these two objects, so that the rst object can occlude the second one, but the opposite is not allowed. In the general case with an arbitrary number of objects, the model 2.7 becomes p.xjjb ; j1 ; : : : ; jL / D

P Y pD1

p.xp jjb ; j1 ; : : : ; jL /;

(2.8)

where p.xjjb ; j1 ; : : : ; jL / is L C 1-component mixture model, p.xp jjb ; j1 ; : : : ; jL / D

L `¡1 X Y `D1 kD1

C

L Y

.1 ¡ Tjk ¼ k /p .Tj` ¼ ` /p p f` .xp I .Tj` f` /p /

.1 ¡ Tjk ¼ k /p pb .xp I .Tjb b/p /;

(2.9)

kD1

Q where if ` D 1, then the term `¡1 kD1 .1 ¡ Tjk ¼ k /p in equation 2.9 is dened to be equal to 1. The order (from left to right) of the object models in equation 2.9 corresponds to the occlusion allowed between the objects. Particularly the rst object exists closest to the camera; thus, it can never be occluded by any other object, the second object can be occluded only by the rst object, and so on. The background exists in the furthest distance from the camera. Notice that in the above model there is an asymmetry between the objects because of the specied occlusion ordering. If the objects can arbitrarily occlude one another so that the occlusion ordering can change from image to image, then the above model is no longer appropriate. A principled way to deal with this situation is to consider all L! possible rearrangements of the objects (using an additional hidden variable). An alternative way is to

1046

C. Williams and M. Titsias

replace the foreground pixel densities p f` by their robust counterparts given by equation 2.5. This can allow for arbitrary occlusion between the objects without increasing the model complexity. Of course, such a model will not be able to answer immediately what the occlusion ordering is in a given image, but that can be done in a postprocessing stage. From now on, we will assume that both the foreground p f` and background pb pixels’ densities are robustied as described in section 2.2. This robustication is the key for our GREEDY algorithm to nd the objects one at a time. Bear in mind that robustifying p f` also makes sense in terms of allowing arbitrary occlusion between the L foreground objects. 2.4 The GREEDY Training Algorithm. An exact EM algorithm (Dempster, Laird, & Rubin, 1977; McLachlan & Krishnan, 1997) for training the above model is highly intractable. This is because a full search over all transformations of the objects requires considering JLf Jb possibilities, which can be extremely large even for small L. An alternative is to consider approximations; Ghahramani (1995) suggests mean eld and Gibbs sampling approximations, and Jojic and Frey (2001) use approximate variational inference. Below, we describe a different learning algorithm by nding the background and all the foreground objects sequentially one after the other. Each component in the mixture distribution of equation 2.9 corresponds to an object model, which is either one of the L foreground objects or the background. The key idea of our learning algorithm is to learn this mixture model (and thus the relation with the associated transformation variable) sequentially, by learning the objects one at a time. An intuitive way to introduce this algorithm is that originally we constrain the mixture distribution so that the background takes all the probability and the masks of the foreground objects are zero. Since the background pixel densities are robustied according to equation 2.6, we can learn the background by “ignoring” all the foreground objects. When a pixel of the background is occluded by a foreground object, that should be explained by the outlier component in equation 2.6, so that the pixel will not affect the estimation of the background. At each subsequent stage, the mask of a foreground object is set free to get a nonzero value, and the corresponding object model is learned. Below we rst describe learning the background in section 2.4.1; then we discuss learning the rst object in section 2.4.2 and further objects in section 2.4.3. We summarize the algorithm in section 2.4.4. Further details are given in the appendix. 2.4.1 Finding the Background. The GREEDY algorithm starts by rst nding the background. By constraining all the masks f¼ ` gL`D1 to be zero, the mixture model 2.9 has only one component (corresponding to the backQP ground), and thus equation 2.8 takes the form p.xjjb / D pD1 pb .xp I .Tjb b/p /: Assuming a uniform prior Pjb for transformation jb , the log likelihood of the

Greedy Learning of Multiple Objects

training images is Lb D

PN

nD1 log

1047

PJ

fb; ¾b2 g

jb D1 Pjb p.x

n jj

b /,

which can be maximized

with respect to by the EM algorithm. This algorithm searches over Jb possibilities and is tractable; details are provided in section A.1 in the appendix. 2.4.2 Finding the First Object. At the second stage of the algorithm, we learn the rst foreground object. By allowing the mask ¼ 1 to take on nonzero values, equation 2.8 becomes p.xjjb ; j1 / D D

P Y

.Tj1 ¼1 /p p f1 .xp I .Tj1 f1 /p / C .1 ¡ Tj1 ¼ 1 /p pb .xp I .Tjb b/p /

pD1 P Y pD1

p.xp jjb ; j1 /:

(2.10)

PN P The log likelihood of the training images is L1 D nD1 log j1 ;jb Pj1 Pjb p.xn jj1 ; jb /, and a direct maximization using the EM algorithm can be quite demanding, since inference involves searching over Jf Jb possibilities. Our GREEDY algorithm drops the complexity of the search to J f possibilities by applying a constrained EM algorithm (Neal & Hinton, 1998) that exploits the fact that we already know the background. In particular, for each training image xn , we introduce the distribution Qn .j1 ; jb / D Qn .j1 jjb /Qn .jb / over the transformations, and we express a lower bound (based on the Jensen’s inequality) of the log likelihood L1 : F1 D

N X X nD1 jb ;j1

8 <

£

:

Qn .j1 jjb /Qn .jb / 0

log @Pj1 Pjb

1 P Y pD1

9 =

p.xpn jjb ; j1 /A ¡ log Qn .j1 jjb /Qn .jb / : ;

(2.11)

This lower bound becomes tight by choosing Qn .jb ; j1 / to be the posterior P.jb ; j1 jxn / for every image xn . Since we have learned the background, we can use the posterior probability P.jb jxn / (computed as described in section A.1) to nd the most probable transformation jnb that best explains image xn and then we approximate Qn .jb / so that it gives probability one for jb D jnb and zero for the remaining values. 4 Thus, F1 takes the form 8 9 Jf <X = N X P X log p.xpn jjnb ; j1 / ¡ log Qn .j1 / C const; (2.12) F1 D Qn .j1 / : ; nD1 j1 D1

pD1

4 It would be possible to make a “softer ” version of this, where the transformations are weighted by their posterior probabilities, but in practice, we have found that these probabilities are usually 1 for the best-tting transformation and 0 otherwise after learning.

1048

C. Williams and M. Titsias

where const depends on the uniform probabilities Pjb and Pj1 . Also, the dependence of Qn .j1 / on jnb for simplicity has been omitted from our notation. Maximization of F1 can be carried by the EM algorithm, where in the Estep we maximize F1 with respect to the Qn distributions (see equation A.6) and in the M-step with respect to the object parameters ff1 ; ¼ 1 ; ¾12 g. Thus, the computational complexity for learning the object has been kept to a minimum since we have only to search over the J f possible transformations of the object. Recall that the pixel densities p f1 and pb are robustied, which allows us to deal with occlusion that can be caused by the remaining L ¡ 1 not-yet-discovered objects. 2.4.3 Learning Further Objects. The algorithm for learning the second and subsequent foreground objects is a bit more complicated as we subtract out the objects learned so far. We rst describe how we learn the second object and then generalize to the case of the `th object. To learn a second foreground object, we rst allow the mask ¼ 2 to take on nonzero values so that the conditional density of x given the hidden transQP formations becomes P.xjjb ; j1 ; j2 / D pD1 p.xp jjb ; j1 ; j2 /, where p.xp jjb ; j1 ; j2 / is given by equation 2.9. We learn the second object by maximizing a lower PN P n bound of the log likelihood L2 D nD1 log jb ;j1 ;j2 Pj b Pj1 Pj2 p.x jjb ; j1 ; j2 /. Particularly, since we have learned the background and the rst object, we use the most probable transformations jnb and jn1 that explain image xn to lower-bound the log likelihood L2 :

F2 D

Jf N X X nD1 j2 D1

Qn .j2 /

8 <X P :

pD1

log p.xpn jjnb ; jn1 ; j2 / ¡ log Qn .j2 /

9 = ;

C const: (2.13)

F2 can be tractably optimized over Qn .j2 / and over the parameters of the second object ff2 ; ¼ 2 ; ¾2 g. However, we can make the search for the second object much more efcient (ensuring that we will nd a different object) by further constraining equation 2.13 so that all the pixels belonging to the rst object are removed from consideration. First, note that the values of the transformed mask Tjn1 ¼ 1 will be close to 1 for all pixels of image xn that are part of the rst object. All of these pixels should be removed from consideration unless they are occluded by other not-yet-discovered objects. Thus, we n consider the vector ½n1 D .Tjn1 ¼ 1 / ¤ r j1 where y ¤ z denotes the element-wise jn

product of the vectors y and z and rp1 D

® f N.xpn I.Tjn f1 /p ;¾12 / 1 n ®f N.xp I.Tjn f1 /p ;¾ 12 /C.1¡®f /U.xpn / 1

. Thus, ½n1

will roughly give values close to 1 only for the nonoccluded object pixels of image xn , and these are the pixels that we wish to remove from consideration. Now, considering .½n1 /p as the probability according to which the pixel p of image xn is part object, we once Pof the rst P P againy lower-bound F2 y using the inequality log i yi D log. i pii pi / ¸ i pi log pii (obtained from

Greedy Learning of Multiple Objects

1049

Jensen’s inequality) to obtain

F2 D

Jf N X X

Qn .j2 /

nD1 j2 D1

8 <X P :

pD1

.½n1 /p logf.Tjn1 ¼1 /p p f1 .xpn I .Tjn1 f1 /p /g

C .½n1 /p logf.1 ¡ Tjn1 ¼ 1 /p [.Tj2 ¼ 2 /p p f2 .xpn I .Tj2 f2 /p / 9 = C .1 ¡ Tj2 ¼ 2 /p pb .xpn I .Tjnb b/p /]g ¡ log Qn .j2 / ; C const;

(2.14)

where ½n1 D 1 ¡ ½n1 and the const is a constant term containing the entropic ª P PP © n n n n term ¡ N nD1 pD1 .½ 1 /p log.½1 /p C .½1 /p log.½ 1 /p plus terms involving the uniform probabilities fPjb ; Pj1 ; Pj2 g. Since the parameters of the rst object are xed, the above quantity is further written as

F2 D

Jf N X X nD1 j2 D1

Qn .j2 /

8 <X P :

pD1

.½n1 /p log[.Tj2 ¼ 2 /p p f2 .xpn I .Tj2 f2 /p /

9 = C .1 ¡ Tj2 ¼ 2 /p pb .xpn I .Tjnb b/p /] ¡ log Qn .j2 / ; C const:

(2.15)

Note that when for a pixel p of image xn .½n1 /p ’ 0, this pixel is removed from consideration (in a probabilistic fashion) according to equation 2.15. Further objects are learned similarly to the two-objects case except that the pixels of all previously learned foreground objects should be removed from consideration. This is achieved by setting zn0 D 1 for all n D 1; : : : ; N and using the recursion zn` D zn`¡1 ¤ ½n` . Note that zn1 D ½n1 . For object `, the objective function F` given in equation 2.16 is optimized to yield ff` ; ¼ ` and ¾`2 g. Note that the GREEDY algorithm treats the background and the rest of the L objects differently, since pixels ascribed to the nonoccluded background are not removed from consideration as is the case for the foreground objects. We implemented an alternative greedy algorithm that treats the background similarly to the remaining objects (removing nonoccluded background pixels from consideration). However, this algorithm did not work so well in practice, as some pixels can wrongly be removed from consideration because their values happen to agree with pixels of the occluding object. For the background, the number of such pixels can be large since the background is always occluded by all of the L objects. We observed experimentally that this can result in noisy estimates for some of the L foreground

1050

C. Williams and M. Titsias

objects since many of their pixels are accidentally removed from consideration after the background is learned. On the other hand, the case of the foreground objects is not problematic since occlusion occurs only in some images and is typically partial. 2.4.4 Summary of the GREEDY Algorithm. 1. Learn the background and infer the most probable transformation jnb for each image xn . 2. Initialize the vectors zn0 D 1 for n D 1; : : : ; N. 3. For ` D 1 to L:

² Learn the `th object parameters ff` ; ¼ ` ; ¾`2 g by maximizing F` using EM algorithm, where

F` D

Jf N X X nD1 j` D1

Qn .j` /

8 <X P :

pD1

.zn`¡1 /p log[.Tj` ¼` /p p f` .xp I .Tj` f` /p / 9 =

C .1 ¡ Tj` ¼ ` /p pb .xp I .Tjnb b/p /] ¡ log Qn .j` / : ;

(2.16)

² Infer the most probable transformation fjn` g, and update the weights zn` D zn`¡1 ¤ ½n` where ½n` is computed as described in the text. The update equations used at any stage of the above algorithm are given in the appendix. 3 Experiments We describe ve experiments extracting movable objects from images using static, moving, and random backgrounds. In these experiments, the uniform distribution U.xp / is based on the maximum and minimum pixel values of all training image pixels. In all the experiments reported below, except experiment 5, the parameters ® f and ®b were chosen to be 0:9.5 In experiment 5, ®b was set to 1. Also we assume that the total number of objects L that appear in the images are known; thus the GREEDY algorithm terminates when we discover the Lth object. To apply the GREEDY algorithm, we have to initialize the model parameters at each stage. We rst describe how we initialize the background parameters as the background is learned at the rst stage of the algorithm. The background appearance b corresponds to an image that is larger than 5 These parameters could be learned with some additional care, but in this implementation, we do not do so.

Greedy Learning of Multiple Objects

1051

the input image size. We initialize the centered Px £ P y block of b to be equal with the mean of the training images. The rest of the pixels of b are initialized by repeating the borderlines of pixels in the centered block of b and then adding a gaussian noise to these pixels. The variance ¾b2 is initialized to a large value (a much larger value than the overall variance of all image pixels.6 ) The parameters of an object learned at each subsequent stage of the GREEDY algorithm are always initialized in the same way. Each element of the mask ¼ is initialized to 0.5 and the variance ¾`2 to a large value equal to the ¾b2 initial value. To initialize the foreground appearance f` , we compute the pixelwise mean of the training images and add independent gaussian noise with equal variances at each pixel, where the variance is set to be large enough that the range of pixel values found in the training images can be explored. In all of the experiments described below, the above initialization scheme proved to be effective, and we obtained good results by performing one or two runs of the GREEDY algorithm. At each stage of the algorithm, typically 100 iterations were sufcient to reach convergence. 3.1 Experiment 1. Figure 1 illustrates the detection of two objects against a static background.7 Some examples of the 44 118 £ 248 training images (excluding the black border) are shown in Figure 1a, and results are shown for the GREEDY algorithm in Figure 1b. For both objects, we show both the learned mask and the element-wise product of the learned foreground and mask. In most runs, the person with the striped shirt (Frey) is discovered rst. It is interesting to comment on how the GREEDY algorithm operates in case the person with the lighter shirt (Jojic) is found rst. As explained in section 2.4, once an object is discovered by the GREEDY algorithm, roughly speaking its nonoccluded pixels are removed from consideration. Figure 2 illustrates this point for two frames of the video sequence—one without occlusion and one with occlusion. Figure 2a shows the two video frames and Figure 2b the pixels (displayed in white) that are masked out from the next run. Note that when Frey occludes Jojic, the white stripes of Frey’s shirt are accounted for by the Jojic model. This is because the white color of these stripes agrees with the learned white color of Jojic’s shirt. This does not cause problems for learning the second object (Frey) as there are many frames where the occlusion does not take place. In other experiments with two people wearing different colored clothes, no such effect takes place. Video sequences of the raw data and the extracted objects can be viewed on-line at http://www.dai.ed.ac.uk/homes/s0129556/ lmo.html.

6 In our experiments, the input image pixels are normalized to lie in [0; 1], and the background variance ¾b2 , as well as any foreground object variance ¾`2 , is initialized to 2. 7 These data are used in Jojic and Frey (2001). We thank N. Jojic and B. Frey for making available these data on-line via http://www.psi.toronto.edu/layers.html.

1052

C. Williams and M. Titsias

Figure 1: Learning two objects against a static background. (a) Some frames of the training images. (b) The two objects and background found by the GREEDY algorithm. The plots in the upper row of b show the masks ¼ 1 and ¼ 2 . The rst two plots in the lower row of b display the element-wise products ¼ 1 ¤ f1 and ¼ 2 ¤ f2 , while the third plot displays the background b.

3.2 Experiment 2. We also conducted an interesting variant on experiment 1. Rather than walking independently, two people now move together, keeping the same distance apart. This led to the extraction of a mask containing both people. Note that this is expected, since the pixels of the two people can be explained by the same transformation, so are considered as one object. Of course, it is open to debate whether we would wish to think of what is learned as one or two objects. In our opinion, the ability to extract such regularities is very sensible and quite widespread (e.g., in nding pairs

Greedy Learning of Multiple Objects

1053

Figure 2: What the GREEDY algorithm removes from consideration once Jojic is found. (a) Two frames of the training images. (b) The corresponding ½ 1 vectors (see section 2.4) that indicate the pixels masked out from the second iteration.

of eyes). If it was desired, it would be a simple matter to run a connected components algorithm on the thresholded mask to pick out the two objects. 3.3 Experiment 3. In the data shown in Figure 3, two objects move against a moving background. Figure 3a shows some of the 36 70 £ 140 images of the video sequence. Note that the background changes from frame to frame because of the camera’s movement. Notice also that there is motion blur in some of the frames and that one person is occluded by the other in many frames as they walk in the same direction. Figure 3b shows the results of the GREEDY algorithm where at the rst stage, we nd the background, and at the next two stages the moving objects are found. 3.4 Experiment 4. In Figure 4, ve objects are learned against a static background, using a data set of 80 images of size 66 £ 88. Notice the large amount of occlusion in some of the training images shown in Figure 4a. Results are shown in Figure 4b for the GREEDY algorithm. 3.5 Experiment 5. In Figure 5, we consider learning objects against random backgrounds. Actually three different backgrounds were used, as can be seen in the example images shown in Figure 5a. Note that in this case, we set ®b D 1 since robustifying a random background does not make sense. There were 67 66 £ 88 images in the training set. The results with the GREEDY algorithm are shown in Figure 5b.

1054

C. Williams and M. Titsias

Figure 3: Learning two objects against a moving background. (a) Some frames of the training images. (b) The panorama-background and the masks and rendered objects found by the GREEDY algorithm. To show the rendered objects, we reverse our usual convention and show the objects against a light background, as the objects themselves are mainly dark.

In some other experiments using a few random backgrounds, our algorithm has not worked well. In these cases, it seems that the foreground models tend to model structure that appears in some backgrounds rather than the foreground objects. These problems might be overcome by using more random backgrounds, as this would t the random background assumptions better.

Greedy Learning of Multiple Objects

1055

Figure 4: Learning ve objects against a static background. (a) Some of the training images. (b) The masks and objects (displayed as described in the caption of Figure 1) learned by the GREEDY algorithm.

4 Discussion The starting point for this work is a full factorial model for the data instantiating multiple objects in their correct positions. However, as we have seen, a direct search over all O.Jb J Lf / values of the hidden variables is not feasible. Rather than use approximate simultaneous inference of the hidden variables, we have developed a sequential method that extracts the background and foreground objects one at a time from the input images. This is achieved by robustifying the generative model so that occlusions of either foreground or background can be tolerated. The results show that this GREEDY algorithm is very effective at nding the background and foreground objects in the data. It is interesting to compare our work with that of Shams and von der Malsburg (1999). They obtained candidate parts by matching images in a pairwise fashion, trying to identify corresponding regions in the two images. These candidate image patches were then clustered to compensate for the effect of occlusions. We make four observations. (1) Instead of directly

1056

C. Williams and M. Titsias

Figure 5: Two objects are learned from a set of images with three different backgrounds. (a) Some examples of the training images. (b) The masks and objects found by the GREEDY algorithm, displayed as described in the caption of Figure 1.

learning the models, they match each image against all others (with complexity O.N 2 /), as compared to the linear scaling with N in our method. (2) In their method, the background must be removed; otherwise it would give rise to large match regions. (3) They do not dene a probabilistic model for the images (with all its attendant benets). (4) Their data (although based on realistic CAD-type models) are synthetic and designed to focus learning on shape-related features by eliminating complicating factors such as background and surface markings. If video sequence data are available, then it is possible to compute optical ow information, and this can be used as a cue to discover objects by

Greedy Learning of Multiple Objects

1057

clustering ow vectors into “layers”. Some early work on this topic is by Wang and Adelson (1994), and an example of more recent work is that of Tao, Sawhney, and Kumar (2000). Note that our method does not require a video sequence and can be applied to unordered collections of images, as illustrated in experiments 4 and 5. Also, problems can arise for opticalow-based methods in regions of low texture where ow information can be sparse. In our work, the model for each pixel is a mixture of gaussians. There is some previous work on pixelwise mixtures of gaussians (see, e.g., Rowe & Blake, 1995), which can, for example, be used to achieve background subtraction and highlight moving objects against a static background. Our work extends beyond this by gathering the foreground pixels into objects and also allows us to learn objects in the more difcult nonstatic background case. The GREEDYmethod has an analog in neural network methods for principal components analysis (PCA). To carry out PCA, we can extract the principal component using Hebbian learning. If we then subtract the projection of the input onto the principal direction, we can again use Hebbian learning to extract the second principal component, and so on (Sanger, 1989). This process parallels the successive discovery of objects in our method. However, we note that this sequential algorithm cannot be used if a full factor analysis model (with different noise variances on different visible dimensions) is to be learned. The GREEDY algorithm has shown itself to be an effective factorial learning algorithm for image data. We are currently investigating issues such as dealing with richer classes of transformations, detecting L automatically, and allowing objects not to appear in all images. Furthermore, although we have described this work in relation to image modeling, it can be applied to other domains. For example, one can apply the GREEDY approach to tting mixture models, as we will describe in a forthcoming article. Also, one can make a model for sequence data by having hidden Markov models (HMMs) for a foreground pattern and the background. Faced with sequences containing multiple foreground patterns, one could extract these patterns sequentially using a similar algorithm to that described above. It is true that for sequence data, it would be possible to train a compound HMM consisting of L C 1 HMM components simultaneously, but there may be severe local minima problems in the search space so that the sequential approach might be preferable. Appendix: Details of the GREEDY Algorithm We introduce some notation. If y and z are two vectors of the same size, then y ¤ z denes the element-wise product between these vectors, and y ¤ y is written as y2 for compactness. Similarly the element-wise division between two vectors is denoted by y:=z. A vector containing 1s is denoted by 1. Also,

1058

C. Williams and M. Titsias

PP summations of the form pD1 yp zp are written in vector notation yT z, for T example, y 1 denotes the sum of elements of y. In our implementation, the transformation matrices of the foreground objects Tj` are permutation matrices. However, our derivations regarding these matrices require only two constraints: (1) that the value of each element of Tj` ¼` is a valid probability (i.e., lies in [0; 1]) and (2) that log.Tj` ¼ ` / D Tj` log ¼ ` and log.1 ¡ Tj` ¼ ` / D Tj` log.1 ¡ ¼ ` /, where log v denotes the element-wise logarithm of a vector v. These constraints certainly hold for matrices that have one 1 (and the other entries 0) in each row. A.1 Learning the Background. Here we derive the EM algorithm for learning a static or moving background. Learning the background consists of the rst stage of the GREEDY algorithm and is carried out by maximizing the following log likelihood: Lb D

N X

log

Jb X

Pjb

jb D1

nD1

P Y

f®b N.xpn I .Tjb b/p ; ¾b2 / C .1 ¡ ®b /U.xpn /g:

(A.1)

pD1

Clearly, this log likelihood corresponds to a mixture model (with Jb components) where the component densities are factorized over the pixels and each pixel density is a two-component mixture. Application of the EM is straightforward, and we can easily show that the expected complete data log likelihood in the EM framework is Qb D

Jb N X X nD1 j b D1

P.jb jxn /

(

£

"

#) 1

1 ¡ 2 .x ¡ Tjb b/ ¡ log ¾b2 1 2 2¾b

.rjnb /T

P p.xn jj b /

j where P.jb jxn / D P Jbb

iD1

Pi p.xn ji/

2

n

C const;

(A.2)

is the posterior probability of the transfor-

mation hidden variable jb given the image xn and rjnb is a P length vector with the pth element storing the probability according to which the pth pixel of image xn is part of the nonoccluded background given jb : .rjnb /p D

®b N.xpn I.Tjb b/p ;¾ b2 /

®b N.xpn I.Tjb b/p ;¾b2 /C.1¡®b /U.xpn /

.

In the E-step of the algorithm, P.jb jxn / and rjnb are obtained using the current parameter values. In the M-step, the Q function is maximized with respect to the parameters fb; ¾b2 g giving the following update equations: bÃ

Jb N X X nD1 jb D1

P.jb jxn /[TjTb .rjnb ¤ xn /]:=

Jb N X X nD1 jb D1

P.jb jxn /[TjTb rjnb ];

(A.3)

Greedy Learning of Multiple Objects

PN ¾b2

Ã

1059

P Jb

nD1

P.jb jxn /[.rjnb /T .xn ¡ Tjb b/2 ] : P Jb n T n nD1 jb D1 P.jb jx /[.rjb / 1] jb D1

PN

(A.4)

The above equations provide an exact M-step. The update for the background appearance b is very intuitive. For example, consider the case when P.jb jxn / D 1 for jb D j¤ and 0 otherwise. For pixels that are ascribed to nonoccluded background (i.e., .rjnb /p ’ 1), the values of xn are transformed

by TjT¤ , which maps the Px £ P y image xn into a larger image of size Mx £ M y so that xn is located in the position specied by jb and the rest of image pixels are lled with zero values. Thus, the nonoccluded pixels found in each training image are located properly in the big panorama image and averaged to produce b. Note also that in the special case where the background is static, the effect of transformation jb is removed from all update equations, and the P P n parameters b and ¾b2 are updated according to b Ã N .rn ¤ xn /:= N nD1 nD1 r PN n T n 2 .r / .x ¡b/ and ¾b2 Ã nD1 PN n T , respectively. nD1

.r / 1

For random backgrounds, the EM algorithm is not needed. In this case, we simply set b to the mean of all training images and ¾b2 to the mean variance of all different pixel variances. These background parameters are kept xed for later stages. A.2 Learning the Foreground Objects. Assume that we have already found the background as described previously. At each next stage, the GREEDY algorithm searches for a foreground object. Below we describe how the `th foreground object is found, where ` D 1; : : : ; L. When we search for the `th object, the background as well as the ` ¡ 1 foreground objects have been found in previous stages. 8 As explained in section 2.4.3, we learn the `th object by maximizing the objective function:

F` D

Jf N X X nD1 j` D1

8 <X P Qn .j` / .zn`¡1 /p log[.Tj` ¼ ` /p p f` .xpn I .Tj` f` /p / : pD1

9 = C .1 ¡ Tj` ¼ ` /p pb .xpn I .Tjnb b/p /] ¡ log Qn .j` / : (A.5) ;

The above maximization can be done by a variational EM algorithm. In the

8 Of course, when we search for the rst object, there will be no previously learned foreground objects.

1060

C. Williams and M. Titsias

E-step, we maximize F` with respect to the Qn .j` /, which gives 8 <X P Qn .j` / / Pj` exp .zn`¡1 /p log[.Tj` ¼ ` /p p f` .xpn I .Tj` f` /p / : pD1

9 =

C .1 ¡ Tj` ¼ ` /p pb .xpn I .Tjnb b/p /] ; ;

(A.6)

with Qn .j` / normalized to sum to one. In the M-step, we maximize F` with respect to the object parameters ff` ; ¼ ` ; ¾`2 g. For this maximization, we need the EM algorithm again. The EM algorithm operates in the following Q function, Q` D

Jf N X X nD1 j` D1

C

" Qn .j` /.zn`¡1 /T sjn` ¤ log Tj` ¼ ` C .1 ¡ sjn` / ¤ log.1 ¡ Tj` ¼ ` / Á sjn`

¤

rjn`

1

1 ¤ ¡ 2 .x ¡ Tj` f` / ¡ log ¾`2 1 2 2¾`

!# 2

n

C const; (A.7)

where each element of the vector sjn` stores the value .Tj` ¼` /p p f` .xpn I.Tj` f` /p / .Tj` ¼ ` /p p f` .xpn I.Tj` f` /p /C.1¡.Tj` ¼ ` /p /pb .xpn I.Tjn b/p /

expressing the probability that the

b

pixel is part of the object. Each element of rjn` stores the probability that the pixel is to be nonoccluded: .rjn` /p D

® f N.xpn I.Tj` f` /p ;¾`2 /

.

®f N.xpn I.Tj` f` /p ;¾ `2 /C.1¡® f /U.xpn / computes the quantities, Qn .j` /, sjn` , and rjn`

The algorithm in the E-step as described above, and in the M-step, we update the parameters as follows:

¼` Ã

f` Ã

Jf N X X nD1 j` D1 Jf N X X nD1 j` D1

£

Ã

nD1

Jf N X X nD1 j` D1

Qn .j` /TjT` zn`¡1 ;

Qn .j` /TjT` [zn`¡1 ¤ sjn` ¤ rjn` ¤ xn ]:=

[zn`¡1

PN ¾`2

Qn .j` /TjT` [zn`¡1 ¤ sjn` ]:=

¤ sjn`

¤ rjn` ];

PJf

j` D1 Q

PN

nD1

n .j

PJf

n T n ` /.z`¡1 / [sj`

Jf N X X nD1 j` D1

(A.8)

Qn .j` /TjT` (A.9)

¤ rjn` ¤ .xn ¡ Tj` f` /2 ]

n n T [sn ¤ rn ] j` j` D1 Q .j` /.z`¡1 / j`

:

(A.10)

As with the updates for b and ¾b2 , these updates make intuitive sense. Consider, for example, the `th appearance model f` when Qn .j` / D 1 for

Greedy Learning of Multiple Objects

1061

j` D j¤ and 0 otherwise. For pixels that are ascribed to the `th foreground and are not occluded (i.e., .zn`¡1 ¤ sjn¤ ¤ rjn¤ /p ’ 1), the values in xn are transformed

by TjT¤ (which is Tj¡1 as the transformations are permutation matrices). This ¤ removes the effect of the transformation and thus allows the foreground pixels found in each training image to be averaged to produce f` . Acknowledgments C.W. thanks Geoff Hinton for helpful discussions concerning the idea of learning one object at a time, Zoubin Ghahramani for helpful discussions on the GREEDY algorithm, and Andrew Fitzgibbon for pointers to the “layers” literature. We thank the anonymous referees for their comments, which helped improve the article. References Barlow, H. (1989). Unsupervised learning. Neural Computation, 1, 295–311. Black, M. J., & Jepson, A. (1996). EigenTracking: Robust matching and tracking of articulated objects using a view-based representation. In B. Buxton & R. Cipolla (Eds.), Proceedings of the Fourth European Conference on Computer Vision, ECCV’96 (pp. 329–342). Berlin: Springer-Verlag. Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. Proceedings of the Royal Statistical Society B, 39, 1–38. Frey, B. J., & Jojic, N. (1999). Estimating mixture models of images and inferring spatial transformations using the EM algorithm. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 1999. Fort Collins, CO: IEEE Computer Society Press. Frey, B. J., & Jojic, N. (2002). Fast, large-scale transformation-invariant clustering. In T. G. Diettrich, S. Becker, & Z. Ghahramani (Eds.), Advances in neural information processing systems, 14 (pp. 721–727). Cambridge, MA: MIT Press. Frey, B. J., & Jojic, N. (2003). Transformation invariant clustering using the EM algorithm. IEEE Trans Pattern Analysis and Machine Intelligence, 25(1), 1–17. Ghahramani, Z. (1995). Factorial learning and the EM algorithm. In G. Tesauro, D. S. Touretzky, & T. K. Leen (Eds.), Advances in neural information processing systems, 7 (pp. 617–624). San Mateo, CA: Morgan Kaufmann. Hinton, G. E., & Zemel, R. S. (1994). Autoencoders, minimum description length, and Helmholtz free energy. In J. Cowan, G. Tesauro, & J. Alspector (Eds.), Advances in neural information processing systems, 6. San Mateo, CA: Morgan Kaufmann. Jojic, N., & Frey, B. J. (2001). Learning exible sprites in video layers. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 2001. Kauai, HA: IEEE Computer Society Press. Lee, D. D., & Seung, H. S. (1999). Learning the parts of objects by non-negative matrix factorization. Nature, 401, 788–791.

1062

C. Williams and M. Titsias

McLachlan, G. J., & Krishnan, T. (1997). The EM algorithm. New York: Wiley. Neal, R. M., & Hinton, G. E. (1998). A view of the EM algorithm that justies incremental, sparse and other variants. In M. I. Jordan (Ed.), Learning in graphical models (pp. 355–368). Norwell, MA: Kluwer. Rowe, S., & Blake, A. (1995). Statistical background modelling for tracking with a virtual camera. In D. Pycock (Ed.), Proceedings of the 6th British Machine Vision Conference (Vol. 2, pp. 423–432). Shefeld, UK: BMVA Press. Sanger, T. D. (1989). Optimal unsupervised learning in a single-layer linear feedforward neural network. Neural Networks, 2, 459–473. Shams, L., & von der Malsburg, C. (1999). Are object shape primitives learnable? Neurocomputing, 26–27, 855–863. Tao, H., Sawhney, H. S., & Kumar, R. (2000). Dynamic layer representation with applications to tracking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Los Alamitos, CA: IEEE Computer Society. Wang, J. Y. A., & Adelson, E. H. (1994). Representing moving images with layers. IEEE Transactions on Image Processing, 3(5), 625–638. Williams, C. K. I., & Titsias, M. K. (2003). Learning about multiple objects in images: Factorial learning without factorial search. In S. Becker, S. Thrun, & K. Obermayer (Eds.), Advances in neural information processing systems, 15. Cambridge, MA: MIT Press. Received March 13, 2003; accepted October 7, 2003.

LETTER

Communicated by Massimiliano Pontil

Are Loss Functions All the Same? Lorenzo Rosasco [email protected] INFM–DISI, Universit`a di Genova, 16146 Genova, Italy

Ernesto De Vito [email protected] Dipartimento di Matematica, Universit`a di Modena, 41100 Modena, Italy, and INFN, Sezione di Genova, 16146 Genova, Italy

Andrea Caponnetto [email protected] DISI, Universit`a di Genova, 16146 Genova, Italy

Michele Piana [email protected] INFM–DIMA, Universit`a di Genova, 16146 Genova, Italy

Alessandro Verri [email protected] INFM–DISI, Universit`a di Genova, 16146 Genova, Italy

In this letter, we investigate the impact of choosing different loss functions from the viewpoint of statistical learning theory. We introduce a convexity assumption, which is met by all loss functions commonly used in the literature, and study how the bound on the estimation error changes with the loss. We also derive a general result on the minimizer of the expected risk for a convex loss function in the case of classication. The main outcome of our analysis is that for classication, the hinge loss appears to be the loss of choice. Other things being equal, the hinge loss leads to a convergence rate practically indistinguishable from the logistic loss rate and much better than the square loss rate. Furthermore, if the hypothesis space is sufciently rich, the bounds obtained for the hinge loss are not loosened by the thresholding stage. 1 Introduction A main problem of statistical learning theory is nding necessary and sufcient conditions for the consistency of the empirical risk minimization principle. Traditionally, the role played by the loss is marginal, and the choice of c 2004 Massachusetts Institute of Technology Neural Computation 16, 1063–1076 (2004) °

1064

L. Rosasco, E. De Vito, A. Caponnetto, M. Piana, and A. Verri

which loss to use for which problem is usually regarded as a computational issue (Vapnik, 1995, 1998; Alon, Ben-David, Cesa-Bianchi, & Haussler, 1993; Cristianini & Shawe Taylor, 2000). The technical results are usually derived in a form that makes it difcult to evaluate the role, if any, played by different loss functions. The aim of this letter is to study the impact of choosing a different loss function from a purely theoretical viewpoint. By introducing a convexity assumption, which is met by all loss functions commonly used in the literature, we show that different loss functions lead to different theoretical behaviors. Our contribution is twofold. First, we extend the framework introduced in Cucker and Smale (2002b), based on the square loss for regression, to a variety of loss functions for both regression and classication, allowing for an effective comparison between the convergence rates achievable using different loss functions. Second, in the classication case, we show that for all convex loss functions, the sign of the minimum of the expected risk coincides with the Bayes optimal solution. This can be interpreted as a consistency property supporting the meaningfulness of the convexity assumption at the basis of our study. This property is related to the problem of the Bayes consistency (Lugosi & Vayatis, in press; Zhang, in press). The main outcome of our analysis is that for classication, the hinge loss appears to be the loss of choice. Other things being equal, the hinge loss leads to a convergence rate that is practically indistinguishable from the logistic loss rate and much better than the square loss rate. Furthermore, the hinge loss is the only one for which, if the hypothesis space is sufciently rich, the thresholding stage has little impact on the obtained bounds. The plan of the letter is as follows. In section 2, we x the notation and discuss the mathematical conditions we require on loss functions. In section 3, we generalize the result in Cucker and Smale (2002b) to convex loss functions. In section 4, we discuss the convergence rates in terms of various loss functions and focus our attention on classication. In section 5, we summarize the obtained results. 2 Preliminaries In this section, we x the notation and then make explicit the mathematical properties required in the denition of loss functions that can be protably used in statistical learning. 2.1 Notation. We denote with p.x; y/ the density describing the probability distribution of the pair .x; y/ with x 2 X and y 2 Y. The sets X and Y are compact subsets of Rd and R, respectively. We let Z D X £ Y, and we recall that p.x; y/ can be factorized in the form p.x; y/ D p.yjx/ ¢ p.x/, where p.x/ is the marginal distribution dened over X and p.yjx/ is the conditional

Are Loss Functions All the Same?

distribution.1 We write the expected risk for a function f as Z ] D I[ f V. f .x/; y/p.x; y/dxdy;

1065

(2.1)

Z

for some nonnegative valued function V, named loss function, the properties of which we discuss in the next subsection. The ideal estimator—or target function, denoted with f0 : X ! R, is the minimizer of min I[ f ]; f 2F

where F is the space of measurable functions for which I[ f ] is well dened. In practice, f0 cannot be found since the probability distribution p.x; y/ is unknown. What is known is a training set D of examples D D f.x1 ; y1 /; : : : ; .x` ; y` /g, obtained by drawing ` independent and identically distributed pairs in Z according to p.x; y/. A natural estimate of the expected risk is given by the empirical risk Iemp [ f ] D

` 1X V. f .xi /; yi /: ` iD1

(2.2)

The minimizer fD of min Iemp [ f ] f 2H

(2.3)

can be seen as a coarse approximation of f0 . In equation 2.3, the search is restricted to a function space H, named hypothesis space, allowing for the effective computation of the solution. A central problem of statistical learning theory is to nd conditions under which fD mimics the behavior of f0 . 2.2 RKHS and Hypothesis Space. The approximation of f0 from a nite set of data is an ill-posed problem (Girosi, Jones, & Poggio, 1995; Evgeniou, Pontil, & Poggio, 2000). The treatment of the functional and numerical pathologies due to ill posedness can be addressed by using regularization theory. The conceptual approach of regularization is to look for approximate solutions by setting appropriate smoothness constraints on the hypothesis space H. Within this framework, reproducing kernel Hilbert space (RKHS) (Aronszajn, 1950) provides a natural choice for H (Wahba, 1990; Girosi et al., 1995; Evgeniou et al., 2000). In what follows, we briey summarize the properties of RKHSs needed in the next section. 1 The results obtained throughout the article, however, hold for any probability measure on Z.

1066

L. Rosasco, E. De Vito, A. Caponnetto, M. Piana, and A. Verri

An RKHS is a Hilbert space H characterized by a symmetric positive denite function K.x; s/, named Mercer kernel (Aronszajn, 1950), K : X £ X ! R; such that K.¢; x/ belongs to H for all x 2 X, and the following reproducing property holds: f .x/ D h f; K.¢; x/iH :

(2.4)

For each given R > 0, we consider as hypothesis space HR , the ball of radius R in the RKHS H , or

HR D f f 2 H; k f kH · Rg: We assume that K is a continuous function on X £ X. It follows from equation 2.4 that HR is a compact subset of C.X/ in the sup norm topology, k f k1 D sup j f .x/j: In particular, this implies that given f 2 HR , we have p k f k1 · RCK with CK D sup K.x; x/: x2X

We conclude by observing that CK is nite since X is compact. 2.3 Loss Functions. In equations 2.1 and 2.2, the loss function V : R £ Y ! [0; C1/ represents the price we are willing to pay by predicting f .x/ in place of y. The choice of the loss function is typically regarded as an empirical problem, the solution to which depends essentially on computational issues. We now introduce a mathematical requirement that a function needs to satisfy in order to be naturally thought of as a loss function. We rst notice that the loss function is always a true function of only one variable t, with t D w ¡ y for regression and t D wy for classication. The basic assumption we make is that the mapping t ! V.t/ is convex for all t 2 R. This convexity hypothesis has two technical implications (Rockafellar, 1970): 1. A loss function is a Lipschitz function, that is, for every M > 0, there exists a constant L M > 0 such that jV.w1 ; y/ ¡ V.w2 ; y/j · LM jw1 ¡ w2 j

for all w1 ; w2 2 [¡M; M] and for all y 2 Y.

Are Loss Functions All the Same?

1067

4

q( t) |1 t| + t log (1+e ) 2

3

2

V(t)

(1 t)

2

1

0 1

0

1

2

t

3

Figure 1: Various loss functions used in classication. Here t D yf .x/.

2. There exists a constant C0 such that 8y 2 Y, V.0; y/ · C0 : The explicit values of LM and C0 depend on the specic form of the loss function. In what follows we consider the square loss V.w; y/ D .w¡y/2 , the absolute value loss V.w; y/ D jw ¡ yj, and the ²¡insensitive loss V.w; y/ D maxfjw ¡ yj ¡ ²; 0g D: jw ¡ yj² for regression and the square loss V.w; y/ D .w¡y/2 D .1¡wy/2 , the hinge loss V.w; y/ D maxf1¡wy; 0g D: j1¡wyjC , and the logistic loss V.w; y/ D .ln 2/¡1 ln.1Ce¡wy / for classication (see Figure 1). The constant in the logistic loss ensures that all losses for classication equal 1 for w D 0. The values of LM and C0 for the various loss functions are summarized in Table 1. We observe that for regression, the value of ± in C0 Table 1: Optimal Values of L M and C0 for a Number of Loss Functions for Regression and Classication. Problem

Loss

LM

C0

Regression Regression Regression Classication Classication Classication

Quadratic Absolute Value ²-insensitive Quadratic Hinge Logistic

2M C ± 1 1 2M C 2 1 .ln 2/¡1 e M =.1 C eM /

±2 ± ± 1 1 1

Note: In regression (Y D [a; b]) ± D maxfjaj; jbjg.

1068

L. Rosasco, E. De Vito, A. Caponnetto, M. Piana, and A. Verri

depends on the interval [a; b] in R and is dened as ± D maxfjaj; jbjg. For classication, instead, C0 D 1 for all loss functions. Notice that the 0 ¡ 1 loss, the natural loss function for binary classication, does not satisfy the convexity assumption. In practice, this does not constitute a limitation since the 0 ¡ 1 loss leads to intractable optimization problems. 3 Estimation Error Bounds for Convex Loss Functions It is well known that by introducing a hypothesis space HR , the generalization error I[ fD ] ¡ I[ f0 ] can be written as I[ fD ] ¡ I[ f0 ] D .I[ fD ] ¡ I[ fR ]/ C .I[ fR ] ¡ I[ f0 ]/

(3.1)

with fR dened as the minimizer of minf 2H R fI[ f ]g: The rst term on the right-hand side of equation 3.1 is the sample or estimation error, whereas the second term, which does not depend on the data, is the approximation error. In this section, we provide a bound on the estimation error for all loss functions through a rather straightforward extension of theorem C in Cucker and Smale (2002b). We let N.²/ be the covering number of HR (which is well dened because HR is a compact subset of C.X/) and start by proving the following sufcient condition for uniform convergence from which the derivation of the probabilistic bound on the estimation error will be trivially obtained. Lemma 1.

Let M D CK R and B D L M M C C0 . For all ² > 0, ³

ProbfD 2 Z` j sup jI[ f ] ¡ Iemp [ f ]j · ²g ¸ 1 ¡ 2N f 2H

Proof.

R

² 4LM

´

2

¡ `² 2

e

8B

:

(3.2)

Since HR is compact, both M and B are nite. We start by writing

L

D[ f ]

D I[ f ] ¡ Iemp [ f ]:

Given f1 ; f2 2 HR , for the Lipschitz property we have that jL

D [ f1 ]

¡L

D [ f2 ]j

· jI[ f1 ] ¡ I[ f2 ]j C jIemp [ f1 ] ¡ Iemp [ f2 ]j · 2LM k f1 ¡ f2 k1 :

(3.3)

The mean value ¹ of the random variable on Z, dened as ».x; y/ D V. f .x/; y/, is Z ¹ :D V. f .x/; y/ dp.x; y/ D I[ f ]; Z

Are Loss Functions All the Same?

1069

and since 0 · ¹; » · B, we have that j».x; y/ ¡ ¹j · B, for all x 2 X, y 2 Y. Given f 2 H, we denote with A f D fD 2 Z` j jL

D [ f ]j

¸ ²g

the collection of training sets for which convergence in probability of Iemp [ f ] to I[ f ] with high condence is not attained. By Hoeffding inequality (Cucker & Smale, 2002b), we have that 2

¡ `² 2

p` .A f / · 2e

2B

:

If m D N. 2L²M /, by denition of covering number, there exist m functions f1 ; : : : ; fm 2 HR such that the m balls of radius 2L²M , B . fi ; 2L²M / cover HR , ² or HR ½ [m iD1 B . fi ; 2LM /. Equivalently, for all f 2 HR , there exists some i 2 f1; : : : ; mg such that f 2 B. fi ; 2L²M /, or k f ¡ fi k1 ·

² : 2LM

(3.4)

If we now dene A D [m iD1 A fi , we then have p` .A/ ·

m X iD1

Thus, for all D 62 A, we have that jL tions 3.3 and 3.4, we obtain jL

D[ f ] ¡

2

¡ `² 2

p` .A fi / · m2e

L

:

· ² and, by combining equa-

D [ fi ]j

D [ fi ]j

2B

· ²:

Therefore, for all f 2 HR and D 62 A, we have that jL

D [ f ]j

· 2²:

The thesis follows, replacing ² with ²2 . Lemma 1 can be compared to the classic result in Vapnik (1998, Chaps. 3 and 5) where a different notion of covering number that depends on the given sample is considered. The relation between these two complexity measures of hypothesis space has been investigated by Zhou (2002); and Pontil (in press). In particular, from the results in Pontil (in press), the generalization of our proof to the case of data-dependent covering numbers does not seem straightforward. We are now in a position to generalize theorem C in Cucker and Smale (2002b) and obtain the probabilistic bound by observing that for a xed ´, the condence term in equation 3.2 can be solved for ².

1070

L. Rosasco, E. De Vito, A. Caponnetto, M. Piana, and A. Verri

Theorem 1.

Given 0 < ´ < 1, ` 2 N and R > 0, with probability at least 1 ¡ ´, I[ fD ] · Iemp [ fD ] C ².´; `; R/

and

jI[ fD ] ¡ I[ fR ]j · 2².´; `; R/

with lim`!1 ².´; `; R/ D 0.

Proof. The rst inequality is evident from lemma 1 and the denition of ². The second follows observing that with probability at least 1 ¡ ´, I[ fD ] · Iemp [ fD ] C ².´; `; R/ · Iemp [ fR ] C ².´; `; R/ · I[ fR ] C 2².´; `; R/; and, by denition of fR , I[ fR ] · I[ fD ]. That lim`!1 ².´; `; R/ D 0 follows elementarily by inverting ´ D ´.²/ with respect to ². 4 Statistical Properties of Loss Functions We now move on to study some statistical properties of various loss functions. 4.1 Comparing Convergence Rates. Using equation 3.2, we rst compare the convergence rates of the various loss functions. This is possible because the constants appearing in the bounds depend explicitly on the choice of the loss function. For simplicity, we assume CK D 1 throughout. For regression, we have that the absolute value and the ²-insensitive loss functions have the same condence, 2N

±² ² 4

³ exp ¡

`² 2 8.R C ±/2

´ ;

(4.1)

from which we see that the radius, ²=4, does not decrease when R increases, unlike the case of the square loss in which the condence is ³ 2N

´ ³ ´ ² `² 2 exp ¡ : 4.2R C ±/ 8.R.2R C ±/ C ± 2 /2

(4.2)

Notice that for the square loss, the convergence rate is also much slower given the different leading power of the R and ± factors in the denominator of the exponential arguments of equations 4.1 and 4.2. In Figure 2A we compare the dependence of the estimated condence ´ on the sample size ` for the square and the ²-insensitive loss for some xed values of the various parameters (see the legend for details). The covering number has been estimated from the upper bounds found in Zhou (2002) for the gaussian kernel. Clearly, a steeper slope corresponds to a better convergence rate.

Are Loss Functions All the Same? 10

1

10

0

1

10

2

10

3

10

4

h × const.

10

0

2

10

0

h × const.

10

10

10

10

1071

square e ins/abs val 2000

4000

6000

8000

10000

4000

6000

8000

10000

sample size

2

4

6

0

square hinge logistic 2000

sample size

Figure 2: Semilogarithmic plots of the convergence rates of various loss functions for regression (A) and classication (B). In both cases, R D 1:5, ² D 0:2, and the dimensionality of the input space used to estimate the covering number is 10. For regression, we set ± D 1:5.

Qualitatively, the behavior of the square loss does not change moving from regression to classication. For the hinge loss, the condence reads ±² ²

³

`² 2 2N exp ¡ 4 8.R C 1/2

´ :

Here again, the covering number does not depend on R, and the convergence rate is much better than for the square loss. The overall behavior of the

1072

L. Rosasco, E. De Vito, A. Caponnetto, M. Piana, and A. Verri

logistic loss, ³ 2N

´ ³ ´ ² `² 2 exp ¡ ; 4.ln 2/¡1 eR =.1 C eR / 8.R..ln 2/¡1 eR =.eR C 1// C 1/2

is very similar to the hinge case. This agrees with the intuition that these two losses have a similar shape (see Figure 1). The behavior of the convergence rates for these three loss functions is depicted in Figure 2B (again, the covering number has been estimated using the upper bounds found in Zhou, 2002, for the case of gaussian kernel and a steeper slope corresponds to a better convergence rate). We conclude this section by pointing out that this analysis is made possible by the fact that unlike previous work, mathematical properties of the loss function have been incorporated directly into the bounds. 4.2 Bounds for Classication. We now focus on the case of classication. We start by showing that the convexity assumption ensures that the thresholded minimizer of the expected risk equals the Bayes optimal solution independent of the loss function. We then nd that the hinge loss is the one for which the obtained bounds are tighter. The natural restriction to indicator functions for classications corresponds to considering the 0 ¡ 1 loss. Due to the intractability of the optimization problems posed by this loss, real valued loss functions must then be used (effectively solving a regression problem), and classication is obtained by thresholding the output. We recall that in this case, the best solution fb for a binary classication problem is provided by the Bayes rule, dened for p.1jx/ 6D p.¡1jx/ as fb .x/ D

» 1 ¡1

if p.1jx/ > p.¡1jx/ if p.1jx/ < p.¡1jx/:

We now prove the following fact relating the Bayes optimal solution to the real valued minimizer of the expected risk for a convex loss. Fact. Assume that the loss function V.w; y/ D V.wy/ is convex and that it is decreasing in a neighborhood of 0. If f0 .x/ 6D 0, then fb .x/ D sgn. f0 .x//: Proof. We recall that since V is convex, V admits left and right derivative 0 0 in 0, and since it is decreasing, V¡ .0/ · VC .0/ < 0. Observe that Z I[ f ] D

¡ X

¢ p.1jx/V. f .x/ C .1 ¡ p.1jx//V.¡ f .x// p.x/dx;

Are Loss Functions All the Same?

1073

with p.1jx/ D p.y D 1jx/, we have f0 .x/ D argminw2R Ã.w/, where Ã.w/ D p.1jx/V.w/ C .1 ¡ p.1jx//V.¡w/ (we assume existence and uniqueness to avoid pathological cases). Assume, for example, that p.1jx/ > 12 . Then, 0 0 Ã¡0 .0/ D p.1jx/V¡ .0/ ¡ .1 ¡ p.1jx//VC .0/ 0 0 · p.1jx/VC .0/ ¡ .1 ¡ p.1jx//VC .0/ 0 D .2p.1jx/ ¡ 1/VC .0/ · 0:

Since Ã is also a convex function in w, this implies that for all w · 0, 0 Ã .w/ ¸ Ã.0/ C Ã¡ .0/w ¸ Ã.0/;

so that the minimum point w¤ of Ã.w/ is such that w¤ ¸ 0. Since f0 .x/ D w¤ , it follows that if f0 .x/ 6D 0, sgn f0 .x/ D sgn.2p.1jx/ ¡ 1/ D fb .x/: Remark. The technical condition f0 .x/ 6D 0 is always met by all loss functions considered in this letter and in practical applications and is equivalent to requiring the differentiability of V in the origin.2 The above fact ensures that in the presence of innite data, all loss functions used in practice, though only rough approximations of the 0 ¡ 1 loss, lead to consistent results. Therefore, our result can be interpreted as a consistency property shared by all convex loss functions. It can be shown that for the hinge loss (Lin, Wahba, Zhang, & Lee, 2003), I[ f0 ] D I[ fb ]:

(4.3)

By directly computing f0 for different loss functions (see Hastie, Tibshirani, & Friedman, 2001, p. 381, for example), it is easy to prove that this result does not hold for the other loss functions used in this letter. We now use this result to show that the hinge loss has a further advantage on the other loss functions. In the case of nite data, we are interested in bounding I[sgn. fD /] ¡ I[ fb ];

(4.4)

1 0 2 . Computing the right derivative of Ã in 0, ÃC .0/, and 0 .0/ V¡ 1 0 observing that ÃC .0/ ¸ 0 for p.1jx/ 2 . 2 ; V 0 .0/CV0 .0/ /, it follows that this interval is empty ¡ C 0 .0/ D V 0 .0/. For more details, see Rosasco, De Vito, Caponnetto, Piana, if and only if V¡ C 2

Consider the case p.1jx/ >

and Verri (2003).

1074

L. Rosasco, E. De Vito, A. Caponnetto, M. Piana, and A. Verri

but we can produce bounds only of the type I[ fD ] ¡ I[ fR ] · 2².´; `; R/: We observe that for all loss functions, I[sgn. fD /] · I[ fD ]

(4.5)

(see Figure 1). Now, if the hypothesis space is rich enough to contain f0 , that is, when the approximation error can be neglected, we have fR D f0 . For the hinge loss, using equations 4.3 and 4.5, and the theorem, we obtain that for 0 < ´ < 1 and R > 0 with probability at least 1 ¡ ´: 0 · I[sgn. fD /] ¡ I[ fb ] · I[ fD ] ¡ I[ f0 ] · 2².´; `; R/: We stress that the simple derivation of the above bound follows naturally from the special property of the hinge loss expressed in equation 4.3. For other loss functions similar results can be derived through a more complex analysis (Lugosi & Vayatis, in press; Zhang, in press). 5 Conclusion In this article, we consider a probabilistic bound on the estimation error based on covering numbers and depending explicitly on the form of the loss function for both regression and classication problems. Our analysis makes explicit an implicit convexity assumption met by all loss functions used in the literature. Unlike previous results, constants related to the behavior of different loss functions are directly incorporated in the bound. This allows us to analyze the role played by the choice of the loss function in statistical learning. We conclude that the built-in statistical robustness of loss functions like the hinge or the logistic loss for classication and the ²-insensitive loss for regression leads to better convergence rates than the classic square loss. It remains to be seen whether the same conclusions on the convergence rates can be drawn using different bounds. Furthermore, for classication, we derived in a simple way results relating the classication problem to the regression problem that is actually solved in the case of real valued loss functions. In particular, we pointed out that only for the hinge loss, the solution of the regression problem with innite data returns Bayes’ rule. Using this fact, the bound on the generalization error for the hinge can be written ignoring the thresholding stage. Finally, we observe that our results are found considering the regularization setting of the Ivanov type—that is, empirical risk minimization in balls of radius R in the RKHS H . Many kernel methods consider a functional of the form Iemp [ f ] C ¸k f k2H

Are Loss Functions All the Same?

1075

that can be seen as the Tikhonov version of the above regularization problem. The question arises of whether the results presented in this article can be generalized to the Tikhonov setting. For the square loss, a positive answer is given in Cucker and Smale (2002a), where the proofs heavily rely on the special properties of the square loss. Current work focuses on extending this result to the wider class of loss functions considered in this article. Acknowledgments We thank Stefano Tassano for useful discussions. L.R. is supported by an INFM fellowship. A.C. is supported by a fellowship within the PRIN project “Inverse Problems in Medical Imaging,” n. 2002013422. This research has been partially funded by the INFM Advanced Research Project MAIA, the FIRB Project ASTA RBAU01877P, and the EU Project IST-2000-25341 KerMIT. References Alon, N., Ben-David, S., Cesa-Bianchi, N., & Haussler, D. (1993). Scale-sensitive dimensions,uniform convergence and learnability. In Proceedings of the 34th Annual IEEE Conference on Foundations of Computer Science (pp. 292–301). New York: IEEE. Aronszajn, N. Theory of reproducing kernels. (1950). Trans. Amer. Math. Soc., 686, 337–404. Cristianini, N., & Shawe Taylor, J. (2000). An introduction to support vector machines. Cambridge: Cambridge University Press. Cucker, F., & Smale, S. (2002). Best choices for regularization parameters in learning theory: On the bias-variance problem. Foundation of Computational Mathematics, 2, 413–428. Cucker, F., & Smale, S. (2002). On the mathematical foundation of learning. Bull. A.M.S., 39, 1–49. Evgeniou, T., Pontil, M., & Poggio, T. (2000). Regularization networks and support vector machines. Adv. Comp. Math., 13, 1–50. Girosi, F., Jones, M., & Poggio, T. (1995). Regularization theory and neural networks architectures. Neural Computation, 7, 219–269. Hastie, T., Tibshirani, R., & Friedman, J. (2001). The elements of statistical learning. New York: Springer-Verlag. Lin, G., Wahba, Y., Zhang, H., & Lee, Y. (2003). Statistical properties and adaptive tuning of support vector machines. Machine Learning, 48, 115–136. Lugosi, G., & Vayatis, N. (in press). On the Bayes risk consistency of regularized boosting methods. Annals of Statistics. Pontil, M. (in press). A note on different covering numbers in learning theory. J. Complexity. Rockafellar, R. (1970). Convex analysis. Princeton, NJ: Princeton University Press.

1076

L. Rosasco, E. De Vito, A. Caponnetto, M. Piana, and A. Verri

Rosasco, L., De Vito, E., Caponnetto, A., Piana, M., & Verri. (2003). Notes on the use of different loss functions (Tech. Rep. No. DISI-TR-03-07).Genova: Department of Computer Science, University of Genova, Italy. Vapnik, V. (1995). The nature of statistical learning theory. New York: SpringerVerlag. Vapnik, V. (1998). Statistical learning theory. New York: Wiley. Wahba, G. (1990). Splines models for observational data. Philadelphia: SIAM. Zhang, T. (in press). Statistical behavior and consistency of classication methods based on convex risk minimization. Annals of Statistics. Zhou, D. (2002). The covering number in learning theory. J. Complexity, 18, 739–767. Received July 14, 2003; accepted September 26, 2003.

LETTER

Communicated by Helen Hao Zhang

Trading Variance Reduction with Unbiasedness: The Regularized Subspace Information Criterion for Robust Model Selection in Kernel Regression Masashi Sugiyama [email protected] Fraunhofer FIRST, IDA, 12489 Berlin, Germany, and Department of Computer Science, Tokyo Institute of Technology, Meguro-ku, Tokyo, 152-8552, Japan

Motoaki Kawanabe nabe@rst.fhg.de Fraunhofer FIRST, IDA, 12489 Berlin, Germany

Klaus-Robert Muller ¨ klaus@rst.fhg.de Fraunhofer FIRST, IDA, 12489 Berlin, Germany, and Department of Computer Science, University of Potsdam, 14482 Potsdam, Germany

A well-known result by Stein (1956) shows that in particular situations, biased estimators can yield better parameter estimates than their generally preferred unbiased counterparts. This letter follows the same spirit, as we will stabilize the unbiased generalization error estimates by regularization and nally obtain more robust model selection criteria for learning. We trade a small bias against a larger variance reduction, which has the benecial effect of being more precise on a single training set. We focus on the subspace information criterion (SIC), which is an unbiased estimator of the expected generalization error measured by the reproducing kernel Hilbert space norm. SIC can be applied to the kernel regression, and it was shown in earlier experiments that a small regularization of SIC has a stabilization effect. However, it remained open how to appropriately determine the degree of regularization in SIC. In this article, we derive an unbiased estimator of the expected squared error, between SIC and the expected generalization error and propose determining the degree of regularization of SIC such that the estimator of the expected squared error is minimized. Computer simulations with articial and real data sets illustrate that the proposed method works effectively for improving the precision of SIC, especially in the high-noise-level cases. We furthermore compare the proposed method to the original SIC, the cross-validation, and an empirical Bayesian method in ridge parameter selection, with good results.

c 2004 Massachusetts Institute of Technology Neural Computation 16, 1077–1104 (2004) °

1078

M. Sugiyama, M. Kawanabe, and K. Muller ¨

1 Introduction Estimating the generalization capability of learning machines has been extensively studied so far because a good estimator of the generalization error can be used for model selection (Vapnik, 1982, 1995, 1998; Bishop, 1995; Devroye, Gyo, ¨ & Lugosi, 1996; Muller, ¨ Mika, R¨atsch, Tsuda, & Sch¨olkopf, 2001). Existing work for estimating the generalization error can be roughly classied into two approaches. One is to estimate the expected generalization error (e.g., Mallows, 1964, 1973; Akaike, 1974; Takeuchi, 1976; Sugiura, 1978; Craven & Wahba, 1979; Wahba, 1990; Murata, Yoshizawa, & Amari, 1994; Konishi & Kitagawa, 1996; Murata, 1998; Sugiyama & Ogawa, 2001; Sugiyama & Muller, ¨ 2002), and the other is to estimate the worst-case generalization error (e.g., Vapnik, 1995; Cherkassky, Shao, Mulier, & Vapnik, 1999; Cucker & Smale, 2002; Bousquet & Elisseeff, 2002). Both approaches have strong theoretical properties, for example, the accuracy of the estimators of the expected generalization error is theoretically guaranteed in the sense of asymptotic or exact unbiasedness, 1 or the validity of the estimators of the worst-case generalization error (i.e., upper bounds on the generalization error) is theoretically guaranteed with certain probability. So far, these methods have been successfully applied to various practical learning tasks. However, unbiased estimators of the expected generalization error can have large variance, or the probabilistic upper bounds on the generalization error can be loose. For this reason, it is very important to reduce the variance of the unbiased estimators of the expected generalization error or tighten the probabilistic upper bounds on the generalization error. In this article, we focus on reducing the variance and propose a method for improving the precision of unbiased estimators of the expected generalization error by regularization. Since we are trying to shrink unbiased estimators of the expected generalization error, this work can be regarded as an application of the idea of the Stein estimator (Stein, 1956) to model selection. So far, the variance of the unbiased estimators of the expected generalization error has been investigated (e.g., Felsenstein, 1985; Linhart, 1988; Shimodaira, 1997, 1998), in a context where the small differences in the values of Akaike’s information criterion (AIC; Akaike, 1974) are not statistically signicant.2 These articles proposed using a set of “good” models whose values of AIC are relatively small rather than selecting the single best model that minimizes AIC. Although these studies pointed out to us the need to investigate the variance of the unbiased estimators of the expected gener1 Here, the term exact unbiasedness is used for expressing ordinary unbiasedness (i.e., the expectation agrees with the true value for nite samples) in order to emphasize the contrast with asymptotic unbiasedness (the expectation converges to the true value as the number of samples goes to innity). 2 AIC is an asymptotic unbiased estimator of the expected generalization error measured by the Kullback-Leibler divergence.

Trading Variance Reduction with Unbiasedness

1079

alization error, they are not primarily intended to improve the precision of the estimators. Tsuda, Sugiyama, and Muller ¨ (2002) gave a method for reducing the variance of the subspace information criterion (SIC; Sugiyama & Ogawa, 2001; Sugiyama & Muller, ¨ 2002) by introducing a regularization parameter to SIC.3 It was experimentally shown that a small regularization of SIC highly contributes to stabilization. This work alluded to the possibility of obtaining more precise estimators of the expected generalization error. At the same time, it raised the question, so far unresolved, of how to appropriately determine the degree of regularization in regularized SIC (RSIC). In this letter, we therefore propose a method for appropriately determining the degree of regularization in RSIC, such that the expected squared error between RSIC and the expected generalization error is minimized. However, we cannot do so directly, since the expected squared error includes the unknown expected generalization error. To cope with this problem, we derive an unbiased estimator of the expected squared error that can be calculated from the data and propose determining the degree of regularization in RSIC such that this estimator of the expected squared error is minimized. Finally, we apply the proposed method to the ridge parameter selection in ridge regression. There are several interesting works that theoretically investigate the asymptotic optimality of the choice of the ridge parameter (Craven & Wahba, 1979; Wahba, 1985; Li, 1986). Although we believe that showing the asymptotic optimality of the proposed method may be possible, we are especially interested in the performance with nite samples. For this reason, we shall experimentally investigate the model selection performance of the proposed method in nite sample situations. Simulations with articial and benchmark data sets show that our regularization approach contributes to improving the precision of SIC; it especially has a stabilizing effect for high noise, and consequently the model selection performance is improved. The rest of this letter is organized as follows. The regression problem is formulated in section 2, and the derivation of SIC is briey reviewed in section 3. Section 4 introduces RSIC and gives a method for determining the degree of regularization in RSIC. Computer simulations with articial and real data sets are performed in section 5, illustrating how RSIC works. Finally, section 6 gives the conclusions and future prospects. 2 Problem Formulation In this section, we formulate the regression problem of approximating a target function from training samples.

3 SIC is an unbiased estimator of the expected generalization error measured by the reproducing kernel Hilbert space norm. As described by Sugiyama and Ogawa (2001), SIC can be regarded as an extension of Mallows’s CL (Mallows, 1973).

1080

M. Sugiyama, M. Kawanabe, and K. Muller ¨

Let us denote the learning target function by f .x/, which is a real-valued function of d variables dened on a subset D of the d-dimensional Euclidean space Rd . We are given a set of n samples called the training examples. A training example consists of a sample point xi in D and a sample value yi in R. We consider the case that yi is degraded by unknown additive noise ²i , which is independently drawn from a normal distribution with mean zero and variance ¾ 2 .4 Then the training examples are expressed as f.xi ; yi / j yi D f .xi / C ²i gniD1 :

(2.1)

We assume that the unknown learning target function f .x/ belongs to a specied reproducing kernel Hilbert space (RKHS) H .5 The reproducing kernel of a functional Hilbert space H , denoted by K.x; x0 /, is a bivariate function dened on D £ D that satises the following conditions (see e.g., Aronszajn, 1950; Bergman, 1970; Saitoh, 1988, 1997; Wahba, 1990; Vapnik, 1998; Cristianini & Shawe-Taylor, 2000): ² For any xed x0 in D , K.x; x0 / is a function of x in H .

² For any function f in H and for any x0 in D , it holds that h f .¢/; K.¢; x0 /iH D f .x0 /;

(2.2)

where h¢; ¢iH stands for the inner product in H .

We will employ the following kernel regression model fO.x/, fO.x/ D

n X

®i K.x; xi /;

(2.3)

iD1

where f®i gniD1 are parameters to be estimated from training examples. Let us denote the estimated parameters by f®O i gniD1 . We consider the case that the estimated parameters f®O i gniD1 are given by linear combinations of sample values fyi gniD1 . More specically, letting y D .y1 ; y2 ; : : : ; yn /> ; >

O D .®O 1 ; ®O 2 ; : : : ; ®O n / ; ®

(2.4) (2.5)

4 The normality of the noise is not assumed in our previous works (Sugiyama & Ogawa, 2001; Sugiyama & Muller, ¨ 2002). We do assume the normality here because we are dealing with higher-order statistics. The discussion in this article may be generalized to any noise distributions where up to the fourth-order moments of the noise are known or can be estimated. However, for simplicity, we focus on the normal noise. 5 In our early work (Sugiyama & Ogawa, 2001), only nite-dimensional RKHSs could be dealt with. However, this restriction has been completely removed by Sugiyama and Muller ¨ (2002). This article is based on the latter work, so we do not impose any restrictions on the choice of the RKHS—for example, innite-dimensional RKHSs are also allowed.

Trading Variance Reduction with Unbiasedness

1081

where > denotes the transpose of a vector (or a matrix), we consider the case O is given by that the estimated parameter vector ® O D Xy; ®

(2.6)

where X is an n-dimensional matrix that does not depend on the noise f²i gniD1 . The matrix X, which we call the learning matrix, can be any matrix, but it is usually determined on the basis of a prespecied learning criterion. For example, in the case of ridge regression (Hoerl & Kennard, 1970), the learning matrix X is determined by minimizing the regularized training error 0 1 n ± n ²2 X X 2 min @ (2.7) fO.xi / ¡ yi C ¸ ®j A ; iD1

jD1

where ¸ is a positive scalar called the ridge parameter. A minimizer of equation 2.7 is given by the following learning matrix, X D .K2 C ¸I/¡1 K;

(2.8)

where I denotes the identity matrix and K is the so-called kernel matrix, that is, the .i; j/th element of K is given by Ki;j D K.xi ; xj /:

(2.9)

Note that Bayesian learning with a particular gaussian process prior yields the same learning matrix (see, e.g., Williams & Rasmussen, 1996; Williams, 1998; Cristianini & Shawe-Taylor, 2000). In the following sections, we focus on the above ridge regression for simplicity. However, all the discussions are valid for any learning matrix X. The purpose of regression is to obtain the optimal approximation fO.x/ to the unknown learning target function f .x/. For this purpose, we need a criterion that measures the closeness of two functions (i.e., the generalization measure). In this article, we measure the generalization error by the squared norm in the RKHS H, k fO ¡ f k2H ;

(2.10)

where k ¢ kH denotes the norm in the RKHS H . Using the function space norm as the error measure is rather common in the eld of function approximation (e.g., Daubechies, 1992; Donoho & Johnstone, 1994; Donoho, 1995). The use of the RKHS norm is advantageous in the machine learning context since we can measure various types of errors such as the interpolation error, the extrapolation error, the test error at points of interest, the error at

1082

M. Sugiyama, M. Kawanabe, and K. Muller ¨

training sample points (Mallows, 1973), the error measured by a weighted norm in the frequency domain (Smola, Sch¨olkopf, & Muller, ¨ 1998; Girosi, 1998), or the error measured by the Sobolev norm (Wahba, 1990). When unlabeled samples (fxj g without fyj g) are available in addition to the usual training examples f.xi ; yi /gniD1 , another advantage of RKHS is that we can use those unlabeled samples benecially and in straightforward manner (Sugiyama & Ogawa, 2002; Tsuda et al., 2002). (For further discussions on this generalization measure, see Sugiyama & Muller, ¨ 2002). As stated in section 1, we focus on estimating the expected generalization error, J0 [X] D E² k fO ¡ f k2H ;

(2.11)

where E² denotes the expectation over the noise f²i gniD1 . Note that we do not take the expectation over the training sample points fxi gniD1 , which is often done in statistical learning frameworks (e.g., Akaike, 1974; Takeuchi, 1976; Murata et al., 1994; Konishi & Kitagawa, 1996; Murata, 1998). Thus, our framework is more data dependent. We denote the expected generalization error J0 as a functional of the learning matrix X since under the above setting, specifying fO is equivalent to specifying the learning matrix X. In the following, we often omit X if it is not relevant. As can be seen from equation 2.11, J0 includes the unknown learning target function f .x/, so it cannot be directly calculated. The aim of this article is to give an estimator of equation 2.11 that can be calculated from the given data. 3 Brief Review of the Subspace Information Criterion The subspace information criterion (SIC) (Sugiyama & Ogawa, 2001; Sugiyama & Muller, ¨ 2002) is an unbiased estimator of an essential part of the expected generalization error J0 . In this section, we briey review the derivation of SIC. Let S be the subspace spanned by fK.x; xi /gniD1 , and let fS .x/ be the orthogonal projection of f .x/ onto S . Then the expected generalization error J0 is expressed by J0 D E² k fO ¡ fS k2H C k fS ¡ f k2H ;

(3.1)

where the second term k fS ¡ f k2H does not depend on fO. For this reason, we will ignore it and denote the rst term by J1 : J1 [X] D E² k fO ¡ fS k2H :

(3.2)

Trading Variance Reduction with Unbiasedness

1083

Since the projection fS .x/ belongs to S , it can be expressed by fS .x/ D

n X

®i¤ K.x; xi /;

(3.3)

iD1

where the parameters ®¤ D .® 1¤ ; ®2¤ ; : : : ; ®n¤ /> are unknown. 6 For convenience, let us dene the weighted norm in Rn , k®k2K D hK®; ®i;

(3.4)

where the inner product h¢; ¢i on the right-hand side is the ordinary Euclidean inner product in Rn . Then J1 is expressed as O ¡ ®¤ k2K : J1 D E ² k®

(3.5)

It is known that the above J1 can be decomposed into the bias and variance terms (see e.g., Geman, Bienenstock, & Doursat, 1992; Heskes,1998): O ¡ ®¤ k2K C E² k® O ¡ E² ® O k2K : J1 D kE² ®

(3.6)

The variance term E² k® O ¡ E² ® O k2K can be expressed as ¡ ¢ E² k ® O ¡ E² ® O k2K D ¾ 2 tr KXX> ;

(3.7)

where tr .¢/ denotes the trace of a matrix, that is, the sum of diagonal elements. Equation 3.7 implies that the variance term E ² k® O ¡ E² ® O k2K in equa2 tion 3.6 can be calculated if the noise variance ¾ is available. When ¾ 2 is unknown, one of the practical estimates is given as follows (see, e.g., Wahba, 1990; Gu, Heckman, & Wahba, 1992):

¾O 2 D

²2 Pn ± O ¡ f .x / y i i iD1 n ¡ tr .KX/

D

kKXy ¡ yk2 : n ¡ tr .KX/

(3.8)

Note that k¢k in the numerator on the right-hand side of equation 3.8 denotes the ordinary Euclidean norm in Rn . The bias term kE² ® O ¡ ®¤ k2K in equation 3.6 is totally inaccessible since ¤ both E² ® O and ® are unknown. The key idea of SIC is to assume that a O u of the unknown true parameter vector ®¤ is linear unbiased estimate ® available: O u D ®¤ ; E² ®

(3.9)

When fK.x; xi /gniD1 are linearly dependent, ®¤ is not determined uniquely. In this case, we adopt the minimum norm one. 6

1084

M. Sugiyama, M. Kawanabe, and K. Muller ¨

where ® O u is given by O u D Xu y: ®

(3.10)

Sugiyama and Muller ¨ (2002) proved that such Xu is given by Xu D K† ;

(3.11)

where † denotes the Moore-Penrose generalized inverse. Using the unbiased O u , the bias term kE² ® O ¡ ®¤ k2K in equation 3.6 is expressed by estimate ® O ¡ ®¤ k2K D k® O ¡® O u k2K kE² ®

C 2hKE² .® O ¡® O u /; E² .® O ¡® O u / ¡ .® O ¡® O u /i ¡ kE² .® O ¡® O u / ¡ .® O ¡® O u /k2K :

(3.12)

However, the second and third terms on the right-hand side of equation 3.12 are still inaccessible since E² .® O ¡® O u / is unknown, so we replace them by their expectations over the noise. Then we have the SIC (Sugiyama & Ogawa, 2001; Sugiyama & Muller, ¨ 2002): 7 ¡ ¢ SIC1 [X] D k.X ¡ Xu /yk2K ¡ ¾ 2 tr K.X ¡ Xu /.X ¡ Xu /> ¡ ¢ C¾ 2 tr KXX> : (3.13) Note that the subscript 1 is added to SIC in order to emphasize that it is an estimator of J1 (cf. section 4.1). It was shown that for any learning matrix X, SIC1 is an unbiased estimator of J1 : E ² SIC1 [X] D J1 [X]:

(3.14)

4 Regularization Approach to Stabilizing SIC SIC is an unbiased estimator of the essential generalization error J1 , and this good property still holds even in nite sample cases (i.e., nonasymptotic cases). Sugiyama and Muller ¨ (2002) demonstrated that SIC can be successfully applied to the ridge parameter selection when the noise level is low or medium. However, when the noise level is very high, the performance of SIC sometimes becomes unstable because the variance of SIC can be large. In this section, we propose a method for stabilizing SIC. 7 The phrase subspace information criterion (SIC) came from the fact that it was rst introduced for selecting subspace models (Sugiyama & Ogawa, 2001). However, now SIC is used not only for choosing the subspace (i.e., the range of X), but also for choosing the learning matrix X itself (Sugiyama & Muller, ¨ 2002). Therefore, in equation 3.13, we describe SIC as a functional of the learning matrix X. For example, in the case of ridge regression (see equation 2.7), SIC is regarded as a function of the ridge parameter ¸ and can be used for choosing the best ridge parameter.

Trading Variance Reduction with Unbiasedness

1085

4.1 Extracting Essential Part of SIC. SIC1 dened by equation 3.13 includes terms that do not depend on X. Indeed, SIC1 can be expressed as SIC1 [X] D hKXy; Xyi ¡ 2hKXy; Xu yi C hKXu y; Xu yi ¡ ¢ ¡ > ¢ 2 C2¾ 2 tr X> u KX ¡ ¾ tr Xu KXu :

(4.1)

Since SIC1 is used for choosing the learning matrix X, the third and fth terms in equation 4.1 can be ignored for this purpose. From here on, we use the term SIC for referring to equation 4.1 without the third and fth terms; that is, we dene ¡ ¢ SIC[X] D hKXy; Xyi ¡ 2hKXy; Xu yi C 2¾ 2 tr X> u KX :

(4.2)

Similarly, J1 dened by equation 3.2 can be expressed as J1 [X] D E² k fOk2H ¡ 2E² h fO; fS iH C k fS k2H

D E² hKXy; Xyi ¡ 2E² hKXy; Xu zi C hKXu z; Xu zi;

(4.3)

where z is the noiseless sample value vector dened by z D . f .x1 /; f .x2 /; : : : ; f .xn //> :

(4.4)

Let us denote the rst two terms in equation 4.3 by J: J[X] D E ² k fOk2H ¡ 2E² h fO; fS iH D E ² hKXy; Xyi ¡ 2E² hKXy; Xu zi:

(4.5)

Then it can be conrmed that for any learning matrix X, SIC given by equation 4.2 is an unbiased estimator of J: E² SIC[X] D J[X]:

(4.6)

4.2 The Regularized SIC. According to Tsuda et al. (2002), the instability of SIC is mainly caused by the large variance of the unbiased estimate ® O u, which plays an essential role in the derivation of SIC (see section 3). In order to reduce the variance of SIC, Tsuda et al. (2002) proposed replacing the linear unbiased estimate ® O u by a linear regularized estimate ® O r: O r D Xr y: ®

(4.7)

Namely, the bias term kE² ® O ¡ ®¤ k2K in equation 3.6 is roughly estimated by 2 O ¡® O u kK in the original SIC, while Tsuda et al. (2002) proposed estimating k® it by k® O ¡® O r k2K (see Figure 1). The regularized estimate ® O r is slightly biased,

1086

M. Sugiyama, M. Kawanabe, and K. Muller ¨

Rn

E² ® ^r ®¤ ^ E² ®

® ^r

® ^

Figure 1: Basic idea of the regularized SIC (RSIC). The bias term kE² ® O ¡ ®¤ k2K (depicted by the solid line) is roughly estimated by k® O ¡® O r k2K (depicted by the dotted line), where ® O r is a regularized estimate. The regularized estimate ® O r is slightly biased, so its expectation E² ® O r no longer agrees with the true parameter O r (denoted by the lighter circle) may be ®¤ . On the other hand, the scatter of ® far smaller than that of the unbiased estimate ® O ¹ (denoted by the darker circle).

so its expectation E² ® O r no longer agrees with the true parameter ®¤ . On the O r may be far smaller than that of the unbiased other hand, the scatter of ® estimate ® O u . The learning matrix Xr that provides the linear regularized estimate ® O r is given, for example, by Xr D .K2 C ° I/¡1 K;

(4.8)

where ° is the regularization parameter that controls the degree of regularization in SIC. Note that the following discussions are valid for any learning matrix Xr , but we mainly focus on equation 4.8 for simplicity. We refer to SIC dened by equation 4.2 with Xu replaced by Xr as the regularized SIC (RSIC): ¡ ¢ RSIC[XI Xr ] D hKXy; Xyi ¡ 2hKXy; Xr yi C 2¾ 2 tr X> r KX ;

(4.9)

where the notation RSIC[XI Xr ] means that RSIC is a functional of a learning matrix X with a parameter matrix Xr . It was experimentally shown that this regularization approach works effectively for stabilizing SIC (Tsuda et al., 2002). However, the degree of regularization (e.g., the regularization parameter ° in equation 4.8) should be appropriately determined, which is still an open problem. In the following, we propose a method to determine the degree of regularization of RSIC.

Trading Variance Reduction with Unbiasedness

1087

4.3 Expected Squared Error of RSIC. Let us dene the expected squared error (ESE) between RSIC and J by ESERSIC [Xr I X] D E² .RSIC[XI Xr ] ¡ J[X]/2 ;

(4.10)

where the notation ESERSIC [Xr I X] means that we treat ESERSIC as a functional of the matrix Xr with a parameter matrix X. In the following, we often omit [Xr I X]. Our aim is to determine Xr in RSIC so that the above ESERSIC is minimized. Similar to equation 3.6, ESERSIC can be decomposed into the bias and variance terms: ESERSIC [Xr I X] D Bias2RSIC[Xr I X] C VarRSIC [Xr I X];

(4.11)

where BiasRSIC [Xr I X] D E² RSIC[XI Xr ] ¡ J[X];

(4.12) 2

VarRSIC [Xr I X] D E² .RSIC[XI Xr ] ¡ E² RSIC[XI Xr ]/ :

(4.13)

Note that the bias of SIC is zero (see equation 4.6), but there is no guarantee that ESE of SIC is small since the variance of SIC can be large. Let B and C be n-dimensional matrices dened by > B D 2X> u KX ¡ 2Xr KX;

(4.14)

CDX

(4.15)

>

KX ¡ 2X> r KX:

Then we have the following lemmas: Lemma 1.

BiasRSIC is expressed by

BiasRSIC D hBz; zi;

(4.16)

where z is dened by equation 4.4. Lemma 2. Under the assumption that f²i gniD1 are independently drawn from the normal distribution with mean zero and variance ¾ 2 , VarRSIC is expressed by ± ² VarRSIC D ¾ 2 k.C C C> /zk2 C ¾ 4 tr C2 C C> C :

(4.17)

Sketches of the proofs of all lemmas and theorems are given in the appendix. See the separate technical report (Sugiyama, Kawanabe, & Muller, ¨ 2003) for the complete proofs. Note that the normality of the noise is used only in lemma 2, not in lemma 1.

1088

M. Sugiyama, M. Kawanabe, and K. Muller ¨

4.4 Estimating the Expected Squared Error of RSIC. In equations 4.16 and 4.17, the noiseless sample value vector z dened by equation 4.4 is unknown. Therefore, BiasRSIC and VarRSIC cannot be directly calculated in practice. Now let us dene

[2 2 2 > 2 2 Bias RSIC [X r I X] D hBy; yi ¡ ¾ k.B C B /yk ¡ 2¾ tr .B/ hBy; yi ± ² C¾ 4 tr B2 C B> B C ¾ 4 tr .B/2 ; (4.18) ± ² d RSIC [Xr I X] D ¾ 2 k.C C C> /yk2 ¡ ¾ 4 tr C2 C C> C : Var (4.19) Then the following theorem holds: Theorem 1. Under the assumption that f²i gniD1 are independently drawn from the normal distribution with mean zero and variance ¾ 2 , the following relations hold for any Xr and X: d 2 [Xr I X] D Bias2 [Xr I X]; E ² Bias RSIC RSIC d RSIC [Xr I X] D VarRSIC [Xr I X]: E² Var

(4.20) (4.21)

[2 d The above theorem shows that Bias RSIC and VarRSIC are unbiased esti2 mators of BiasRSIC and VarRSIC, respectively. Let us dene [2 d RSIC[Xr I X] D Bias d ESE RSIC [Xr I X] C VarRSIC [X r I X]:

(4.22)

Then, from theorem 1, we immediately have the following corollary: Corollary 1. Under the assumption that f²i gniD1 are independently drawn from the normal distribution with mean zero and variance ¾ 2 , the following relation holds for any Xr and X: d RSIC [Xr I X] D ESERSIC [Xr I X]: E ² ESE

(4.23)

d RSIC dened by equation 4.22 is an unCorollary 1 shows that the ESE biased estimator of ESERSIC . Based on this corollary, we propose using d RSIC [Xr I X] for determining the degree of regularization of RSIC, that is, ESE d RSIC [Xr I X] is minimized. Note that ESE d RSIC Xr is determined such that ESE [Xr I X] depends on the learning matrix X, so Xr is individually optimized for each X. For example, when X and Xr are both ridge regression,8 RSIC is treated d RSIC is treated as a as a function of ¸ with a tuning parameter ° and ESE 8

X is given by equation 2.8, and Xr is given by equation 4.8.

Trading Variance Reduction with Unbiasedness

1089

function of ° that depends on ¸. The regularization parameter ° in RSIC is d RSIC is minimized, and determined for each ridge parameter ¸ such that ESE then ¸ is determined such that RSIC is minimized: ¸O RSIC D argmin RSIC.¸I °O¸ /;

(4.24)

¸

where d RSIC.° I ¸/: °O¸ D argmin ESE

(4.25)

°

When the noise variance ¾ 2 is unknown, it can be estimated, for example, by equation 3.8. 5 Computer Simulations In this section, the effectiveness of the proposed generalization error estimation method is investigated through computer simulations. 5.1 Illustrative Examples. First, a simple articial simulation shows how the proposed method works.9 5.1.1 Setting. For illustration purpose, let the dimension d of the input vector be 1. We use the gaussian RKHS with width c D 1, which may be one of the standard RKHSs (see, e.g., Vapnik, 1998; Sch¨olkopf, Smola, Williamson, & Bartlett, 2000): ³ ´ .x ¡ x0 /2 (5.1) K.x; x0 / D exp ¡ : 2c2 We use f .x/ D sinc.x/ as the learning target function (see Figure 2), which is often used as an illustrative regression example (e.g., Vapnik, 1998; Sch¨olkopf et al., 2000). Note that the above sinc function is included in the gaussian RKHS.10 9 Because of space limitations, we describe the results only briey here. For extensive discussions, see Sugiyama et al., (2003) 10 As described in Smola et al. (1998) and Girosi (1998), the gaussian RKHS is spanned by the function f .x/ that belongs to L2 .R/ and satises

Z

1 ¡1

j fQ.!/j2 d! < 1; Q k.!/

Q where ±fQ.!/ is²the Fourier transform of the function f .x/ and k.!/ is the Fourier transform 2

x of exp ¡ 2c 2 . The sinc function belongs to L 2 .R/, and its Fourier transform is zero for

j!j > ¼. Therefore, the above conditions are fullled so the sinc function is included in the gaussian RKHS.

1090

M. Sugiyama, M. Kawanabe, and K. Muller ¨

1 0.5 0 0.5 2

0

2

Figure 2: Learning target function and 50 training examples with noise variance ¾ 2 D 0:09.

The sample points fxi gniD1 are independently drawn from the uniform distribution on .¡¼; ¼ /. The sample values fyi gniD1 are created as yi D f .xi / C ²i , where the noise f²i gniD1 are independently drawn from the normal distribution with mean zero and variance ¾ 2 . We consider the following four cases as the number n of training examples and the noise variance ¾ 2 : .n; ¾ 2 / D .100; 0:01/; .100; 0:09/; .50; 0:01/; .50; 0:09/;

(5.2)

that is, we investigate the cases with small or large noise levels and small or large samples. An example of the training set is also illustrated in Figure 2. The simulations are repeated 100 times for each .n; ¾ 2 / in equation 5.2, randomly drawing the sample points fxi gniD1 and noise f²i gniD1 from scratch in each trial. Note that in theory, we x the training sample points fxi gniD1 and change only the noise f²i gniD1 (see section 2). However, in this experiment, we change both the training sample points fxi gniD1 and noise f²i gniD1 because we would like to investigate whether the proposed method works regardless of the choice of the training set. We use the kernel regression model, equation 2.3, and the parameters f®i gniD1 in the model are learned by ridge regression, that is, the learning matrix is given by equation 2.8. 5.1.2 Investigating Generalization Error Estimation Performance. First, we illustrate how SIC and RSIC work in generalization error estimation. The precision of SIC and RSIC is investigated as a function of the ridge parameter

Trading Variance Reduction with Unbiasedness

1091

¸, using the following values: ¸ 2 f10¡3 ; 10¡2:5 ; 10¡2 ; : : : ; 103 g:

(5.3)

When the ridge regression, equation 2.8, is used, it holds that K> D K, X> D X, and K† KX D X. Therefore, SIC given by equation 4.2 can be expressed in the following simpler form, SIC.¸/ D hX¸ KX¸ y; yi ¡ 2hX¸ y; yi C 2¾ 2 tr .X¸ / ;

(5.4)

where X¸ denotes the learning matrix, equation 2.8 with a ridge parameter ¸. We calculate SIC by the above simpler form, where the noise variance ¾ 2 is estimated by ¾O ¸2 D

kKX¸ y ¡ yk2 : n ¡ tr .KX¸ /

(5.5)

RSIC is calculated by equation 4.9, where the ridge regression, equation 4.8, is used for obtaining the regularized estimator ® O r . The regularization d RSIC .° I ¸/ is minimized (see parameter ° in RSIC is determined so that ESE equation 4.22). Note that the optimization of ° is individually carried out for each ¸ in equation 5.3. The regularization parameter ° is selected from d RSIC is f10¡3 ; 10¡2:5 ; 10¡2 ; : : : ; 103 g. The noise variance ¾ 2 in RSIC and ESE estimated by equation 5.5. In this experiment, we measure the generalization error by the following criterion, which is equivalent to J without the expectation E² (see equation 4.5), Error.¸/ D k fO¸ k2H ¡ 2h fO¸ ; fS iH D hKX¸ y; X¸ yi ¡ 2hX¸ y; zi;

(5.6)

where fO¸ denotes the learned function with a ridge parameter ¸. Figure 3 displays the values of Error.¸/, SIC.¸/, and RSIC.¸/ as a function of the ridge parameter ¸ for each .n; ¾ 2 / in equation 5.2. The horizontal axis denotes the values of ¸ in log scale. From the top, the graphs denote the mean Error with error bar, the mean SIC with error bar, and the mean RSIC with error bar. The mean is taken over 100 trials, and the error bar denotes the standard deviation over 100 trials. In order to compare the mean curves clearly, the mean Error is also drawn by the dashed line in the bottom two graphs. Figure 4 depicts the values of ESE (see equation 4.10), Bias2 (see equation 4.12), and Var (see equation 4.13) of SIC and RSIC as a function of the ridge parameter ¸. Note that in this simulation, the expectation over the

1092

M. Sugiyama, M. Kawanabe, and K. Muller ¨

A. (n; ¼

2

) = (100; 0:01)

0

Error 3

2

1

0

1

2

3

0

SIC 3

2

1

0

1

2

RSIC 2

1

0 logl

1

2

2

1

0

1

2

3

3

3

SIC 2

1

0

1

0 2 4 6

2

3

Error 3

3

0

4

) = (100; 0:09)

0 2 4 6

2 4

2

0 2 4 6

2 4

B. (n; ¼

2

3

RSIC 3

2

1

0 logl

1

2

3

Figure 3: Values of Error.¸/, SIC.¸/, and RSIC.¸/. The horizontal axis denotes the value of ¸ in log scale. From top, the graphs denote the mean Error with error bar, the mean SIC with error bar, and the mean RSIC with error bar. Dashed curves in the bottom two graphs are the mean Error (same as the curve in the top graph).

A. (n; ¼

2

) = (100; 0:01)

0.5

2

1

0

1

0.5

SIC RSIC

3

2

1

0

1

0.5

3

SIC RSIC

Var

0

2

3

2

1

0 logl

1

2

3

2

) = (100; 0:09) SIC RSIC

10 5 0

3

Var

0

2

Squared Bias

3

ESE

SIC RSIC

ESE Squared Bias

0

B. (n; ¼

3

2

1

0

1

2

3

SIC RSIC

10 5 0

3

2

1

0

1

2

3

SIC RSIC

10 5 0

3

2

1

0 logl

1

2

3

Figure 4: Values of ESE (see equation 4.10), Bias2 (see equation 4.12), and Var (see equation 4.13) for SIC and RSIC. The horizontal axis denotes the value of ¸ in log scale.

Trading Variance Reduction with Unbiasedness

1093

noise included in the denitions of ESE, Bias, and Var is replaced by the mean over 100 trials, where both the training sample points fxi gniD1 and noise f²i gniD1 are changed. When .n; ¾ 2 / D .100; 0:01/, the left graphs in Figure 3 show that the mean SIC seems to capture the mean Error very well and the size of the error bar looks reasonable. The mean RSIC looks almost the same as the mean SIC for medium to large ¸, but the mean RSIC is slightly overestimated for small ¸. In exchange, the error bar of RSIC is slightly smaller than that of SIC for small ¸. Indeed, the left graphs in Figure 4 show that for small ¸, Bias2RSIC is slightly larger than Bias2SIC but VarRSIC is slightly smaller than VarSIC . Consequently, ESERSIC and ESESIC are comparable. When .n; ¾ 2 / D .100; 0:09/, the right graphs in Figure 3 show that the mean SIC still captures the mean Error very well. However, the size of the error bar is rather large for small ¸. In contrast, the size of the error bars of RSIC is compressed for small ¸, in exchange for the slight overestimation of the mean RSIC for small ¸. Indeed, the right graphs in Figure 4 show that while the variance is largely suppressed for small ¸, the increase in the squared bias is relatively small. As a result, ESE is much improved for small ¸, and it stays almost the same for medium to large ¸. When the number n of training examples is 50, all the results are almost identical to the case with n D 100. For this reason, we omit the graphs. d RSIC maintains the The above simulation results show that RSIC with ESE good performance of SIC when the noise level is low, and it highly improves the precision over SIC when the noise level is high. Furthermore, it is notable that the simulation results are almost unchanged even when the number of training examples is decreased. This may be a useful property in practice.

5.1.3 Investigating Model Selection Performance. Now we illustrate how SIC and RSIC work in model selection. We choose the ridge parameter ¸ from equation 5.3 so that SIC or RSIC is minimized. The goodness of the selected ridge parameter is again evaluated by the Error from equation 5.6. Figure 5 depicts the values of Error obtained by the ridge parameter selected based on SIC or RSIC. The box plot notation species marks at 5, 25, 50, 75, and 95 percentiles from bottom. OPT indicates the optimal choice of the ridge parameter; we actually calculated Error for each ¸ in equation 5.3 and selected the one that minimizes Error. Note that the values of Error can be negative since a positive constant is ignored in the denition of Error, equation 5.6 (cf. equation 2.10). When .n; ¾ 2 / D .100; 0:01/, the error obtained by RSIC is comparable to that of SIC (see the left plot in Figure 5); this fact is also conrmed by the 95% t-test (see e.g., Henkel, 1979). When .n; ¾ 2 / D .100; 0:09/, the distributions of the error obtained by SIC and RSIC are comparable for 5, 25, and 50 percentiles, but RSIC improves 75 and 95 percentiles over SIC (see the right plot in Figure 5). The t-test says that RSIC surely improves over SIC. When the number n of training examples is 50, all the results are again similar to

1094

M. Sugiyama, M. Kawanabe, and K. Muller ¨

A.

(n; ¼

2

) = (100; 0:01)

B.

4.4

(n; ¼

2

) = (100; 0:09)

1.5

95%

95%

2

4.5

2.5 75%

Error

Error

4.6

50%

4.7

25%

3 3.5

75%

4

4.8 5%

50% 25% 5%

4.5

4.9

5 OPT

SIC

RSIC

OPT

SIC

RSIC

Figure 5: Box plot of Error obtained by the ridge parameter selected based on SIC or RSIC. The box plot notation species marks at 5, 25, 50, 75, and 95 percentiles of values from bottom. OPT indicates the optimal choice of the ridge parameter. Note that the values of Error can be negative since a positive constant is ignored.

the case with n D 100 (although the improvement of RSIC over SIC is not statistically signicant when .n; ¾ 2 / D .50; 0:09/). For this reason, we omit the plots. The above model selection simulation results show that RSIC and SIC perform similarly when the noise level is low, and RSIC works better than SIC when the noise level is high. Especially, RSIC mostly improves higher percentiles of the obtained error (see Figure 5), from which we conjecture that RSIC is a robust model selection criterion against “wicked” training sets. 5.2 Real Data Sets. In section 5.1, we found that RSIC works well for a very simple articial data set. Here we apply RSIC to real data sets, and evaluate whether this good property can be carried over to practical problems. We will use 10 practical data sets provided by DELVE (Rasmussen et al., 1996): Abalone, Boston, Bank-8fm, Bank-8nm, Bank-8fh, Bank-8nh, Kin-8fm, Kin-8nm, Kin-8fh, and Kin-8nh. The Abalone data set contains 4177 samples, each of which consists of nine physical measurements. The task is to estimate the last attribute (the age of abalones) from the rest. The rst attribute is qualitative (male/female/ infant) so it is ignored; 7-dimensional input and 1-dimensional output data are used. The Boston data set contains 506 samples with 13-dimensional input and 1-dimensional output data. The Bank data family consists of four data sets. They are labeled as fm, nm, fh, and nh, where f or n signies “fairly linear ” or “nonlinear,” respectively, and m or h signies “medium unpredictability/noise” or “high unpredictability/noise,” respectively. Each of the four data sets contains 8192 samples, consisting of 8-dimensional input

Trading Variance Reduction with Unbiasedness

1095

and 1-dimensional output data. The Kin data family also consists of four data sets labeled as fm, nm, fh, and nh. Each of the four data sets has 8192 samples, consisting of 8-dimensional input and 1-dimensional output data. For convenience, every attribute is normalized to [0; 1]. One hundred randomly selected samples f.xi ; yi /g100 iD1 are used for training. In the real data set, we cannot measure the generalization error by equation 5.6 since neither the true function f nor its projection fS is known. Instead, we evaluate the performance by the mean squared test error dened by n0 ± ²2 1 X Test Error D 0 fO.x0i / ¡ y0i ; n iD1

(5.7)

0

where f.x0i ; y0i /gniD1 denote the test samples that are not used for training. A gaussian kernel with width c D 1 is again employed (see equation 5.1), and the kernel regression model, equation 2.3 with ridge regression, equation 2.8 is used for learning. The ridge parameter ¸ is selected from ¸ 2 f10¡3 ; 10¡2 ; 10¡1 ; : : : ; 103 g:

(5.8)

As ridge parameter selection strategies, we compare SIC, RSIC, leaveone-out cross-validation (CV),11 and an empirical Bayesian method (EB) (Akaike, 1980). SIC is calculated by equation 5.4, where the noise variance ¾ 2 is estimated by equation 5.5. For each ¸ in equation 5.8, the regularization parameter ° in RSIC is chosen from f10¡3 ; 10¡2 ; 10¡1 ; : : : ; 103 g so d RSIC is minimized. The noise variance ¾ 2 in RSIC and ESE d RSIC is that ESE estimated by equation 5.5. The simulation is repeated 100 times, randomly selecting the training set f.xi ; yi /g100 trial (i.e., sampling without replacement). iD1 from scratch in each 0 0 n0 Note that the test set f.xi ; yi /giD1 also varies in each trial. Simulation results are summarized in Table 1. The table describes the normalized mean test errors and their standard deviations, where the values of the test error are normalized so that the mean test error obtained by the optimal ridge parameter is 1. The results of the best method and all other methods with no signicant difference (95% t-test) are described in italics. The result shows that RSIC gives the best or comparable results for 8 of 10 data sets. It is interesting to note that RSIC outperforms SIC for data sets with high noise (Bank-8fh, Bank-8nh, Kin-8fh, and Kin-8nh data sets), while 11 For the kernel regression model, equation 2.3, there are two possibilities of calculating the leave-one-out error. One is to use the full kernel regression model with n kernels all through the leave-one-out procedure; when one sample is left, the corresponding kernel function is kept. The other is to use the reduced kernel regression model with n ¡ 1 kernels in the leave-one-out procedure; when one sample is left, the corresponding kernel function is also left. We took the former standpoint and used the closed formula for calculating the leave-one-out error (see, e.g., Wahba, 1990; Orr, 1996).

1096

M. Sugiyama, M. Kawanabe, and K. Muller ¨

Table 1: Normalized Mean Test Errors and Their Standard Deviations. Data

SIC

RSIC

Cross Validation

Empirical Bayes

Abalone

1:005 § 0:050

1:015 § 0:045

1:015 § 0:043

1:044 § 0:083

1:001 § 0:066

1:034 § 0:100

1:040 § 0:095

1:029 § 0:092

Boston Bank-8fm Bank-8nm Bank-8fh Bank-8nh Kin-8fm Kin-8nm Kin-8fh Kin-8nh

1:000 § 0:218 1:002 § 0:063

1:000 § 0:218 1:013 § 0:071

1:081 § 0:088

1:037 § 0:097

1:000 § 0:077

1:000 § 0:077

1:046 § 0:080

1:022 § 0:061

1:062 § 0:079 1:009 § 0:060 1:160 § 0:094

1:008 § 0:056

1:006 § 0:056 1:077 § 0:091

1:113 § 0:199

1:023 § 0:077 1:063 § 0:082 1:004 § 0:050

1:138 § 0:178 1:054 § 0:090 1:066 § 0:104 1:344 § 0:113

1:005 § 0:093

1:526 § 0:253

1:029 § 0:067

1:086 § 0:045

1:078 § 0:063 1:020 § 0:031

1:135 § 0:025 1:031 § 0:047

Note: Italics denotes the results of the best method and all other methods with no signicant difference.

RSIC gives fairly comparable results to SIC for data sets with medium noise (Bank-8nm, Kin-8fm, and Kin-8nm data sets). Therefore, RSIC can improve the degraded performance of SIC in the high-noise cases, and it tends to maintain the good performance of SIC in the medium-noise cases. In theory, we assumed that the noise f²i gniD1 are independently drawn from the normal distribution with mean zero and common variance. This assumption may not be fullled in the DELVE data sets. This implies that when using RSIC in practice, the assumption on the noise does not have to be rigorously satised. Compared with CV and EB, RSIC is comparable or better for most of the data sets. From the above experimental results, we conjecture that RSIC should be regarded as a practical model selection criterion for choosing the ridge parameter. Finally, we compare our results with "-support vector regression ("-SVR) (Vapnik, 1998; Sch¨olkopf & Smola, 2002), which recently became one of the most popular regression algorithms. In SVR, we used the same gaussian kernel with width c D 1 (see equation 5.1). The regularization parameter C and the tube width " in SVR are chosen from a wide range of values using 10-fold cross-validation. We obtained the solutions of SVR by the SVMlight package (Joachims, 1999). The simulation results are described in Table 2, where the results of the signicantly better method (95% t-test) are described in boldface. The table shows that SVR works well for the Boston, Bank-8nh, Kin-8fh, and Kin-8nh data sets (although the 95% t-test does not say that they are signicantly different from the results of the ridge regression with RSIC), and it tends

Trading Variance Reduction with Unbiasedness

1097

Table 2: Normalized Mean Test Errors and Their Standard Deviations for the Ridge Regression with RSIC and the Support Vector Regression with 10-Fold Cross Validation. Data

Ridge+RSIC

Abalone

1:015 § 0:045

Boston Bank-8fm Bank-8nm Bank-8fh Bank-8nh Kin-8fm Kin-8nm Kin-8fh Kin-8nh

SVR+10CV 1:096 § 0:118

1:000 § 0:218

0:955 § 0:198

1:013 § 0:071

1:217 § 0:168

1:034 § 0:100 1:037 § 0:097 1:008 § 0:056 1:000 § 0:077 1:006 § 0:056 1:022 § 0:061

1:077 § 0:091

1:157 § 0:148 1:095 § 0:127 0:988 § 0:104 1:059 § 0:143 1:056 § 0:103 1:020 § 0:091 1:078 § 0:101

Note: Boldface denotes the results of the signicantly better method (95% t-test).

to give larger errors for other data sets. Given the fact that the Boston, Bank-8nh, and Kin-8fh data sets may include large noise, the "-insensitive loss seems to be more robust for such large noise cases (cf. Muller ¨ et al., 1998). However, SVR tends to give large errors for the given data sets that include small noise (Bank-8fm, Bank-8nm, Kin-8fm, and Kin-8nm data sets). Therefore, the "-insensitive loss is not as effective as the squared loss on the medium/small noise cases considered in the table (see also Muller ¨ et al., 1998; Sch¨olkopf & Smola, 2002). Note that the main difference between the ridge regression and SVR is the loss function: The ridge regression uses a squared loss (see equation 2.7), while SVR uses the "-insensitive loss. Which one will be advantageous certainly depends on what noise type is inherent to the data-generating process. Note that the computation time for the ridge regression with RSIC is faster than that for SVR with cross-validation because the latter requires retraining.12 For this reason, we consider using ridge regression with RSIC to be advantageous in practice.

12 Note that retraining is not needed for the ridge regression with leave-one-out cross validation because the leave-one-out error can be calculated analytically (see, e.g., Wahba, 1990; Orr, 1996).

1098

M. Sugiyama, M. Kawanabe, and K. Muller ¨

6 Conclusions and Outlook In this article, we have proposed using Stein’s idea in the context of model selection; we suggested that the use of a biased estimator, such as, by means of regularization, can yield more stable and robust, and thus better estimators of the generalization error than its unbiased counterpart. Thus, we sacriced the unbiasedness for the sake of variance reduction in a model selection criterion by actively optimizing and balancing out this bias-variance trade-off. This general idea was applied for a particular criterion where we regularized the unbiased estimator of the expected generalization error called the subspace information criterion (SIC). Our approach was to directly estimate the expected squared error between the generalization error estimator and the expected generalization error, and determine the degree of regularization in the regularized SIC (RSIC) such that the estimator of the expected squared error is minimized. Computer simulations with articial and real data sets showed that our approach surely contributes to obtaining a more precise estimator of the expected generalization error, and it can be successfully applied to the ridge parameter selection. We focused on the case that SIC is regularized by Xr given by equation 4.8. However, the proposed method for determining the degree of RSIC is valid for any type of regularization; the estimator of the expected squared error given by equation 4.22 does not depend on the form of Xr . Finding improved ways of regularization, in particular using domain knowledge, is left to future exploration. Furthermore, it would be interesting to extend the current framework such that efcient nonlinear estimators such as the LASSO (Tibshirani, 1996) can be dealt with. In equation 4.22, we gave an unbiased estimator of the expected squared error between RSIC and the expected generalization error. The simulation results reported in section 5 showed that the unbiased estimator of the expected squared error contributes benecially to stabilizing SIC. However, the unbiased estimator of the expected squared error can again have large variance because of its unbiasedness (see the experimental results reported in Sugiyama et al., 2003, for details). One of the promising future directions is to improve the unbiased estimator of the expected squared error to enhance the precision of RSIC further. The theoretical discussions in section 4 do not include the analysis of estimating the noise variance ¾ 2 . From the simulation with articial data sets (see section 5.1), the inuence of estimating the noise variance ¾ 2 appears unproblematic because the unbiasedness of SIC is almost satised, and therefore RSIC can improve the precision over SIC. It still remains open to see whether this property can be shown to hold always or not. Therefore, it is a further important step to investigate the inuence of the noise variance estimation more formally. Our previous work (Sugiyama & Muller, ¨ 2002) showed that a linear unbiased estimate of the projection fS exists if and only if the regression model

Trading Variance Reduction with Unbiasedness

1099

is included in the span of fK.x; xi /gniD1 . For this reason, we chose to use the kernel regression model given by equation 2.3. However, due to this fact, the SIC given in Sugiyama and Muller ¨ (2002) cannot be used for selecting the kernel parameters (e.g., kernel width). In RSIC, the linear unbiased estimate of the projection fS has not appeared explicitly anymore in the denition (see equation 4.9). Therefore, in principle, RSIC could be applied to regression models that are not included in the span of fK.x; xi /gniD1 , such as, models with different kernel width. However, we are still using the linear unbiased estimate of the projection fS for determining the degree of regularization in RSIC (see section 4.3). It is therefore interesting to devise other methods for determining the degree of regularization in RSIC that do not use the linear unbiased estimate of the projection fS , to enable an optimization of even the kernel parameters by RSIC. In this article, we pursued a better estimator of the generalization error. Another important issue in model selection research is to investigate model selection performance. For several model selection criteria such as Mallows’s CL (Mallows, 1964, 1973) and the generalized cross-validation (Craven & Wahba, 1979; Wahba, 1990), asymptotic optimality of the choice of the model has been investigated throughly (Craven & Wahba, 1979; Wahba, 1985; Li, 1986). It will be instructive to see whether similar discussions can be made for SIC and RSIC. Finally, another future direction is to apply our general idea of stabilizing model selection criteria to other existing criteria. For example, the leave-one-out error is shown to be an almost unbiased estimate of the expected generalization error (Luntz & Brailovsky,1969, see also Sch¨olkopf & Smola, 2002), but it can have a large variance. For this reason, it is often recommended to use 5- or 10-fold cross validation (i.e., divide the training set into 5 or 10 disjoint sets). However, the number of folds in cross validation actually controls the trade-off between the bias and variance of the cross-validation estimates of the expected generalization error. For this reason, it is highly important to determine the number of folds in cross validation so that the expected squared error between the cross-validation estimate and the expected generalization error is minimized. We conjecture that the approach taken in this article can also play an important role in this challenging problem. Appendix A: Sketch of Proof of Lemma 1 It follows from equation 4.9 that E ² RSIC is expressed as ¡ ¢ E² RSIC D hX> KXz; zi C ¾ 2 tr X> KX ¡ 2hX> r KXz; zi:

(A.1)

Similarly, it follows from equation 4.5 that J is expressed as ¡ ¢ J D hX> KXz; zi C ¾ 2 tr X> KX ¡ 2hX> u KXz; zi;

(A.2)

1100

M. Sugiyama, M. Kawanabe, and K. Muller ¨

where only the third term is different from equation A.1. Then BiasRSIC is expressed as > BiasRSIC D h.2X> u KX ¡ 2Xr KX/z; zi:

(A.3)

Equations A.3 and 4.14 yield equation 4.16. Appendix B: Sketch of Proof of Lemma 2 Let ² be the noise vector dened by

² D .²1 ; ²2 ; : : : ; ²n /> :

(B.1)

Then VarRSIC is expressed as VarRSIC D ¾ 2 k.C C C> /zk2 C E ² hC²; ²i2 C ¾ 4 tr .C/2 C2E² h.C C C> /z; ²ihC²; ² i ¡ 2¾ 4 tr .C/2 :

(B.2)

On the other hand, it holds that E² hC²; ²i2 D E² E ² h.C C C> /z; ²ihC²; ²i D E²

n X

Ci;j Ck;l ²i ²j ²k ²l ;

(B.3)

.Ci;j C Cj;i /Ck;l zi ²j ²k ²l ;

(B.4)

i;j;k;lD1 n X

i;j;k;lD1

where Ci;j denotes the .i; j/th element of C. It is known that when the random variable ²i is drawn from the normal distribution with mean zero and variance ¾ 2 , it holds that E² ²i3 DP 0 and E ² ²i4 D 3¾ 4 (e.g., Lehmann, 1983). They imply that all terms in E² ni;j;k;lD1 Ci;j Ck;l ²i ²j ²k ²l vanish except four cases: i D j D k D l, i D j 6D k D l, i D k 6D j D l, and i D l 6D j D k. Therefore, we have ± ² ¡ ¢ E ² hC²; ²i2 D ¾ 4 tr .C/2 C ¾ 4 tr C> C C ¾ 4 tr C2 : Similarly, all terms in

Pn

i;j;k;lD1 .Ci;j

E ² h.C C C> /z; ²ihC²; ²i D 0:

(B.5)

C Cj;i /Ck;l zi ²j ²k ²l vanish: (B.6)

Substituting equations B.5 and 5.6 into equation B.2, we obtain equation 4.17.

Trading Variance Reduction with Unbiasedness

1101

Appendix C: Sketch of Proof of Theorem 1 It holds that E² hBy; yi2 D Bias2RSIC C ¾ 2 E² k.B C B> /yk2 C 2¾ 2 tr .B/ E² hBy; yi ± ² ¡¾ 4 tr B2 C B> B ¡ ¾ 4 tr .B/2 ; (C.1) from which we have equation 4.20. Similarly, it holds that ± ± ²² VarRSIC D E² ¾ 2 k.C C C> /yk2 ¡ ¾ 4 tr C2 C C> C :

(C.2)

from which we have equation 4.21. Acknowledgments We thank Stefan Harmeling, Pavel Laskov, Koji Tsuda, Hidemitsu Ogawa, Hidetoshi Shimodaira, and Takafumi Kanamori for their valuable comments. Special thanks also go to Satoru Tanaka for nding an error in the proof. We are also grateful to the anonymous reviewers. Their comments greatly helped us to improve the readability of this article. M.S. thanks Fraunhofer FIRST for warm hospitality during his research stay in Berlin. M.S. also acknowledges partial nancial support from MEXT, Grants-inAid for Scientic Research, 14380158 and 14780262, and the Alexander von Humboldt Foundation. K.-R.M. acknowledges the BMBF under contract FKZ 01IBB02A. References Akaike, H. (1974). A new look at the statistical model identication. IEEE Transactions on Automatic Control, AC-19(6), 716–723. Akaike, H. (1980). Likelihood and the Bayes procedure. In N. J. Bernardo, M. H. DeGroot, D. V. Lindley, & A. F. M. Smith (Eds.), Bayesian statistics (pp. 141– 166). Valencia: University Press. Aronszajn, N. (1950). Theory of reproducing kernels. Transactions of the American Mathematical Society, 68, 337–404. Bergman, S. (1970). The kernel function and conformal mapping. Providence, RI: American Mathematical Society. Bishop, C. M. (1995). Neural networks for pattern recognition. Oxford: Clarendon Press. Bousquet, O., & Elisseeff, A. (2002). Stability and generalization. Journal of Machine Learning Research, 2 499–526.

1102

M. Sugiyama, M. Kawanabe, and K. Muller ¨

Cherkassky, V., Shao, X., Mulier, F. M., & Vapnik, V. N. (1999). Model complexity control for regression using VC generalization bounds. IEEE Transactions on Neural Networks, 10(5), 1075–1089. Craven, P., & Wahba, G. (1979). Smoothing noisy data with spline functions: Estimating the correct degree of smoothing by the method of generalized cross-validation. Numerische Mathematik, 31, 377–403. Cristianini, N., & Shawe-Taylor, J. (2000). An introduction to support vector machines and other kernel-based learning methods. Cambridge: Cambridge University Press. Cucker, F., & Smale, S. (2002). On the mathematical foundation of learning. Bulletin of the American Mathematical Society, 39(1), 1–49. Daubechies, I. (1992). Ten lectures on wavelets. Philadelphia: Society for Industrial and Applied Mathematics. Devroye, L., Gyor, ¨ L., & Lugosi, G. (1996). A probabilistic theory of pattern recognition. New York: Springer. Donoho, D. L. (1995). De-noising by soft thresholding. IEEE Transactions on Information Theory, 41(3), 613–627. Donoho, D. L., & Johnstone, I. M. (1994). Ideal spatial adaptation via wavelet shrinkage. Biometrika, 81, 425–455. Felsenstein, J. (1985). Condence limits on phylogenies: An approach using the bootstrap. Evolution, 39, 783–791. Geman, S., Bienenstock, E., & Doursat, R. (1992). Neural networks and the bias/variance dilemma. Neural Computation, 4(1), 1–58. Girosi, F. (1998). An equivalence between sparse approximation and support vector machines. Neural Computation, 10(6), 1455–1480. Gu, C., Heckman, N., & Wahba, G. (1992). A note on generalized cross-validation with replicates. Statistics and Probability Letters, 14, 283–287. Henkel, R. E. (1979). Tests of signicance. Beverly Hills, Calif.: Sage. Heskes, T. (1998). Bias/variance decompositions for likelihood-based estimators. Neural Computation, 10(6), 1425–1433. Hoerl, A. E., & Kennard, R. W. (1970). Ridge regression: Biased estimation for nonorthogonal problems. Technometrics, 12(3), 55–67. Joachims, T. (1999). Making large-scale SVM learning practical. In B. Sch¨olkopf, C. J. C. Burges, & A. J. Smola (Eds.), Advances in kernel methods—Support vector learning (pp. 169–184). Cambridge, MA: MIT Press. Konishi, S., & Kitagawa, G. (1996). Generalized information criteria in model selection. Biometrika, 83, 875–890. Lehmann, E. L. (1983). Theory of point estimation. New York: Wiley. Li, K. (1986). Asymptotic optimality of CL and generalized cross-validation in ridge regression with application to spline smoothing. Annals of Statistics, 14 (3), 1101–1112. Linhart, H. (1988). A test whether two AIC’s differ signicantly. South Africa Statistical Journal, 22, 153–161. Luntz, A., & Brailovsky, V. (1969). On estimation of characters obtained in statistical procedure of recognition. Technicheskaya Kibernetica, 3. (in Russian) Mallows, C. L. (1964, May). Choosing variables in a linear regression. Paper presented at the Central Regional Meeting of the Institute of Mathematical Statistics, Manhattan, KS.

Trading Variance Reduction with Unbiasedness

1103

Mallows, C. L. (1973). Some comments on CP . Technometrics, 15(4), 661–675. Muller, ¨ K.-R., Mika, S., R¨atsch, G., Tsuda, K., & Sch¨olkopf, B. (2001). An introduction to kernel-based learning algorithms. IEEE Transactions on Neural Networks, 12, 181–201. Muller, ¨ K.-R., Smola, A. J. , R¨atsch, G., Sch¨olkopf, B., Kohlmorgen, J., & Vapnik, V. (1998). Using support vector machines for time series prediction. In B. Sch¨olkopf, C. J. C. Burges, & A. J. Smola (Eds.), Advances in kernel methods— Support vector learning, (pp. 243–254). Cambridge, MA: MIT Press. Murata, N. (1998). Bias of estimators and regularization terms. In Proceedings of 1998 Workshop on Information-Based Induction Sciences (IBIS’98), (pp. 87–94). Izu, Japan. Murata, N., Yoshizawa, S., & Amari, S. (1994). Network information criterion— Determining the number of hidden units for an articial neural network model. IEEE Transactions on Neural Networks, 5(6), 865–872. Orr, M. J. L. (1996). Introduction to radial basis function networks. (Tech. Rep.). Edinburgh: Center for Cognitive Science, University of Edinburgh. Available on-line: http://www.anc.ed.ac.uk/»mjo/papers/intro.ps.gz. Rasmussen, C. E., Neal, R. M., Hinton, G. E., van Camp, D., Revow, M., Ghahramani, Z., Kustra, R., & Tibshirani, R. (1996). The DELVE manual. Available on-line: http://www.cs.toronto.edu/»delve/. Saitoh, S. (1988). Theory of reproducing kernels and its applications. White Plains, NY: Longman. Saitoh, S. (1997). Integral transforms, reproducing kernels and their applications. White Plains, NY: Longman. Sch¨olkopf, B., & Smola, A. J. (2002) Learning with kernels. Cambridge, MA: MIT Press. Sch¨olkopf, B., Smola, A. J., Williamson, R., & Bartlett, P. (2000). New support vector algorithms. Neural Computation, 12(5), 1207–1245. Shimodaira, H. (1997). Assessing the error probability of the model selection test. Annals of the Institute of Statistical Mathematics, 49(3), 395–410. Shimodaira, H. (1998). An application of multiple comparison techniques to model selection. Annals of the Institute of Statistical Mathematics, 50(1), 1– 13. Smola, A. J., Sch¨olkopf, B., & Muller, ¨ K.-R. (1998). The connection between regularization operators and support vector kernels. Neural Networks, 11(4), 637–649. Stein, C. (1956). Inadmissibility of the usual estimator for the mean of a multivariate normal distribution. In Proceedings of the 3rd Berkeley Symposium on Mathematical Statistics and Probability, 1 (pp. 197–206), Berkeley, CA: University of California Press. Sugiura, N. (1978). Further analysis of the data by Akaike’s information criterion and the nite corrections. Communications in Statistics: Theory and Methods, 7 (1), 13–26. Sugiyama, M., Kawanabe, M., & Muller, ¨ K.-R. (2003). Trading variance reduction with unbiasedness — The regularized subspace information criterion for robust model selection in kernel regression. (Tech. Rep. No. TR03-0003). Tokyo: Department of Computer Science, Tokyo Institute of Technology. Available on-line: http://www.cs.titech.ac.jp/.

1104

M. Sugiyama, M. Kawanabe, and K. Muller ¨

Sugiyama, M., & Muller, ¨ K.-R. (2002). The subspace information criterion for innite dimensional hypothesis spaces. Journal of Machine Learning Research, 3 323–359. Sugiyama, M., & Ogawa, H. (2001). Subspace information criterion for model selection. Neural Computation, 13(8), 1863–1889. Sugiyama, M., & Ogawa, H. (2002). Optimal design of regularization term and regularization parameter by subspace information criterion. Neural Networks, 15(3), 349–361. Takeuchi, K. (1976). Distribution of information statistics and validity criteria of models. Mathematical Science, 153, 12–18. (in Japanese) Tibshirani, R. (1996). Regression shrinkage and selection via the LASSO. Journal of the Royal Statistical Society, Series B, 58(1), 267–288. Tsuda, K., Sugiyama, M., & Muller, ¨ K.-R. (2002). Subspace information criterion for non-quadratic regularizers—Model selection for sparse regressors. IEEE Transactions on Neural Networks, 13(1), 70–80. Vapnik, V. N. (1982). Estimation of dependencies based on empirical data. New York: Springer-Verlag. Vapnik, V. N. (1995). The nature of statistical learning theory. Berlin: SpringerVerlag. Vapnik, V. N. (1998). Statistical learning theory. New York: Wiley. Wahba, G. (1985). A comparison of GCV and GML for choosing the smoothing parameter in the generalized spline smoothing problem. Annals of Statistics, 13(4), 1378–1402. Wahba, G. (1990). Spline model for observational data. Philadelphia: Society for Industrial and Applied Mathematics. Williams, C. K. I. (1998). Prediction with gaussian processes: From linear regression to linear prediction and beyond. In M. I. Jordan (Ed.), Learning in graphical models (pp. 599–621). Cambridge: MIT Press. Williams, C. K. I., & Rasmussen, C. E. (1996). Gaussian processes for regression. In D. S. Touretzky, M. C. Mozer, & M. E. Hasselmo (Eds.), Advances in Neural Information Processing Systems, 8 (pp. 514–520). Cambridge, MA: MIT Press. Received March 11, 2003; accepted November 6, 2003.

LETTER

Communicated by Peter Latham

Nonlinear Population Codes Maoz Shamir maoz@z.huji.ac.il

Haim Sompolinsky haim@z.huji.ac.il Racah Institute of Physics and Center for Neural Computation, Hebrew University of Jerusalem, Jerusalem 91904, Israel

Theoretical and experimental studies of distributed neuronal representations of sensory and behavioral variables usually assume that the tuning of the mean ring rates is the main source of information. However, recent theoretical studies have investigated the effect of cross-correlations in the trial-to-trial uctuations of the neuronal responses on the accuracy of the representation. Assuming that only the rst-order statistics of the neuronal responses are tuned to the stimulus, these studies have shown that in the presence of correlations, similar to those observed experimentally in cortical ensembles of neurons, the amount of information in the population is limited, yielding nonzero error levels even in the limit of innitely large populations of neurons. In this letter, we study correlated neuronal populations whose higherorder statistics, and in particular response variances, are also modulated by the stimulus. We ask two questions: Does the correlated noise limit the accuracy of the neuronal representation of the stimulus? and, How can a biological mechanism extract most of the information embedded in the higher-order statistics of the neuronal responses? Specically, we address these questions in the context of a population of neurons coding an angular variable. We show that the information embedded in the variances grows linearly with the population size despite the presence of strong correlated noise. This information cannot be extracted by linear readout schemes, including the linear population vector. Instead, we propose a bilinear readout scheme that involves spatial decorrelation, quadratic nonlinearity, and population vector summation. We show that this nonlinear population vector scheme yields accurate estimates of stimulus parameters, with an efciency that grows linearly with the population size. This code can be implemented using biologically plausible neurons. 1 Introduction In many areas of the brain, information on certain stimulus features is coded in a distributed manner—by a population of neurons (e.g., Georc 2004 Massachusetts Institute of Technology Neural Computation 16, 1105–1136 (2004) °

1106

M. Shamir and and H. Sompolinsky

gopoulos, Schwartz, & Kettner, 1986; Lee, Rohrer, & Sparks, 1988; Wilson & McNaughton, 1993; Fitzpatrick, Batra, Stanford, & Kuwada, 1997; Young & Yamane, 1992). It is generally accepted that the average response of each neuron is tuned to the stimulus; hence, the average ring rate codes for the stimulus. The ring rates of each neuron uctuate from trial-to-trial about their average value. If the trial-to-trial uctuations of different neurons are uncorrelated, then by pooling information from large populations of neurons, the brain can overcome the noise inherent in the single neuron responses. However, experimental ndings suggest that correlations between the uctuations of different neurons are considerable (Zohary, Shadlen, & Newsome, 1994; Lee, Port, Kruse, & Georgopoulos, 1998). Theoretical studies studies that investigated the effect of correlations on population coding yielded conicting results (e.g., see Zohary et al., 1994; Abbott, & Dayan, 1999). Recently Sompolinsky, Yoon, Kang, and Shamir (2001) studied the effect of correlations on the accuracy of population coding. They showed that long-range positive correlations that vary smoothly with the functional distance between the neurons lead to saturation of the accuracy by which the stimulus parameters can be extracted to a nite value, even in the limit of innitely large ensembles. This work, however, was limited to stimulusindependent covariance matrices. Experimental ndings show that information exists not only in the mean ring rates but also in higher-order statistics of the neuronal responses. The approximate linear relationship between the average ring rate and its variance reported in many brain areas (e.g., see Vogels, Spileers, & Orban, 1989; Britten, Shadlen, Newsome, & Movshon, 1992) implies that the variance is also tuned to the stimulus. Maynard et al. (1999) showed a (weak) dependence of the cross correlations of different neurons on the stimulus. Here we address two questions. Is information coded in the higher-order statistics of the neuronal responses bounded by the correlated noise in the system? And, How can such a code be read by a biologically plausible mechanism? For concreteness, we discuss the coding for an angle, such as the direction of arm movement in M1 during a simple unimanual movement task or the coding for the direction of movement of visual stimuli in area V5. The outline of this letter is as follows. In section 1, the problem of coding information in a correlated population of neurons is introduced. First, the stochastic model of the neuronal population that codes for the angle is dened. Then the Fisher information of the system is calculated, and the effect of the correlated noise on linear readout mechanisms is discussed. The saturation of information coded in the rst-order statistics of the neuronal behavior, on the one hand, and the inability of linear readout schemes to extract information coded in the high-order statistics, on the other hand, encourage the study of nonlinear readout schemes in section 2. Motivated by the results, we suggest a nonlinear readout mechanism and study its efciency. We then discuss the statistical properties of the neurons in the readout layer. A combination of linear and nonlinear readout schemes is

Nonlinear Population Codes

1107

presented and is shown to have superior performance. In section 3, we summarize the results of our analysis, discuss the effects of different variations of this model, and compare our model to other existing models of nonlinear population readout. Preliminary results have been published in Shamir and Sompolinsky (2001a, 2001b). 1.1 The Statistical Model of the Input Layer. We consider a population of N neurons that code for an angle µ, and denote by ri the response of the ith neuron (i 2 f1; 2; : : : Ng) to the stimulus µ . We shall refer to this population of neurons hereafter as the input layer. Assuming the ring rates of the neurons are sufciently high, we model the probability distribution of the neuronal activities according to a multivariate gaussian distribution, P.fri g j µ / D p

1 .2¼ /N

det.C/ 0 1 X 1 £ exp @¡ C¡1 .ri ¡ fi .µ //.rj ¡ fj .µ //A 2 ij ij

(1.1)

where fi .µ / is the average ring rate of neuron i, and Cij .µ / is the covariance of the responses of neurons i and j, for a given stimulus µ. The average ring rate of each neuron is modeled by a smooth stereotypical tuning curve with a single peak at the neuron’s preferred direction (PD), denoted here by Ái , as shown in Figure 1a for a neuron with a PD of 0 deg. The tuning curve of a neuron with a PD of Á is obtained by a horizontal translation of this tuning curve by Á. In our numerical simulations, we use the following tuning curve, fi .µ / D f .µ ¡ Ái /

Á

f .µ / D . fmax ¡ fref / exp

cos.µ / ¡ 1 ¾ f2

(1.2)

! C fref;

(1.3)

where ¾ f is the tuning width, fmax is the peak response at the neuron’s PD, and fref determines the modulation amplitude of the tuning curve. We assume that the PDs of the neurons are evenly spaced on the ring— C 2¼ Ák D ¡¼ N¡1 N N k. The covariance matrix, C, on the ring is, in general, a function of three angles Cij .µ / D .Ái ; Áj ; µ/. Assuming isotropy of the system implies that there is no preferred spatial direction. Hence, C should obey C.Ái ; Áj ; µ / D C.Ái ¡ Ã; Áj ¡ Ã; µ ¡ Ã / for arbitrary Ã. Thus, C is a function of only two angles: Cij .µ / D C.Ái ¡ µ; Áj ¡ µ /:

(1.4)

M. Shamir and and H. Sompolinsky

1

fi(q) (sec )

1108

(a)

80 60 40 120

60

0

q (deg)

60

120

2

180 (b)

bi(q) (sec )

180 80 60 40

120

60

0

q (deg)

60

2

120

180 (c)

Cij (sec )

180

10

0 0

60

f

j

(deg)

120

180

Figure 1: The statistics of neurons in the input layer in model 1. (a) The tuning curve of a neuron i with PD of Ái D 0 degree. The average ring rate of the neuron, fi .µ /, is plotted as a function of the stimulus, µ . The tuning curve of a neuron with a PD of Á can be obtained by a horizontal shift of the above curve by Á. (b) The tuning of the variance. The variance, bi .µ /, of neuron i with PD of Ái D 0 degree is plotted as a function of the stimulus µ . (c) The decay of the covariance, Cij , between neuron i with PD of 0 degree and neuron j with PD of Áj is plotted as a function of Áj . Note that in model 1, Cij , for i 6D j, is independent of µ . The above gures present the rate statistics, which is equivalent to the spike count statistics in a 1 sec time interval. In order to obtain the spike count statistics in different time intervals, these tuning curves should be scaled by the relevant time window. Unless stated otherwise, the parameters that were used for the numerical simulations of the p input layer are: for the tuning curve fmax D 80 sec ¡1 , fref D 40 sec ¡1 , ¾ f D 2; for the variance bmax D 80 sec ¡1 , p bref D 40 sec ¡1 , ¾b D 1=2; for the cross-correlations c D 0:3, ½ D 1. Note that these parameters are given in rates, that is, in numbers per 1 sec. In all the simulations, we used a time window of T D 0:25 sec by which the averages and the correlations were scaled linearly.

Nonlinear Population Codes

1109

A particularly simple example is a system in which only the variances of the neurons depend on the external stimulus, while the cross-correlations do not depend on the stimulus but depend on the angular coordinates of the neurons. We shall refer to this model hereafter as model 1. We assume that correlations are stronger for pairs of neurons with closer PDs, a property that has been observed experimentally in some systems (Zohary et al., 1994; Lee et al., 1998; van Kan, Scobey, & Gabor, 1985; Mastronarde, 1983). For concreteness, we have used the following form of C for model 1, ³ ´ jÁi ¡ Áj j D C ¡ exp ¡ model 1 (1.5) Cij ±ij bi .µ / cbref .1 ±ij / ½ Á ! cos.µ ¡ Ái / ¡ 1 C bref (1.6) bi .µ / D b.Ái ¡ µ/ D .bmax ¡ bref / exp ¾b2 where ±ij is a Kronecker delta. Here the variance, bi .µ /, has a unimodal tuning with regard to µ with a peak at the neuron’s PD, as shown in Figure 1b. The tuning of the variance is determined by three parameters: ¾b is the tuning width, bmax is the peak variance, and bref determines the modulation of the variance tuning curve. This stimulus dependence of the neuronal variances is qualitatively similar to the monotonic relationship between variance and mean observed in many neurons. The dependence of the cross-correlations on the distance between neurons (i.e., the difference in their PDs) is modeled as an exponential decay with correlation length ½ and correlation strength c. Here jxj denotes angular distance and ranges from 0 to ¼ . Figure 1c shows the decay of the correlation between the neurons as a function of the distance between them. We assume that the correlations are long distance, that is, ½ D O .1/ (note the maximum distance is ¼ ) and remains xed when we change the size of the population. For example, for model p 1 with the parameters c D 0:3, ½ D 1, bmax D 80, bref D 40, and ¾b D 1=2 for a time interval of T D 1 sec (see Figure 1), the average correlation coefcient (CC) across pairs of neurons with PD differences of less than 90 degrees is 0.12, while for pairs of neurons with PD differences of more than 90 degrees, the average CC is 0.025. For comparison, Zohary et al. (1994) report an average CC of 0.18 for neurons with close PDs and 0.04 for neurons with PD differences of more than 90 degrees. Although these correlation values may seem weak, their effect on the population code is dramatic, as shown below. Another simple example is the multiplicative model (Abbott & Dayan, 1999), which we shall refer to as model 2. Here, the covariance matrix has the form ³ ³ ´´ q jÁi ¡ Áj j model 2: (1.7) Cij .µ / D bi .µ /bj .µ / ±ij C c.1 ¡ ±ij / exp ¡ ½ The neuronal variances are given by bi .µ /, as before. The cross-correlations

1110

M. Shamir and and H. Sompolinsky

q consist of a stimulus-dependent multiplicative component, bi .µ /bj .µ /, multiplied by the distance-dependent function. The two models for the correlation matrix are introduced as two example that will yield qualitatively similar results (see below). As will be clear from the analysis below, the essential features of the correlation matrices are the long range of the crosscorrelations and the discontinuity of the correlation matrix on the diagonal resulting from the variance term. 1.2 Estimation Errors and Fisher Information. Throughout this letter, we will evaluate the efciency of an estimator µO of the stimulus angle µ by the inverse of the mean square of its estimation error, h.± µO /2 i¡1 ; where O and h: : :i denotes averaging with regard to the distribution of ± µO D µO ¡ hµi the neuronal responses for a given stimulus µ. The Fisher information (FI) (Thomas & Cover, 1991; Kay, 1993), *³ JD

@ log P.rjµ/ @µ

´2 + ;

(1.8)

provides an upper bound on the efciency of any unbiased estimator of µ. The FI of the gaussian ensemble, equation 1.1, can be written as a sum of the FI of the mean responses, denoted as Jmean and that of the tuning of the covariance matrix, Jcov (see, e.g., Kay, 1993) J D Jmean C Jcov

(1.9)

Jmean D ft0 C¡1 f0 1 Jcov D TracefC0 C¡1 C0 C¡1 g: 2

(1.10) (1.11)

Here f0 and C0 are derivatives with regard to µ; of the mean response vector and the covariance matrix, respectively. In the case of uncorrelated population, the FI is given by J D Jmean C Jvar, where Z Jmean D N

¼ ¡¼

0

dÁ f 2 .Á/ 2¼ b.Á/

(1.12)

and Z Jvar D N

¼ ¡¼

0

dÁ b 2 .Á/ : 2¼ 2b2 .Á/

(1.13)

Here and throughout the letter, we assume that the system size, N, is large; hence spatial summations can be converted to integrals over angles. Figure 2

Nonlinear Population Codes

1111

0.04

D q 2 (deg 2)

J 0.03

Jcov

0.02

0.01 J 0 0

500

1000

N

1500

mean

2000

Figure 2: FI of a correlated population of neurons. The FI, J, of the input layer (model 1) is plotted as a function of the number of cells, N, in the pool by the solid line. The contributions of the different terms Jmean , and Jcov are plotted by the dash-dot and the dotted lines, respectively. Notice that Jmean saturates for large pools, while Jcov grows linearly with the size of the pool, N. The Fisher information, J, Jmean , and Jcov , was calculated using equations 1.9–1.11 and parameters appearing in Figure 1 for model 1. Qualitatively similar results are obtained in model 2 as well.

shows the values of J, Jmean , and Jcov for model 1 as a function of N. Note that while the FI of the covariance matrix, Jcov , grows linearly with the population size, N, the FI of the average ring rates, Jmean , saturates to a nite value, in contrast to the FI of uncorrelated population, equation 1.12. The saturation of Jmean in the large N limit is not specic to this model, but is characteristic of systems with long-range positive correlations, as previously shown (Sompolinsky et al., 2001). Thus, in the present system, most of the information in the neuronal responses comes from the stimulus dependence of the covariance matrix. In this letter, we study readout models for correlated populations. As we will show below, a linear readout is incapable of extracting the information embedded in the second-order statistics; its performance is bounded by Jmean . This raises the question of whether there is a nonlinear readout scheme whose performance is close to the total J, and if so, how complex such a readout needs to be. Here we present a relatively simple bilinear readout that extracts a large fraction of the information embedded in the covariance matrix, providing a plausible model for readout of population codes with stimulus-dependent correlations.

1112

M. Shamir and and H. Sompolinsky

1.3 Linear and Recurrent Network Readouts. A linear readout is an P estimator of the form i ri wi , where fwi g is a set of readout weights. Here we shall consider the linear readout in the context of two tasks: estimation and discrimination of an angular variable. 1.3.1 Estimation. A well-known example of a linear estimator of µ is the linear population vector (LPV), rst proposed by Georgopoulos et al. (1986) for the readout of arm-reaching-movement directions in two dimensions, from the responses of motor cortical cells. The LPV can be written as

zO D

N X

ri eiÁi ;

(1.14)

iD1

where we have used a complex representation of two-dimensional unit vectors. In this notation, zO is an estimator of the unit vector z D eiµ , which represents a movement in the direction of the angle µ. Using the rotational symmetry of the system, one can show that the estimator zO =jhOzij is an unbiased estimator of z D eiµ and that it is the optimal linear readout in terms of squared error averaged over all angles µ, with a xed set of weights, w (see appendix A.1). The efciency of any estimator that respects the symmetry of the system (i.e., its parameters are periodic functions of Ái / is independent of µ . Assuming that the uctuations in the estimation of µ are small, one can write the average estimation error of such an estimator as a signal-to-noise relation for µ D 0 degree (see appendix A.1), ³ O 2 i¡1 D h.± µ/

signal noise

´2

O signal D hxi

(1.15) (1.16)

2

O 2 i; noise D h.± y/

(1.17)

O For the LPV, these signal and where we have used the notation zO D xO C iy. noise terms are O D hxi

N X

f .Ái / cos.Ái /

(1.18)

iD1

O 2i D h.± y/

N X

Cij .µ / sin.Ái / sin.Áj /;

(1.19)

i;jD1

where Cij .µ / is the covariance matrix given by equations 1.5 and 1.6 for model 1. In the case of an uncorrelated population of neurons, C.Ái ; Áj / D b.Ái /±ij ; hence, both equations 1.16 and 1.17 grow linearly with N, yielding

Nonlinear Population Codes

1113

(see also Seung & Sompolinsky, 1993) h.± µO /2 i¡1 D N

2 2 f1 ; b0 ¡ b2

(1.20)

R dÁ where fn D 2¼ f .Á/einÁ and similarly bn . This result should be compared to the upper bound provided by J D Jmean C Jvar of equations 1.12 and 1.13. Note that the performance of the LPV is bounded by Jmean (see section 1.3.2) reecting the fact that the linear readout extracts information only from the mean responses. In fact, the slope of the LPV performance versus N, equation 1.20, is smaller than that of Jmean , equation 1.12, since due to the smoothness of the LPV weights, it is sensitive only to the low Fourier components of f and b, the rst Fourier component of f , and zero and second Fourier components of b, as can be seen from equation 1.20. In the uncorrelated case, the efciency of the LPV grows linearly with N although with a suboptimal slope. This is not the case in the presence of the correlations. Figure 3 shows the effect of the correlations on the accuracy of the LPV readout (open circles) for model 1 for several values of the parameter c, equation 1.5. As can be seen from the gure for c D 0, the efciency of the LPV grows linearly with N. However, in the case of c > 0, there exists some Nsat such that for N » Nsat the efciency saturates to a size-independent limit. Note that because the tuning curves of the mean rates for the chosen parameters are broad, they contain little power in high Fourier modes. Consequently, the LPV performs close to the bound of Jmean . 1.3.2 Discrimination. Another interesting readout problem is that of discriminating between two close angles. We model a two-interval discrimination task by a system that is given two sets of neuronal activities r.1/ ; r.2/ generated by two proximal stimuli µ and µ C ±µ . From these responses, the system must infer which stimulus generated which activity. For cases where the FI is large, the maximum likelihood (ML) discrimination for p two close angles yields a probability of error given by H.d0 = 2/, where R1 2 H.x/ D .2¼ /¡1=2 x dxe¡x =2 and the discriminability d0 equals p d0 D j±µj J.µ /;

(1.21)

where J.µ / is the FI of the system (Seung & Sompolinsky, 1993). A linear readout can be used to perform the discrimination task. We dene the discrimination rule to be according to the sign of q D w.r.1/ ¡ r.2/ /, where a plus sign is interpreted as µ C ±µ was presented rst and the minus sign is interpreted as µ was presented rst. The set of weights w that optimize the 0 discrimination for close angles, ±µ ¿ 1, is given by w D C¡1 f . Note that these weights depend on the angle µ. The probability of error of the optimal

1114

M. Shamir and and H. Sompolinsky

Figure 3: The effect of correlations on the efciency of the LPV and RN estimators. The efciency of the readouts is shown in terms of the reciprocal squared estimation error. The efciency of the LPV (circles) and the RN (boxes) is plotted as a function of the number of cells in the pool, for different values of correlation strength, from the bottom c D 0:3, c D 0:03, and c D 0; all other parameters that were used are as dened in Figure 1. The efciency was calculated numerically by simulating the neuronal responses and averaging the LPV/RN squared estimation errors over 4000 trials, for model 1. The contribution of the average ring rates to the FI of the system, Jmean , is given in the solid lines, from bottom to top for c D 0:3, c D 0:03 and c D 0. Jmean was calculated using equation 1.10.

p linear discriminator is given by H.d0L = 2/, where (see section A.2) p d0L D j±µj Jmean .µ /;

(1.22)

which is inferior to the ML performance, equation 1.21. Note that the linear discriminator is the optimal local linear estimator (dened by a linear readout with weights that depend on the neighborhood of angles to be estimated). Thus, Jmean is an upper bound on the performance of any linear estimator of µ. 1.3.3 Recurrent Networks. Recently a recurrent network (RN) model has been proposed (Deneve, Latham, & Pouget, 1999) as an improvement on the performance of the LPV in estimation tasks. Essentially, the RN readout is superior to the LPV in that it is sensitive to the ner structure of the neuronal tuning curves. The efciency of the RN readout is presented in

Nonlinear Population Codes

1115

Figure 3 (open boxes) for different values of c, as a function of N. In this example, the RN performance is only slightly better than that of the LPV (circles). This is due to the fact that for smooth tuning curves, the LPV already captures almost all the information in the mean responses. The numerical results are in agreement with the claim (Deneve et al., 1999) that the efciency of the optimal RN readout in the limit of small estimation errors approaches 1=h.± µORN /2 i D Jmean . Thus, although the performance of the RN readout is in general superior to that of the LPV, it too saturates to a size-independent limit in the presence of long-range positive correlations. In summary, both the LPV and the RN schemes read out only the information that is embedded in the rst-order statistics of the data, that is, the tuning curves of the neurons, and hence both are bounded by Jmean . 2 The Bilinear Readout 2.1 Covariance Fisher Information. In order to understand how the information in the second-order statistics can be extracted, it is important to gain insights into the properties of the FI, equations 1.9–1.11. As observed above, the rst-order statistics contain only a nite amount of information. The reason for the saturation of Jmean is that the tuning curves for the average ring rates of the neurons are relatively smooth functions of the stimulus angle; hence, most of the embedded information resides in the low Fourier modes of the network responses. On the other hand, since the cross-correlations vary smoothly with angular distance, these slowly varying modes contain noise whose standard deviation grows linearly with N, yielding a signal-to-noise ratio of order 1. This raises the question of why the second-order statistics of the same smooth system yields information that grows linearly with N. Indeed, our study shows that most of the information in Jcov resides in the stimulus dependence of the variances, not the cross-correlations. More precisely, the covariance matrix of a system with smoothly varying cross-correlations (such as equations 1.5 and 1.7) has a discontinuity on the diagonal resulting from the fact that the variance, bi .µ /, is larger than the zero distance limit of the cross-correlations, which is cbref in model 1 (see Figure 1) and cbi .µ / in model 2. It is useful to write the matrix C as Cij D ±ij Di .µ / C Sij

(2.1)

with S as the continuous (smooth) part of C and D is the discontinuous part of C on the diagonal. In the models described above, we obtain Di .µ / D b.Ái ¡ µ/ ¡ cbref; model 1 ³ ´ jÁi ¡ Áj j D exp ¡ Sij cbref ; model 1 ½

(2.2) (2.3)

1116

M. Shamir and and H. Sompolinsky

1.4

x 10

Model 1

3

4

3

D

J |

D

J |

1

cov

0.8

cov

0.6

|J

|J

Model 2

5

3.5

1.2

0.4

2.5 2 1.5 1

0.2 0 0

x 10

0.5 500

1000

N

1500

0 0

2000

500

1000

N

1500

2000

Figure 4: The difference between the contribution of the correlations to the total FI of the system, Jcov, and the FI of an uncorrelated gaussian population of neurons with zero means and variances D, JD , is shown as a function of the size of the system, N, for model 1 (left) and for model 2 (right). Note that the difference jJcov ¡ JD j is a sublinear function of N. For model 2, the tuning curves of the mean and variance are as dened in Figure 1, with ½ D 1 and c D 0:3 (see equation 1.7).

for model 1, and Di .µ / D b.Ái ¡ µ/.1 ¡ c/; model 2 ³ ´ q jÁi ¡ Áj j Sij D c bi .µ /bj .µ / exp ¡ ; model 2 ½

(2.4) (2.5)

for model 2. As will be explained below, the discontinuity in the covariance matrix along the diagonal plays an important role in the system’s behavior. In fact, for both models, we nd numerically that for large N, Jcov can be approximated by

Jcov

N 1X ¼ JD ´ 2 iD1

Á

0

Di .µ / Di .µ /

!2

Z ¼N

0

dÁ D 2 .Á/ : 2¼ 2D2 .Á/

(2.6)

Equation 2.6 is a central result of this article. Figure 4 shows the difference between Jcov and JD as a function of N for models 1 (left) and 2 (right). In both cases, as can be seen from the gure, the difference jJcov ¡ JD j is a sublinear function of N for large N. However from equation 2.6, we nd that JD scales linearly with N; hence, to a leading order in N, one obtains the result of equation 2.6. In section 2.4, we present an analytical calculation that bounds Jcov from below by JD to a leading order in N. This result, equation 2.6, suggests that the source of the fact that the second-order statistics

Nonlinear Population Codes

1117

of the correlated system contains an extensive amount of information is the discontinuous nature of the variance. Because of this discontinuity, the information in the variance resides in all modes, including the modes with high Fourier numbers, and are therefore immune to the correlated noise, which resides mostly in the slowly varying modes. Comparing equation 2.6 with equation 1.13 indicates that the information in the second-order statistics of the correlated system is equivalent to that of an uncorrelated population of neurons with zero mean tuning but with stimulus dependence variances equal to Di .µ /. This important insight motivates our proposal for a nonlinear readout in the next section. 2.2 The Bilinear Readout Mechanism. Based on the above results, we propose a two-layer readout network (see Figure 5) for extracting information out of the second-order statistics. The rst layer of the readout system receives as input the activities of neurons in the input layer, fri g, and performs a linear ltering of the inputs followed by quadratic nonlinearity. Similar to the input-layer neurons, the output response of a neuron in the rst readout layer is a real scalar value. Denoting the output of the ith neuron in this layer by Ri , we dene Ri according to the deterministic mapping, 0 12 N X Ri D @ Wij rj A ;

(2.7)

jD1

where fW ij g is a set of linear ltering weights between the presynaptic neuron j in the input layer and the postsynaptic neuron i in the rst processing layer. The form of the N £ N lter W will be discussed below. The output layer of the readout mechanism receives, via feedforward connections, the responses of neurons in the rst layer (see Figure 5) and calculates their population vector, termed hereafter the nonlinear population vector (NLPV), O ZO D XO C iY: ZO D

N X

Ri eiÁi :

(2.8)

iD1

Note that we associate with the ith neuron of the rst layer an angle, Ái , which is the same as the PD of the ith input neuron. The rationale for this will be evident when we specify W below. Using the symmetry of the correlation matrix, Cij .µ /, equation 1.4, and constraining the linear lter, W, to be of the form Wij D W.Ái ¡ Áj /, one obtains 0 O D eiµ @ hZi

1 X ijk

.C.Áj ; Ák / C f .Áj / f .Ák //W.Áj ¡ Ái /W.Ák ¡ Ái /eiÁi A : (2.9)

1118

M. Shamir and and H. Sompolinsky

r1

r2

rN Input layer

W11

W 1N

R1

WNN RN

Readout 1st layer

Output layer

X W

Y

ij

1

0.5

0

-0.5 180

120

60

0

f- f i

j

60 120 (deg)

180

Figure 5: The architecture of the NLPV readout model. The input to the rst layer of the readout mechanism is a linear combination of the activities of neurons in the input. The rst layer responds by a nonlinear transfer function, and its output is transmitted in a feedforward manner to the second layer that calculates its population vector. At the bottom left appears a plot of the NLPV weight matrix, Wij , as a function of .Ái ¡ Áj /, for N D 300 and p D 6.

O Zij O is an unbiased estimator of µ. As before, we Hence, the estimator Z=jh O 2i O 2 i D h.± Y/ calculate its variance for µ D 0 degree by h.± µ/ O 2 (see appendix A.1), the results of which are shown below.

hXi

2.3 The Efciency of the Bilinear Readout. In order to complete the denition of the readout, one has to choose the form of the linear lter W. Motivated by the observation that most of the signal lies in the discontinuous part of the correlation matrix on the diagonal and that the noise mainly resides in the slowly varying modes of the network, that is, in the low-order

Nonlinear Population Codes

1119

Fourier modes, we choose Wkj D ±kj ¡

p 1 X in.Ák ¡Áj / 1 X in.Ák ¡Áj / D e e : N nD¡p N jnj>p

(2.10)

With this choice, the matrix W simply lters out the rst .2p C 1/ low-order Fourier modes of the inputs, fri g. The parameter p determines how many low-order Fourier modes are ltered out by W. Substituting equations 2.8 and 2.10 into equation 1.16, the signal term reduces to (

) N ± ² X ¤ O D 2NReal hXi CQ nC1;n C fnC1 fn ;

(2.11)

n>p

where CQ mn is the double Fourier transform of the correlation matrix, 1 X imÁk ¡inÁj CQ m;n D 2 e Ckj : N kj

(2.12)

In the large N limit, the Fourier transform is given by (see equation 2.1) 1 CQ m;n D Dm¡n C Sm;n N Z dÁ Dm D D.Á/eimÁ 2¼ Z dÁdÃ Sm;n D S.Á ; Ã/eimÁ¡inÃ : .2¼/2

(2.13) (2.14) (2.15)

Both S and f are relatively smooth functions of the angles; hence, Sm;n and fn decay fast with the increase of m and n. Therefore, for sufciently large p, one can neglect the contribution of S and f to the signal, equation 2.11. Finally, noting that CQ nC1;n D N1 D1 C SnC1;n , we obtain O D hXi

N X jnj>p

Z D1 D [N ¡ .2p C 1/]D1 ¼ N

dÁ D.Á/eiÁ ; 2¼

(2.16)

where D is the discontinuous piece of the covariance matrix, equations 2.2 and 2.4. It is important to note the limit in which equation 2.16 is valid. First, we have assumed (in the last step) that p ¿ N such that terms of order p are negligible compared to terms of order N. On the other hand, we have assumed that p is sufciently large such that the Fourier transforms of S and f can be neglected relative to the contribution of the terms containing D. As will be shown below, this latter condition is equivalent to the requirement

1120

M. Shamir and and H. Sompolinsky Model 1

Model 2 0.02

D q 2 (deg 2)

0.03

p=6 p=3

p=6 0.015

p=3

0.02 0.01 p=1 0.01

p=1

0.005

0 0

500

1000

N

1500

2000

0 0

500

1000

N

1500

2000

Figure 6: Nonlinear PV efciency as a function of the size of the pool for different values of p. Efciency of the NLPV readout in terms of the reciprocal average squared estimation error, shown in solid lines for p D 1; 3; 6 from bottom to top, for model 1 (left) and model 2 (right). The dashed line is the asymptotical value of the NLPV efciency, equation 2.19.

N ¿ Nsat.p/, where Nsat .p/ is a strongly growing function of p. The form of this function depends on the asymptotic decay of the Fourier transform of S. Thus, the range of validity of equation 2.16 is p ¿ N ¿ Nsat.p/. Similar but more tedious analysis of the noise term yields (see appendix A.3), for p ¿ N ¿ Nsat .p/ h.±Y/2 i ¼ N D

N 2

X m

Z

.Dm D¡m ¡ Dm D2¡m / dÁ B.Á/ .1 ¡ cos 2Á/ ; 2¼

(2.17)

where B.Á/ D 2D2 .Á/:

(2.18)

Combining equations 2.16 and 2.17, we obtain O 2 i¡1 D N h.± µ/

2jD1 j2 ´ NJ0 ; p ¿ N ¿ Nsat.p/: B0 ¡ B2

(2.19)

This result shows that for sufciently large p such that N ¿ Nsat .p/ , the squared estimation error will scale like 1=N. Figure 6 shows the efciency of the NLPV readout as a function of the pooling size N, for models 1 (left) and 2 (right) and several values of p. The dashed line is the asymptotical limit of the performance of the NLPV, given by equation 2.19. As is indicated by the results, initially the estimation efciency of the NLPV grows linearly with

Nonlinear Population Codes

1121

Figure 7: Typical network size for saturation of the NLPV efciency, Nsat , as a function of p, for model 1. The solid line is the exact expression, Nsat D NID =IS (equation A.22, calculated for N D 2000), is presented as a function of p. The approximated result, equation A.19, is shown by the dashed line.

N. However, for any given p, there exists some scale of N, Nsat .p/, above which the efciency of the NLPV begins to saturate. Indeed we show in the appendix that for any q xed p, if N is made large enough, the standard O 2 i; grows eventually like N, saturating the deviation of the noise, h.± Y/ signal-to-noise ratio. Analyzing this saturation in detail for model 1, we derive the following expression for the saturation size, Nsat .p/ D

B0 ¡ B2 p3 ; model 1; ¢2 4=3 cbref ¼=½ .1 ¡ e¡2¼=½ / ¡

(2.20)

where B.Á/ is given in equation 2.18. The fact that the saturation value grows fast with p implies that we can use the bilinear readout to obtain an accurate estimate of µ even with moderate values of p, as shown in Figure 6. Figure 7 shows Nsat as a function of p in model 1. The exact expression for Nsat (equation A.23 in the appendix) is given by the solid line. The dashed line shows the approximate analytical calculation, equation 2.20. Finally, when N is large compared with Nsat , the saturated value of the efciency of the NLPV is given by lim h.± µO /2 i¡1 D J0 Nsat ;

N!1

(2.21)

1122

M. Shamir and and H. Sompolinsky

35

(a)

150

(b)

30 25

100

ri 20

R

15

i 50

10 5

10 180 120 60

0

60

f (deg) i

120

180

180 120 60

0

60

f (deg) i

120

180

Figure 8: Typical result of simulating the activities of neurons in the model during a single trial of a presentation of µ D 0 degree. (a) Activities of a population of N D 500 neurons in the input layer during a single trial of presentation of µ D 0 degree, model 1. The neuronal activities are plotted as a function of their PDs by the thin line. The thick smooth line is the average across trial activity of the different neurons. (b) The activities of neurons in layer 1 of the readout are shown as a function of their PDs by the thin line. The input to the system in this single trial is given in a. The average activity of the neurons is shown by the thick line. For this graph, p D 10 was used.

where J0 is dened in equation 2.19. Thus, Nsat determines both the linear regime of the efciency of the NLPV readout in N and the asymptotic efciency of this readout. Qualitatively similar results are obtained for model 2. 2.4 Statistics of the Neuronal Responses in Layer 1 of the Readout. To obtain more insight into the performance of the NLPV, we compared the statistics of the responses of neurons in the input layer, fri g, and those in the rst readout layer, fRi g: An example of the two responses is displayed in Figure 8 for a network of 500 neurons during a presentation of stimulus µ D 0 degree as a function of their PDs. Figure 8a shows the activities in the input layer (thin line) together with their trial-averaged activities (thick line). The rapid jittering of the neuronal responses is caused by uctuations in the high-order Fourier modes of the system. However, it is the uctuations in the low-order Fourier modes of the system that cause the population activity prole to shift left (as in the gure) or right from the average prole (i.e., the population tuning curve), and thus cause substantial errors in the estimation of µ. In the specic example shown here, the input layer activities (thin line) looks as though they were shifted from the average prole of activities (thick

Nonlinear Population Codes

1123

(a)

(b)

20

250 200

15

C

ij

R

10

Cij

5

150 100 50

0 180

0 180 180 0

f

j

(deg)

180 0

0 f

180 180

i

(deg)

f

j

(deg)

0 180 180

f

i

(deg)

Figure 9: (a) Covariance matrix of the stochastic network. The covariance between the activities of neurons i and j is plotted on the vertical axis as a function of Ái and Áj . The covariance is calculated with respect to the distribution of the ri , given µ D 0 degree, in model 1. Note that the covariance in the activities of different neurons (off-diagonal elements) is independent of µ and decays slowly with the functional distance between the neurons. The variance (diagonal elements) depends on µ and peaks at Ái D µ . (b) Covariance matrix of neurons in layer 1 of the readout. The covariance is calculated with respect to the distribution of their inputs, fri g, given µ D 0 degree, for p D 15. Note that the covariance between different neurons is negligible. The variance (diagonal elements of the matrix) is stimulus dependent and peaks at Ái D µ .

line) by 30 degrees, causing an error of ¼ 30 degrees in the LPV estimation of µ. Figure 8b shows the activities of neurons in the rst readout layer. The rapid jittering in the activities is obvious, but the long-range uctuations observed in the input layer are absent, suggesting that the uctuations in Ri are largely independent. Indeed, using calculations similar to those mentioned in the previous section, we show (see appendix A.3) that in the limit of p ¿ N ¿ Nsat .p/, the statistics of Ri obeys hRi i D Di

(2.22) 2

h±Ri ±Rj i D 2±ij .Di / :

(2.23)

These analytical results are supported by the numerical results of Figure 9b, which shows the covariance matrix of neurons in layer 1, CRij .µ / D h±Ri ±Rj i,

for µ D 0 degree. As can be seen, the matrix CRij .µ / D h±Ri ±Rj i is almost a diagonal matix; the covariance between different neurons is negligible. In contrast, the covariance matrix of the input neurons, Figure 9a, contains a substantial smooth off-diagonal part that gives rise to the collective uctuation that limits the accuracy of the LPV (see also Figure 1c). Figures 10a and 10b show the average and standard deviation of the Ri , respectively, as

1124

M. Shamir and and H. Sompolinsky

20

(a)

30

(b)

25 15

i

DR

i

20 10

15 10

5 5

180 120 60

0

60

f (deg) i

120

180

180 120 60

0

60

f (deg)

120

180

i

Figure 10: Statistics of neurons in the rst layer of readout. (a) The population tuning prole. The average activity of neurons in the rst layer is plotted as a function of their PDs (solid line). The population prole peaks at Ái D µ . For comparison, Cdi .Ái / is also plotted (dashed line). (b) The standard deviation of neurons in theprst layer is plotted as a function of their PDs (solid line). For comparison 2.Cdi .Ái //2 is also plotted (dashed line). The statistics were calculated for a system in model 1 with N D 300 and for p D 4.

a function of their PDs, for stimulus µ D 0 degree. They show good t with the analytical predictions of equations 2.22 and 2.23. We conclude that for sufciently large p, the linear ltering of the loworder Fourier modes effectively decorrelates the activities of different neurons. The subsequent squaring of the ltered activity is required to be able to extract the information embedded in the second-order statistics of the Ri by a linear readout downstream (the second layer of weights in our readout). The results of equations 2.22 and 2.23 also shed light onto our expression for the efciency of the bilinear readout, equation 2.19. Comparing equation 2.19 and equation 1.20, one observes that the efciency of the bilinear readout is equivalent to the efciency of a linear population vector of independent neurons with mean responses, D.Ái ¡ µ/, and standard deviations p of 2D.Ái ¡ µ/, in line with equations 2.22 and 2.23. 2.4.1 The FI of the First Readout Layer Population. Our bilinear readout contains rst a decorrelation step, a nonlinear operation, and nally a population vector summation. We may inquire how much better we can do if we replace the last layer by a more complicated operation (e.g., a RN). This question can be addressed by calculating the FI, JR , of the R layer. To do this, we P use the fact that the Ri are squares of gaussian variables: Ri D t2i I ti D j Wij rj . Hence, the FI of the Ri is equal to the FI of the ti . In the limit of p ¿ N ¿ Nsat.p/, the ti become independent random gaussian vari-

Nonlinear Population Codes

1125

ables with zero mean and variance Di . Applying equation 1.13, we obtain in this limit JR D JD , where JD is given by equation 2.6. A corollary of this result is that Jcov ¸ JD C o.N/. In particular one obtains that Jcov scales (at least) linearly with N. In fact, as indicated above (see equation 2.6), our numerical results indicate that Jcov ¼ JD , implying that the ltering out of the low-lying modes of the inputs by Ri leaves most of the information intact. Nevertheless, the NLPV estimator does not saturate the FI of the system (or the Ri / as its variance, equation 2.19, is larger than 1=Jcov ¼ 1=JD . This is due to the choice of a population vector for the second-layer weights, which extracts only the low-order components in the tuning of the Ri . 2.4.2 Optimal Bilinear Readout and Discrimination. For a stimulusindependent set of weights of the second layer, the NLPV weights are optimal in the sense of squared error averaged over all angles, similar to the case with a linear PV. However, one can obtain optimal results with a bilinear readout appropriate for local tasks. Considering a discrimination task, as described in section 1.3, we dene a bilinear discriminator that is based on t the sign of q D W.R.1/ ¡ R.2/ /. Minimizing the noise h.±q/2 i ¼ 2WCR W , while constraining the signal hqi ¼ ±µ WhRi0, for small angular deviations, one obtains the optimal weights Wi D D0i =.2D2i /, for p ¿ N ¿ Nsat . Calculating the discriminability in this limit (similar to the calculation in section A.2) yields p d0R D j±µ j JD :

(2.24)

This result shows once again that the R layer retains most of the information coded in the correlated neurons of the input layer, and furthermore that locally this information can be extracted by a further linear pooling. 2.5 Combined Population Vector. So far we have focused on extracting the information from the covariance matrix. In fact, linear and nonlinear readout schemes can be combined to implement a readout mechanism that is sensitive to the different aspects of the statistics of the neural responses. We dene a combined readout as a linear combination of the LPV estimator, zO L , equation 1.14, and the NLPV estimator, zO B , equation 2.8, zO D aL zO L C aB zO B :

(2.25)

Note that this estimator is unbiased and its accuracy is independent of µ. Minimizing the quadratic error with regard to the coefcients aL and aB , we obtain the optimal set of coefcients, ³ ´ µ h.± yO L /2 i aL D aB h± yO L ± yO B i

¶¡1 ³ ´ h± yO L ± yO B i hxO L i ; hxO B i h.± yO B /2 i

(2.26)

1126

M. Shamir and and H. Sompolinsky

D q 2 (deg 2)

0.04

0.03

J Asymp NLPV LPV NLPV Combined

0.02

0.01

0 0

500

1000

N

1500

2000

Figure 11: Comparison of the efciency of different readout schemes. Efciency of the LPV (boxes), NLPV (circles), and the combined readout (asterisks), as dened in section 2.5, was calculated by averaging the estimation errors of these readout schemes over 4000 trials. For comparison, we show the Fisher information, J (solid line), and the asymptotic limit of the NLPV, equation 2.19 (dashed line). For the NLPV, p D 12 was used. In this gure, model 1 was used for the input-layer neurons.

where averages are taken with regard to the probability distribution of the neural responses, fri g, for µ D 0 degree. The efciency of the combined readout is shown in Figure 11 (asterisks). As can be seen, the performance of the combined readout is superior to that of the NLPV due to the addition of the information embedded in the rst-order statistics. 3 Summary and Discussion We have investigated readout mechanisms for correlated populations of neurons that code for an angle, whose second-order statistics vary with the stimulus. Using the Fisher information, we show that the total amount of information in the system about the stimulus is extensive: it grows linearly with population size (see equation 2.6). This is in contrast to the case of stimulus-independent correlations where the FI saturates to a nite value due to the long-range correlations (e.g., compare J and Jmean in Figure 2) as has been shown previously (Sompolinsky et al., 2001). This information is embedded in the stimulus-dependent covariance matrix since the information in the tuning of the mean rates saturates to a nite value (see Figure 2).

Nonlinear Population Codes

1127

We nd that the main source of information in the second-order statistics of the correlated populations is the stimulus dependence of the variances and not the cross-correlations (see Figure 4). This is because the variances, being larger than the cross-correlations between nearby neurons, induce a stimulus-dependent discontinuity in the spatial structure of the covariance matrix (see Figure 9a). This stimulus sensitivity is spread across all spatial modes and hence is not signicantly suppressed by the slowly varying modes of correlated noise. The above analysis has a direct bearing on the assessment of distributed coding in cortical neuronal ensembles. Although pairs of cortical neurons often exhibit signicant cross-correlations, the degree and ubiquity of the modulation of the noise cross-correlograms by changing the stimulus are unclear (Aertsen, Gerstein, Habib, & Palm, 1989; Ahissar, Ahissar, Bergman, & Vaadia, 1992; Maynard et al., 1999). On the other hand, the variance of the spike rates of cortical cells typically changes monotonically with the mean rates; hence, they are strongly modulated by the stimulus. In addition, the variance in spike counts is in general substantially larger than the mean cross-correlations of nearby neurons (see, e.g., Lee et al., 1998; note that the condition for the variance, b.Ái /, to be larger than the correlation with nearby neurons, cbref , in model 1, is equivalent to the demand that correlation coefcient of nearby neurons cbref=b.Ái / will be less than one in absolute value), validating our assumption of discontinuity in the spatial structure of the neuronal covariance matrices. If this analysis is correct, then despite the presence of signicant noise cross-correlations in cortical ensembles, the response of large cortical pools contains accurate information about the coded stimuli, mainly due to the tuning of the variance of the ring rates. This suggests that experimental characterization of the tuning properties of the response variances may be as important as the traditional characterization of the tuning properties of the trial-averaged responses. Extracting the information embedded in the response variances requires more complex readouts than the ones usually assumed in modeling of neuronal population coding. We show that linear schemes (e.g., the population vector; see Figure 3) are incapable of extracting information from secondorder statistics. Their efciency is bounded by the relatively small Fisher information of the mean rates. Here we propose a bilinear readout model that we call a nonlinear population vector. The NLPV consists of two stages of processing (see Figure 5). In the rst stage, a linear ltering of the input neurons is performed, such that the slowly varying Fourier modes of the inputs are subtracted out. This high-pass ltering is followed by a quadratic nonlinearity. We show that this stage effectively decorrelates the neuronal responses and generates a population of uncorrelated responses with stimulus-dependent means and standard deviations (see Figures 9 and 10), both of which are proportional to the variances of the input neurons (see equations 2.22 and 2.23). In the second stage, the outputs of the rst stage are linearly summed similar to a linear population vector. The resul-

1128

M. Shamir and and H. Sompolinsky

tant estimate has efciency that is extensive in the pool size, although it is below the Fisher information of the system (see Figure 11). Our readout model suggests that the degree of noise correlations in cortical ensembles will exhibit substantial heterogeneity, where ensembles higher in the sensory processing pathways are decorrelated by high-pass ltering mediated by the internal (or interlaminar) synaptic patterns. In addition, the decorrelated neurons should exhibit more signicant inputoutput nonlinearity than correlated lower-level neurons. Our readout model employs a quadratic nonlinearity. The realization of such a computation by neurons has been discussed in several previous studies (Poirazi & Mel, 2000; Schwartz & Simoncelli, 2001; Deneve et al., 1999). We have also investigated the robustness of our readout to deviations from the precise form of quadratic nonlinearity. We have studied numerically the efciency of a more general nonlinearity by taking ® N X Ri D Wij rj ; jD1

(3.1)

where ® is a parameter of the neural responses that characterizes the neuron’s input-output nonlinearity. The simulations have shown that moderate deviations of ® from 2 yield NLPV performances that are qualitatively the same as the quadratic one, as shown in Figure 12. Further reasonable altering of other parameters of the model will not change qualitatively the results presented here. For example, changing the DC component of the mean responses tuning curves (i.e., changing fref while keeping fmax ¡ fref constant) will not affect any result. Increasing the modulation of the tuning curve, fmax ¡ fref, by a factor ¯ will increase Jmean and the efciency of the LPV by factor ¯ 2 ; however, they will still saturate to a nite limit at the same system size as before scaling. An interesting feature of our readout scheme is (approximate) scale invariance. As can be seen from equation 2.19, the performance of the readout depends on the tuning of the variances but is independent of their absolute scale. Such a change in the scale of the variances may be implemented by a change in the overall time window of the observed responses. In contrast, linear readout will be sensitive to a change in the overall time window since it will affect their signal-to-noise ratio. This scale invariance of the bilinear readout can be traced to the form of the Fisher information, equations 1.10 and 1.11. Changing the time window by a factor of say, ° , is expected to change both the mean spike counts and the covariance matrix by the same factor. Hence, whereas Jmean will change by ° , Jcov will not. This suggests that for short time intervals, the NLPV is superior to the LPV even for uncorrelated populations. It should be noted, however, that this scale invariance is a property of the gaussian statistics assumed here, which is no longer a

Nonlinear Population Codes

(D q) 2 (deg 2)

0.02

0.015

1129

a=3.2 a=2 a=1.6 NLPV asymp

0.01

0.005

0 0

200

400

N

600

800

1000

Figure 12: Efciency of the NLPV readout for small deviations from the quadratic nonlinearity (see equation 3.1), as a function of the number of neurons in the pool. From the bottom ® D 1:6 (diamonds), ® D 3:2 (squares), ® D 2 (circles). The asymptotic limit of the NLPV, equation 2.19, is shown for comparison (dashed line). For all plots, p D 10 was used. In this gure, model 1 was used for the input-layer neurons.

valid model for the neuronal responses at sufciently low spike counts (i.e., at very short observation times). Finally, we have shown that linear and nonlinear readout schemes can be combined to implement a readout mechanism that is sensitive to several aspects of the statistics of the neural responses (see equation 2.25). Even in this simple example, a wisely chosen weighted sum of the linear and nonlinear PVs yields a readout that is more accurate than the LPV and the NLPV (see Figure 11). In this article, we have chosen a high-pass lter form of the linear weights matrix, W, based on heuristic arguments. Alternatively, one may nd the optimal weights by minimizing the squared error of the readout (averaged over all angles). However, nding the optimal weights is complicated. In a previous study we used this approach for model 2 (Shamir & Sompolinsky, 2001a, 2001b). We found the efciency of the optimal readout is similar to that of the high-pass NLPV with an appropriate value of p (for details, see Shamir & Sompolinsky, 2001a). Recent studies have used model neurons with nonlinear input-output functions in order to obtain an improved population readout. Deneve et al. (1999) used recurrent network (RN) nonlinear neurons, followed by linear population vector pooling. They show that their network yields superior

1130

M. Shamir and and H. Sompolinsky

performances compared to LPV. In fact, they showed that for low noise levels, their network extracts all the information that is coded in the average ring rates of the neurons, Jmean . Interestingly almost all of the information that can be read by their mechanism is obtained in the rst iteration of the dynamics. The rst iteration step of their dynamics can be modeled according to equation 2.7 with a feedforward linear lter W, identical to their recurrent weight matrix. Because information coded in the average ring rates of the neurons resides mainly in the low-order Fourier modes of the network, their choice of lter, W, leaves only these few slow modes of the system and lters out all of the higher Fourier modes. Thus, their readout suffers from two major drawbacks. First, it is unable to read information coded in the variance of the responses. Second, it does not overcome the problem of the strong correlated noise in the slow Fourier modes of the system (see Figure 3). Note that their study shows how bilinear readouts can be used in order to read information efciently from the rst-order statistics of the neuronal responses. In the context of this work, we have studied the effect of replacing the feedforward layer Ri by an RN with recurrent weights equal to our high-pass lter W. We have found (results not shown) that iterating the dynamics of the network beyond the rst step causes a deterioration of the readout accuracy of the system rather than an improvement. On the other hand, it is expected that feeding the output of our Ri into a RN with weights similar to that of Deneve et al. (1999) might yield slightly improved performance, closer to the bound given by Jcov t JD (see equation 2.6). Appendix A.1 Derivation of Angular Estimation Error. Let µO D arg.Oz/ be an unO For small uctuations in xO and y, O biased estimator of µ . Dene zO D xO C iy. one can expand µO in xO and yO around their mean, obtaining

O 2i D h.± µ/

µ O 2i h.± x/ .¡ sin µ; cos µ / O yi O h± x±

jhOzij2

O O yi h± x± O 2i h.± y/

¶³ ´ ¡ sin µ cos µ

;

(A.1)

where averages are taken with respect to the distribution of the neuronal O 2 i, is independent of responses for a given µ. If the estimation error, h.± µ/ µ, one can evaluate the estimation error for µ D 0 degree. In this case, equation A.1 is reduced to O 2i D h.± µ/

O 2i h.± y/ ; O 2 hxi

O to noise, which is in the form of a signal, hxi,

(A.2) p

O 2 i, relation. h.± y/

Nonlinear Population Codes

1131

the optimality of the LPV, A.1.1 Optimality of the LPV. We will now showP assuming isotropy of the neural network. Let zO D i wi ri be an estimator of eiµ . Dene a cost function, E, as the squared estimation error of zO averaged over all angles: Z 1 dµ iµ hje ¡ wrj2 i: (A.3) ED 2 2¼ Minimizing E with respect to w, one obtains that the optimal set of linear weights, wopt , is given by wopt D Q¡1 u Z dµ iµ uk D e fk .µ / 2¼ Z ¢ dµ ¡ Qij D Cij .µ / C fi .µ / fj .µ / : 2¼

(A.4) (A.5) (A.6)

Now, using the symmetry of the tuning curves, fi .µ / D f .Ái ¡µ/, one obtains from equation A.5, ui D f1 eiÁi . Both the correlation matrix Cij and the outer product fi fj are functions of two angles, .Ái ¡ µ/ and .Áj ¡ µ/. After integration over all values of µ (e.g., equation A.6), one obtains by a change of variables µ 0 D µ ¡ Áj that Q is a function of only one angle Qij D Q.Ái ¡ Áj /. Hence, u is an eigenvector of Q. Thus, the optimal linear weights for discrimination are wi / eiÁi , that is, zO is the LPV. A.2 Derivation of the Discrimination Error. One can apply the linear readout to the task of discrimination. We dene the readout according to the sign of q D w.r.1/ ¡ r.2/ /. Minimizing the noise, h.±q/2 i D 2wCwt , while constraining the signal, hqi ¼ wf0±µ, one obtains that the optimal linear readout weights are given by w D C¡1 f0 . Thus, for the optimal discriminator, q is a gaussian random variable with mean ±µf0t C¡1 f0 D ±µ Jmean and variance 2Jmean . The probability of discrimination error, q±µ < 0, for the optimal linear p discriminator is given by H.d0L = 2/ with a discriminability p d0L D j±µj Jmean .µ /;

(A.7)

thus yielding the result of equation 1.22. A.3 Derivation of the NLPV Noise Term. For the calculation of the NLPV noise term, it is convenient to dene ZO D TrfArrt g N X Aij D W ik eiÁk W kj ; kD1

(A.8) (A.9)

1132

M. Shamir and and H. Sompolinsky

where rrt is an .N £ N/ matrix of rank 1. Using the above denition of A with the notation A D AR C iAI , with AR and AI real matrices, the signal O D Tr.AR C/ C Tr.AR fft /. The noise is given by can be written as hXi X O 2i D h.± Y/ (A.10) AIij AIkl M ijkl ijkl

Mijkl D h±.ri rj /±.rk rl /i;

(A.11)

where ±.ri rj / D ri rj ¡ hri rj i. Using the gaussianity of the ri , we obtain Mijkl D Cik Cjl C Cil Cjk C Cik fj fl C Cil fj fk C Cjk fi fl C Cjl fi fk :

(A.12)

Substituting equation A.12 into equation A.10 yields O 2 i D 2Trf.AI C/2 g C 4Trf.AI C/AI fft g: h.± Y/

(A.13)

The last term on the right-hand side of equation A.13 contains only terms including Fourier transforms of the tuning curve, f .µ /, of order > p. Due to the smoothness of f .µ /, these terms will decay faster than any power of p, and hence will be neglected in our analysis. Using the form of W, equation 2.10, the expression for AI , imaginary of equation A.8, is reduced to ² X ± 1 (A.14) AIij D ei.nC1/Ái ¡inÁj ¡ einÁi ¡i.nC1/Áj : 2iN > p n

< ¡p ¡ 1

Substituting equation A.14 into equation A.13, we obtain n o X O 2 i D N2 h.± Y/ CQ m;n CQ nC1;mC1 ¡ CQ mC1;n CQ nC1;m ; m;n

(A.15)

>p < ¡p ¡ 1

where CQ m;n is the double Fourier transform of C, as dened in equation 2.12. We now consider the contribution of the different parts of the correlation matrix, namely S and D, to the noise term, equation A.15. As mentioned above, equation 2.13, the transform of C is a sum of transforms of S and D. Thus, the products CQ m;n CQ t;u contain three different terms: products of transforms of D, products of transforms of S, and a mixture term containing products of transforms of D and transforms of S. Using the relation, equation 2.13, we obtain the contribution of terms containing only D, ID , to the noise X fDm¡n Dn¡m ¡ DmC1¡n DnC1¡m g ID D m;n

¼N

>p < ¡p ¡ 1

X m

fDm D¡m ¡ Dm D2¡m g D N

B0 ¡ B2 ; 2

(A.16)

Nonlinear Population Codes

1133

where we have neglected terms of order p relative to N. Bn is the nth Fourier transform of B.Á/ D 2D2 .Á/. Note that this term scales linearly with N and does not decay fast as p grows. The contribution of terms containing only S, IS , is given by IS D N 2

X m;n

>p < ¡p ¡ 1

© ª Sm;n SnC1;mC1 ¡ SmC1;n SnC1;m :

(A.17)

This term, IS , scales like N 2 . However, since it contains only transforms of S that are of order > p, it will be a strongly decaying function of p. Below, we focus on model 1. In model 1, the Fourier modes are eigenvectors of S: Smn D ±mn S.n/, thus yielding IS D 2N 2

X n>p

S.n/S.n C 1/; model 1

(A.18)

For large n, we can approximate S.n/ ¼ cbref =.¼½/.1¡.¡1/n e¡¼=½ /n¡2 , model 1. Replacing the summation in the last equation by an integral, we obtain IS D

2N 2 3

³

cbref ¼½

´2

.1 ¡ e¡2¼=½ /p¡3 ; model 1:

(A.19)

The contribution of the mixed term, ISD is Á ISD D 2N.D0 ¡ D2 / 2

X n>p

! S.n/ ¡ S.p/ / Np¡1 ; model 1:

(A.20)

Note that although this term yields a contribution that scales linearly with N, this contribution is suppressed due to the algebraic decay of ISD with the increase of p. Neglecting the contribution of ISD , we obtain that the efciency of the NLPV is given by 1 1 C N=Nsat NID B0 ¡ B2 ¼ Nsat .p/ D lim p3 ; model 1: ± ²2 N!1 I S cbref ¡2¼=½ 4=3 ¼½ .1 ¡ e / h.± µO /2 i¡1 D NJ0

(A.21) (A.22)

The above results are qualitatively the same for model 2 as well. The algebraic decay of the transforms of S results from the discontinuity of the derivatives of S on the diagonal. This discontinuity causes the nth Fourier component to decay asymptotically like n¡2 in model 1. In model 2, the Fourier components are not its eigenvectors. However, the scaling of the

1134

M. Shamir and and H. Sompolinsky

different contributions to the noise term, namely ID , IS , and ISD , is the same as in model 1. Thus, for model 2, the relation, equation A.21, still holds with the denition NID : N!1 IS

Nsat.p/ D lim

(A.23)

A.3.1 The CalculationPof the Moments of Ri . The Ri are squares of gaussian variables:Ri D t2i I ti D j W ij rj with average hti i D

X

einÁi fn

(A.24)

jnj>p

and covariance h±ti ±tj i D

X jmj;jnj>p

e¡inÁi CimÁj CQ mn :

(A.25)

In the limit of p ¿ N ¿ Nsat.p/, the averages of t are exponentially small in p, and hence will be neglected, hti i ¼ 0

p ¿ N ¿ Nsat .p/

(A.26)

Using equation 2.13, we obtain from equation A.25 in the limit of p ¿ N ¿ Nsat .p/: h±ti ±tj i D Di ±ij

p ¿ N ¿ Nsat.p/:

(A.27)

From equations A.26 and A.27 and the gaussianity of the ti , we obtain the results of the equations 2.22 and 2.23. Acknowledgments This research is partially supported by the Israel Science Foundation, Center of Excellence Grant no 8006/00. M.S. is supported by a scholarship from the Clore Foundation. References Abbott, L. F., & Dayan P. (1999). The effect of correlated variability on the accuracy of a population code. Neural Comput., 11(1), 91–101.

Nonlinear Population Codes

1135

Aertsen, A. M., Gerstein, G. L., Habib, M. K., & Palm, G. (1989) Dynamics of neuronal ring correlation: Modulation of “effective connectivity.” J. Neurophysiol., 61(5), 900–917. Ahissar, M., Ahissar, E., Bergman, H., & Vaadia, E. (1992) Encoding of soundsource location and movement: Activity of single neurons and interactions between adjacent neurons in the monkey auditory cortex. J. Neurophysiol., 67(1), 203–215. Britten, K. H., Shadlen, M. N., Newsome, W. T., & Movshon, J. A. (1992). The analysis of visual motion: A comparison of neuronal and psychophysical performance. J. Neurosci., 12(12), 4745–4765. Deneve, S., Latham, P. E., & Pouget, A. (1999). Reading population codes: A neural implementation of ideal observers. Nat. Neurosci., 2(8), 740–745. Fitzpatrick, D. C., Batra, R., Stanford, T. R., & Kuwada, S. (1997). A neuronal population code for sound localization. Nature, 388(6645), 871–874. Georgopoulos, A. P., Schwartz, A. B., & Kettner, R. E. (1986). Neuronal population coding of movement direction. Science, 233(4771), 1416–1419. Kay, S. M. (1993). Fundamentals of statistical signal processing. London: Prentice Hall International. Lee, C., Rohrer, W. H., & Sparks, D. L. (1988). Population coding of saccadic eye movements by neurons in the superior colliculus. Nature, 332(6162), 357–360. Lee, D., Port, N. L., Kruse, W., & Georgopoulos, A. P. (1998). Variability and correlated noise in the discharge of neurons in motor and parietal areas of the primate cortex. J. Neurosci., 18(3), 1161–1170. Mastronarde, D. N. (1983). Correlated ring of cat retinal ganglion cells. II. Responses of X- and Y-cells to single quantal events. J. Neurophysiol., 49(2), 325–349. Maynard, E. M., Hatsopoulos, N. G., Ojakangas, C. L., Acuna, B. D., Sanes, J. N., Normann R. A., & Donoghue, J. P. (1999). Neuronal interactions improve cortical population coding of movement direction. J. Neurosci., 19(18), 8083– 8093. Poirazi, P., & Mel, B. W. (2000). Choice and value exibility jointly contribute to the capacity of a subsampled quadratic classier. Neural Comput., 12(5), 1189–1205. Schwartz, O., & Simoncelli, E. P. (2001). Natural signal statistics and sensory gain control. Nat. Neurosci., 4(8), 819–825. Seung H. S., & Sompolinsky H. (1993). Simple models for reading neuronal population codes. Proc. Natl. Acad. Sci. USA, 90(22), 10749–10753. Shamir, M., & Sompolinsky, H. (2001a). Correlation codes in neuronal networks. In D. G. Thomas, B. Suzanna, & G. Zoubin (Eds.), Advances in neural information processing systems, 14. Cambridge, MA: MIT Press. Shamir, M., & Sompolinsky, H. (2001b). Nonlinear population vector for correlated neurons. Abstract presented at the Society for Neuroscience’s 31st Annual Meeting. Sompolinsky, H., Yoon, H., Kang, K., & Shamir, M. (2001). Population coding in neuronal systems with correlated noise. Phys. Rev. E, 64(5 Pt 1), 051904. Thomas, J. A., & Cover, T. M. (1991). Elements of information theory. New York: Wiley.

1136

M. Shamir and and H. Sompolinsky

van Kan, P. L., Scobey, R. P., & Gabor A. J. (1985). Response covariance in cat visual cortex. Exp. Brain. Res., 60(3), 559–563. Vogels, R., Spileers, W., & Orban, G. A. (1989). The response variability of striate cortical neurons in the behaving monkey. Exp. Brain Res., 77(2), 432–436. Wilson, M. A., & McNaughton, B. L. (1993). Dynamics of the hippocampal ensemble code for space. Science, 261(5124), 1055–1058. Young, M. P., & Yamane, S. (1992). Sparse population coding of faces in the inferotemporal cortex. Science, 256(5061), 1327–1331. Zohary, E., Shadlen, M. N., & Newsome, W. T. (1994). Correlated neuronal discharge rate and its implications for psychophysical performance. Nature, 370(6485), 140–143. Received August 6, 2003; accepted November 6, 2003.

LETTER

Communicated by William Bialek

Enhancement of Information Transmission Efciency by Synaptic Failures Mark S. Goldman ¤ [email protected] Brain and Cognitive Sciences Department, MIT, Cambridge, MA 02139, U.S.A.

Many synapses have a high percentage of synaptic transmission failures. I consider the hypothesis that synaptic failures can increase the efciency of information transmission across the synapse. I use the information transmitted per vesicle release about the presynaptic spike train as a measure of synaptic transmission efciency and show that this measure can increase with the synaptic failure probability. I analytically calculate the Shannon mutual information transmitted across two model synapses with probabilistic transmission: one with a constant probability of vesicle release and one with vesicle release probabilities governed by the dynamics of synaptic depression. For inputs generated by a non-Poisson process with positive autocorrelations, both synapses can transmit more information per vesicle release than a synapse with perfect transmission, although the information increases are greater for the depressing synapse than for a constant-probability synapse with the same average transmission probability. The enhanced performance of the depressing synapse over the constant-release-probability synapse primarily reects a decrease in noise entropy rather than an increase in the total transmission entropy. This indicates a limitation of analysis methods, such as decorrelation, that consider only the total response entropy. My results suggest that synaptic transmission failures governed by appropriately tuned synaptic dynamics can increase the information-carrying efciency of a synapse. 1 Introduction Synaptic transmission failures are commonly seen in recordings of single central synapses, with the fraction of synaptic failures at a given synapse often exceeding the fraction of successful transmissions (Stevens & Wang, 1995; Murthy, Sejnowski, & Stevens, 1997). When a synapse fails to transmit, information about the presynaptic spike train is lost. Nevertheless, the prevalence of synaptic failures suggests that they may have a functional role, and a variety of such roles have been proposed. Synaptic failures could re¤ Present address: Department of Physics and Program in Neuroscience, Wellesley College, Wellesley, MA 02481, U.S.A.

c 2004 Massachusetts Institute of Technology Neural Computation 16, 1137–1162 (2004) °

1138

M. Goldman

ect the conservation of limited synaptic resources for a synapse with a limited pool of primed, readily releasable vesicles (Murthy et al., 1997). Alternatively, synaptic failures could provide an energy-saving mechanism that reduces the large metabolic costs of synaptic transmission (Balasubramanian, Kimber, & Berry, 2001; Laughlin, 2001; Levy & Baxter, 2002; Schreiber, Machens, Herz, & Laughlin, 2002). Congruent with these proposals is the suggestion that synapses could serve as dynamic lters of the presynaptic spike train, with less important spikes being ltered out while more important spikes are transmitted (Maass & Zador, 1999; Natschlager & Maass, 2001; Goldman, Maldonado, & Abbott, 2002). Short-term synaptic dynamics such as synaptic depression or facilitation have been suggested as the mechanism underlying this role. Similarly, synaptic failures associated with synaptic depression have been suggested as a means of gain control by which a postsynaptic neuron can respond to important information arriving along particular synapses while ltering out less interesting inputs arriving at other synapses (Abbott, Varela, Sen, & Nelson, 1997). Common to the above proposals is the suggestion that synaptic failures can increase the efciency of information transmission by a synapse. One possible source of inefciency in natural spike trains is temporal autocorrelations. Spike train autocorrelations have been widely observed in the sensory areas of animals viewing natural stimuli (Dan, Atick, & Reid, 1996) and of freely viewing animals (Baddeley et al., 1997; Goldman et al., 2002) and indicate that information is encoded in a redundant manner (Barlow, 1961). Reduction of this redundancy has been suggested as a computational principle underlying early sensory processing (see, e.g., Attneave, 1954; Barlow, 1961; Srinivasan, Laughlin, & Dubs, 1982; Barlow & Foldiak 1989). It has been suggested that the removal of spike train autocorrelations by the synapse could produce a more efcient representation of the encoded information (Goldman et al., 2002). However, the mechanism by which this decorrelation was proposed, synaptic depression, is inherently stochastic. This means that the same spike train, if presented multiple times, will produce different sets of transmissions across a depressing synapse. In information-theoretic terms, for a xed number of transmissions, decorrelation leads to maximization of the total response entropy of the transmissions but does not generally lead to a reduction in the noise entropy. In this article, I re-explore the hypothesis that synaptic failures can increase the efciency of information transmission across a synapse. As a measure of information transmission efciency, I use the Shannon information transmitted per vesicle release by a sequence of postsynaptic vesicle releases about a sequence of presynaptic action potentials. As a measure of information transmission efciency, I use the Shannon information transmitted per vesicle release by a sequence of postsynaptic vesicle releases about a sequence of presynaptic action potentials (see section 2). For the cases considered here, it will be seen that the noise entropy rather than the total response entropy dominates the behavior of this quantity. Both the noise and the total entropies have been calculated in previous

Enhancement of Information Transmission Efciency

1139

studies that use direct numerical calculations (de Ruyter van Steveninck, Lewen, Strong, Koberle, & Bialek, 1997; Strong, Koberle, de Ruyter van Steveninck, & Bialek, 1998) of information. However, these studies are typically constrained to timescales on the order of tens of milliseconds. This limitation is due to a combinatoric explosion in the number of spike train probabilities that need to be computed numerically. Spike train autocorrelations and short-term synaptic dynamics often extend for many hundreds of milliseconds, making numerical calculations intractable for typical spike rates. Previous studies have approximated the information transmitted across a depressing synapse under various simplifying assumptions (Goldman, 2000; de la Rocha, Nevado, & Parga, 2002; Fuhrmann, Segev, Markram, & Tsodyks, 2002; Goldman et al., 2002). Here, I perform an exact analytic calculation of the information transmitted across a model depressing synapse. To make this calculation tractable, I use simplied models of a stochastic depressing synapse and of positively autocorrelated spike trains. Using several examples, I analyze the effects of synaptic failures, spike train autocorrelations, and synaptic dynamics on the transmission of information across a synapse. In section 3, the information transmitted per vesicle release by a synapse with constant probability of transmission is compared to that transmitted by a perfectly transmitting synapse with no synaptic failures. In section 4, the information per vesicle release for a model depressing synapse is compared to that for the perfectly transmitting synapse. In section 5.1, the depressing and constant-probability synapses are compared directly. This last comparison gives insight into how synaptic dynamics can increase the efciency of information transmission across a synapse. The results of the analytic calculations are constrained to a particular, analytically tractable synaptic model and form of spike train autocorrelations. In section 5.2, a qualitative framework is presented for understanding more generally how synaptic dynamics can enhance the efciency of information transmission across a synapse when spike trains are autocorrelated. 2 Information Calculations with Point Process Spike Trains and Transmissions The calculations that follow compute the Shannon mutual information I.SI V/ conveyed by a sequence of N synaptic transmissions (vesicle releases) occurring at times v D .v1 ; v2 ; : : : ; vN / about a presynaptic spike train with n spikes occurring at times s ´ .s1 ; s2 ; : : : ; sn /: I.SI V/ D H.V/ ¡ H.VjS/ X D¡ P.v/ log2 .P.v// v

C

X s

P.s/

X v

(2.1) (2.2)

P.vjs/ log2 .P.vjs//;

where P.¢/ denotes probability, the random variables S and V denote the ensembles of spike trains and vesicle releases, respectively, and the sums are

1140

M. Goldman

over all possible realizations s and v, respectively, of the random variables S and V. Written in this form, the information is dened as the difference between the transmission entropy H.V/ and the noise entropy H.VjS/ of the train of synaptic vesicle releases. H.V/ quanties the variability in the sequence of synaptic vesicle releases and gives the information-carrying capacity of the vesicle releases. Subtracting H.VjS/ from the transmission entropy removes the portion of the vesicle release train variability that arises when the presynaptic spike train is held xed. The presynaptic spike and postsynaptic vesicle release sequences are modeled by point processes, with spike and release times specied with a constant precision ±T. The probabilities P.v/ and P.s/ are then expressed in terms of probability densities P .v/ and P .s/ as P.v/ D P .v/.±T/N and P.s/ D P .s/.±T/n , respectively. A consequence of this model is that it gives rise to a term ¡N log2 .±T/ in H.V/. Correspondingly, the information per vesicle release I.SI V/=N contains a constant term ¡ log2 .±T/ that approaches 1 as ±T ! 0. This reects that, in this limit, each of the N vesicle releases is specied with innitesimal resolution and has a correspondingly large capacity to carry information. In practice, any numerical calculation of the information requires the choice of a nite time bin that denes the meaningful precision for a given system. Since the biologically relevant precision ±T is not necessarily known, it is useful to have a measure for comparing synaptic transmission efciency that is independent of ±T. Because measures such as the information transmitted per unit time or energy are dependent on ±T, I choose not to use these even though they could provide interesting results that are calculable for specic choices of ±T. I instead use the difference between the information per vesicle release transmitted by two synapses, which is independent of ±T. In section 6, I point out some qualitative distinctions between these different measures.

3 Information Transmission for a Synapse with Constant Vesicle Release Probability 3.1 Poisson Input with Time-varying Rate. I rst calculate the information I.SI V/ for a synapse with constant vesicle release probability p receiving input described by a Poisson process of arbitrary rate r.t/: This example illustrates the effect of synaptic unreliability in the case that: (1) spikes are generated independently, and (2) transmission failures are governed by a process that has no dynamics. After the calculation, I consider the effects of relaxing these two assumptions. Because the transmission probability p is time independent and the input is Poisson, H.V/ and H.VjS/ can be found from the innitesimal entropies ±H.V/ and ±H.VjS/ corresponding to a single time bin .t; t C ±T/, where ±T is assumed to be small. Inserting the probabilities of vesicle release, p r.t/±T, and failure, 1 ¡ p r.t/±T, into equation 2.2 and ignoring terms of order .±T/2

Enhancement of Information Transmission Efciency

1141

gives ±H.V/ D ¡pr.t/±T log2 .pr.t/±T/¡.1¡pr.t/±T/ log2 .1¡pr.t/±T/ µ ¶ 1 ¼ pr.t/±T ¡ log2 .pr.t/±T/ ln 2

(3.1)

±H.VjS/ D ¡r.t/±T[p log2 .p/ C .1 ¡ p/ log2 .1 ¡ p/]

(3.3)

(3.2)

Summing the above equations over time and dividing by the average R number of transmissions in a time Ttotal, N D 0Ttotal pr.t/ dt, gives the entropies and information per vesicle release (here and throughout, I assume ±T is small and denote sums over time intervals of size ±T by integrals with respect to the dummy integration variable t): 1 H.V/ D ¡ ln 2 N

R Ttotal 0

r.t/ log2 .pr.t/±T/ dt : R Ttotal r.t/ dt 0

(3.4)

µ ¶ 1¡p H.VjS/ D ¡ log2 .p/ C log2 .1 ¡ p/ ; N p 1 1¡p I.SI V/ D C log2 .1 ¡ p/ ¡ ln 2 N p

R Ttotal 0

r.t/ log2 .r.t/±T/ dt : R Ttotal r.t/ dt 0

(3.5)

(3.6)

It should be noted that, although different in detail, the above derivation is similar in form and spirit to that for the information carried by the arrival time of a single spike (Brenner, Strong, et al., 2000). For a synapse with perfect transmission (p D 1/, the corresponding information per vesicle release equals the entropy per spike present in the presynaptic spike train, IpD1 .SI V/ 1 H.S/ D D ¡ ln 2 N n

R Ttotal 0

r.t/ log2 .r.t/±T/ dt ; R Ttotal r.t/ dt 0

(3.7)

where n is the number of spikes in the presynaptic train. The difference 1Iconst =N between the information transmitted per vesicle release by the constant probability and perfectly transmitting synapses is independent of the rate r.t/, 1Iconst =N D

1¡p I.SI V/ IpD1 .SI V/ ¡ D log2 .1 ¡ p/; N N p

and is shown in Figure 1.

(3.8)

1142

M. Goldman

Iconst/N (bits)

0.0

-0.5

-1.0

-1.5 0.0

0.2

0.4

0.6

0.8

1.0

Synaptic failure probability Figure 1: Synaptic unreliability decreases information per vesicle release for a constant-probability synapse receiving time-varying Poisson input. The information difference 1Iconst =N D Iconst =N ¡ IpD1 =N decreases monotonically with the synaptic failure probability 1 ¡ p. This result is independent of the Poisson rate r.t/.

One might have naively expected the information per transmission to be independent of p for a constant-probability synapse. This would be the case if the information transmitted were proportional to the number of transmissions. Instead, the information per transmission decreases monotonically with the failure probability 1 ¡ p (see Figure 1). For example, a constant-probability synapse with p D 0:5 transmits 1 bit per transmission less than a perfectly transmitting synapse. Mathematically, the loss of information per vesicle release derives from the contribution of the failures to the noise entropy, expressed by the second term of equation 3.5. Qualitatively, the information per transmission is less for the constant-probability synapse because, in addition to reducing the number of transmissions, there is added uncertainty associated with not knowing how many presynaptic action potentials failed to transmit and which of the many possible arrangements of these untransmitted action potentials was actually present in the presynaptic spike train (a similar situation is described by Shannon & Weaver, 1949). 3.2 Input Generated by a Stationary Renewal Process. For Poisson input, the probability of producing a spike in any given time bin depends on only the rate r.t/ and is independent of the spiking activity in any other time bin. I next show that when the locations of spikes depend on the locations of previous spikes, the constant-probability synapse can transmit more information per vesicle release than a perfectly transmitting synapse. To introduce spike-spike dependencies, I model the presynaptic spike train by a

Enhancement of Information Transmission Efciency

1143

renewal process, which generates successive interspike intervals (ISIs) independently from one another. Furthermore, for tractability of calculation, the analyses presented here are restricted to stationary renewal processes, which are characterized by a time-independent probability density PISI .1S / of generating an ISI of duration 1S . The stationary renewal process is a generalization of the stationary (or homogeneous) Poisson process. Unlike the homogeneous Poisson process, the stationary renewal process generally has nonzero autocorrelations. Spike train autocorrelations are an indication of redundancy in the neural code (Barlow, 1961). This suggests that synaptic transmission failures might increase the efciency of information transmission by removing redundancy from the presynaptic spike trains. Below, I derive general expressions for H.V/=N and H.VjS/=N for a constant-probability synapse receiving stationary renewal spike train input. I then apply these results to a model renewal spike train constructed to have nonnegative autocorrelations. The autocorrelation function of the stationary renewal process spike trains is derived in appendix A. 3.2.1 General Results. The probability P.s/ of generating a particular nspike train s from a stationary renewal process is given by P.s/ D

n Y

[PISI .1S;i /±T];

(3.9)

iD1

where ±T gives the precision with which all spike and transmission times are dened, and 1S;i denotes the duration of the ith ISI. For a synapse with constant vesicle release probability p receiving a spike train generated by a stationary renewal process, the intervesicle release interval (IVI) distribution PIVI .1V / is also a stationary renewal process. This is because the transmission probabilities across the constant-probability synapse do not depend on the locations of spikes previous to the one transmitted. Thus, the probability of generating a particular N-vesicle-release train v is given by P.v/ D

N Y

[PIVI .1V;i /±T];

(3.10)

iD1

where 1V;i denotes the duration of the ith IVI. The IVI distribution PIVI .1V / for the constant probability synapse can be derived from the ISI distribution PISI .1S / by using the method of Laplace transforms (Cox, 1962; Cox & Lewis, 1966; de la Rocha et al., 2002). The resulting expression, derived in appendix B, is ( ) eISI .u/ pP ¡1 PIVI .1V / D L (3.11) ; eISI .u/ 1 ¡ .1 ¡ p/P

1144

M. Goldman

e denotes the Laplace transform of the distribution P , u denotes the where P argument of the Laplace transform, and L ¡1 f¢g indicates the operation of taking the inverse Laplace transform. Inserting equation 3.10 into the denition of the transmission entropy (equation 2.2) and noting that the integral separates into N identical terms gives the transmission entropy per vesicle release H.V/=N as Z 1 H.V/ D¡ (3.12) d1V PIVI .1V / log2 .PIVI .1V /±T/: N 0 Using equations 3.11 and 3.12, H.V/=N can be calculated for any spike train generated by a stationary renewal process. I next nd a corresponding expression for the noise entropy per release H.VjS/=N. To calculate this quantity, note that the vesicle releases across the constant probability synapse have no history dependence. This implies that the conditional probability P.vjs/ can be written as a product of probabilities Q.bi j1S;i / that the ith spike triggers a vesicle release, bi D 1, or does not, bi D 0: P.vjs/ D

n Y iD1

Q.bi j1S;i /;

where the probability Q.bj1S / is given by, » 1 ¡ p; b D 0 Q.bj1S / D p; b D 1:

(3.13)

(3.14)

Substituting equation 3.13 into the denition of the noise entropy (equation 2.2) gives the noise entropy per spike H.VjS/=n, Z 1 X H.VjS/ D¡ (3.15) d1S PISI .1S / Q.bj1S / log2 .Q.bj1S //: n 0 bD0;1 Converting this to a noise entropy per vesicle release H.VjS/=N by dividing by the probability of vesicle transmission p gives, after substitution for Q.bj1S / from equation 3.14, µ ¶ H.VjS/ .1 ¡ p/ D ¡ log2 p C log2 .1 ¡ p/ : (3.16) N p This expression is independent of the ISI distribution P ISI .1S / and is identical to that derived previously for the Poisson spike train, equation 3.5. From the general expressions given above for H.V/=N (equation 3.12 using PIVI .1V / from equation 3.11) and H.VjS/=N (equation 3.16), the information per vesicle release I.SI V/=N can be obtained for any spike train dened by a stationary renewal process.

Enhancement of Information Transmission Efciency

1145

3.2.2 Input with Nonnegative Autocorrelations That Decay Exponentially. In this section, the general expressions derived above are evaluated for a stationary renewal spike train of average rate r and with positive autocorrelations of the form C.¿ / D Ae¡j¿ j=¿corr . This example allows independent study of the effects of changing the average rate r, the magnitude of autocorrelations A (normalized to be independent of r), and the exponential decay time of the autocorrelations ¿corr . In appendix A, I show that autocorrelations of this form are generated when the ISI distribution is given by (Cox, 1962; Cox & Lewis, 1966; de la Rocha et al., 2002)

PISI .1S / D ®¸¡ e¡¸¡ 1S C .1 ¡ ®/¸C e¡¸C 1S

(3.17)

with q ¸§ D K § ®D

¡1 K2 ¡ r¿corr ;

¡1 ¡ ¸ ¿corr ¡ ; ¸C ¡ ¸¡

(3.18)

(3.19)

¡1 ]=2. P and where K ´ [r.1 C A/ C ¿corr ISI .1S / above corresponds to choosing each ISI randomly from a homogeneous Poisson distribution of either rate ¸¡ (with probability ®) or rate ¸C (with probability 1 ¡ ®). The inverse relations for r, A, and ¿corr as functions of ®, ¸¡ , and ¸C are given in appendix A. To determine I.SI V/=N from equations 3.12 and 3.16, one need only calculate PIVI .1V /. For the constant probability synapse, PIVI .1V / can be obtained by the following trick. Because the constant-probability synapse has no history dependence, vesicle release failures at the constant-probability synapse are equivalent to failures to receive a spike. Therefore, the vesicle release train produced by a constant-probability synapse with release probability p receiving input of average rate r is statistically identical to that produced by a reliable synapse receiving input of average rate pr. This implies that PIVI .1V / can be determined from PISI .1V / by replacing r in equations 3.17 through 3.19 by pr. The difference 1Iconst =N between I.SI V/=N for the unreliable (p < 1) and perfectly transmitting (p D 1) synapses is shown in Figure 2. When the input spike train has no autocorrelations (A D 0; see Figure 2, dashed lines), the input is a homogeneous Poisson process, and 1Iconst =N is identical to that found in section 3.1 (see Figure 1). When the spike trains have sufciently strong autocorrelations (see Figure 2, A D 5 and r¿corr ¸ 3), the constant-probability synapse can transmit more information per vesicle release than a reliable synapse. This reects that when inputs are generated by a non-Poisson process, the location of

1146

M. Goldman

Iconst/N (bits)

1.0

r corr = 0.5

r corr = 3

r corr = 15

0.5 0.0 -0.5 -1.0 -1.5 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

Failure probability

Failure probability

Failure probability

A=0 A=1 A=5 Figure 2: Information per vesicle release difference 1Iconst =N D I.SI V/=N ¡ IpD1 .SI V/=N for a constant-probability synapse receiving input generated by the stationary renewal process of equation 3.17. Columns correspond to inputs with different decay times of positive autocorrelations (measured relative to the spike rate r). Individual traces give 1Iconst =N for the specied autocorrelation magnitudes A (see the legend). 1Iconst =N tends to be negative except when the input is strongly correlated and the transmission probability is very low.

one spike is correlated with the location of other spikes. Observation of a vesicle release therefore not only identies the location of the spike that caused the transmission but also partially identies the locations of correlated spikes. When the spikes are sufciently strongly autocorrelated, the extra information gained about the locations of nontransmitted spikes can outweigh the noisiness of the constant-probability synapse. In this case, 1Iconst =N can be greater than zero. However, the positive information differences tend to be small and occur at only very low transmission probabilities (see Figure 2, <7% transmitted for A D 5, r¿corr D 3; <3% transmitted for A D 5, r¿corr D 15). Thus, the information per unit time transmitted by such a synapse would be quite low in this regime (see section 6). The constant-probability synapse has the same autocorrelation function C.¿ / as the presynaptic spike train and has maximal noise entropy for a given number of releases. Therefore, positive information differences are attributable to increases in transmission entropy as the transmission rate becomes lower rather than to changes in the statistical structure of the releases. For the same statistical structure, lower transmission rates tend to increase the entropy per transmission because transmissions are less probable, and thus more surprising, at lower rates (for a concrete example, consider the low rate limit of equation 3.4 for the Poisson process). As I will show, positive infor-

Enhancement of Information Transmission Efciency

1147

mation differences for positively autocorrelated spike trains can be much more dramatic when the synapse has the dynamics of synaptic depression. 4 Information Transmission for a Stochastic Depressing Synapse 4.1 Model of Synaptic Depression. I next calculate the information transmitted across a stochastic model synapse with failures of transmission resulting from post–vesicle release refractoriness. The model is intended to capture, in an analytically tractable way, both the stochasticity of synaptic transmission and the essential dynamics of depressing synapses, in which the average transmission amplitude falls dramatically following a presynaptic spike before recovering over a characteristic timescale (Abbott et al., 1997; Tsodyks & Markram, 1997; Varela et al., 1997). I assume a model synapse with a single active zone containing a single vesicle release site. The synapse transmits perfectly when a vesicle is present at the release site and does not transmit when the vesicle is absent (which I will refer to as being in the depleted state). Following vesicle release, the synapse becomes refractory until the vesicle is replenished. Vesicle replenishment is governed by a homogeneous Poisson process of rate 1=¿D (for similar models, see Vere-Jones, 1966; Goldman, 2000; Matveev & Wang, 2000). The simplifying assumption that the synapse either transmits perfectly (when the vesicle is available) or is refractory (when the vesicle is unavailable) makes the calculation of information transmission for this model analytically tractable. The ensemble-averaged behavior of this synapse is that the transmission probability drops to zero following release and then recovers exponentially back to one with a time constant ¿D . Because the synaptic transmission probability drops all the way to zero following release, this model represents a very strongly depressing synapse. Failures of transmission by the model depressing synapse result exclusively from vesicle depletion and are governed by the dynamics of vesicle replenishment. This contrasts with the failures of transmission by the constant-probability synapse that result from a purely static probability of vesicle release failure. Studying these two types of synapses allows a comparison between the effects of stochasticity in the presence or absence of synaptic dynamics. 4.2 Input Generated by a Stationary Renewal Process. In this section, the calculations of section 3.2 are repeated for the model depressing synapse. As in that section, general results are rst given for an arbitrary stationary renewal process and then applied to the autocorrelated renewal process described in section 3.2.2. 4.2.1 General Results. The calculations of H.V/=N and H.VjS/=N for the depressing synapse follow closely those for the constant-probability

1148

M. Goldman

synapse (section 3.2). This is a consequence of the simplied model of synaptic depression, as described below. I consider a stationary renewal process with P.s/ dened as in equation 3.9. Because the ISIs are independent and the depressing synapse is in an identical (depleted) state after each transmission, the IVIs are independent as well. This implies that P.v/ takes the form of a product, as in equation 3.10, and that H.V/=N assumes the form given in equation 3.12. The only difference between H.V/=N for the constant-probability and model-depressing synapses is in the form of the IVI distribution function PIVI .1V /. This can be found for the depressing synapse by the method of Laplace transforms (Cox, 1962; Cox & Lewis, 1966; de la Rocha et al., 2002). The calculation is outlined in appendix B. The resulting distribution is ( ¡1

PIVI .1V / D L

eISI .u/ ¡ P eISI .u C ¿ ¡1 / P D eISI .u C ¿ ¡1 / 1¡P D

) :

(4.1)

The noise entropy per release H.VjS/=n is also of the same form as that for the constant-probability synapse (see equation 3.15). This is because the depressing synapse is in an identical (depleted) state not only following each transmission, but also following each spike. If the synapse is depleted just before the spike, it remains depleted; if the synapse is not depleted before the spike, it releases and becomes depleted. As a consequence, the conditional probabilities P.vjs/ factor into a product as in equation 3.13, leading to H.VjS/=n as in equation 3.15. The difference from the constantprobability synapse is in the expression for the probability Q.bj1S /. For the depressing synapse, Q.bj1S / is given by the exponential probability that the synapse recovers during the ISI 1S : » Q.bj1S / D

e¡1S =¿D ; b D 0 1 ¡ e¡1S =¿D ; b D 1:

(4.2)

H.VjS/=n can be converted to a noise entropy per transmission H.VjS/=N by dividing by the average probability of vesicle transmission p, pD

N D n

Z

1

0

d1S PISI .1S /Q.1j1S /;

(4.3)

giving H.VjS/ D¡ N

R1 0

P d1S PISI .1S / bD0;1 Q.bj1S / log2 .Q.bj1S // R1 : 0 d1S PISI .1S /Q.1j1S /

(4.4)

The information per vesicle release I.SI V/=N is the difference between H.V/=N (equation 3.12, using PIVI .1V / from equation 4.1) and H.VjS/=N (equation 4.4).

Enhancement of Information Transmission Efciency

Idep/N (bits)

5

r corr = 0.5

r corr = 3

1149

r corr = 15

4 3 2 1 0 -1 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

Failure probability

Failure probability

Failure probability

A=0 A=1 A=5 Figure 3: Information per vesicle release difference 1Idep =N D I.SI V/=N ¡ IpD1 .SI V/=N for the model-depressing synapse receiving input generated by the stationary renewal process of equation 3.17. Graphs are organized as in Figure 2. 1Idep =N is positive for a wide range of parameters when the input has nonzero autocorrelations.

4.2.2 Input with Nonnegative Autocorrelations That Decay Exponentially. For positively correlated inputs, I.SI V/=N can be much larger for the depressing synapse than for a reliable (p D 1) synapse. I demonstrate this by evaluating the general results derived above for the exponentially correlated renewal process (see section 3.2.2, equations 3.17 through 3.19). PIVI .1V / and p for this process are calculated in appendix B. Here, I give only the resulting information differences 1Idep =N D I.SI V/=N ¡ IpD1 .SI V/=N between the depressing and reliable synapses. Figure 3 shows 1Idep =N as a function of the input parameters and average synaptic failure probability 1 ¡ p. 1Idep =N > 0 over a wide range of parameter values, indicating that the depressing synapse can transmit more information per vesicle release than a reliable synapse. This contrasts with the corresponding result for the constant-probability synapse (see Figure 2), in which positive information differences occur only for extremely low transmission probabilities. For uncorrelated input, the depressing synapse always transmits less information per vesicle release than the perfectly transmitting synapse (1Idep < 0; see Figure 3, dashed lines). This is to be expected because for uncorrelated input, the perfectly transmitting synapse has both maximal total transmission entropy (for a given number of transmissions) and no noise entropy. However, for correlated input, the perfectly transmitting synapse has nonmaximal total transmission entropy and may transmit less information per

1150

M. Goldman

vesicle release than the depressing synapse (1Idep > 0). The positive information differences tend to increase with the magnitude of autocorrelations A (see Figure 3, thin and thick solid lines). Large increases in 1Idep occur around values of p where the correlation and depression timescales match, ¿D D ¿corr (these values are indicated by the arrows on Figure 4B), and tend to increase more rapidly for higher values of ¿corr (compare the three graphs in Figure 3). Because higher values of ¿D correspond to higher failure probabilities (see equation B.13), the rapid increases in 1Idep shift to higher failure probabilities when ¿corr is increased. 5 Comparison of the Information Transmitted Across the Constant-Probability and Depressing Synapses The information differences between the depressing and reliable synapses reect differences in both average transmission probability and synaptic dynamics. To isolate the effects of synaptic dynamics, I next compare the information transmitted by the depressing synapse to that transmitted by a constant-probability synapse with the same average transmission probability. The nonnegatively autocorrelated renewal spike train is considered below. This is followed by a qualitative analysis that enables the information transmission process to be understood in a framework that is more generally applicable to non-analytically tractable studies. 5.1 Input with Nonnegative Autocorrelations That Decay Exponentially. A comparison of Figures 2 and 3 reveals that the information transmitted by the depressing synapse, Idep .SI V/, exceeds that transmitted by the constant-probability synapse, Iconst .SI V/, when the depressing and constantprobability synapses have the same average transmission probability (i.e., when p D p so that the average number of transmissions N is the same for both synapses). Figure 4A shows this explicitly, plotting 1Idep¡const =N D Idep .SI V/=N ¡ Iconst .SI V/=N as a function of the input parameters and average failure probability. For all parameter values tested, 1Idep¡const =N ¸ 0. This demonstrates that the depressing synapse transmits more information about these trains than does the constant probability synapse. The larger transmission of information per vesicle release by the depressing synapse, compared to the constant probability synapse, primarily reects a decrease in the noise entropy H.VjS/=N (see Figure 4C) rather than an increase in the total entropy H.V/=N (see Figure 4B). This is surprising in the light of previous information studies (Srinivasan, Laughlin, & Dubs, 1982; Barlow & Foldiak, 1989; Dan et al., 1996; de la Rocha et al., 2002; Goldman et al., 2002; Wang, Liu, Sanchez-Vives, & McCormick, 2003) that have primarily used decorrelation, or maximization of H.V/ for a xed number of transmissions, as a measure of information transmission efciency. For uncorrelated input, the constant-probability synapse has maximal total entropy per release H.V/=N, reecting that the transmissions are completely

Enhancement of Information Transmission Efciency

Idep-const/N (bits)

A

r corr = 0.5

r corr = 3

1151

r corr = 15

5 4 3 2 1 0 -1

B Hdep-const(V)/N

5 4 3 2 1 0 -1

Hdep-const(V|S)/N

C 1 0 -1 -2 -3 -4 -5 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

Failure probability

Failure probability

Correlation

D1

Failure probability A=0 A=1 A=5

1

0

0

-1

-1 0

(ms)

500

0

(ms)

500

Figure 4: Synaptic depression increases the information per vesicle release relative to a constant-probability synapse with the same average transmission probability. (A) Information per vesicle release difference 1Idep¡const =N D Idep =N ¡ Iconst =N between the depressing and constant-probability synapses. Graphs are organized as in Figures 2 and 3. (B) The corresponding response entropy difference 1Hdep¡const .V/=N D Hdep .V/=N ¡ Hconst .V/=N is negative for uncorrelated input (dashed line) but can be positive for correlated input (solid lines). Peak positive values are located near where ¿D D ¿corr (indicated by arrows). These results reect the autocorrelation structure of the transmissions (panel D). (C) The noise entropy difference 1Hdep¡const .VjS/=N D Hdep .VjS/=N ¡ Hconst .VjS/=N is negative for all parameter values and is larger in magnitude than 1Hdep¡const .V/=N. (D) Autocorrelations C.¿ / of the vesicle releases across a constant-probability (left) and depressing (right) synapse receiving uncorrelated input. Each synapse transmitted 50% of the presynaptic spikes. The depressing synapse transmits more information per vesicle release than the constant-probability synapse even though the transmissions across the constant-probability synapse are uncorrelated.

1152

M. Goldman

uncorrelated (see Figure 4D, left). By contrast, the depressing synapse has nonzero autocorrelations (see Figure 4D, right). Thus, for a given failure probability, the constant-probability synapse has larger total entropy per release than the depressing synapse in this case (1Hdep¡const .V/=N < 0; see Figure 4B, dashed lines). Nevertheless, the depressing synapse transmits more information per release (1Idep¡const =N > 0; see Figure 4A, dashed lines), reecting that the relative decrease in total entropy for the depressing synapse is more than compensated for by the relative decrease in noise entropy (H.VjS/=N < 0; see Figure 4C, dashed lines). The noise entropy per release is always larger for the constant-probability synapse because a synapse with constant probability of transmission is maximally noisy in the sense that all spikes are transmitted with equal probability. The depressing synapse is less noisy because it has a nonuniform probability of transmission: spikes that follow long ISIs are transmitted more reliably, and spikes that follow short ISIs are transmitted less reliably. For correlated input, the noise entropy differences 1Hdep¡const .VjS/=N are even larger (see Figure 4C, solid lines). This is because the positively correlated input tends to have clusters of spikes that enhance the nonuniformity of transmission probabilities across the depressing synapse. The rst spike in a cluster is transmitted with high probability, whereas the latter spikes are not. Moreover, 1Hdep¡const .V/=N can be greater than zero when the input has positive autocorrelations (see Figure 4B, solid lines). This reects the decorrelation of the positively correlated spike trains by the depressing synapse and, as noted in previous work (Goldman et al., 2002), tends to be maximal when ¿D D ¿corr (indicated by the arrows in Figure 4B). 5.2 Qualitative Analysis. The previous sections give insight into how the total entropies compare for the depressing and constant-probability synapses and how the noise entropies compare. In some cases, such as for uncorrelated input (see Figure 4, dashed lines), both the total entropy and the noise entropy are smaller for the depressing synapse than for the constantprobability synapse. To understand why the depressing synapse transmits more information per vesicle release (1Idep¡const =N > 0) in these cases therefore requires the relatively nonintuitive assessment of whether the difference in total entropy, 1Hdep¡const .V/=N, or difference in noise entropy, 1Hdep¡const .VjS/=N, is larger in magnitude. However, as demonstrated below, it becomes obvious that 1Idep¡const =N > 0 for uncorrelated input if the information is expressed instead as I.SI V/ D H.S/ ¡ H.SjV/ X D¡ P.s/ log2 .P.s// s

C

X v

P.v/

X s

P.sjv/ log2 .P.sjv//:

(5.1) (5.2)

Enhancement of Information Transmission Efciency

1153

When expressed in this manner, I.SI V/ gives the amount by which the uncertainty in the presynaptic spike train ensemble S can be reduced by observation of the transmissions. The spike train entropy H.S/ gives the uncertainty in the spike train when the transmissions are not known. The conditional spike train entropy H.SjV/ gives the uncertainty remaining after the transmissions have been observed. The advantage of writing I.SI V/ in this form is that H.S/ is independent of the synaptic properties. Thus, the difference in the information transmitted by two synapses about a given spike train ensemble S is determined solely by differences in H.SjV/, that is, 1Idep¡const .SI V/ D ¡1Hdep¡const .SjV/. H.SjV/ is given P by an average over transmission trains v of the entropy term h.Sjv/ D ¡ s P.sjv/ log2 .P.sjv//. To understand the qualitative behavior of H.SjV/, I approximate the average over transmission trains by considering a typical set of transmissions (see Figure 5, v). I then compare h.Sjv/ for the constant-probability and depressing synapses by estimating the probabilities P.sjv/ of various presynaptic spike trains (see Figure 5, s1 ; s2 , and s3 ). More nonuniform distributions of P.sjv/ correspond to lower h.Sjv/ and therefore to larger I.SI V/. Figure 5A illustrates the information transmission process when the presynaptic spike train is generated by a homogeneous Poisson process. Consider a set of transmissions that could be produced by either the constantprobability or depressing synapse (see Figure 5, v). P.sjv/ is nonzero only for presynaptic spike trains with spikes occurring at the times of transmission. I consider three such trains (see Figure 5, s1 ; s2 , and s3 ). Because these trains have equal numbers of spikes, they have equal prior probabilities P.s/ of being generated by the homogeneous Poisson process. For the constantprobability synapse, the posterior probabilities P.sjv/ are also equal (see Figure 5A, top right) because each spike in the presynaptic train has an equal probability of transmission. By contrast, these trains have nonuniform probabilities of being transmitted by the depressing synapse (see Figure 5A, bottom right). Spikes located far away from a previous transmission are more likely to be transmitted by the depressing synapse. Therefore, the spike trains most likely to be consistent with the shown transmissions are those that do not have any nontransmitted spikes located far away from the rst transmission. This implies that P.sjv/ is lowest for train s3 and greatest for train s1 . This pictorial argument suggests that P.sjv/ is more nonuniform for the depressing synapse than for the constant-probability synapse, and therefore I.SI V/ is greater for the depressing synapse. The nonuniformity of P.sjv/ is not particular to synaptic depression. It would be expected for a synapse with any form of dynamics. Thus, this result is not particular to depressing synapses but more generally applies to synapses with dynamic vesicle release probabilities. Qualitatively, synaptic dynamics allow information to be obtained not only about presynaptic spikes occurring at the

1154

M. Goldman

P(s), or P(s|v) for const.-prob. synapse

A s1 2

v

D

B

s2

s3

s1

s2

s3

P(s), or P(s|v) for const.-prob. synapse

Spike Train

s1 s2 s3 v

s1

P(s|v) with depression

s3

D

time

s1

s2

s3

s1

s2

s3

P(s|v) with depression

s

Spike Train

Figure 5: Qualitative comparison of the information transmitted across a depressing and constant-probability synapse with the same average transmission probabilities. The information transmitted corresponds to the ability to identify which of the many possible presynaptic trains s gives rise to an observed set of transmissions v. (A) The depressing synapse transmits more information about a Poisson spike train than does the constant-probability synapse. P.sjv/ is uniform for the constant-probability synapse (top right) but nonuniform for the depressing synapse (bottom right). This reects that spikes located far from the previous transmission are more likely to be transmitted by the depressing synapse. (B) The depressing synapse is particularly well tuned for carrying information about positively correlated trains. Correlated spike trains tend to have clusters of spikes (top left) that lead to a strongly nonuniform P.sjv/ for the depressing synapse. In both A and B, we assume that the shown trains are typical and have equal prior probabilities P.s/. In the top right panels, P.s/ has been scaled to be the same height as P.sjv/ for the constant-probability synapse. The arrow labeled ¿D indicates the time constant of depression. See the text for details.

Enhancement of Information Transmission Efciency

1155

times of transmission but also about spikes occurring when there are no transmissions. For correlated spike trains, the form of the dynamics is important. Spike trains generated by a positively correlated process tend to have clustered spikes. Consider a typical train of transmissions v and three typical presynaptic trains with clustered spikes (see Figure 5B, left). For simplicity, I assume that these spike trains have equal prior probabilities P.s/ and do not consider less clustered spike trains that would have lower prior probabilities. By the same arguments that were made for uncorrelated input, the constant-probability synapse has a uniform distribution P.sjv/ for these trains. The depressing synapse, however, has a highly nonuniform distribution because the clustering of spikes makes it relatively likely that trains will have spikes located either very far away from previous transmissions (like s1 and s2 ) or very close to the previous transmission (like s3 ). This gives rise to a much more nonuniform distribution P.sjv/ than was found for uncorrelated input (see Figure 5B, left). This suggests why 1Idep¡const =N tends to be larger for correlated input than for uncorrelated input (see Figure 4A). The result for correlated input does depend on the form of the dynamics. For example, a facilitating synapse would differ from the depressing synapse in that it would be more likely, for a xed number of transmissions, to transmit multiple spikes from the same cluster and no spikes from other clusters. Information about the nontransmitted clusters would be lost, and the representation of the transmitted clusters would be redundant. This suggests that a facilitating synapse would be less efcient at transmitting these trains than a depressing synapse. 6 Discussion I have studied how the information transmission across a synapse depends on the correlation structure of the presynaptic spike trains, the stochastic nature of synaptic transmission, and the dynamics of synaptic depression. In summary, when spikes are generated by an uncorrelated process, synaptic transmission failures reduce the information per vesicle release transmitted across a synapse. Intuitively, this is because the input has no redundancy that can be reduced by synaptic failures. For correlated input, synaptic failures can increase the information transmitted per vesicle release. Information efciency increases are limited for the constant-probability synapse but can be large for a depressing synapse. For the nonnegatively autocorrelated inputs considered here, the information transmitted by the depressing synapse exceeds that transmitted across a constant probability synapse with the same average transmission probability. I have used information per vesicle release as the measure of synaptic transmission efciency. This measure is easily interpretable and allows a comparison between synapses that is independent of the precision ±T with which spike times are specied (see section 2). An artifact of the use of this

1156

M. Goldman

measure of efciency is that in some cases, its maximum value occurs when the transmission probability equals zero (see Figures 2 and 3). This suggests that other information-maximization measures might be more suitable in the low-transmission-probability regime. For example, maximizing information per unit time favors perfect transmission. Maximizing information per energy usage (Laughlin, de Ruyter van Steveninck, & Anderson, 1998; Balasubramanian et al., 2001; Levy & Baxter, 2002; Schreiber et al., 2002) similarly tends to avoid very low transmission probabilities because of the costs of transmission-independent energy expenditures. Many studies of information transmission in sensory areas focus on the role of decorrelation in maximizing information transmission (Srinivasan et al., 1982; Barlow & Foldiak, 1989; Dan et al., 1996; Goldman et al., 2002; Wang et al., 2003). Decorrelation corresponds to a maximization of the response entropy that, when the noise entropy is small, also corresponds to an approximate maximization of the information transmitted. However, the results here show that the noise entropy cannot be neglected. Rather, changes in the noise entropy are the dominant contributor to the enhanced performance of the depressing synapse over the constant-probability synapse. The constant-probability synapse is maximally noisy because it transmits all spikes with equal probability. By contrast, the synaptic failures across the depressing synapse are much more predictable, with higher probabilities of failure following short ISIs and lower probabilities following long ISIs. The noise entropy is likely to be smaller for many synaptic transmission situations whose behavior is more complex than those modeled here. For example, noise in the transmission process can be reduced if multiple independently releasing active zones, in either the same or different synapses, receive the same input (de Ruyter van Steveninck & Laughlin, 1996; Manwani & Koch, 2000). In such cases, maximization of the total transmission entropy (see Figure 4B) would assume a relatively larger role in maximizing the information transmission. In the limit that the total transmission entropy dominates, the result of maximizing the information per transmission for a given fraction of transmissions would be expected to approach that found in previous decorrelation studies (Goldman et al., 2002). However, the dominance of the noise entropy in the examples studied here suggests that for many situations, the noise entropy is a signicant factor that cannot be neglected. This is consistent with studies of other adaptation (Wainwright, 1999; Brenner, Bialek, & de Ruyter van Stevenick, 2000; Fairhall, Lewen, Bialek, & de Ruyter van Steveninck, 2001) or refractory (Berry & Meister, 1998) processes that implicate a signicant role for noise reduction in the information transmission process. Direct numerical calculations of information (de Ruyter van Steveninck et al., 1997; Strong et al., 1998) explicitly account for noise entropies but are limited in practice by the need to use either large time bins or brief duration word sizes. These constraints can be problematic when considering spike trains that may have high temporal precision but long timescales of

Enhancement of Information Transmission Efciency

1157

autocorrelations. I handle this problem by analytically solving the synaptic information transmission problem for simplied models of positively correlated spike trains and depressing synapses. Attempts to include more realistic features of spike trains and synapses in calculation of the Shannon information have worked in limits that consider only total transmission entropies (Goldman et al., 2002; de la Rocha et al., 2002), or used approximate but more easily calculated measures of the information transmission (Fuhrmann et al., 2002). The ideal measure of information transmission efcacy must consider the synapse in the context of the whole neuron or network. Calculations of the information that a postsynaptic neuron’s spike train conveys about a particular presynaptic input have thus far been limited to constant-transmissionprobability models of stochastic synapses (Zador, 1998; Manwani & Koch, 2001) or have ignored synaptic stochasticity altogether (London, Schreibman, Hausser, Larkum, & Segev, 2002). An interesting open question is how synaptic dynamics and stochasticity may be incorporated into such models of information transmission by the postsynaptic neuron. Because the calculations here and elsewhere are applicable only to limited situations, qualitative analysis provides a valuable tool for gaining insight into the synaptic transmission process. The framework presented here formulates the information transmission process as a stimulus reconstruction problem in which the likelihoods of different presynaptic spike trains are estimated based on a given set of transmissions. This analysis, for example, shows clearly how synaptic dynamics can increase the information transmitted across a synapse by providing information about the locations of spikes that are not themselves transmitted (see Figure 5). For uncorrelated input (see Figure 5A), the presence of synaptic dynamics, rather than the exact form of the dynamics, was seen to be the essential feature that leads to increases in information transmission. For correlated input, the dynamics of synaptic depression were seen to be particularly well adapted to transmitting information about positively correlated inputs. In both examples, general insights were drawn without requiring a detailed mathematical model of the synapse. I expect that this framework will provide similar insights for systems with different spike train autocorrelations and different synaptic dynamics.

Appendix A: Autocorrelation Function of a Stationary Renewal Process Spike Train Here, the normalized autocorrelation function C.¿ / of a stationary renewal process spike train is related to the ISI distribution PISI .1S / from which the spike train was generated. The normalized autocorrelation function C.¿ /

1158

M. Goldman

for a point-process spike train s.t/ is dened by C.¿ / ´

s.t/s.t C ¿ / ¡ r2 ; r2

(A.1)

where .¢/ denotes ensemble averaging. Following the notation of Cox and Lewis (1966), C.¿ / can be expressed as C.¿ / D

m f .¿ / ¡ r r

(A.2)

;

where m f .¿ / gives the probability density that some pair of events is separated by a time interval ¿ . m f .¿ / is expressed as a sum of probabilities fi .¿ / that the ith ISI is in the interval (¿; ¿ C d¿ ), m f .¿ / D

1 X

(A.3)

fi .¿ /:

iD1

For a renewal process, fi .¿ / is given by convolving the ISI distribution PISI .1S / with itself i times. Laplace transforming the above equation thus gives ef .u/ D m

1 X iD1

eISI .u/]i D [P

eISI .u/ P eISI .u/ 1¡P

:

(A.4)

Inverse Laplace-transforming and substituting into equation A.2 gives C.¿ /. For the ISI distribution given in equation 3.17 with Laplace transform given by equation B.4, substitution into equation A.4 gives exponentially decaying autocorrelations of the form C.¿ / D Ae¡¿ =¿corr , with A, ¿corr , and the average rate r given by AD

®.1 ¡ ®/.¸¡ ¡ ¸C /2 ¸¡ ¸C

(A.5)

1 .1 ¡ ®/¸¡ C ®¸C

(A.6)

¿corr D rD

¸¡ ¸C : .1 ¡ ®/¸¡ C ®¸C

(A.7)

It can be shown that this ISI distribution can assume any arbitrary coefcient of variation between one and 1, depending on the choice of the distribution parameters (Cox, 1962).

Enhancement of Information Transmission Efciency

1159

Appendix B: Calculation of PIVI .1V / and p for the Stationary Renewal Spike Train Below, the IVI distributions PIVI .1V / are derivedfor the constant-probability or model depressing synapse receiving input generated by a stationary renewal process. For the constant-probability synapse, PIVI .1V / is obtained by noting that the probability of having an IVI of duration 1V can be broken into two terms as

PIVI .1V / D pPISI .1V /

Z

C .1 ¡ p/

0

(B.1) 1V

d1S PISI .1S /PIVI .1V ¡ 1S /:

(B.2)

The rst term gives the probability that the rst ISI is of duration 1V and the vesicle releases. The second term gives the sum over all possible ways that an IVI of duration 1V could occur when the rst ISI does not cause a vesicle release. PIVI .1V / is obtained from the Laplace transform of the above equation: eIVI .u/ D P

eISI .u/ pP : eISI .u/ 1 ¡ .1 ¡ p/P

(B.3)

For the exponentially correlated spike train model described by equation 3.17, the Laplace transform of PISI .1S / is eISI .u/ D ®¸¡ =.u C ¸¡ / C .1 ¡ ®/¸C =.u C ¸C /: P

(B.4)

Inserting the above into equation B.3 and inverse Laplace transforming gives that PIVI .1V / is identical in form to the ISI distribution PISI .1V / (see equation 3.17) except for the replacement of the average rate r by pr. This result reects that failures to transmit and failures to receive a presynaptic spike are equivalent for the constant-probability synapse. For the model-depressing synapse, PIVI .1V / can similarly be decomposed into two terms corresponding to an IVI of duration 1V resulting from the rst ISI or multiple ISIs:

PIVI .1V / D PISI .1V /.1 ¡ e¡1V =¿D / Z

C

0

1V

d1S PISI .1S /e¡1S =¿D PIVI .1V ¡ 1S /:

(B.5) (B.6)

Solving this equation by using Laplace transforms gives eIVI .u/ D P

eISI .u/ ¡ P eISI .u C ¿ ¡1 / P D eISI .u C ¿ ¡1 / 1¡P D

:

(B.7)

1160

M. Goldman

For the exponentially correlated spike train, PIVI .1V / is obtained by substituting equation B.4 into equation B.7 and inverse Laplace transforming. The resulting expression gives PIVI .1V / as a linear combination of four exponential distributions with inverse time constants ¸1 D ¸¡ , ¸2 D ¸C , ¡1 C ¿D¡1 and corresponding weights °i : ¸3 D ¿D¡1 , and ¸4 D ¿corr

PIVI .1V / D

4 X

°i ¸i e¡¸i 1V ;

(B.8)

iD1

where °1 D

®¸3 .¸2 ¡ ¸1 C ¸3 / .¸4 ¡ ¸1 /.¸3 ¡ ¸1 /

°2 D

.1 ¡ ®/¸3 .¸1 ¡ ¸2 C ¸3 / .¸4 ¡ ¸2 /.¸3 ¡ ¸2 /

(B.10)

°3 D

¡1 ¡ r.¿corr ¸3 / .¸3 ¡ ¸2 /.¸3 ¡ ¸1 /

(B.11)

°4 D

¡Ar¸3 : .¸4 ¡ ¸2 /.¸4 ¡ ¸1 /

(B.12)

(B.9)

The weights °i can be negative so, unlike ® in the ISI distribution (see equation 3.17), they cannot be interpreted as probabilities. The average transmission probabilities p D N=n are obtained by substituting PISI .1S / (see equation 3.17) into equation 4.3. The result is: p D 1¡

®¸1 .1 ¡ ®/¸2 ¡ : ¸1 C ¸3 ¸2 C ¸3

(B.13)

For uncorrelated input (® D 0 or 1), p depends on only the product r¿D . For correlated input, p depends more generally on the form of the autocorrelations. This implies that for a given value of p, r¿D differs across the graphs shown in Figures 3 and 4. The arrows labeling ¿D D ¿corr in Figure 4B give an example of this dependence. Acknowledgments This work was supported by NIH grants MH58754 and MH60651, NSF grant IBN-9817194, the Sloan Center for Theoretical Neurobiology at Brandeis University, the W.M. Keck Foundation, and the Howard Hughes Medical Institute. I thank Larry Abbott and Dan Butts for helpful discussions and comments on the manuscript and Xiaohui Xie for helpful comments on the manuscript.

Enhancement of Information Transmission Efciency

1161

References Abbott, L. F., Varela, J. A., Sen, K., Nelson, S. B. (1997). Synaptic depression and cortical gain control. Science, 275, 220–223. Attneave, F. (1954). Some informational aspects of visual perception. Psych. Rev., 61, 183–193. Baddeley, R., Abbott, L. F., Booth, M. C., Sengpiel, F., Freeman, T., Wakeman, E. A., & Rolls, E. T. (1997). Responses of neurons in primary and inferior temporal visual cortices to natural scenes. Proc. R. Soc. Lond. B Biol. Sci., 264, 1775–1783. Balasubramanian, V., Kimber, D., Berry, M. J. II. (2001). Metabolically efcient information processing. Neural Comput., 13, 799–815. Barlow, H. B. (1961). Possible principles underlying the transformation of sensory messages. In W. A. Rosenblith (Ed.), Sensory communication (pp. 217–234). Cambridge, MA: MIT Press. Barlow, H. B., & Foldiak, P. (1989). Adaptation and decorrelation in the cortex. In R. Durbin, C. Miall, & G. Mitchinson (Eds.), The computing neuron. Reading, MA: Addison-Wesley. Berry, M. J., & Meister, M. (1998). Refractoriness and neural precision. J. Neurosci., 18, 2200–2211. Brenner, N., Bialek, W., & de Ruyter van Steveninck, R. (2000). Adaptive rescaling maximizes information transmission. Neuron, 26, 695–702. Brenner, N., Strong, S. P., Koberle, R., Bialek, W., & de Ruyter van Steveninck, R. R. (2000). Synergy in a neural code. Neural Comput., 12, 1531–1552. Cox, D. R. (1962). Renewal theory. London: Methuen. Cox, D. R., & Lewis, P. A. W. (1966). The statistical analysis of series of events. London: Methuen. Dan, Y., Atick, J. J., & Reid, R. C. (1996). Efcient coding of natural scenes in the lateral geniculate nucleus: Experimental test of a computational theory. J. Neurosci., 16, 3351–3362. de Ruyter van Steveninck, R. R., & Laughlin, S. B. (1996). The rate of information transfer at graded-potential synapses. Nature, 379, 642–645. de Ruyter van Steveninck, R. R., Lewen, G. D., Strong, S. P., Koberle, R., & Bialek, W. (1997). Reproducibility and variability in neural spike trains. Science, 275, 1805–1808. de la Rocha, J., Nevado, A., & Parga, N. (2002). Information transmission by stochastic synapses with short-term depression: Neural coding and optimization. Neurocomputing, 44–46, 85–90. Fairhall, A. L., Lewen, G. D., Bialek, W., & de Ruyter van Steveninck, R. R. (2001). Efciency and ambiguity in an adaptive neural code. Nature, 412, 787–792. Fuhrmann, G., Segev, I., Markram, H., & Tsodyks, M. (2002). Coding of temporal information by activity-dependent synapses. J. Neurophysiol., 87, 140–148. Goldman, M. S. (2000). Computational implications of activity-dependent neuronal processes. Unpublished doctoral dissertation, Harvard University. Goldman, M. S., Maldonado, P., & Abbott, L. F. (2002). Redundancy reduction and sustained ring with stochastic depressing synapses. J. Neurosci., 22, 584–591.

1162

M. Goldman

Laughlin, S. B. (2001). Energy as a constraint on the coding and processing of sensory information. Curr. Opin. Neurobiol., 11, 475–480. Laughlin, S. B., de Ruyter van Steveninck, R. R., & Anderson, J. C. (1998). The metabolic cost of neural information. Nat. Neurosci., 1, 36–41. Levy, W. B., & Baxter, R. A. (2002). Energy-efcient neuronal computation via quantal synaptic failures. J. Neurosci., 22, 4746–4755. London, M., Schreibman, A., Hausser, M., Larkum, M. E., & Segev, I. (2002). The information efcacy of a synapse. Nat. Neurosci., 5, 332–340. Maass, W., & Zador, A. M. (1999). Dynamic stochastic synapses as computational units. Neural Comput., 11, 903–917. Manwani, A., & Koch, C. (2001). Detecting and estimating signals over noisy and unreliable synapses: Information theoretic analysis. Neural Comput., 13, 1–33. Matveev, V., & Wang, X.-J. (2000). Differential short-term synaptic plasticity and transmission of complex spike trains: To depress or to facilitate? Cereb. Cortex, 10, 1143–1153. Murthy, V. N., Sejnowski, T. J., & Stevens, C. F. (1997). Heterogeneous release properties of visualized individual hippocampal synapses. Neuron, 18, 599– 612. Natschlager, T., & Maass, W. (2001). Computing the optimally tted spike train for a synapse. Neural Comput., 13, 2477–2494. Schreiber, S., Machens, C. K., Herz, A. V. M., & Laughlin, S. B. (2002). Energyefcient coding with discrete stochastic events. Neural Comput., 14, 1323–1346. Shannon, C., & Weaver, W. (1949). The mathematical theory of communication. Urbana: University of Illinois Press. Srinivasan, M. V., Laughlin, S. B., & Dubs, A. (1982). Predictive coding: A fresh view of inhibition in the retina. Proc. R. Soc. Lond. B. Biol. Sci., 216, 427–459. Stevens, C. F., & Wang, Y. (1995). Facilitation and depression at single central synapses. Neuron, 14, 795–802. Strong, S. P., Koberle, R., de Ruyter van Steveninck, R. R., & Bialek, W. (1998). Entropy and information in neural spike trains. Phys. Rev. Lett., 80, 197–200. Tsodyks, M. V., & Markram, H. (1997). The neural code between neocortical pyramidal neurons depends on neurotransmitter release probability. Proc. Natl. Acad. Sci. U.S.A., 94, 719–723. Varela, J., Sen, K., Gibson, J., Fost, J., Abbott, L. F., & Nelson, S. B. (1997). A quantitative description of short-term plasticity at excitatory synapses in layer 2/3 of rat primary visual cortex. J. Neurosci., 17, 7926–7940. Vere-Jones, D. (1966). Simple stochastic models for the release of quanta of transmitter from a nerve terminal. Aust. J. Stat., 8, 53–63. Wainwright, M. J. (1999). Visual adaptation as optimal information transmission. Vision Res., 39, 3960–3974. Wang, X. J., Liu, Y., Sanchez-Vives, M. V., & McCormick, D. A. (2003). Adaptation and temporal decorrelation by single neurons in the primary visual cortex. J. Neurophysiol., 89, 3279–3293. Zador, A. (1998). Impact of synaptic unreliability on the information transmitted by spiking neurons. J. Neurophysiol., 79, 1219–1229. Received May 15, 2003; accepted November 6, 2003.

LETTER

Communicated by Javier Movellan

An Algorithm for the Detection of Faces on the Basis of Gabor Features and Information Maximization Hitoshi Imaoka [email protected] Multimedia Research Laboratories, NEC Corporation, Miyamaeku, Kawasaki, Kanagawa, 216-8555 Japan

Kenji Okajima [email protected] Fundamental Research Laboratories, NEC Corporation, Tsukuba, 305-8501 Japan

We propose an algorithm for the detection of facial regions within input images. The characteristics of this algorithm are (1) a vast number of Gabor-type features (196,800) in various orientations, and with various frequencies and central positions, which are used as feature candidates in representing the patterns of an image, and (2) an information maximization principle, which is used to select several hundred features that are suitable for the detection of faces from among these candidates. Using only the selected features in face detection leads to reduced computational cost and is also expected to reduce generalization error. We applied the system, after training, to 42 input images with complex backgrounds (Test Set A from the Carnegie Mellon University face data set). The result was a high detection rate of 87.0%, with only six false detections. We compared the result with other published face detection algorithms. 1 Introduction In this letter, we propose a feature-based algorithm for the detection of faces in frontal view within images. The algorithm operates by extracting subwindows from the input image. It then applies Gabor lters to extract edge features from these subwindows and uses a modied decision tree classier to decide whether each subwindow contains a face. The classier is generated through a learning process based on the information-maximization principle (Linsker, 1989). The algorithm is applicable to the detection of other objects because it is not customized for faces. Many algorithms for face detection have been proposed (Rowley, Baluja, & Kanade, 1995, 1998; F´eraud & Bernier, 1997; F´eraud, Bernier, Viallet, & Collobert, 2001; Sung & Poggio, 1998; Osuna, Freund, & Girosi, 1997; Terrillon, Shirazi, Sadek, Fukamachi, & Akamatsu, 2000; Schneiderman & Kanade, 1998, 2000; Wiskott, Fellous, Kruger, ¨ & von der Malsburg, 1997; Huang & c 2004 Massachusetts Institute of Technology Neural Computation 16, 1163–1191 (2004) °

1164

H. Imaoka and K. Okajima

Mariani, 2000; Hung, Gutta, & Wechsler, 1996). This is because variations in illumination, view angle, expression, and so on make face detection difcult (Chellappa, Wilson, & Sirohey, 1995). In nding ways to detect an object with a high degree of accuracy, we have to answer two questions: How do we select effective features that are suitable for detecting the object? and How do we classify the pattern into the correct class (face or nonface) by using the selected features? In response to the second question, algorithms such as neural network-based algorithms (Rowley et al., 1995, 1998; F´eraud & Bernier, 1997; F´eraud et al., 2001), cluster-based algorithms (Sung & Poggio, 1998), support vector machines (Osuna et al., 1997; Terrillon et al., 2000), and the histogram method (Schneiderman & Kanade, 1998, 2000) have been applied. Attempts at answering the rst question involve manual feature selection (Wiskott et al., 1997; Huang & Mariani, 2000), the use of a linear feature selection algorithm (Turk & Pentland, 1991), AdaBoost-based feature selection (Viola & Jones, 2001), and others. (See also Fasel & Movellan, 2002, for a comparison of techniques used in most successful neurally inspired face detectors.) Our purposes in the work presented in this letter are to obtain a new method of feature selection that applies a probabilistic approach and to construct an algorithm that applies the features thus selected in face detection. Specically, we prepare 196,800 Gabor-type features and select about 400 features from among these through a learning process based on an empirically determined informational criterion. The learning process also generates a decision tree, which we call a modied decision tree, as a classier. In section 2, we give an outline of the face detection algorithm and in section 3 propose the modied decision tree classier. In section 4, we present a face detection algorithm in which this classier is applied. In section 5, we describe our evaluation of the algorithm’s performance on the CMU test images and its robustness on side views of faces. We summarize our results and provide discussion in section 6. 2 Outline of the Face Detection Algorithm Figure 1 outlines the ow of the algorithm. The ow is almost the same as that of several other algorithms for the same purpose (Rowley et al., 1995, 1998; F´eraud & Bernier, 1997; F´eraud et al., 2001). The rst step is to extract the subwindows from the input image. The subwindow size is a xed 30 by 40 pixels. Each possible subwindow of this size is extracted from the image and then tested to see whether it contains a face. To detect faces of different size, subsampling is repeatedly applied to the input image, reducing its size on each step by a factor of 1/1.2, and subwindows are extracted and tested at each size. The number of extracted subwindows is thus large. For example, from one input image with 300 by 200 pixels, nine images, each having a different size, were generated, and about 114,000 subwindows were extracted and tested.

An Algorithm for the Detection of Faces

1165

Figure 1: Outline of the face detection process.

The second step is to narrow the candidate of face subwindows by a factor of 1=100 » 1=1,000; this is achieved by ltering out subwindows that are more likely to contain faces. The purpose of this step is to reduce the algorithms overall computational cost. In this step, we use a compact and fast classier that is designed to have a high detection rate but also has a rather high false detection rate. We call this classier a prelter. The nal step is to make the nal decision. Of the subwindows that were passed by the prelter, we use an accurate classier to detect the subwindows that contain faces. The accurate classier is designed to have a high detection rate and low false-detection rate (but with a slower processing speed than the prelter). We call this classier an accurate lter. A subwindow that is passed by the accurate lter is taken to be a face subwindow, and its location and size within the original image are recorded. In the next section, we propose a classier for use as the accurate lter. A compact version of the same classier is used as the prelter. 3 Face/Nonface Classier Based on a Modied Decision Tree We describe our classier in this section. Sections 3.1 and 3.2 describe the learning process. In section 3.1, we show how to select suitable features for use in the detection of faces. In section 3.2, we show how the selected features are used in constructing the classier, a modied decision tree. In section 3.3, we describe the classication rule that is used to detect faces during detection runs. 3.1 Feature Selection. First, we let F D f f1 ; f2 ; : : : ; fn g be a set of features to be extracted from a subwindow. Each feature fi takes states f0; 1g, which are determined as described below. The subwindow image, I.E x/, is ltered through a Gabor lter, and the magnitude (i.e., the absolute value) of the complex response is calculated by X ¯ guEi ;kEi ;¾i D xE I.E x/ÃkEi ;¾i .E x ¡ uEi /ai ;

(3.1)

1166

H. Imaoka and K. Okajima

where ÃkEi ;¾i .uE/ denotes a Gabor lter that is dened by E ÃkEi ;¾i .E u/ D exp.¡jE uj2 =2¾i2 C ikEi ¢ u/;

(3.2)

with ¾i and kEi denoting the width and the wave vector of the Gabor lter, respectively. The denominator ai is a normalization factor, which is introduced to make guEi ;kEi ;¾i stable against changes in the absolute brightness, I.E x/ ! ®I.E x/: ai D

qX

xE

I2 .E x/ exp.¡jE x ¡ uE i j2 =2¾i2 /2 :

(3.3)

Here, we use the magnitude of the Gabor lter’s output, because this remains almost constant against changes in the positions of objects in the subwindow. We are then able to extract features that are robust against shifts in object position. According to the energy model (Adelson & Bergen, 1985; Heeger, 1992), the response of a complex cell in the visual cortex is approximated by the magnitude of a complex Gabor lter’s output (see also Fasel, Bartlett, & Movellan, 2002, for the Gabor lter for use in the facial landmarks detection). The state of a feature fi is determined by comparing guE i ;Eki ;¾i with a threshold ti , that is, fi takes state 1 if guEi ;kEi ;¾i ¸ ti and fi takes state 0 otherwise. Actually, however, we introduce additive gaussian noise to the lter’s output guEi ;kEi ;¾i , and use the following equation to compute the probability that the feature fi takes the state 1: p. fi D 1/ D p

1 2¼ ¾n2

Z ti

1

exp.¡.g ¡ guE i ;Eki ;¾i /2 =2¾n2i /dg:

(3.4)

The probability p. fi D 0/ is determined by p. fi D 0/ D 1 ¡ p. fi D 1/. In equation 3.4, ti and ¾n2i denote the threshold and noise variance, respectively, and these are determined depending on the feature. A procedure for determining the threshold and noise variance is described in appendix A. We introduce the noise so that those features whose outputs guEi ;kEi ;¾i take values above or below the threshold by large margins are selected during the learning process. This improves the detection ability for new examples that were not used in training (see appendix A). For each subwindow, we prepared a large number of feature candidates, each having a different set of the parameters .uEi ; kEi ; ¾i / .i D 1; 2; : : : ; 196,800/ from among the values listed in Table 1. The central positions uEi include all possible positions in the subwindows .30 £ 40 D 1200/. For each central E (among them, jkj E D 0 in four sets) and position, we prepare 14 sets of .¾; jkj/ E 16 angles as orientations of k. From Table 1, we see that the wavelengths ¸ D p p E of the lters (whose jkj E 6D 0) are 2 2, 4, 4 2, and 8 pixels. The average 2¼=jkj distance between the centers of eyes is 18.2 pixels in our 30 £ 40 image plane

An Algorithm for the Detection of Faces

1167

E ; ¾; Ek Used in the Gabor Filter. Table 1: Parameter Values of u Parameter

Values

uE E .¾; jkj/ ¾ (pixels) E (1/pixels) jkj

All in a subwindow ppossible positions p .2p2=3; 0/; .4=3; 0/; .4 2=3; 0/; .8=3; p p 0/, .2 2=3; 2¼=2/; .4=3; ¼=2/; .4=3; 2¼=2/, p p p p p .4 2=3; 2¼=4/;p .4 2=3; ¼=2/; .4 2=3; p 2¼=2/, .8=3; ¼=4/; .8=3; 2¼=4/; .8=3; ¼=2/; .8=3; 2¼=2/

Angle of kE (radians)

0; ¼=16; : : : ; 15¼=16

Number of Possible Parameter Values 1200 14

16

Figure 2: Distance between the centers of eyes.

(see Figure 2). The total number of feature candidates per subwindow was thus 196,800 .D 1200 £ 10 £ 16 C 1200 £ 4/. We use F D f f1 ; f2 ; : : : ; f196;800 g to denote these feature candidates. Next, we apply a learning process to select suitable features from among the above features. Suppose we select m features F¤ D . f1¤ ; f2¤ ; : : : ; fm¤ / from the feature candidates F. We introduce a variable º as a label for each subwindow (we set º D 1 for a face subwindow and º D 0 for a nonface subwindow). Our purpose is to nd features that are applicable to classifying the subwindows as a face subwindow or a nonface subwindow with minimal error. A suitable objective function for this purpose is the mutual information between º and F¤ D . f1¤ ; f2¤ ; : : : ; fm¤ /, that is, I.ºI F¤ /. A procedure for calculating an empirically based value for mutual information is described in appendix B. We select an optimal set of features FQ ¤ D . fQ1¤ ; fQ2¤ ; : : : ; fQm¤ / by maximizing the mutual information: FQ ¤ D arg max I.ºI F¤ /: F¤

(3.5)

1168

H. Imaoka and K. Okajima

However, n Cm calculations of mutual information are necessary in selecting features from among all possible F¤ according to equation 3.5. If n or m is large, the computational cost of nding features FQ ¤ becomes prohibitive. We avoid this difculty by using a simplied sequential maximization algorithm instead of the exact maximization algorithm, 3.5. We start by applying the chain rule to write the mutual information as ¤ I.ºI F¤ / D I.ºI f1¤ / C I.ºI f2¤ j f1¤ / C ¢ ¢ ¢ C I.ºI fm¤ j f1¤ ; f2¤ ; : : : ; fm¡1 /

(3.6)

and then determine optimal features by sequentially maximizing each of the terms in the right-hand side of this equation, that is, fQ1¤ D arg max I.ºI fi /; fi 2F

fQ2¤ D arg max I.ºI fi j fQ1¤ /; fi 2F

:: : ¤ fQm¤ D arg max I.ºI fi j fQ1¤ ; fQ2¤ ; : : : ; fQm¡1 /:

(3.7)

fi 2F

It should be noted that FQ ¤ obtained through equation 3.7 is in general different from the true solution. However, equation 3.7 requires only that we evaluate m £ n values of mutual information. Hence, the computational cost is markedly lower than with equation 3.5. In equation 3.7, the mutual information I.ºI fi j fQ1¤ /, for example, is dened as I.ºI fi j fQ1¤ / D H.º j fQ1¤ / ¡ H.º j fi ; fQ1¤ /. The conditional entropies, H.º j fQ1¤ / and H.º j fi ; fQ1¤ /, can be empirically evaluated in a similar way as that described in appendix B. We stop selecting new features when the maximal information gain falls below a threshold, that is, when max fi 2F I.ºI fi j fQ1¤ ; fQ2¤ ; : : : ; fQm¤ / < threshold. 3.2 Generation of Modied Decision Tree. In selecting the optimal features according to equation 3.7, a decision tree is produced automatically. In the top layer, the rst feature fQ1¤ is used as the test to classify the subwindows, and two nodes f0; 1g are generated (e.g., a subwindow for which fQ1¤ D 1 is assigned to node 1). Similarly, the feature fQ2¤ is used as the test in the second layer, and four nodes . fQ1¤ ; fQ2¤ / D .0; 0/; .0; 1/; .1; 0/; .1; 1/ are generated. In this way, a decision tree with m layers is generated (see Figure 3), and this is used to classify unknown subwindows during detection runs. The following problems remain even when we follow the procedure based on equation 3.7: when the number of selected features m is large (several hundreds), the number of nodes explodes (e.g., there are 2m nodes

An Algorithm for the Detection of Faces

1169

~ test f*1 0

~ test f*2 ~ test f*3

0

1 10

1

Figure 3: Structure of the decision tree.

in the mth layer), and too few training subwindows are assigned to each of the nodes when the number of nodes is large. For this reason, we merge some of the nodes. In creating the rule for this process of merging, we apply an important property of the mutual information: if any two nodes D1 and D2 satisfy a relation p.º j D1 / D p.º j D2 / for 8 º, the mutual information has the same value after merging of the nodes D1 and D2 as before the merging. Here, p.º j Di / denotes the probability that a subwindow belongs to a class º when it belongs to the node Di . The proof of the node-merging property is given in appendix C. According to this property, we introduce a variable qi D p.º D 1 j Di /=p.º D 0 j Di /. If q1 and q2 satisfy q1 D q2 , the relation p.º j D1 / D p.º j D2 / is satised for 8 º. Hence, we use the variable qi in the criterion for merging; that is, when qi ¼ qi , we merge nodes Di and Dj . The details of the merging procedure are described below and illustrated in Figure 4. According to the procedure described below, the number of nodes in each layer always remains constant .D 2l/. 1. A rst feature fQ1¤ is selected according to equation 3.7: fQ1¤ D arg max I.ºI fi /:

(3.8)

fi 2F

2. We rewrite each node f0; 1g, which are generated after the rst test by using fQ1¤ , as fD11 ; D12 g. We prepare 2l nodes ¤ D11 ; ¤ D12 ; : : : ; ¤ D12l . We set n D 1. 3. All of the training subwindows that have been assigned to the node Dni .i D 1; 2 for n D 1 and i D 1; 2; : : : ; 4l for n 6D 1) are transferred (merged) to node ¤ Djn .j D 1; 2; : : : ; 2l/ according to the following transition rule (merging rule). Let qi ´ p.º D 1 j Dni /=p.º D 0 j Dni /.

1170

H. Imaoka and K. Okajima

input subwindow ~*

~*

test f 1

~

f 1* = 1

f1 = 0 D1

D11

D21

(merging rule):D 1i *

D1

~

D11

. . .

*

D21

splitting: *Dm1

test f * 2 D 2 D12

*

*

D2

*

D12

D22

D 2j *

D2l2

~

~

3 D2m-1 (when f 3* = 0 ), *Dm2

3 D2m (when f 3* = 1 )

. . .

D23

merging rule: D 3i D3

2 D2x2l *

. . .

*

splitting: *Dm2

*

~

2 D2m (when f 2* = 1 )

. . .

*

D 3 D13

D2l1

2 D2m-1 (when f 2 = 0 ), *Dm1

merging rule: D 2i ~

*

~*

D22

test f * 3

D 1j

*

D13

*

D23

3 D2x2l *

. . .

D 3j *

D2l3

. . . merging *

D m *D1m *D2m

0 p(n =1| D m )<
. . .

*

q

00

i

D2lm

m p(n =1| D m 2l )>>p(n =0| D 2l )

Figure 4: Structure of the modied decision tree. The merging process and splitting process are repeated in an alternating fashion. In each layer i, the probability p.º f ace j ¤ Dji / increases with j.

The rule is then written as 8¤ n > >¤ D1n > > > < D2 :: n ! ¤ n D Di Dj : > > > >¤ Dn2l¡1 > :¤ n D2l

qi < 2¡lC1 · qi < 2¡lC2 :: : : l¡2 l¡1 2 · qi < 2 2l¡1 · qi

2¡lC1

(3.9)

An Algorithm for the Detection of Faces

1171

4. The .n C 1/th feature is selected by ¤ D arg max I.ºI fi j ¤ Dn /: fQnC1

(3.10)

fi 2F

5. We split each node ¤ Djn according to a test by using the .nC1/th feature ¤ ¤ —for example, a subwindow x 2 ¤ Dnk for which fQnC1 D 0 is asfQnC1 nC1 ¤ Q signed to a node D and that for which f D 1 is assigned to DnC1 . 2k¡1

nC1

2k

nC1 nC1 ¤ n Thus, we obtain 4l nodes, DnC1 1 ; D2 ; : : : ; D2l£2 , from nodes D D ¤ nC1 ¤ nC1 f¤ Dn1 ; ¤ Dn2 ; : : : ; ¤ Dn2l g. We prepare nodes ¤ DnC1 1 ; D2 ; : : : ; D2l . We set n D n C 1 and go to 3.

6. We continue to apply procedures 3 to 5 until the maximal information gain has fallen below a threshold: max fi 2F I.ºI fi j ¤ Dn / < threshold.

Applying this rule for merging ensures that the number of nodes (¤ Djn ) in each layer is always 2l. That is, the number of nodes in the nth layer is decreased from 2n to 2l. In each layer n, the probability p.º D 1 j ¤ Djn / increases with j. Note that we select features according to equations 3.8 and 3.10 instead of 3.7. In equation 3.10, the mutual information I.ºI fi j ¤ Dn / is dened as I.ºI fi j ¤ Dn / D H.º j ¤ Dn / ¡ H.º j fi ; ¤ Dn /;

(3.11)

where the conditional entropies, H.º j ¤ Dn / and H.º j fi ; ¤ Dn /, are respectively dened as H.º j ¤ Dn / D ¡

X X j

ºD0;1

p.º; ¤ Djn / log p.º j ¤ Djn /;

(3.12)

and H.º j fi ; ¤ Dn / D ¡

X

X X fi D0;1

j

ºD0;1

p.º; fi ; ¤ Djn / log p.º j fi ; ¤ Djn /: (3.13)

During the training, equation 3.11 is empirically evaluated by using training examples. When the training has been completed, the selected feature set FQ¤ D ¤ Q¤ Q . f1 ; f2 ; : : : ; fQm¤ / and the rule for merging (transition rule) in each layer Dni ! ¤ n are memorized. These data are used in detecting faces during detection Dj runs as described in the next section. We call the above type of decision tree a modied decision tree, because the number of nodes in each layer is reduced by the merging process. 3.3 Classication of Subwindows. We now write a decision rule for determining the class of a subwindow x. We use the following approximation

1172

H. Imaoka and K. Okajima

for the a posteriori probability of the subwindow x belonging to the face class p.º D 1 j x/ D

X2l iD1

¤ m p.º D 1 j ¤ Dm i /p. Di j x/;

(3.14)

where p.¤ Dm i j x/ denotes the probability of x being assigned to a terminal node ¤ Dm and p.º D 1 j ¤ Dm i i / denotes the probability that a subwindow, ¤ m having been assigned to ¤ Dm , i contains a face. The probability p.º D 1 j Di / is empirically evaluated by using training examples. The detailed procedure used in calculating p.º D 1 j ¤ Dm i / is given in appendix D. The probability ¤ m p.¤ Dm i j x/ is determined in the following way. First, we write p. Di j x/ as p.¤ Dm i j x/ D

X X j

fm¤ 2f0;1g

¤ m¡1 Q¤ p.¤ Dm ; fm /p.¤ Djm¡1 ; fQm¤ j x: (3.15) i j Dj

¤ m¡1 Q¤ is given by the node-merging rule (transition rule) Here p.¤ Dm ; fm / i j Dj that was obtained during the learning process:

p.

¤

Dm i

j

¤

Djm¡1 ; fQm¤ /

D

( 1 0

when Dm

2j¡1C fQm¤

is transferred to ¤ Dm i

otherwise

: (3.16)

The probability p.¤ Djm¡1 ; fQm¤ j x/ is modied to obtain p.¤ Djm¡1 ; fQm¤ j x/ D p.¤ Djm¡1 j x/p. fQm¤ j ¤ Djm¡1 ; x/:

(3.17)

Note that p. fQm¤ j ¤ Djm¡1 ; x/ is determined by equation 3.4 regardless of ¤ Djm¡1 . Thus, equation 3.17 is written as p.¤ Djm¡1 ; fQm¤ j x/ D p.¤ Djm¡1 j x/p. fQm¤ j x/:

(3.18)

Substituting equation 3.18 for equation 3.15, we obtain p.¤ Dm i j x/ D

X X j

fm¤ 2f0;1g

¤ m¡1 Q¤ p.¤ Dm ; fm /p. fQm¤ j x/p.¤ Djm¡1 j x/: (3.19) i j Dj

Hence, we can iteratively compute values for p.¤ Dm i j x/ from the probability j x/. The probability p.¤ D1i j x/ is given as p.¤ Dm¡1 i p.¤ D1i j x/ D

X

f1¤ 2f0;1g

p.¤ D1i j fQ1¤ /p. fQ1¤ j x/;

(3.20)

where p.¤ D1i j fQm¤ / is given by the node-merging rule. Therefore, the a posteriori probability p.º D 1 j x/ is calculated by using the probabilities p. fQ1¤ j x/ and the transition probabilities of the modied decision tree,

An Algorithm for the Detection of Faces

1173

¤ m¡1 Q¤ p.¤ Dm ; fm /. The decision rule is then given by i j Dj

p.º D 1 j x/ ¸ T p.º D 1 j x/ < T

) )

x is a face subwindow x is a nonface subwindow

(3.21)

where T is a threshold. 4 Using the Modied Decision Tree Classier in a Face Detection Algorithm We now describe the face detection algorithm that operates on the basis of the decision tree proposed in the previous section; that is, we describe the details of the procedure applied to detect faces. 4.1 Preprocessing. We extract a large number of subwindows from an input image in order to detect faces of various sizes and at various locations. However, the extracted subwindows include variations in overall and local values with illumination. To reduce the effects of such variation, we take the convolution with the difference-of-gaussian (DoG) lter, I0 .E x/ D

X xE0

D.E x0 /I.E x0 ¡ xE/;

(4.1)

where D.E x/ D .1=2¼ ¾e2 / exp.jE xj2 =2¾e2 / ¡ .1=2¼ ¾i2 / exp.jE xj2 =2¾i2 /:

(4.2)

In equation 4.1, I.E x/ and I0 .E x/ denote subwindow images before and after ltering, respectively, and ¾e and ¾i in equation 4.2 are constants that satisfy ¾e < ¾i . We set ¾e2 D 5=3(pixels) and ¾i2 D 12=3(pixels). This process of ltering suppresses low-frequency components in the subwindows. This preprocessing is optional, but we found that it improves the detection performance especially for faces that have high brightness and low contrast. During tests that will be described in section 5, we incorporated the preprocessing. Presumably, other preprocessing, such as mean and variance normalization or histogram equalization, would also improve the detection performance for such low-contrast faces. We can efciently compute equation 4.1 by using the Fourier transform technique, that is, the DoG lter outputs, equation 4.1, for subwindows at various locations can be simultaneously computed by one whole image Fourier transform, a ltering in the frequency domain, and an inverse Fourier transform. Since we compute the Fourier transform anyway (see section 4.2), the main additional computation needed is an inverse Fourier transform, which requires us to perform » N log2 N multiplications (N: number of pixels), much fewer than required by other possible preprocessing. For example, mean and variance normalization for each subwin-

1174

H. Imaoka and K. Okajima

dow requires us to perform » N £ 1200 multiplications in total, that is, » 75 £ N log2 N multiplications when N D 200 £ 300 pixels. 4.2 Preltering. In constructing the prelter, we prepared 2400 feature candidates. All possible positions in the subwindow are used as central positions (30 £ 40 D 1200). The orientation of the Gabor lter is determined as vertical or horizontal (i.e., angle of kE D 0; ¼=2). The size and spatial frep p E D .4 2=3; 2¼=2/. There are 2l D 52 intermediate quency are xed at .¾; jkj/ nodes in the application of equation 3.9. E because we wanted to calculate We used only two combinations of .¾; k/ the Gabor outputs quickly. Remember that the prelter has to test all the subwindows. Thus, we need to calculate the Gabor outputs at (almost) every pixel within a whole input image. These can be efciently computed via Fourier transform, that is, by a Fourier transform of the whole image, a ltering in the frequency domain, and an inverse Fourier transform. We need to carry out such calculations only for two xed combinations of the lter, E The accurate lter computes the Gabor outputs through product-sum .¾; k/. operations, since it needs to test only subwindows that are passed by the prelter. In training the prelter, we used 200 face subwindows and 1000 nonface subwindows; the training data were gathered from the World Wide Web. The face images are of various sizes and are against various backgrounds. The positions of the eyes in these images were labeled manually. These positions are used in extracting face subwindows and normalizing their sizes: 1. Let the positions of the right and left eyes be zEr and zEl . We use zEc D .Ezr CEzl /=2 and d D jEzr ¡Ezl j to compute the central position and distance between the eyes, respectively. 2. We extract the region with left-hand, right-hand, upper, and lower bounds at 0:65d; 0:65d; 1; 2d, and 3:2d from Ezc . 3. Resize the region thus extracted to 30 by 40 pixels. The nonface subwindows are gathered from images of natural objects of various sizes and in various positions. One hundred features were selected from among 2400 feature candidates. The prelter obtained through the above procedure showed a high detection rate (above 90%) along with a rather high rate of false detections (about 1%). 4.3 Accurate Filtering. We construct a modied decision tree classier according to the method described in section 3. There are 196,800 feature candidates for a subwindow, and the number of intermediate nodes 2l in the application of equation 3.9 is 52. We used 1500 examples of faces gathered from the World Wide Web. The nonface subwindows are extracted

An Algorithm for the Detection of Faces

1175

from images of natural scenes that are also gathered from the Web. From these images, the nonface subwindows are gathered in a bootstrap manner (Rowley et al., 1998) according to the following procedure: 1. Prepare 200 nonface subwindows by randomly extracting them from the images. 2. Construct a modied decision tree classier, and test new nonface subwindows that are randomly extracted from the images. Choose 100 nonface subwindows that have been misclassied as face during the test. Add them to the training set of nonface subwindows. 3. Reconstruct the modied decision tree with the enlarged set of nonface subwindows. This process is repeated until the number of nonface subwindows reaches 3000. In this bootstrap method, missed nonface subwindows are learned repeatedly so that the incidence of false detections gradually decreases. We nally succeeded in reducing the rate of false detections to »10¡6 . In the nal decision tree, the number of selected features was about 400. Figure 5 shows the rst to ninth selected features, fQ1¤ ; fQ2¤ ; : : : ; fQ9¤ in the nal accurate lter. In Figure 5, the left-hand image shows the central position of the selected Gabor lter superimposed on an averaged face, and the right-hand image shows a density plot of the real component of the Gabor lter. Figure 6 shows a distribution of the output of each Gabor lter. In each plot, the horizontal axis represents the output value of the Gabor lter, and the vertical axis represents the number of subwindows for each bin. The white bars indicate the histogram for face subwindows, and the black bars indicate that for nonface subwindows. The thin line represents the threshold ti , and the dotted lines represent the value of ti § ¾ni where ¾ni is the noise variance. From Figure 5, we notice that features that seem reasonable are automatically selected through the process of learning. For example, the second feature corresponds to a horizontal edge detected at the position of an eye. Detection of this feature is expected in most of face subwindows (see Figure 6). The rst feature corresponds to an edge near the nose. Since the orientation of this edge is different from the orientation of the nose, this edge is not detected in most of face subwindows (see Figure 6). Figure 7A shows the distribution of selected spatial frequencies (A-1), the distribution of selected orientations (A-2), and the distribution of selected positions in the 30 £ 40 image plane (A-3). We also plotted in Figure 7B the information gain at each frequency (B-1), each orientation (B-2), and each position (B-3). P For example, in Figure 7B-2,¤the information gain for selected features, fn¤ 2FQ¤ ;µf ¤ Dµ I.ºI fn¤ j f1¤ ; f2¤ ; : : : ; fn¡1 /, is plotted at each orientation n P µ , while in Figure 7A-2, f ¤ 2FQ¤ ;µf ¤ Dµ 1 is plotted at each orientation µ . Here, FQ¤ denotes the set of selected features, and µ f denotes the orientation of a feature f . We notice that the information gain is large around µ D 0 (vertical

1176

H. Imaoka and K. Okajima

Figure 5: The rst to ninth selected features in the nal accurate lter. The lefthand image shows the central position of the selected Gabor lter superimposed on the averaged face, and the right-hand image shows the density plot of the real component of the Gabor lter.

orientation) and around µ D 90 (horizontal orientation). From Figure 7B-3, we see that the information gain is large around the eyes, the nose, and the mouth. 4.4 Heuristics for the Precise Estimation of the Positions and Sizes of Faces. The accurate lter will usually detect several face subwindows around a face. In order to remove such overlapping detections and to estimate the size and position of the face correctly, we include the following heuristics after the stage of accurate ltering: 1. Select the subwindow x¤1 with the highest a posteriori probability of containing an image of a face p.º D 1 j x/.

2. Extract the set of subwindows y D fy1 ; y2 ; : : : ; yn g, which are located within 5 pixels of x¤1 and have been passed by the prelter. 3. Average the a posteriori probabilities p.º D 1 j yi / over the above subP windows: p.º D 1 j x¤1 / D .1=n/ i p.º D 1 j yi /. If the average is above a threshold .D 0:25/, the subwindow x¤1 is determined as belonging to

An Algorithm for the Detection of Faces

1177

1st feature

2nd feature 200

300 200

100

100 0 0

4 x 10

2

0

-3

0

0.01

0.02

4th feature

3rd feature

x 10

200 300 200

100

100 0

0 0

0

0.01

0.01

0.02

6th feature

5th feature 600

600

400

400

200

200

0

0 0

0.01

0

7th feature 200

0.005

0.01

8th feature 300 200

100

100 0

0 0

0.02

0

9th feature

5 x 10

-3

400

200

0 0

0.05

0.1

Figure 6: Distributions of the outputs of the Gabor lters for the rst to ninth features. In each plot, the horizontal axis represents the output value of the Gabor lter, and the vertical axis represents the number of subwindows for each bin. The white bars indicate the histogram for face subwindows, and the black bars indicate that for nonface subwindows. The thin line represents the threshold ti , and the dotted lines represent the value of ti § ¾ni where ¾ni is the noise variance.

1178

H. Imaoka and K. Okajima

Figure 7: (A) The distribution of selected spatial frequencies (A-1), selected orientations (A-2), and selected positions in a 30 £ 40 image plane (A-3). (B) Information gain at each spatial frequency (B-1), at each orientation (B-2), and at each position (B-3).

An Algorithm for the Detection of Faces

1179

the face class. The a posteriori probabilities for the subwindows that overlap with x¤1 are set to 0 in order to prevent overlapping detection. 4. Select the subwindow with the second highest a posteriori probability p.º D 1 j x/. Steps 2 and 3 are applied to determine whether the selected subwindow contains a face. 5. Repeat processes 3 and 4 for all subwindows. Through these procedures, we eliminate overlapping detection and determine the precise sizes and positions of faces. 4.5 Combination of Two Decision Trees. We combine two decision trees to reduce further the rate of false detections. The rst decision tree is that generated in the above manner. The second is generated by changing the examples of nonface subwindows. These trees are combined by (1) using two lters separately to detect regions that contain faces and (2) in cases where almost the same region has been detected as containing a face by both lters, the region is considered to contain a face. In step 2 in particular, the following ratio is used as a criterion in judgment: (area of overlap between S1 and S2 )/(the averaged area of S1 and S2 ), where S1 and S2 are the areas of the facial regions detected by the rst and second lters, respectively. If the ratio is above a threshold (in this trial, we used 0.8), the region is selected as one that contains an image of a face. The region Si (i D 1 or 2) that has the larger a posteriori probability is then determined as the correct region. 5 Experimental Results To verify the ability of the modied decision tree face detector, we investigated its results on 42 gray-scale images that have complex backgrounds, which were collected by the computer vision group at Carnegie Mellon University (CMU) (Rowley et al., 1995, 1998). The set is called CMU test set A and contains 169 faces. In this investigation, 4,836,713 subwindows were tested. To search for relatively small faces (smaller than that for 30 £ 40 subwindow), the original images were enlarged by a factor of 1.5. In Figure 8, we give some examples of our system’s results in detection. The results in the image at the upper-left corner show one false detection and two missed faces. Table 2 shows the levels of performance of our single and double decision tree methods, along with some other algorithms of state-of-the-art performance (Rowley et al., 1995; F´eraud et al., 2001). The result for the double decision tree is the high detection rate of 87.0%, with six false detections. The structure of our system used for the test is summarized in Table 3. Table 4 lists the detection rate of our double decision tree system evaluated by using another data set, (CMU+MIT) test set (A+B+C), or CMU-130 (Rowley et al., 1998; Sung & Poggio, 1998). This set consists of 130 images with 507 frontal faces. The CMU test set A is a subset of this test set. Table 4

1180

H. Imaoka and K. Okajima

Figure 8: Examples of detection results.

Table 2: Results on CMU Test Set A. Model Single decision tree Double decision trees Constrained generative model (F´eraud et al., 2001) Neural network model (Rowley et al., 1995)

Detection Rate (%)

False Detections

False Detection Rate

Computational Time (s)

87.6

25

5:17 £ 10¡6

18.1

87.0

6

1:24 £ 10¡6

25.3

87

39

1:15 £ 10¡6

–

85.2

47

2:13 £ 10¡6

–

Notes: This test set consists of 42 images and contains 169 faces. We tested 4,836,713 subwindows. The computational times are calculated for an Intel Pentium 1.0 GHz processor, operating on 200-by-300 pixel images.

An Algorithm for the Detection of Faces

1181

Table 3: Structure of Decision Trees Used for Performance Evaluation.

Single decision tree Prelter 100 Accurate lter Double decision tree Prelter Accurate lter 1 Accurate lter 2

Number of Selected Features

Number of Intermediate Nodes, 2l

100 402

52 52

100 402 397

52 52 52

Table 4: Detection Rates for Various Numbers of False Detections on CMU-130 Test Set. Correct Detection Rate/ False Detections

Detector Imaoka and Okajima a Viola & Jones (2001) Rowley et al. (1995, 1998) F´e raud et al. (2001) Schneiderman & Kanade (2000) Yang et al. (2000) Roth et al. (2000) a Work

82.2%/9

86.6%/32 88.6%/65

89.3%/94 90.3%/165 91.9%/405

78.3%/10

85.2%/31 89.8%/65

90.8%/95 91.8%/167 93.7%/422

83.2%/10 86.2%/23 86.0%/31

89.2%/95 90.1%/167 89.9%/422

86%/8 (94.4%/65) (93.6%/74) (94.8%/78)

in prcess.

also lists the detection rate for other published systems compiled by Hjelmas and Low (2001) and by Viola and Jones (2001). The system of Schneiderman and Kanade, that of Yang, Ahuja, and Kriegman (2000) and that of Roth, Yang, and Ahuja (2000) have high detection rates with a relatively small number of false detections. These results are on the CMU-130 test set but excluding some hand-drawn faces, so their results are for a subset of the CMU-130 containing 125 images with 483 faces. If the full test set was used, their detection rate would be lower and their number of false detections increased. The result for F´eraud et al. (2001) is obtained by using a system different from that which they used for test set A. Their result for test set A (see Table 2) was obtained by a system composed of two detectors, while they used a system of four detectors to obtain the result in Table 4. Also, the results for Rowley et al. (1995, 1998) are for different versions of their detector (summarized in Table 3 of Viola and Jones, 2001). As for the processing speed, the Viola-Jones detector is fast. Their detector can process a 384£288 pixel image in about 0.067 second on a 700 MHz Pentium III processor (Viola & Jones, 2001). At present, the processing speed of our system (see Table 2) is about

1182

H. Imaoka and K. Okajima

correct detection rate (%)

94 92 90 88 86 84 combined detector decision tree 1 (without heuristics) decision tree 2 (without heuristics)

82 80 0

200

400

600

800

false positives

1000 1200 1400

Figure 9: ROC curves for our face detection system on the CMU-130 test set.

1000 times slower than their system. Reducing the processing time of our system is left for future studies. (At present, our algorithm is implemented on a MATLAB platform.) Figure 9 shows a receiver operating characteristic (ROC) curve for the double decision tree system on the CMU-130 test set. In Figure 9, ROC curves for each single decision tree are also shown. To obtain these curves, single decision trees were tested without the heuristics described in section 4.4 (overlapping detections were eliminated). Next, to investigate a robustness of the algorithm against variation in the angle of view, we tested images of faces from various angles. For this test, we used a data set collected at Sussex University (Howell & Buxton, 1995), which is composed of 100 images of 10 people with a 90 degree range of angle of view (10 views for each person). Table 5 shows the result of the double decision tree and of one other method (F´eraud et al., 2001). Although we had used frontal views of faces as the training data, we found that our algorithm detected the faces in 40 degree side views at the rate of 80% and with no false detections. This indicates that our face detector is robust against variation in the angle of view. 6 Discussion In this letter, we have proposed a modied decision tree classier and applied it to the problem of detecting faces. Since our method is not specialized

An Algorithm for the Detection of Faces

1183

Table 5: Results on the Sussex Database. Face View

Double Decision Tree

F´eraud et al. (2001)

(degrees)

Detection Rate (%)

False Detections

Detection Rate (%)

0 10 20 30 40 50 60 70 80 90

100.0 100.0 100.0 100.0 80.0 60.0 40.0 20.0 10.0 0.0

0 0 0 0 0 1 0 0 0 0

100.0 87.5 87.5 62.5 50.0 0.0 0.0 0.0 – –

Notes: The database is composed of 100 images of 10 people with a 90 degree range of views. Here, the results under F´e raud et al. (2001) are for a detector that was trained by using frontal views of faces.

for the detection of faces, we expect it to be applicable to other object detection problems. The operation of the modied decision tree resembles the ID3 method proposed by Quinlan (1979). However, the merging of nodes in our method leads to markedly lower computational times than for ID3. Furthermore, our algorithm realizes improved detection ability for new examples that were not used in training by introducing noise to the feature output. AdaBoost (for an overview, see, e.g., Schapire, 2002) can be used as a method for feature selection if a weak learner in AdaBoost is constrained to return a weak classier that can depend on only a single feature (Viola & Jones, 2001). Both our method and the AdaBoost-based method select features in a sequential way. Our infomax-based method selects each feature by maximizing the additional (or conditional) information gain. Thus, in our method, the ìimportanceî of each selected feature can be evaluated by using the information gain, and the stopping criterion is rather simple: stop selecting a new feature when the maximal additional information gain becomes (almost) zero. The AdaBoost-based method also computes a weight for each feature (for each weak classier). However, the sum of the weights is not bounded, and thus determining the stopping criterion is not straightforward. For example, Shapire (2002) describes a case in which the test error continues to decrease with the number of boosting rounds even after the training error becomes zero. Viola and Jones (2001) used the AdaBoost-based feature selection method in their face detection system. They reported that the rst and second features selected were located around the eyes. Our experiments showed, consistent with their result, that the information gain around the eyes is large (see Figure 7). The rst feature selected in their system is a two-rectangle

1184

H. Imaoka and K. Okajima

feature (one rectangle for an excitatory subeld, and the other for an inhibitory subeld) whose width is larger than the distance between the eyes. This feature size is larger than the largest feature size possible in our system (3¾ of the largest Gabor feature in our system was set to 8 pixels, 0.44 times the distance between the eyes). By considering their result, the detection performance of our system might be improved if we add features of larger size to the ensemble of possible features. Appendix A: Procedure for Determining the Threshold and Noise Variance for the Gabor Filter’s Output In this appendix, we give a procedure for determining the threshold ti and noise variance ¾n2i for use in equations 3.4 and 3.5. Let x D fx1 ; x2 ; : : : ; xn g be a training set of subwindows, each of which is assumed to belong to the face class (º D 1) or the nonface class (º D 0). A threshold for each feature, ti , is determined by maximizing the empirical value for mutual information between the variable º and each feature fi .i D 1; 2; : : : ; 196,800/. Let fi .t0i / be a feature with a threshold of t0i and the mutual information be I.ºI fi .t0i //. The threshold is then determined as the t0i that produces the maximum value for mutual information: ti D arg max I.ºI fi .t0i //;

(A.1)

ti 28i

p aº ºa where 8i D fgaº i C .3 gi =20/ £ i j i D ¡20; ¡19; : : : ; 19; 20g, and g i and are the average and variance of the Gabor lter’s output over training gºa i subwindows respectively. The average and variance are given as gaº i D .1=n/ gºa i D .1=n/

Xn

j

gE

(A.2)

;

ui ;kEi ;¾i j .g /2 jD1 uEi ;kEi ;¾i

jD1 Xn

2 ¡ .gaº i / ;

(A.3)

j

where g E is the Gabor lter’s output for the jth subwindow and n is the uEi ;ki ;¾i number of training subwindows. (The noise variance is assumed to be zero in the calculation A.1). A procedure for calculating the mutual information is given in appendix B. For the noise variance, we assume that it is proportional to the variance of the Gabor lter’s outputs as ¾n2i D ®gºa i ;

(A.4)

where ® is a xed constant. We used ® D 0:2 in constructing the classier and in testing images. The noise is introduced to select features that can separate training examples with large margins. Suppose, for simplicity, we have two training

An Algorithm for the Detection of Faces

1185

Figure 10: Effect of the noise on the training error (schematic representations.)

examples, x f ace and xnonf ace , and these can be successfully classied by using a feature (lter output) g1 or by using g2 (see Figure 10a). The training error is zero for both cases, but we expect that the generalization error will be smaller for a feature that can separate these data with a larger margin. By introducing the noise, the training error will increase for both cases (see Figure 10b), but the increase will depend on the margin (presumably it also depends on the generalization error), and thus we will be able to select a better feature. In this sense, the noise in our algorithm plays a similar role as that of a complexity penalty term introduced in model selection problems. Equation 3.4 mean that for each training example, we generate an ensemble of data according to a certain data-generation model, and we use these ensembles for the training. Suppose we generate an ensemble Sx for a training example x. Then the lter output gi .x0 / for x0 2 Sx is determined by a probability density that is set in our model to p.gi .x0 // D p 2 1=. 2¼ ¾n;i / exp.¡.gi .x0 / ¡ gi .x//2 =2¾n;i /, where gi .x/ denotes the lter output for x. We assumed that the parameter ¾n2i is proportional to the variance 2 ºa gºa i , that is, ¾ni D ®gi , and experimentally optimized the value of ® as ® D 0:2. From the data-generation model viewpoint, the noise has the effect of enlarging the effective training data size. We hoped that this will reduce

1186

H. Imaoka and K. Okajima

the generalization error. Actually, we found that the noise is effective in reducing the generalization error in our system. Appendix B: Procedure for Calculating the Empirical Mutual Information In this appendix, we give a procedure for calculating the empirical mutual information I.ºI F¤ / between the variable º, which represents a subwindow label, and a set of features F¤ D . f1¤ ; f2¤ ; : : : ; fm¤ /. We let x1 ; x2 ; : : : ; xl face denote the face subwindows used in training and xlf ace C1 ; xl face C2 ; : : : ; xl face Clnonf ace denote the nonface subwindows used in training, where l f ace and lnonf ace .D l ¡ l f ace / denote the numbers of training subwindows that belong to the respective classes. We write the probability that a feature f takes state i 2 f0; 1g for a subwindow x as p f .i j x/. We now derive a procedure for calculating the mutual information I.ºI F¤ / by using pf .i j x/, l f ace , lnonf ace , and l. By denition, the mutual information is given by I.ºI F¤ / D H.º/ ¡ H.º j F¤ /;

(B.1)

where the rst term denotes the entropy of º and the second term denotes the conditional entropy. The rst term is given by H.º/ D ¡p.º D 1/ log p.º D 1/ ¡ p.º D 0/ log p.º D 0/;

(B.2)

where p.º D 1/ and p.º D 0/ are the probabilities that a subwindow belongs to the face and nonface class, respectively. These are given by p.º D 1/ D l f ace = l; p.º D 0/ D lnonf ace = l:

(B.3) (B.4)

The second term on the right-hand side of equation B.1 is given by H.º j F¤ / D ¡

X

X i1 2 f1¤

i2 2 f2¤

¢¢¢

X

X im 2 fm¤

£ log p.º j i1 ; i2 ; : : : ; im /:

ºD0;1

p.º; i1 ; i2 ; : : : ; im / (B.5)

Note that the probability of a subwindow x having a feature fi¤ D 0 or 1, that is, p fj¤ .ij j x/ is determined by equation 3.4. Thus, the probability is written as p.i1 ; i2 ; : : : ; im j x/ D

m Y jD1

pfj¤ .ij j x/:

(B.6)

An Algorithm for the Detection of Faces

1187

Using equation B.6, the elements of equation B.5 are rewritten as p.º; i1 ; i2 ; : : : ; im / D

X x

p.i1 ; i2 ; : : : ; im j x/p.x; º/

D .1= l/

X

m Y

x2º

jD1

pfj¤ .ij j x/;

(B.7)

p.º j i1 ; i2 ; : : : ; im / D p.º; i1 ; i2 ; : : : ; im /=p.i1 ; i2 ; : : : ; im / m m X Y ¯X Y D pfj¤ .ij j x/ p fj¤ .ij j x/: x2º x jD1

(B.8)

jD1

We then use equations B.2, B.5, B.7, and B.8 in empirically obtaining the mutual information I.ºI F¤ / by using pf .i j x/, l f ace , lnonf ace , and l. Appendix C: Proof of the Invariance of Mutual Information Through Node Merging We consider a case where two nodes D1 and D2 are merged into a single node D. We show that if p.º j D1 / is equal to p.º j D2 / for 8 º (D 0 or 1), the mutual information before and after merging is invariant: I.ºI Dbef ore / D I.ºI Daf ter /. Here, the variables Dbef ore and Dafter are, respectively, dened as Dbef ore .x/ D Di ( Di D Daf ter .x/ D

when the subwindow x belongs to Di , and when x 2 Di ; i 6D 1; 2 when x 2 D1 or x 2 D2 :

First, we write I.ºI Dbef ore / as I.ºI Dbef ore / D H.º/ ¡ H.º j Dbef ore /:

(C.1)

The second term on the right-hand side of equation C.1 is written as H.º j Dbef ore / D ¡

X

¡

iD1;2

X

p.Di /

i6D1;2

X

p.Di /

XºD0;1

p.º j Di / log p.º j Di /

ºD0;1

p.º j Di / log p.º j Di /

(C.2)

Our hypothesis is p.º j D/ D p.º j D1 / D p.º j D2 /:

(C.3)

The following equation holds when nodes are merged: p.D/ D p.D1 / C p.D2 /:

(C.4)

1188

H. Imaoka and K. Okajima

Substituting equations C.3 and C.4 into C.2, X H.º j Dbef ore / D ¡p.D/ p.º j D/ log p.º j D/ ºD0;1 X X ¡ p.D / p.º j Di / log p.º j Di / i i6D1;2 ºD0;1 D H.º j Daf ter /:

(C.5)

From equations C1 and C.5, we obtain I.ºI Dbef ore / D I.ºI Daf ter /. Appendix D: Calculation of p.º D 1 j ¤ Dm i / in Equation 3.14 We give a procedure for calculating p.º D 1 j ¤ Dm i / on the basis of training examples. Let Y D fy1 ; y2 ; : : : ; yn g be a set of training examples. We determine p.º D 1 j ¤ Dm i / by p.º D 1 j ¤ Dm i /D

X

y2Y

p.º D 1 j y/p.y j ¤ Dm i /:

(D.1)

Here, p.º D 1 j y/ is given as ( 1 p.º D 1 j y/ D 0

when y is a face subwindow : when y is a nonface subwindow

(D.2)

On the other hand, p.y j ¤ Dm i / is given as ¤ m p.y j ¤ Dm i / D p. Di j y/p.y/=

X

y2Y

p.¤ Dm i j y/p.y/;

(D.3)

where p.y/ D 1=n for all y and p.¤ Dm i j y/ is written as p.¤ Dm i j y/ D

X X j

fm¤ 2f0;1g

¤ m¡1 Q¤ p.¤ Dm ; fm /p.¤ Djm¡1 ; fQm¤ j y/: (D.4) i j Dj

¤ m¡1 Q¤ is obtained by applying the node-merging rule: Here, p.¤ Dm ; fm / i j Dj

p.

¤

Dm i

j

¤

Djm¡1 ; fQm¤ /

D

( 1

when Dm

0

otherwise

2j¡1C fQm¤

is transferred to ¤ Dm i

: (D.5)

Note that a subwindow x 2 ¤ Djm¡1 for which fQm¤ D 0.1/ is assigned to the node Dm .Dm / (see section 3.2). The probability p.¤ Dm¡1 ; fQ¤ .y// is modi2j¡1

2j

j

m

ed to obtain

p.¤ Djm¡1 ; fQm¤ j y/ D p.¤ Djm¡1 j y/p. fQm¤ j ¤ Djm¡1 ; y/:

(D.6)

An Algorithm for the Detection of Faces

Note that p. fQm¤ j ¤

¤

1189

Djm¡1 ; y/ is determined by equation 3.4, regardless of

Djm¡1 . Thus, equation D.6 is written as p.¤ Djm¡1 ; fQm¤ j y/ D p.¤ Djm¡1 j y/p. fQm¤ j y/:

(D.7)

Substituting equation D.7 into equation D.4, we obtain p.¤ Dm i j y/ D

X X fm¤ Df0;1g

j

¤ m¡1 Q¤ p.¤ Dm ; fm /p. fQm¤ j y/p.¤ Djm¡1 j y/: (D.8) i j Dj

Hence, we can iteratively compute the probability p.¤ Dm i j y/ from the probability p.¤ Dm¡1 j y/. The probability p.¤ D1i j y/ is given as i p.¤ D1i j y/ D

X

f1¤ Df0;1g

p.¤ D1i j fQ1¤ /p. fQ1¤ j y/;

(D.9)

where p.¤ D1i j fQ1¤ / is given by the node-merging rule. Therefore, the probability p.y j Dm i / in equation D.3 is calculated by using the probabilities ¤ j Q p. fi y/ .i D 1; 2; : : : ; m/ and the transitional probabilities p.¤ Dki j ¤ Djk¡1 ; fQk¤ / (i; j D 1; 2; : : : ; 2l, k D 1; 2; : : : ; m). Consequently, by substituting equations D.2 and D.3 into equation D.1, we obtain P p.º D 1 j

¤

Dm i /

¤ m y2 f ace p. Di j y/ ; ¤ m y2Y p. Di j y/

D P

(D.10)

where p.¤ Dm i j y/ is given by equations D.8 and D.9. References Adelson, E. H., & Bergen, J. R. (1985). Spatiotemporal energy models for the perception of motion. Journal of the Optical Society of America, A2, 294–299. Chellappa, R., Wilson, C. L., & Sirohey, S. (1995). Human and machine recognition of faces: A survey. Proc. of IEEE, 83, 705–740. Fasel, I. R., Bartlett, M. S., & Movellan, J. R. (2002). A comparison of Gabor lter methods for automatic detection of facial landmarks. In Proc. Fifth Int’l Conf. Automatic Face and Gesture Recognition (pp. 231–235). Washington, DC. Fasel, I. R., & Movellan, J. R. (2002). A comparison of face detection algorithms. In Proc. Int’l Conf. on Articial Neural Networks (ICANN2002) (pp. 1325–1332). Madrid, Spain. F´eraud, R., & Bernier, O. J. (1997). Ensemble and modular approaches for face detection: A comparison. In M. Kearns, M. Jordan, & S. Solla (Eds.), Neural information processing systems, 10 (pp. 472–478). Cambridge, MA: MIT Press. F´eraud, R., Bernier, O. J., Viallet, J.-E., & Collobert, M. (2001). A fast and accurate face detector based on neural networks. IEEE Trans. Pattern Analysis and Machine Intelligence, 23, 42–53.

1190

H. Imaoka and K. Okajima

Heeger, D. J. (1992). Half-squaring in response of cat striate cells. Visual Neuroscience, 9, 427–443. Hjelmas, E., & Low, B. K. (2001). Face detection: A survey. Computer Vision and Image Understanding, 83, 236–274. Howell, A. J., & Buxton, H. (1995). Invariance in radial basis function neural networks in human face classication. In Proc. Int’l Conf. Automatic Face and Gesture Recognition (pp. 221–226). Zurich, Switzerland. Huang, W., & Mariani, R. (2000). Face detection and precise eyes location. In Proc. Int’l Conf. Pattern Recognition (pp. 722–727). Barcelona, Spain. Hung, J., Gutta, S., & Wechsler, H. (1996). Detection of human faces using decision trees. In Proc. Int’l Conf. Automatic Face and Gesture Recognition (pp. 334–339). Linsker, R. (1989). How to generate ordered maps by maximizing mutual information between input and output signals. Neural Computation, 1, 402– 411. Osuna, E., Freund, R., & Girosi, F. (1997). Training support vector machines: An application to face detection. In Proc. IEEE Computer Soc. Conf. Computer Vision Pattern Recognition (pp. 130–136). San Juan, PR. Quinlan, J. R. (1979). Discovering rules by introduction from large collections of examples. In D. Michie (Ed.), Expert systems in the microelectronics age. Edinburgh: Edinburgh University Press. Roth, D., Yang, M. H., & Ahuja, N. (2000). A SNoW-based face detector. In S. A. Solla, T. K. Leen, & K.-R. Muller ¨ (Eds.), Neural information processing systems (pp. 862–868). Cambridge, MA: MIT Press. Rowley, H. A., Baluja, S., & Kanade, T. (1995). Human face detection in visual scenes. In D. Touretzsky, M. Mozer, & M. Hasselmo (Eds.), Neural information processing systems, 8 (pp. 875–881). Cambridge, MA: MIT Press. Rowley, H. A., Baluja, S., & Kanade, T. (1998). Neural network-based face detection. IEEE Trans. Pattern Analysis and Machine Intelligence, 20, 23–38. Schapire, R. E. (2002). The boosting approach to machine learning—An Overview. In Proc. MSRI Workshop on Nonlinear Estimation and Classication. New York: Springer-Verlag. Schneiderman, H., & Kanade, T. (1998). Probabilistic modeling of local appearance and spatial relationships for object recognition. In Proc. IEEE Computer Soc. Conf. Computer Vision Pattern Recognition (pp. 45–51). Santa Barbara, CA. Schneiderman, H., & Kanade, T. (2000). A statistical method for 3D object detection applied to faces and cars. In Proc. IEEE Computer Soc. Conf. Computer Vision Pattern Recognition (pp. 746–751). Hilton Head, SC. Sung, K.-K., & Poggio, T. (1998). Example-based learning for view-based human face detection. IEEE Trans. Pattern Analysis and Machine Intelligence, 20, 39– 51. Terrillon, J.-C., Shirazi, M. N., Sadek, M., Fukamachi, H., & Akamatsu, S. (2000). Invariant face detection with support vector machines. In Proc. Int’l Conf. Pattern Recognition (pp. 210–217). Barcelona, Spain. Turk, M., & Pentland, A. (1991). Face recognition using eigenfaces. In Proc. IEEE Computer Soc. Conf. Computer Vision Pattern Recognition (pp. 586–591). Maui, HI.

An Algorithm for the Detection of Faces

1191

Viola, P., & Jones, M. (2001). Robust real-time object detection. In Proc. of IEEE Workshop on Statistical and Computational Theories of Vision. Vancouver, Canada. Wiskott, L., Fellous, J.-M., Kruger, ¨ N., & von der Malsburg, C. (1997). Face recognition by elastic bunch graph matching. IEEE Trans. Pattern Analysis and Machine Intelligence, 19, 775–779. Yang, M. H., Ahuja, N., & Kriegman, D. (2000). Face detection using mixtures of linear subspaces. In Proceedings of the Fourth IEEE International Conference on Automatic Face and Gesture Recognition (pp. 70–76). Grenoble, France. Received October 4, 2002; accepted December 2, 2003.

LETTER

Communicated by Barak Pearlmutter

Analysis of Sparse Representation and Blind Source Separation Yuanqing Li [email protected] Laboratory for Advanced Brain Signal Processing and RIKEN Brain Science Institute, Wako shi, Saitama, 3510198, Japan, and Automation Science And Engineering Institute, Southchina University of Technology, Guangzhou, China

Andrzej Cichocki [email protected] Laboratory for Advanced Brain Signal Processing and RIKEN Brain Science Institute, Wako shi, Saitama, 3510198, Japan, and The Department of Electrical Engineering, Warsaw University of Technology, Warsaw, Poland

Shun-ichi Amari [email protected] Laboratory for Mathematical Neuroscience, RIKEN Brain Science Institute, Wako shi, Saitama, 3510198, Japan

In this letter, we analyze a two-stage cluster-then-l1 -optimization approach for sparse representation of a data matrix, which is also a promising approach for blind source separation (BSS) in which fewer sensors than sources are present. First, sparse representation (factorization) of a data matrix is discussed. For a given overcomplete basis matrix, the corresponding sparse solution (coefcient matrix) with minimum l1 norm is unique with probability one, which can be obtained using a standard linear programming algorithm. The equivalence of the l1 ¡norm solution and the l0 ¡norm solution is also analyzed according to a probabilistic framework. If the obtained l1 ¡norm solution is sufciently sparse, then it is equal to the l0 ¡norm solution with a high probability. Furthermore, the l1 ¡norm solution is robust to noise, but the l0 ¡norm solution is not, showing that the l1 ¡norm is a good sparsity measure. These results can be used as a recoverability analysis of BSS, as discussed. The basis matrix in this article is estimated using a clustering algorithm followed by normalization, in which the matrix columns are the cluster centers of normalized data column vectors. Zibulevsky, Pearlmutter, Boll, and Kisilev (2000) used this kind of two-stage approach in underdetermined BSS. Our recoverability analysis shows that this approach can deal with the situation in which the sources are overlapped to some degree in the analyzed c 2004 Massachusetts Institute of Technology Neural Computation 16, 1193–1234 (2004) °

1194

Y. Li, A. Cichocki, and S. Amari

domain and with the case in which the source number is unknown. It is also robust to additive noise and estimation error in the mixing matrix. Finally, four simulation examples and an EEG data analysis example are presented to illustrate the algorithm’s utility and demonstrate its performance. 1 Introduction Some learning problems can be expressed in terms of matrix factorization (e.g., Zibulevsky, Pearlmutter, Boll, & Kisilev, 2000; Lee & Seung, 1999; Cichocki & Amari, 2002). Different cost functions and imposed constraints may lead to different types of factorization. Sparse representation or sparse coding of signals, which can be modeled using matrix factorization, has received a great deal of attention in recent years. Sparse representation of signals using large-scale linear programming under given overcomplete bases (e.g., wavelets) was discussed by Chen, Donoho, and Saunders (1998). A sparse image coding approach using the wavelet pyramid architecture was also presented (Olshausen, Sallee, & Lewicki, 2001). The wavelet bases were adopted to best represent natural images in terms of sparse coefcients. From the simulation presented in the letter, we nd that the algorithm is time-consuming, and convergence of the algorithm is not discussed. The algorithms of Olshausen and Field (1997) are easy to implement; however, like that of Olshausen et al. (2001), no convergence of the algorithm is given. Several improved FOCUSS-based algorithms (Delgado et al. 2003) have been designed to solve underdetermined linear inverse problems in cases when both the dictionary and the sources are unknown; however, generally a locally optimal solution can be obtained. Recently, Donoho and Elad (2003) discussed optimally sparse representation in general (nonorthogonal) dictionaries via l1 minimization. Using a deterministic approach, many interesting results have been obtained, for example, less than 50% concentration implies equivalence between the l0 ¡norm solution and l1 ¡norm solution. And two sufcient conditions are proposed under which the equivalence holds. For instance, a result is that (theorem 12 given by Donoho & Elad, 2003) if the nonzero entry number 0 of the l0 ¡norm solution is less than MC1 2M , then the l ¡norm solution is the 1 unique l ¡norm solution, where M is a bound of all off-diagonal entries of the Gram matrix (G D DT D) of the dictionary D. If all columns of the basis matrix (dictionary) are close to orthogonal and the parameter M is very small, then this is a good, strong result. However, if the basis is arbitrary and the number of basis vectors is large, M could be not so small as expected (e.g., larger than 0:5); this implies that only one or two entries of the l0 ¡norm solution can be nonzero. For this reason, it is not convenient for the above results to be used unless the basis matrix is very sparse or its columns are close to orthogonal. For instance, in example 1 in this letter,

Analysis of Sparse Representation and Blind Source Separation

1195

the 10 £ 15–dimensional mixing matrix (which Donoho & Elad, 2003 refer to as a dictionary) A is selected randomly as a uniform distribution, and its columns are normalized; then the bound MC1 2M < 1:5 generally. This means that if the l0 ¡norm solution has only one nonzero entry, then it is equivalent to the l1 ¡norm solution. In fact, from example 1, we see that the equivalence still holds with a probability higher than 0:97, even if the l0 ¡norm solution has ve nonzero entries. In this letter, different results are obtained, using a probabilistic approach. Sparse representation can be used for blind source separation (BSS). In several references, the mixing matrix and sources were estimated using the maximum posterior approach and the maximum likelihood approach (Zibulevsky & Pearlmutter, 2001; Zibulevsky et al., 2000; Lee, Lewicki, Girolami, & Sejnowski, 1999). A variational expectation maximization algorithm for sparse presentation was proposed by Girolami (2001), which can also be used in overcomplete BSS, especially in a noisy environment. Jang and Lee (2003) presented a technique of maximum likelihood subspace projection for extracting multiple sources in cases when only a single channel observation is available. However, these kinds of algorithms are limited by local minima. Furthermore, a two-step approach was proposed for underdetermined BSS in which the mixing matrix and the sources are estimated separately (Boll & Zibulevsky, 2001). Boll and Zibulevsky (2001) performed the BSS in the transformed domain. A potential-function-based method was presented for estimating the mixing matrix, which was very effective in the two-dimensional observable data space. The sources were then estimated by minimizing 1-norm (shortest path separation criterion). Blind source separation was also performed using multinode sparse representation (Zibulevsky, Kisilev, Zeevi, & Pearlmutter, 2001). Based on several subsets of wavelet packets coefcients, the mixing matrix was estimated using the fuzzy C-means clustering algorithm, and the sources were recovered using the inverse of the estimated mixing matrix. The above approach is promising; however, several fundamental problems are still open. In this letter, we discuss the following issues: 1. Why and when can l1 norm solution be taken as the true source vector? This is a recoverability problem. If sources are nonoverlapped (in the time domain or transformed domain), the answer is obvious. But if sources are overlapped (i.e., some source vectors have more than one nonzero entry), the answer is not obvious, and we need to investigate further. Sparse representation of a known data matrix is discussed rst in this letter, and several results on equivalence of l0 norm solution and l1 norm solution are obtained from the viewpoint of probability. Based on these results, the recoverability analysis in BSS has been established. A simulation example is given to illustrate our recoverability analysis.

1196

Y. Li, A. Cichocki, and S. Amari

2. According to our recoverability analysis, the sparser the sources are in the time or transformed domain, then the higher the probability is that each source vector can be recovered correctly. In fact, the original sources must be sufciently sparse in order to estimate or identify the mixing matrix using clustering methods. Thus, sparsity of sources in the time or the transformed domain plays an important role in the performance of the two-stage BSS approach. This fact has not been investigated deeply until now. 3. How should we deal with the case in which the source number is also unknown? In contrast to other publications, in our method we assume that the number of sources is generally unknown and only the upper limit of sources number should be roughly known. The estimated mixing matrix in our approach is overestimated; it contains more columns than the original mixing matrix. The effectiveness of this kind of overestimated mixing matrix has been proved theoretically and demonstrated by simulations. 4. We consider the case when the number of observations (sensor signals) is arbitrary, generally more than two. According to our recoverability analysis, more sensor signals are helpful in dealing with the case of more sources and more overlapped (less sparse) sources. 5. We analyze the robustness of this kind of approach to additive noise and the estimation errors of the mixing matrix. This letter rst considers sparse representation (factorization) of a data matrix based on the following model, X D BS;

(1.1)

where the X D [x.1/; : : : ; x.K/] 2 Rn£K (K À 1) is an known data matrix, and B D [b1 ¢ ¢ ¢ bm ] is an n £ m basis matrix, and S D [s1 ; : : : ; sK ] D [sij ]m£K is a coefcient matrix, also called a solution corresponding to basis matrix B. Generally, m > n, which implies that the basis is overcomplete. To nd a reasonable overcomplete basis of equation 1.1, such that the coefcients are as sparse as possible, the following two trivial cases should be removed from consideration in advance: (1) the number of basis vectors is arbitrary and (2) the norms of basis vectors are unbounded. In case 1, we can set the number of basis vectors to be that of the data vectors, and the basis is composed of data vectors themselves. In case 2, if the norms of basis vectors tend to innity, the coefcients will tend to zero. Thus, we begin with two assumptions: Assumption 1: The number of basis vectors m is assumed to be xed in advance and satises the condition n · m << K. Assumption 2: All basis vectors are unit vectors with their 2¡norms being equal to 1. It is well known that there exists an innite number of solutions of the factorization equation 1.1 generally. For a given basis matrix, under the

Analysis of Sparse Representation and Blind Source Separation

1197

sparsity measure of the l1 norm, the uniqueness result of a sparse solution is obtained in this letter. The equivalence of the l1 and l0 norm solutions and their robustness to noise are also discussed. The results can be used as a recoverability analysis in blind sparse source separation. Theoretically, among all the basis matrices, there exists the best basis, such that the corresponding coefcient matrix is the sparsest. Although nding the best basis remains an open problem, we nd that the basis, of which the column vectors are composed of cluster centers of X, is a suboptimal basis. This can be obtained using the K-means clustering algorithm followed by normalization. Using this kind of basis matrix, the corresponding coefcient matrix will be very sparse, a result that is evident in simulation. Additionally, as will be evident in this letter, the basis matrix can be used as an estimation of mixing in BSS. In the next section, this kind of sparse representation approach is used in blind sparse source separation of problems having fewer sensors than sources and an unknown number of sources. The clustering algorithm and minimum l1 norm criterion are used for estimating the mixing matrix and sources, respectively. From the equivalence analysis of the l1 and l0 norm solutions, we can obtain the recoverability result; that is, if the source vector is sufciently sparse, then it can be successfully recovered with a high probability. Thus, if this were the case, blind separation can be carried out directly. Otherwise, blind separation can be implemented in the time-frequency domain after wavelet packets transformation preprocessing of the mixture signals. The noise in blind sparse source separation is from two categories: additive noise and the estimation error of the mixing matrix. If the additive noise is sufciently low and the estimated matrix contains a submatrix that is sufciently close to the original mixing matrix, then we can recover the sources efciently, even if the estimated matrix has more columns than the original mixing matrix. Thus, the approach in this letter can be used to deal with the cases in which more sources exist than sensors and in which the number of sources is unknown. The remainder of this letter is organized as follows. Section 2 analyzes the sparse representation of a data matrix based on the l1 norm sparsity measure. Section 3 discusses BSS in which fewer sensors are present than sources via sparse representation. Simulation results are presented in section 4. Concluding remarks in section 5 review the advantages of the proposed algorithm and state the remaining aspects to study. 2 Sparse Representation of Data Matrix In this section, the two-stage cluster-then-l1 optimization approach is analyzed for sparse representation of a data matrix. Two algorithms are used: a linear programming algorithm for estimating the coefcient matrix and a K-means algorithm for estimating the basis matrix. Several results are ob-

1198

Y. Li, A. Cichocki, and S. Amari

tained on the equivalence of the l1 norm solution and l0 norm solution and on the robustness of these kinds of solutions to noise. These results can be used as a recoverability analysis of BSS, as will be explained. 2.1 Linear Programming Algorithm for Estimating a Coefcient Matrix for a Given Basis Matrix. For a given basis matrix B in equation 1.1, the coefcient matrix can be found by maximizing the posterior distribution P.SjX; B/ of S (Lewicki & Sejnowski, 2000). Under the assumption that the prior of S is Laplacian, maximizing the posterior distribution can be implemented by solving the following optimization problem (Chen et al., 1998; Zibulevsky et al., 2000), min

m X K X iD1 jD1

jsij j; subject to BS D X:

(2.1)

Thus, the l1 norm, J.S/ D

m X K X iD1 jD1

jsij j;

(2.2)

is used as the sparsity measure in this article. To facilitate the discussion, we rst present a denition. Denition 1. For a given basis matrix B, denote the set of all solutions of equation 1.1 as D. The solution SB D arg min J.S/ is called a sparse solution with respect S2D

to the basis matrix B. The corresponding factorization (1.1) is said to be a sparse factorization or sparse representation.

For a given basis matrix B, the sparse solution of equation 1.1 can be obtained by solving the linear programming problem, equation 2.1. It is not difcult to prove that the linear programming problem is equivalent to the following K smaller-scale linear programming problems, min

m X iD1

jsij j; subject to Bsj D x.j/;

(2.3)

where j D 1; : : : ; K. Setting S D U ¡ V, where U D [uij ]m£K and V D [vij ]m£K are nonnegative, equation 2.3 can be converted to the following linear programming problems with nonnegative constraints, min

m X iD1

.uij C vij /; subject to [B; ¡B][ujT ; vjT ]T D x.j/; uj ¸ 0; vj ¸ 0;

where j D 1; : : : ; K, uj D [uij ; uzj ; : : : ; umj ]T .

(2.4)

Analysis of Sparse Representation and Blind Source Separation

1199

Theorem 1. For almost all bases B, the sparse solution of equation 1.1 is unique. That is, the set of bases B, under which the sparse solution of equation 1.1 is not unique, is of measure zero. And there are at most n nonzero entries of the solution. The proof is in appendix A. In some references, l0 norm J0 .S/ D

n P K P iD1 jD1

jsij j0 is used as a sparsity mea-

sure of S, which is the number of nonzero entries of S. Under this measure, the sparsest solution is obtained by solving the problem min

m X K X iD1 jD1

jsij j0 ; subject to BS D X:

(2.5)

In the next section, we discuss the equivalence of the l0 norm solution and l1 norm solution. 2.2 Equivalence of the l0 Norm Solution and l1 Norm Solution. First, we introduce the two optimization problems: .P0 / min

m X iD1

.P1 / min

m X iD1

jsi j0 ; subject to As D x;

jsi j; subject to As D x:

where A 2 Rn£m , x 2 Rm are a known coefcient matrix and a data vector, respectively, and s 2 Rm , n · m. Suppose that s0 is a solution of .P0 /, and s1 is a solution of .P1 /. Although the solution of .P0 / is the sparsest, it is not an efcient way of nding the sparsest solution by solving the problem .P0 / for three reasons: 1. If jjs0 jj0 D n, then the solution of .P0 / is not unique generally, since any solution satisfying the constraint equations with n nonzero entries is a solution of (P0 ). 2. Until now, an effective algorithm to solve the optimization problem .P0 / did not exist. 3. As will be seen in section 2.3, the solution of .P0 / is not robust to noise. However, the solution of .P1 / is unique with a probability of one and has at most n nonzero entries from theorem 1. It is well known that there are many strong tools to solve the problem .P1 /. The following problem arises: What is the condition under which the solution of .P1 / is one of the sparsest

1200

Y. Li, A. Cichocki, and S. Amari

solutions, that is, the solution has the same number of nonzero entries as the solution of .P0 /? In the following, we will discuss the problem. Lemma 1. Suppose that x 2 Rn and A 2 Rn£m are selected randomly. If x is represented by a linear combination of k column vectors of A, then k ¸ n generally, that is, the probability that k < n is zero. Proof. Since A 2 Rn£m is selected randomly, any n columns of A are linearly independent with a probability of one. Given that x is selected randomly, it is linearly independent with any n ¡ 1 columns of A with a probability of one. Thus the result of the lemma holds. From the results of Donoho and Elad (2003), we can see that if jjs0 jj0 < < nC1 2 , then

nC1 2 , then s0 is the unique solution of .P0 /. Furthermore, if jjs1 jj0 s0 D s1 . Furthermore, we have the following result:

Theorem 2. If jjs0 jj0 < n, then s0 is the unique solution of .P0 / with probability of one. And if jjs1 jj0 < n, then s0 D s1 with a probability of one. Proof. Given jjs0 jj0 < n, s0 not being unique implies that x has another representation that is a linear combination of column vectors of A smaller than n in number. From lemma 1, its probability is zero. Hence, s0 is unique with a probability of one. Similarly, from lemma 1, if jjs1 jj0 < n, then x D As1 is a unique representation with probability one by using columns of A smaller than n in number. Thus, s0 D s1 with probability one. Now we introduce four models and then dene several related probabilities. The rst model is x D Al sl ;

(2.6)

where Al 2 Rn£l , sl 2 Rl , l · n, and all entries of Al and sl are selected randomly according to a distribution (e.g., uniform distribution). All columns of Al are normalized to unit length. The second model, x D A.il 1 ;:::;ik / s.il 1 ;:::;ik / ;

(2.7)

where x is generated in equation 2.6, A.il 1 ;:::;ik / 2 Rn£n is composed of l ¡ k columns of Al with indices f1; : : : ; lg ¡ fi1 ; : : : ; ik g, and n ¡ l C k normalized

Analysis of Sparse Representation and Blind Source Separation

1201

column vectors are newly selected randomly according to the same distribution as in equation 2.6, 1 · k · l. s.il 1 ;:::;ik / 2 Rn is the solution of the equations. The third model, N l sN l ; xDA

(2.8)

N l 2 Rn£n is composed of at most where x is generated in equation 2.6, A l columns of Al , and the other normalized column vectors newly selected randomly according to the same distribution as in equation 2.6. sN l 2 Rn is the solution of the equations. N m¡l 2 Rn£.m¡l/ are selected randomly according to the Suppose that A same distribution as in equation 2.6 with its columns being normalized. We have the fourth model, x D A.i1 ;:::;ik ;j1 ;:::; jn¡lCk / s.i1 ;:::;ik ;j1 ;:::;jn¡lCk / ;

(2.9)

where x is generated in equation 2.6, A.i1 ;:::;ik ;j1 ;:::; jn¡lCk / 2 Rn£n is obtained by N m¡l with indices .j1 ; : : : ; jn¡lCk / to replace the k using n ¡ l C k columns of A columns of Al with indices .i1 ; : : : ; ik /. s.i1 ;:::;ik ;j1 ;:::; jn¡lCk / is the corresponding solution of equation 2.9. Dene the probabilities: ¯.k; l; n/ D P.jjsl jj1 < jjs.il 1 ;:::;ik / jj1 g;

(2.10)

where 1 · i1 < ¢ ¢ ¢ < ik · l are k indices selected randomly. P.k; l; n/ D P.jjsl jj1 < minfjjs.il 1 ;:::;ik / jj1 ; 1 · i1 < ¢ ¢ ¢ < ik · lg/;

(2.11)

where k D 1; : : : ; l. And denote skl to be the solution of equation 2.7 satisfying that jjskl jj1 D minfjjs.il 1 ;:::;ik / jj1 ; 1 · i1 < ¢ ¢ ¢ < ik · lg. P.k; l; n; m/ D P.jjsl jj1 < minfjjs.i1 ;:::;ik ;j1 ;:::;jn¡lCk / jj1 ; 1 · i1 < ¢ ¢ ¢ < ik · l; 1 · j1 < ¢ ¢ ¢ < jn¡lCk · m ¡ lg/; (2.12) where k D 1; : : : ; l. Now we consider the optimization problem .P0 /. If l denotes jjs0 jj0 and l < n, then by theorem 2, s0 is the unique solution of .P0 / with a probability of one. The data vector x can be seen as generated by the model 2.6, where sl is composed of the nonzero entries of s0 and Al is composed of the column vectors of A with the same indices as entries of sl . From theorem 1, the solution s1 of the problem .P1 / has at most n nonzero entries; thus we consider only the solutions of the constraint equation of .P0 /, which have

1202

Y. Li, A. Cichocki, and S. Amari

at most n nonzero entries (s1 is one of these solutions). A solution of the constraint equation of .P0 / can be seen as generated by the model 2.9 for a N m¡l D A n Al . In view of the denition of P.k; l; n; m/, we have the k, where A following probability equality,

P.s0 D s1 / D P.

l \ kD1

fjjsl jj1 · minfjjs.i1 ;:::;ik ;j1 ;:::;jn¡lCk / jj1 ;

1 · i1 < ¢ ¢ ¢ < ik · l; 1 · j1 < ¢ ¢ ¢ < jn¡lCk · m ¡ lgg ¸

l Y

P.k; l; n; m/:

(2.13)

kD1

Thus, the probability P.k; l; n; m/ plays an important role in the equivalence analysis of the l0 norm solution and l1 norm solution. In the following, several results with regard to the probability P.k; l; n; m/ will be obtained after some necessary preparations are made. For l D n, we rst present two lemmas about the probabilities ¯.j; n; n/ and P.j; n; n/. Lemma 2. 1. ¯.1; n; n/ · ¯.2; n; n/ · ¢ ¢ ¢ · ¯.n; n; n/ · 1. 2.

lim ¯.n; n; n/ D 1:

n!C1

The proof is in appendix B. Lemma 3. 1. P.1; n; n/ · P.2; n; n/ · ¢ ¢ ¢ · P.n; n; n/ · 1. 2. For xed k ¸ 0, lim P.n ¡ k; n; n/ D 1: n!C1

The proof is in appendix C. Figure 1 shows several simulation results illustrating the two results of lemma 3. For the models 2.6 and 2.7, all coefcient matrices and sl are taken as uniform distribution. All probabilities are estimated from 1000 independent, repeated experiments. In the rst subplot, the probability curve of P.k; 12; 12/ is presented with k D 1; : : : ; 12. In the second subplot, two probability curves are presented, in which the curve with stars corresponds to P.4n; 4n; 4n/, and the curve with points corresponds to P.4n ¡ 1; 4n; 4n/, where n is from 1 to 30. Now we present several conclusions about P.k; l; n/.

Analysis of Sparse Representation and Blind Source Separation 1

0.9

0.95

0.8

0. 9

P(4n-k,4n,4n)

p(k,12,12)

1

0.7

0.85

0.6

0. 8

0.5

0.75

0.4

1203

2

4

6

8

10

0. 7

12

5

10

15

20

25

30

n

k

Figure 1: (Left) First subplot: Probability curve P.k; 12; 12/. (Right) Second subplot: Two probability curves: curve with stars refers to P.4n; 4n; 4n/, and curve with points refers to P.4n ¡ 1; 4n; 4n/.

Theorem 3. 1. P.1; l; n/ · P.2; l; n/ · ¢ ¢ ¢ · P.l; l; n/ · 1:

2. 1 D P.1; 1; n/ ¸ P.1; 2; n/ ¸ ¢ ¢ ¢ ¸ P.1; n; n/. 3. For xed l, lim P.k; l; n/ D 1; 1 · k · l. n!C1

4. For the solution sN l of equation 2.8, the probability P.jjsl jj1 · jjNsl jj1 / ¸ .P.1; l; n//l . And for xed l, lim P.jjsl jj1 · jjNsl jj1 / D 1. n!C1

The proof is in appendix D. We also have several conclusions about the probability P.k; l; n; m/ in equation 2.12. Theorem 4. 1. P.1; l; n; m/ · P.2; l; n; m/ · ¢ ¢ ¢ · P.l; l; n; m/ · 1:

2. 1 D P.1; 1; n; m/ ¸ P.1; 2; n; m/ ¸ ¢ ¢ ¢ ¸ P.1; n; n; m/.

3. For xed l; m ¡ n, and 1 · k · l, lim P.k; l; n; m/ D 1: n!C1

4. For xed k; l; n, lim P.k; l; n; m/ D 0: m!C1

N m¡l ], we have P.Osl D s1 / ¸.P.1; l; n; m//l , 5. Considering .P1 / with A D [Al ; A T T m where sO l D [sl ; 0; : : : ; 0] 2R . Furthermore, for xed l, m¡n, lim P.Osl D s1 / D 1:

n!C1

6. For the optimization problems .P0 / and .P1 /, set l D jjs0 jj0 . Then P.s0 D

1204

Y. Li, A. Cichocki, and S. Amari

s1 / ¸ .P.1; l; n; m//l . Furthermore, for given positive integers l0 and n0 , if l · l0 , and m ¡ n · n0 , then lim P.s0 D s1 / D 1: n!C1

The proof is in appendix E. Remark 1. 1. From the fth result of theorem 4, for xed l and m¡n, if n is sufciently large, then sl can be recovered correctly with a high probability by solving the optimization problem .P1 /. 2. From the second and fth results, if n and m are xed and l is sufciently small, then sl also can be recovered correctly with a high probability by solving the optimization problem .P1 /. These results can be seen in simulation example 1. 3. The sixth result of this theorem is the equivalence result of the l0 norm solution and l1 norm solution. We can see that if l0 norm solution is sufciently sparse (i.e., jjs0 jj0 is sufciently small), then s0 D s1 with a high probability (e.g., larger than 0:95). However, if the assumption is not satised for analyzed data, a preprocessing (e.g., Fourier transformation, wavelet transformation and wavelet packets transformation) to the known data matrix X is necessary in order that the assumption holds in the transform domain (see section 3 and simulation example 3). The applied preprocessing depends on the nature of analyzed data. 4. Theorem 4 will be used as a recoverability analysis in blind sparse source separation of this article. An interesting problem is if only the ratios °1 D nl , °2 D mn are xed (it implies that l tends to C1 as n tends to C1), how does the probability P.Osl D s1 / change when n ! C1? Set P.°1 ; °2 ; n/ D P.Osl D s1 /. We present a conjecture and its variant next. Conjecture.

There exists a µ1 .°2 / 2 .0; 1], such that for °1 < µ1 .°2 /,

lim P.°1 ; °2 ; n/ D 1:

n!C1

(2.14)

There is another version of the conjecture: Conjecture.

There exists a µ2 .°1 / 2 .0; 1], such that for °2 ¸ µ2 .°1 /,

lim P.°1 ; °2 ; n/ D 1:

n!C1

(2.15)

Note that if l < n, then sO l is equal to the solution s0 of .P0 / with a probaN m¡l ]. bility of one from theorem 2, where A D [Al ; A

1

1

09 .

09 .

08 .

08 .

07 .

07 .

06 .

06 .

Probability

Probability

Analysis of Sparse Representation and Blind Source Separation

05 .

05 .

04 .

04 .

03 .

03 .

02 .

02 .

01 .

01 .

0

10

20

30

40

1205

0

l

10

20

30

40

l

Figure 2: (Left) First subplot, three probability curves P.°1 ; °2 ; l/ with °2 D 0:5; °1 D 13 , 0:5 and 0:8 from top to bottom, respectively. (Right) Second subplot, three probability curves P.°1 ; °2 ; l/ with °1 D 0:5; °2 D 0:7, 0:5 and 0:25 from top to bottom, respectively, where l is the nonzero entry number of source vector sl .

Now we present several simulation results to illustrate the two conjecN m¡l ] and the tures. For the problem .P1 /, the coefcient matrix A D [Al ; A vector sl composed of nonzero entries of the original source vector are selected randomly according to the uniform distribution. All the probabilities are estimated from 300 independent repeated experiments. Figure 2 shows the estimated probability curves as a function of the nonzero entry number of source vectors. In the rst subplot, °2 D 0:5; the three curves with ¢, ¤, and C correspond to °1 D 13 ; 0:5, and 0:8, respectively. In the second subplot, °1 D 0:5; the three curves with ¢; ¤ and C correspond to °2 D 0:7; 0:5 and 0:25, respectively. 2.3 Robustness Analysis of the l1 and l0 Norm Solutions Under Noise Conditions. In this section, we reconsider the optimization problems .P0 / and .P1 / in the case of noise and present the robustness analysis. Rewrite the models .P0 / and .P1 / with a noise term as follows, .P00 / min

m X iD1

.P01 / min

m X iD1

jsi j0 ; subject to As C v D x; jsi j; subject to As C v D x:

where v 2 Rn is a noise vector.

1206

Y. Li, A. Cichocki, and S. Amari

For a basis matrix A and a data vector x, denote the solutions of .P00 / and as sv0 and sv1 , respectively. First, we discuss the robustness of the solution to noise for .P00 /. There exist two cases: (1) jjs0 jj0 < n and (2) jjs0 jj0 D n. Consider the two cases in the following. In case 1, jjs0 jj0 < n. This implies that x can be represented by a combination of fewer than n columns of A, and the solution s0 is unique with probability of one from theorem 2. From the equations of .P00 /, we have .P01 /

x ¡ v D Asv0 :

(2.16)

Noting that v is a noise vector and that equation 2.16 is a representation of the vector x ¡ v, we have jjsv0 jj0 D n with a probability of one from lemma 1. There exists another obvious fact: if jjsv0 jj0 D n, then sv0 is not unique to the problem .P00 /. In case 2 jjs0 jj0 D n. In this case, jjsv0 jj0 D n with a probability of one, and v s0 is not a unique solution of the problem .P00 /. It follows from the preceding discussion that the solution of .P0 / is not robust with respect to additive noise. Now we analyze the robustness of the solution of .P1 /. From theorem 1, jjs1 jj0 · n. We consider only the case of jjs1 jj0 D n. (The case of jjs1 jj0 < n can be discussed similarly.) Suppose that the indices of nonzero entries of s1 are .i1 ; : : : ; in /, the maQ 1 is composed of n columns of A with indices .i1 ; : : : ; in /, and the trix A n-dimensional column vector sQ 1 is composed of all nonzero entries of s1 . Thus, we have Q 1 sQ 1 : xDA

(2.17)

Considering the determined equations, Q 1 sQ v C v; xDA

(2.18)

we have Q ¡1 x ¡ A Q ¡1 v: sQ v D A 1 1

(2.19)

Thus, Q ¡1 xjj1 C jjA Q ¡1 vjj1 jjQsv jj1 · jjA 1 1

Q ¡1 jj1 jjvjj1 : · jjs1 jj1 C jjA 1

(2.20)

Dene Q s; A Q 6D A Q 1 g; ®0 D minfjjQsjj1 ¡ jjs1 jj1 ; x D AQ Q A

(2.21)

Analysis of Sparse Representation and Blind Source Separation

1207

Q is an n £ n submatrix of A. Since there are only Cn ¡ 1 minor where A m Q 1 , and the solution s1 is unique with probability of matrices of A, except A one from theorem 1, it follows that ®0 > 0 with probability of one. Furthermore, consider the determined equations, Q sv C vI x D AN

(2.22)

we have Q ¡1 x ¡ A Q ¡1 v: sN v D A

(2.23)

In view of equation 2.23, 2.21, and 2.20, we have Q ¡1 xjj1 ¡ jjA Q ¡1 vjj1 jjNsv jj1 ¸ jjA

Q ¡1 jj1 jjvjj1 ¸ .jjs1 jj1 C ®0 / ¡ jjA

Q ¡1 jj1 jjvjj1 C ®0 / ¡ jjA Q ¡1 jj1 jjvjj1 ¸ .jjQsv jj1 ¡ jjA 1 ¸ jjQsv jj1 C ®0 ¡ 2Mjjvjj1 ;

(2.24)

Q ¡1 jj1 , A Q is a n £ n submatrix of Ag. where M D maxfjjA Q A

From equation 2.24, if jjvjj1 <

®0 2M ,

then

jjQsv jj1 < jjNsv jj1 ;

(2.25)

that is, sQ v D sQ v1 , which is composed of all nonzero entries of sv1 . This implies that sv1 has the same nonzero entry indices as s1 . From equation 2.19 and 2.17, Q ¡1 jj1 jjvjj1 · Mjjvjj1 ; jjQsv ¡ sQ 1 jj1 · jjA 1

(2.26)

that is, jjsv1 ¡ s1 jj1 · Mjjvjj1 :

(2.27)

From the analysis above, we nd that the solution of .P1 / has robustness to noise. Thus, we have the following result: Theorem 5. The solution of .P0 / is not robust to noise, while the solution of .P1 / is robust to noise, at least to some degree. 2.4 The Algorithm for Estimating a Suboptimal Basis Matrix. It follows from theorem 1 that for any given basis, there exists a unique sparse solution of equaiton 2.1 with a probability of one. Among all basis matrices

1208

Y. Li, A. Cichocki, and S. Amari

that satisfy assumptions 1 and 2, there exists at least one basis matrix such that the corresponding solution of equation 2.1 is the sparsest. It is very difcult to nd the best basis. We will use a gradient-type algorithm presented in the following or the typical K¡means clustering algorithm to estimate a suboptimal basis matrix that is composed of the cluster centers of the normalized, known data vectors as in Zibulevsky et al. (2000). With this kind of cluster center basis matrix, the corresponding coefcient matrix can become very sparse, although it is not the sparsest generally. This can be proved if data vectors are two-dimensional, and this is evident in simulations. First, we introduce an objective function, J2 .b1 ; : : : ; bm / D D

m X

X

d.x.i/; bj /

jD1 x.i/2µj0 .bj / m X

X

jD1 x.i/2µj0 .bj /

q .x1i ¡ b1j /2 C ¢ ¢ ¢ C .xni ¡ bnj /2 ;

(2.28)

where µj0 .bj / is a data column vector set with the center bj , x.i/ 2 µj0 .bj / if and only if d.x.i/; bj / D minfd.x.i/; bk /; k D 1; : : : ; mg. By solving the following optimization problem, we can obtain the suboptimal basis matrix: min J2 ; s:t: b21 j C ¢ ¢ ¢ C b2nj D 1; j D 1; : : : ; m:

(2.29)

The gradients of J2 with respect to blj can be obtained by X @ J2 1 D¡ ..x1i ¡ b1j /2 C ¢ ¢ ¢ C .xni ¡ bnj /2 /¡ 2 .xli ¡ blj /; @blj 0 x.i/2µ .b / j

(2.30)

j

where l D 1; : : : ; n, j D 1; : : : ; m. We have the following gradient-type algorithm followed by normalization, 8 0 P blj .k C 1/ D blj .k/ C ´ ..x1i ¡ b1j /2 C ¢ ¢ ¢ C > > > > x.i/2µj0 .bj / > > > < 1 (2.31) .xni ¡ bnj /2 /¡ 2 .xli ¡ blj /; l D 1; : : : ; n; > > > > > b0 .kC1/ > > :bj .k C 1/ D jjbj0 .kC1/jj ; j

where ´ is the step size, j D 1; : : : ; m. Also we can use a K¡means clustering algorithm to estimate the ideal basis matrix. Thus, we have the following algorithm steps:

Analysis of Sparse Representation and Blind Source Separation

1209

Algorithm Outline 1 Step 1. Normalize the data vectors, µ ¶ x.1/ x.K/ X0 D D [x01 ; : : : ; x0K ]; ;:::; jjx.1/jj jjx.K/jj

where x.i/; i D 1; : : : ; k are column vectors of the data matrix X.

Step 2. Determine the starting points of iteration. For the i D 1; : : : ; n, determine the maximum and minimum of the ith row of X0 , denoted as Mi and mi . Choose a sufciently large positive integer N, and divide the set [m1 ; M1 ] £ ¢ ¢ ¢ £ [mn ; Mn ] into N subsets. The centers of the N subsets can be used as the initial values. Step 3. Begin a K¡means clustering iteration followed by normalization or the iteration of equation 2.31 to estimate the suboptimal basis matrix. End 3 Blind Source Separation Based on Sparse Representation In this section, we discuss an application of sparse representation in BSS. The mixing matrix and source matrix are taken as the basis matrix and the coefcient matrix in the sparse factorization of an mixture signal matrix, respectively. If the sources are sufciently sparse, the normalized mixture vectors will form several clusters, of which the center vectors are close to the column vectors of the mixing matrix in direction. Thus, the mixing matrix can be estimated as in the estimation of basis matrix in the previous section. After the mixing matrix is estimated correctly, a sparse source vector (its nonzero entry number is sufciently small) is corresponding to an l0 ¡norm solution in the sparse representation. According to our discussion above, it is equivalent to the l1 -norm solution with a high probability, which can be obtained using a linear programming algorithm. The proposed approach is suitable for the case in which the number of sensors is less than or equal to the number of sources and for the case in which the number of source is unknown. We will consider the following noise-free model and noisy model, X D AS;

(3.1)

X D AS C V;

(3.2)

where the mixing matrix A 2 Rn£m is unknown, the matrix S 2 Rm£K is composed by the m unknown sources, and the only observable X 2 Rn£K is a data matrix that has rows containing mixtures of sources, n · m. V 2 Rn£K is a gaussian noise matrix.

1210

Y. Li, A. Cichocki, and S. Amari

The task of BSS is to recover the sources using only the observable mixture matrix X. 3.1 Blind Sparse Source Separation. If the sources are sufciently sparse, the sparse representation approach presented in this letter can be used directly in blind separation. The process is divided into two steps. Step 1 is to estimate the mixing matrix using algorithm outline 1, which was presented in section 2.4. When the iteration is terminated, the obtained N which is the estimation of the original mixing matrix matrix is denoted as A, N contains more columns than A. A. Note that the A Intuitively, if all source column vectors have only one nonzero entry, that is, all sources are nonoverlapped, then every mixture vector is parallel to one of the column vectors of the mixing matrix. Practically, although the sources are overlapped sometimes, if there exists a fraction of source vectors with only one nonzero entry (single entry dominant for noisy data), then the mixing matrix can be estimated. In the following, we present a simulation in which the mixing matrix is estimated using a K-means algorithm followed by a normalization of columns to unit length (see algorithm outline 1). In the simulation, the mixing matrix A 2 R5£10 is taken randomly from a uniform distribution on [0; 1], of which all columns are normalized to unit length. Source matrix S 2 R10£1000 contains two classes of column vectors (source vectors). The rst class source vectors contain only one nonzero entry; however, the index of the nonzero entry is chosen randomly. All entries of the second class of column vectors are drawn randomly from a uniform distribution on [0; 1]. The simulation are carried out for six different but xed ratios of the rst class of columns versus the whole population of source vectors (the total column number is 1000). N the estimation error is calculated as For an estimated mixing matrix A, follows, 10 P

er D

iD1

jjNai ¡ ai jj2 10

;

N is the closest to the column ai of A. where the column vector aN i of A Under every ratio in f0; 0:1; 0:2; 0:3; 0:4; 0:5g, we have 20 independent repeated experiments for estimating the mixing matrix, and calculate the corresponding mean estimation error. Figure 3 shows the curve of mean estimation error with respect to the ratio of single nonzero entry source vectors. In fact, we nd that if the error of an estimated mixing matrix is less than 0:1, then it will be very close to the original mixing matrix. After the mixing matrix is estimated, step 2 is to estimate the source matrix by solving linear programming problem 2.4. N has more columns than A generally, the sources can be obAlthough A

Analysis of Sparse Representation and Blind Source Separation

1211

meanestimationerror of mixingmatrix

05 .

04 .

03 .

02 .

01 .

0

0

01 .

02 .

03 .

04 .

05 .

ra tio o f s in g le n o n ze ro e n try s o u rc e ve c to r

Figure 3: Mean estimation error of mixing matrix versus the ratios of single nonzero entry source vectors.

N contains a submatrix that is sufciently tained effectively provided that A close to A. This can be seen in the following analysis and the simulation examples in section 4. There is a potential recoverability problem occurred here: Are the estimated sources equal to the true sources, even if the mixing matrix is estimated correctly? From theorem 4, for xed n (sensor number) and m, if the nonzero entry number l of the source vector is sufciently small compared to n (that is, the source vector is sufciently sparse), then the source can be recovered with a high probability. For instance, the mixing matrix in simulation example 1 is 10 £ 15 dimensional, and the source vector is uniformly distributed. If l · 5, then the source can be recovered with a probability higher than 0:97. The implication is that the sparser the sources are, the higher is the recoverability (higher probability) of the sources. Furthermore, if the sources are sparser, it is easier to estimate the mixing matrix (i.e., it is easier to obtain the cluster centers of data column vectors). Thus, the sparsity of the sources plays an important role in the approach presented in this letter. 3.2 Blind Sparse Source Separation for the Noisy Case. Now we consider blind sparse source separation in the case of noise. The noise emerges due to two categories: (1) the estimation error of the mixing matrix and (2) additive noise. Thus, we consider the model

.P2 / min

m1 X iD1

N s C v D x; jNsi j; subject to AN

1212

Y. Li, A. Cichocki, and S. Amari

N 2 Rn£m1 (m1 ¸ m) is the estimation of the mixing matrix A and where A n v 2 R is the additive noise vector. Since the case of additive noise has already been discussed in section 2.3, we discuss only the following problem here: .P3 / min

m1 X iD1

N s D x: jNsi j; subject to AN

N 2 Rn£m . Denote There exist two cases. In case 1, m1 D m. That is, A N A D A C 4A. Using the notations in sections 2.2 and 2.3, and from equation 2.17, we have Q ¡1 x: sQ 1 D A 1

(3.3)

Now considering the equations Q 1 C 4A Q 1 /Ns1 D x; .A

(3.4)

we have Q 1 C 4A Q 1 /¡1 x D sN 1 D .A

Q 1 C 4A Q 1 /¤ .A x; Q 1 C 4A Q 1/ det.A

(3.5)

Q 1 C 4A Q 1 /¤ is the adjacent matrix of A Q 1 C 4A Q 1 . Noting that sN 1 also where .A can become a solution of the constraint equation of .P3 / by adding m¡n zero entries, thus it is also taken as a solution of the constraint in the following. Q 1 tends to zero, then sN 1 tends to sQ 1 . From equation 3.5, if jj4Ajj Furthermore, consider the two sets of equations, Q sn D x; AN

Q C 4A/N Q sv D x; .A n Q is an n £ n minor of A different from A Q 1 , and 4A Q is an n £ n minor where A Q of 4A with the same row and column indices as A. Q ¡1 x; jjNsn jj1 > jjs1 jj1 . If jj4Ajj1 tends to zero, then sN v tends We have sN n D A n v to sN n , and jjNsn jj1 > jjs1 jj1 . Thus, sN 1 is the solution of the problem .P3 / if jj4Ajj1 is sufciently small. Q 1 is sufciently small, then jjNs1 ¡ From the discussion above, if jj4Ajj s1 jj1 is sufciently small. Furthermore, if s1 D s, we can obtain the ideal estimation of the original source vector s by solving .P3 /. N has more columns than A. Denote 4A1 D In case 2, m1 > m. That is, A N N A.:; [1 : m]/ ¡ A, 4A2 D A.:; [m C 1 : m1 ]/. Set A0 D [A; 4A2 ]; 4A0 D N D A0 C 4A0 . [4A1 ; 0] 2 Rn£m1 . Obviously, A

Analysis of Sparse Representation and Blind Source Separation

1213

Consider the problem

.P4 / min

m1 X iD1

jqi j; subject to A0 q D x;

and denote the solution of .P4 / as q1 . From the discussion in section 2.2, if the source vector is sufciently sparse, the following result holds with a high probability: q1 .[1 : m]/ D s1 D s; qmC1 D ¢ ¢ ¢ D qm1 D 0:

(3.6)

Now we consider the problem that is a variant of .P3 /,

.P5 / min

m1 X iD1

jNqi j; subject to .A0 C 4A0 /qN D x;

and denote the solution of .P5 / as qN 1 . : From the discussion in case 1, if jj4A0 jj1 is sufciently small, then qN 1 D q1 . In view of equation 3.6, the following result is satised with a high probability: : qN 1 .[1 : m]/ D s1 D s; qN mC1 D ¢ ¢ ¢ D qN m1 D 0;

(3.7)

Thus, if jj4A0 jj1 is sufciently small, we can estimate the original sources efciently by solving the problem .P5 /, provided that the sources can be recovered by solving the problem .P1 /. Combining the discussion in section 3.3, we have the following theorem: Theorem 6. If the source vector is sufciently sparse, the estimated matrix contains a submatrix sufciently close to the mixing matrix, and if the additive noise is sufciently small, then the source vector can be recovered by estimating the mixing matrix and solving the problem .P2 /, even if the number of sources is unknown. 3.3 Preprocessing of Wavelet Packets Transformation and Blind Separation for Dense Sources. Generally, sources cannot be recovered directly using sparse factorization if the sources are not sufciently sparse. In this section, a blind separation algorithm based on preprocessing wavelet packets transformation is presented for dense sources.

1214

Y. Li, A. Cichocki, and S. Amari

Algorithm Steps 2 Step 1. Transform the time domain signals xi , the ith row of X , i D 1; : : : ; n, to time frequency signals by a wavelet packets transformation, and make sure that n wavelet packets trees have the same structure. Step 2. Select these nodes of wavelet packets trees, of which the coefcients are as sparse as possible. The selected nodes of different trees should have the same indices. Based on these coefcient vecN using algorithm 1 presented tors, estimate the mixing matrix A in section 2.2. If there exists noise in the mixing model, we can estimate the mixing matrix for several times, and then take the mean of these estimated matrices. N and the coefcients of Step 3. Based on the estimated mixing matrix A all nodes obtained in step 1, estimate the coefcients of all the nodes of the wavelet packets trees of sources by solving the linear programming problems equation 2.4. Step 4. Reconstruct sources using the inverse wavelet packets transformation. Among the reconstructed sources, m signals are the estimations of original sources up to a permutation and a scale. Step 5. Denoise to the estimated sources using the wavelet transformation if noise exists in the mixtures. End. Remark 2. 1. We nd in simulations that the coefcients of wavelet transformation or Fourier transformation are often not sufciently sparse to estimate the mixing matrix and sources. Thus, wavelet packets transformation is used in this letter. 2. From simulation example 4, we nd that the denoising step should be carried out last. If we implement denoising preprocessing to the mixture signals before blind separation, we fail to obtain good results. 4 Simulation Examples The simulation results presented in this section are divided into ve categories. Example 1 is concerned with the recoverability of sparse sources using a linear programming method. Example 2 considers blind separation of sparse images of faces. In this example, six observable signals are mixtures of 10 sparse images of faces. Example 3 considers BSS in the context of wavelet packets preprocessing and sparse representation. In this example, eight speech sources and ve observable mixtures are analyzed in the

Analysis of Sparse Representation and Blind Source Separation

1215

simulation. A noisy mixing model is considered in example 4 with the same sources as in example 3. Finally, in example 5, a 14-channel EEG data matrix is analyzed using the sparse component analysis approach presented in this letter. 4.1 Example 1. In this example, consider the mixing model x0 D As0 ; Rn£m

(4.1) Rm

where A 2 is a known mixing matrix with n · m , s0 2 is a source vector, and x0 2 Rn is the mixture vector. Based on the known mixture vector and mixing matrix, the source vector is estimated by solving the following linear programming problem: min

m X iD1

jsi j; s:t: As D x0 :

(4.2)

The task of this example is to determine whether the estimated source vector is equal to the true source vector. There are three simulations in this example. In the rst simulation, n and m are xed to be 10 and 15, respectively. Fifteen loop experiments are completed, each of which contains 1000 optimization experiments. In each of the 1000 experiments of the rst loop experiment, the mixing matrix A and the source vector s0 are selected randomly; however, s0 has only one nonzero entry. After 1000 optimization experiments, the ratio is calculated measuring whether the source vectors are estimated correctly. The kth loop experiment is carried out similarly, except that the source vector has only k nonzero entries. The rst subplot of Figure 4 shows the ratio curve with respect to nonzero entry number k of the source vector. These ratios were obtained in 15 loop experiments. The source can be estimated correctly when k D 1; 2, and the ratios are greater than 0:95 when k · 5. The second simulation contains 11 loop experiments. All source vectors have ve nonzero entries, that is, k D 5 and m D 15. The dimension n of the mixture vectors varies from 5 to 15. As above, each loop experiment contains 1000 experiments, and the ratio for correctly estimated source vectors is calculated. The second subplot of Figure 4 shows the ratio curve, and it is evident that when n ¸ 10, the source can be estimated correctly with a probability higher than 0:95. In the third simulation, 15 loop experiments are conducted. The nonzero entry number k of source vectors and the dimension n of mixture vectors are xed at 5 and 10, respectively. The source number m varies from 11 to 25. Other experimental parameters and calculations are the same as in the other two simulations. The third subplot of Figure 4 shows the ratio curve with respect to the source number m. From this simulation, we see that for xed k and n, the ratio decreases as m increases.

1216

Y. Li, A. Cichocki, and S. Amari 1

08 .

08 .

08 .

06 .

06 .

06 .

ratio

ratio

ratio

1

1

04 .

04 .

04 .

02 .

02 .

02 .

0

5

10 k

15

0

5

10 n

15

0 10

15

20

25

m

Figure 4: (Left) First subplot: Curve of ratios measuring whether the source vectors are estimated correctly in the rst simulation of example 1. (Middle) Second subplot: Curve of ratios for source vector estimation for the second simulation. (Right) Third subplot: Curve of ratios for source vector estimation for the third simulation.

4.2 Example 2. In this example, we consider the linear mixing model 3.1 that has nonnegative sources and a mixing matrix. The source matrix S is composed of 10 sparse images of faces; the nonnegative 6£10 mixing matrix is selected randomly, and every column is normalized. Figure 5 shows 10 sparse face images in the subplots of the rst and second rows and six mixtures in the subplots of the third and fourth rows. Using algorithm 1, we obtain an estimated 6 £ 14 nonnegative mixing N To guarantee that the linear programming problems matrix, denoted as A. 2.3 with additional nonnegative constraints are feasible, we use the matrix N E] as a basis matrix, where E is a 6 £ 6 unitary matrix. Solving the linear [A; programming problem 2.3, we obtain 20 outputs, of which 15 are shown in Figure 6. Obviously, the rst 10 outputs are the recovered sources. 4.3 Example 3. Consider the linear mixing model 3.1, where source matrix S is composed of eight speech signals shown in Figure 7. A 5 £ 8 mixing matrix is selected randomly, and every column is normalized. Only ve mixtures are observable, shown in the subplots of the rst row of Figure 8. Using algorithm outline 2, an estimated 5£10 dimensional mixing matrix and 10 estimated sources are obtained. The subplots in the second and third rows of Figure 8 show the 10 estimated signals, from which we can see that all eight sources have been recovered favorably. The other two estimated sources, which are close to zero, can be seen as a kind of estimation error of sources. There are several potential methods for reducing this kind of error: develop algorithms that are “better” than the K-means clustering algorithm in estimating the mixing matrix, nd a more suitable transformation such that the signals are as sparse as possible in the transformed domain, increase

Analysis of Sparse Representation and Blind Source Separation

1217

Figure 5: Ten subplots of the rst two rows: 10 sparse images of faces. Six subplots in the third and fourth rows: 6 mixtures. All are from simulation example 2.

the number of sensors if possible, and others. Generally, if the sources are overlapped to some degree in the analyzed domain, this kind of error cannot be eliminated even if the overcomplete mixing matrix is known. 4.4 Example 4. Consider the noisy mixing model, equation 3.2, where the source matrix S is composed of the same eight speech signals as those in example 3 (see Figure 7). The 5 £ 8 mixing matrix is selected randomly, and every column is normalized. Using algorithm 2, an estimated 5 £ 9-dimensional mixing matrix and 9 estimated sources are obtained. The subplots in the second and third rows of Figure 9 show the nine estimated signals, from which we can see that all eight sources have been recovered favorably, and the last estimated signal is close to zero. Recently, many researchers have used independent component analysis (ICA) in EEG data analysis and have obtained many promising results (e.g., Makeig et al., 2002). Compared with the general class of ICA algorithms based on the independence principle, there exist two obvious advantages of the sparse representation in EEG data analysis: (1) there exist several sources that are dependent, and several that are not stationary sometimes, and (2)

1218

Y. Li, A. Cichocki, and S. Amari

Figure 6: Blind sparse source separation results of example 2. The 15 outputs, of which the rst 10 are recovered sources. 5

0

-5

5

0

-5

5000

10000

5000

10000

5000

10000

5000

10000

Figure 7: Eight sources in example 3.

the source number can be bigger than the number of EEG channels. Thus, we believe that sparse representation is an alternative and very promising approach in EEG data analysis. In the following, we present an example for EEG data analysis based on sparse representation. 4.5 Example 5. In this example, a set of EEG data is analyzed using the sparse component analysis (SCA) set out in this letter. The data matrix

Analysis of Sparse Representation and Blind Source Separation

1219

5 0 -5 5 0 -5 5 0 -5

5000

5000

5000

5000

5000 10000

Figure 8: Blind source separation results of example 3. (Top) Five observable mixtures. (Middle, bottom) Estimated signals of which eight are recovered sources; two other signals are close to zero.

X 2 R10£20;000 was obtained from 10 channels (ltered in 1–50 Hz range), and the sampling rate was 1000 Hz. We assume that the 10 EEG signals are linear mixtures of brain sources, artifacts, and noise; that is, they are generated according to the model 3.1. Using algorithm 2, an estimated 10 £ 18–dimensional mixing matrix and 18 estimated signals are obtained, including brain sources, artifacts, and noise (there also may be several new mixtures). Figure 10 shows the 10 ltered EEG mixture signals from 3 seconds of recording, and Figure 11 shows their power spectral density estimations. The 18 estimated components are shown in Figure 12, with the power spectral densities shown in Figure 13. From Figures 10 and 11, we can see that all EEG channel signals contain strong ® components in long time intervals and that many power spectral density functions are very similar. This can be viewed as evidence that these EEG channel signals contain common components. From Figures 12 and 13, we can see that the ® component still appears in many estimated components, but in shorter time intervals for example, s08 contains ® components in the time interval .0; 1/ and s12 contains ® component in .1:8; 2:4/. We can also see that the power spectral density of s2 has a peak around 20 Hz, s4 contains mainly low-frequency components

1220

Y. Li, A. Cichocki, and S. Amari

5 0 -5 2 0 -2 5000 10000 2 0 -2 5000

5000

5000

5000 10000

Figure 9: Blind source separation results of Example 4. (Top) Five observable mixtures. (Middle, bottom) Estimated signals of which eight are recovered sources and the sixth is close to zero.

of less than 5 Hz and around 10 Hz, and so does s11 . The time courses of the three components are very special. We also calculated the correlation coefcient matrices of EEG channel data and estimated components, denoted by Rx and Rs , respectively. We found that Rxi;j 2 .0:14; 1] (the median of jRxi;j j is 0:49). The correlation coefcients for estimated components are considerably lower (the median of jRsi;j j is 0:26) than for raw EEG data. There exist many pairs of components with small correlation coefcients, for example, Rs2;4 D 0:03; Rs2;11 D 0:006; Rs3;4 D 0:0018; and so forth. Furthermore, we found that the higher-order correlation coefcients of these pairs are also very small (e.g., fourth-order correlation coefcients of the pairs s2 and s4 , s2 and s1 are 0:018; and 0:003; respectively, and the median of absolute value of the fourth-order correlation of all components is 0:20). We would like to emphasize that although the independence principle was not used, many pairs of components were almost independent. 5 Concluding Remarks Sparse representation of a data matrix and its application to blind source separation were discussed in this letter. A data matrix can be represented

Analysis of Sparse Representation and Blind Source Separation

x01

x0 6

x02

x0 7

x03

x0 8

x04

x0 9

x05

1221

50

x1 0

-5 0 0

1

2

Time (sec.)

3

0

1

2

3

T ime (s ec.)

Figure 10: One-second EEG data (1000 sample points) obtained from 10 channels (ltered from 1 to 50 Hz) and analyzed in example 5.

by a product of two matrices, one of which is called a basis matrix, and the other is called a coefcient matrix or a solution. Sparse representation implies that the coefcients are sparse. In this article, the l1 norm is used as a sparsity measure, whereas the l0 norm sparsity measure is considered for comparison and recoverability analysis of BSS. For a given basis matrix, although there exist innite solutions (factorizations) generally, the sparse solution with a minimum l1 norm is proved to be unique with a probability of one. This can be obtained using the standard linear programming algorithm. The l1 norm solution has several important advantages. One is that there have been many strong tools (e.g., interior point algorithm) for solving linear programming problems; another is that the l1 norm solution is almost unique and is robust to noise. The situation is different, however, for the l0 norm solution; we think that the l1 norm sparseness measure is better than the l0 norm sparseness measure. However, the l0 norm solutions are the sparsest, and it is natural to hope that the l1 norm solution is one of the l0 norm solutions. From equivalence analysis of the l1 norm solution and l0 norm solution presented in this letter, it is evident that if a data vector (observed vector) is generated from a sufciently sparse source vector, then with a high probability, the l1 norm solution is equal to the l0 norm solution,

1222

Y. Li, A. Cichocki, and S. Amari

Figure 11: Power spectral density estimations for 10 EEG channel data in example 5.

which is equal to the source vector. This result still holds even if additive noise is present, which can be used as a recoverability result for blind sparse source separation. Among the different basis matrices, there should be the best basis matrix such that the corresponding coefcient matrix is the sparsest. How to discover the best basis matrix remains an open problem. In this letter, the basis matrix is composed of cluster centers of normalized data vectors, as in Zibulevsky, et al. (2000). The reasoning for applying this kind of basis derives from two considerations: the corresponding coefcient matrix obviously can become sparse, and the basis matrix can represent the mixing matrix in BSS with sparse sources. It is known that this kind of construct that employs sparse representation can be used in BSS, especially in cases in which fewer sensors exist than sources and while the source number is unknown. Our analysis shows that this approach is robust to additive noise and to estimation error of mixing matrix. If the sources are sufciently sparse (i.e., they overlap to some degree), blind separation can be carried out directly. Otherwise, preprocessing of wavelet packets transformation is necessary, and the blind separation is implemented in the time-frequency domain. Four simulation examples and an EEG data analysis example support the validity and performance of the algorithm detailed in this letter.

Analysis of Sparse Representation and Blind Source Separation

s 01

s 10

s 02

s 11

s 03

s12

s 04

s13

s 05

s 14

s 06

s 15

s 07

s16

s 08

s 17

s 09

s18

0

1

2

T ime (s e c.)

3

0

1

2

1223

3

T ime (s ec.)

Figure 12: Eighteen estimated signals using the SCA approach proposed in this letter and detailed in example 5.

We tried to use sparse representation in EEG data analysis in this article, and obtained some results. Obviously, our EEG data analysis results are very initial; extensive investigation will appear in our other articles. Compared with general ICA algorithms based on independence principle, there exist two obvious advantages of the sparse representation in EEG data analysis: (1) several sources are dependent, and some are not stationary sometimes, and (2) the source number can be bigger than the number of EEG channels. Further work that needs to be done includes study of the algorithm for estimating the best basis matrix of sparse representation, estimation of the probability that differently distributed sources can be recovered, and application extensions, specically, applications in image processing, visual computation and EEG data analysis. Appendix A: Proof of Theorem 1 First, dene a set of B in Rn£m , G1 D fB 2 Rn£m j there exists at least one n £ n square submatrix of B that is decient of rankg. Obviously, the measure of G1 is zero.

1224

Y. Li, A. Cichocki, and S. Amari

Welch PSD E stimation 40

20

s1

P o w e r S p e c tra l D e n s ity o f s i (dB/Hz)

s4 s7

s1 0

s

s2 0 20

F re q ue nc y (H z )

0 40

F re q ue nc y (H z )

0 20

0 40

s8

0 40

F re q ue nc y (H z )

0 20

F re q ue nc y (H z )

s1 1 0 40

F re q ue nc y (H z )

s1 3 0 40

F re q ue nc y (H z )

s1 6 0

20

0

40

Frequency (Hz)

0 40 s6

Frequency (Hz)

s5

0 40

F re q ue nc y (H z )

0 40

F re q ue nc y (H z )

0 40

F re q ue nc y (H z )

s9

s1 2 0 40

s1 4

F re q ue nc y (H z )

s 15

0 40

F re q ue nc y (H z )

s1 7 0

40 3

0 40

F re q ue nc y (H z )

s1 8 0

20

40

Frequency (Hz)

0

0

20

40

Frequency (Hz)

Figure 13: Power spectral density estimations for 18 estimated signals in example 5.

In the following, suppose that B2 Rn£m n G1 , that is, all the n £ n square minor matrices of B are of full rank. Without loss of generality, we consider only a column vector of X denoted by x. The sparse solution is obtained by solving the optimization problem, min J1 D

m X iD1

jsi j; subject to x D s1 b1 C ¢ ¢ ¢ C sm bm :

(A.1)

Take an n £ n submatrix of B denoted as BQ D [bl1 ; : : : ; bln ], and set the index set fj1 ; : : : ; jm¡n g D f1; : : : ; mg n fl1 ; : : : ; ln g. In equation A.1, solutions of the constraint equations can be represented

Analysis of Sparse Representation and Blind Source Separation

1225

by [sl1 ; : : : ; sln ]T D BQ ¡1 [x ¡ sj1 bj1 ¡ ¢ ¢ ¢ ¡ sjm¡n bjm¡n ];

(A.2)

which is denoted by [xQ 1 ; : : : ; xQ n ]T ¡ [bQ 1j1 ; : : : ; bQ nj1 ]T sj1 ¡ ¢ ¢ ¢ ¡ [bQ 1jm¡n ; : : : ; bQ njm¡n ]T sjm¡n , where sj1 ; : : : ; sjm¡n are arbitrary. And J1 becomes, J1 D

n X iD1

jxQ i ¡ bQ ij1 sj1 ¡ ¢ ¢ ¢ ¡ bQ ijm¡n sjm¡n j C jsj1 j C ¢ ¢ ¢ C jsjm¡n j:

The possible minimum points of J1 are in the following two sets: 1. D1 D f.s1 ; : : : ; sm /T j .s1 ; : : : ; sm /T satises the constraints of equation @ J1 A.1 at which J1 is differentiable, and @s D ¢ ¢ ¢ D @s@j J1 D 0g. j 1

m¡n

2. D2 D f.s1 ; : : : ; sm /T j.s1 ; : : : ; sm /T has at least one si D 0g. Note that the function J1 is not differentiable at the points of D2 . For the rst class points, we have @ J1 D sign.sj1 / ¡ sign.sl1 /bQ1j1 ¡ ¢ ¢ ¢ ¡ sign.sln /bQnj1 ; @sj1 :: : @ J1 D sign.sjm¡n / ¡ sign.sl1 /bQ 1jm¡n ¡ ¢ ¢ ¢ ¡ sign.sln /bQnjm¡n : @sjm¡n

(A.3)

Thus, if [1 § bQ1j1 § ¢ ¢ ¢ § bnj1 ; : : : ; 1 § bQ1 jm¡n § ¢ ¢ ¢ § bQ njm¡n ]T 6D 0;

(A.4)

then D1 is an empty set, and J1 has minima only in the second class point set D2 . Choosing a p 2 f1; : : : ; m ¡ ng and setting sjp D 0, we obtain a new optimization problem: min JN1 D min[

n X iD1

jxQ i ¡ bQij1 sj1 ¡ ¢ ¢ ¢ ¡ bQijp¡1 sjp¡1

C bQ ijpC1 sjpC1 C ¢ ¢ ¢ C bQijm¡n sjm¡n j C jsj1 j C ¢ ¢ ¢ C jsjp¡1 j C jsjpC1 j C ¢ ¢ ¢ C jsjm¡n j]; subject to

Q ¡1 [x ¡ sj bj ¡ ¢ ¢ ¢ ¡ sjp¡1 bjp¡1 [sl1 ; : : : ; sln ]T D B 1 1 ¡ sjpC1 bjpC1 ¡ sjm¡n bjm¡n ]:

(A.5)

1226

Y. Li, A. Cichocki, and S. Amari

Similarly, we have the conclusion: if [1 § bQ 1j1 § ¢ ¢ ¢ § bQnj1 ; : : : ; 1 § bQ 1jp¡1 § ¢ ¢ ¢ § bQnjp¡1 ; 1 § bQ 1jpC1 § ¢ ¢ ¢ § bQnjpC1 ; : : : ; 1 § bQ1jm¡n § ¢ ¢ ¢ § bQnjm¡n ]T 6D 0;

(A.6)

then the objective function in equation A.5 has minima only in the nondifferentiable points. That is, the minimum points of the objective function in equation A.5 have at least two zero entries, one of which is sjm¡n¡p D 0. Noting that for p D 1; : : : ; m ¡ n, we can obtain m ¡ n conditions as equation A.6. If all the m ¡ n conditions and condition A.4 are satised, then the minimum points of the objective function in equation A.5 have at least two zero entries. Repeating the procedure above m ¡ n times, we can obtain the optimization problem,

min

n X iD1

jsli j; subject to x D sl1 bl1 C ¢ ¢ ¢ C sln bln :

(A.7)

Note that the constraint equations in equation A.7 are determined and have a unique solution. Q 2 Dene the set G2 .B/ D fB 2 Rn£m jthere exists a square minor matrix B B; such that at least one vector [1 § bQ 1i1 § ¢ ¢ ¢ § bQ 1ik ; : : : ; 1 § bQ ni1 § ¢ ¢ ¢ § bQnik ]T D 0; where f.i1 ; : : : ; ik / is a subset of fj1 ; : : : ; jn¡m g; 1 · k · m ¡ ngg, G0 D S Q G2 .B/ G1 . Then G0 is a zero measure set of B in Rn£m . Reconsidering equation A.1, if B 2 Rn£m n G0 , then all possible solutions of this equation can be obtained by choosing a suitable submatrix of B and solving the determined constraint equations of equation A.7. Since the submatrix number of B is Cnm , there are at most nite solutions of equation A.1. And the nonzero number of these solutions is less than or equal to n. Furthermore, we prove the solution is unique. Suppose that x has the following sparse representation, x D si1 bi1 C ¢ ¢ ¢ C siL biL ;

(A.8)

where jsi1 j C ¢ ¢ ¢ C jsiL j is a global minimum of equation A.1. It follows from the discussion above that L · n, thus, bi1 ,: : :, biL are independent. If x has another sparse factorization based on B, then there are two cases: (1) the sparse factorization of x is also based on the basis vectors bi1 ; : : : ; biL ; and (2) the sparse factorization of x is based on different basis vectors from those in equation A.8.

Analysis of Sparse Representation and Blind Source Separation

1227

Now consider the rst case. We have x D s0i1 bi1 C ¢ ¢ ¢ C s0iL biL ; where [s0i1 ; : : : ; s0iL ]0 6D [si1 ; : : : ; siL ]0 , and Thus, we have

(A.9) L P kD1

js0ik j D

L P kD1

jsik j.

.si1 ¡ s0i1 /bi1 C ¢ ¢ ¢ C .siL ¡ s0iL /biL D 0:

(A.10)

This is in contradiction to the statement that bi1 ,: : :, biL are independent. For the second case, suppose that x has a sparse factorization different from equation A.8: x D s0l1 bl1 C ¢ ¢ ¢ C s0lP blP ;

(A.11)

js0l1 j C ¢ ¢ ¢ C js0lP j D jsi1 j C ¢ ¢ ¢ C jsiL j:

(A.12)

and

From equation A.8 and A.11, we have x D ¯[si1 bi1 C ¢ ¢ ¢ C siL biL ] C .1 ¡ ¯/[s0l1 bl1 C ¢ ¢ ¢ C s0lP blP ] D rk1 bk1 C ¢ ¢ ¢ C rkQ bkQ ;

(A.13)

where ¯ 2 [0; 1], bk1 ; : : : ; bkQ are all different basis vectors in fbi1 ; : : : ; biL ; bl1 ; : : : ; blP g, and ri1 ; : : : ; riQ are corresponding coefcients. From equations A.12 and A.13, it can be proved that jrk1 j C ¢ ¢ ¢ C jrkQ j · ¯.jsi1 j C ¢ ¢ ¢ C jsiL j/ C .1 ¡ ¯/.js0l1 j C ¢ ¢ ¢ C js0lP j/ D jsi1 j C ¢ ¢ ¢ C jsiL j:

(A.14)

Thus, the representation of equation A.13 is also a sparse representation, and Q L P P jrkl j D jsij j. By choosing different ¯ in equation A.13, it can be seen that lD1

jD1

there are innite sparse representations of x. This is in contradiction with the statement that there are at most innite solutions of equation A.1. From the discussion above, it follows that for a given basis matrix B 2 Rn£m nG0 , the sparse solution of equation 1.1 is unique. Note that the measure of G0 is zero. From the proof above, it can be seen that for a given basis B 2 Rn£m n G0 , there are at most n nonzero entries of the solution of equation 2.1. Thus, theorem 1 is proved.

1228

Y. Li, A. Cichocki, and S. Amari

Appendix B: Proof of Lemma 2 1. Set ®kn D 1 ¡ ¯.k; n; n/, which is the probability P.jjsn jj1 ¸ jjs.in 1 ;:::;ik / jj1 /. .i ;:::;i

/

.i ;:::;i

/

n Now consider the probability ®kC1 D P.jjsn jj1 ¸ jjsn 1 kC1 jj1 /. sn 1 kC1 can be obtained in two steps. First, using k columns selected randomly to replace the k columns with indices .i1 ; : : : ; ik / of An in equation 2.6, equation 2.7 and the corresponding solution s.in 1 ;:::;ik / are obtained. Second, using one column selected randomly to replace the ikC1 ¡th columns of A.in 1 ;:::;ik / in equation 2.7, the corresponding solution s.in 1 ;:::;ik ;ikC1 / is obtained. Thus, we have n D P.jjsn jj1 ¸ jjs.in 1 ;:::;ikC1 / jj1 / ®kC1 \ D P.fjjsn jj1 ¸ jjs.in 1 ;:::;ik / jj1 g fjjs.in 1 ;:::;ik / jj1 ¸ jjs.in 1 ;:::;ikC1 / jj1 g/ \ \ .i ;:::;i / C P.fjjsn jj1 ¸ jjs.in 1 ;:::;ik / jj1 g fjjs.in 1 ;:::;ik / jj1 < jjsn 1 kC1 jj1 g

fjjs.in 1 ;:::;ikC1 / jj1 g · jjsn jj1 g/ C P.fjjsn jj1 < jjs.in 1 ;:::;ik / jj1 g

\

fjjsn jj1 ¸ jjs.in 1 ;:::;ikC1 / jj1 g/

n ¼ ®kn ¹nkC1 C ®kn .1 ¡ ¹nkC1 /®kC1 ;

(B.1)

where P.jjs.in 1 ;:::;ik / jj1 ¸ jjs.in 1 ;:::;ikC1 / jj1 / is denoted by ¹nkC1 . And we have supT posed that P.fjjsn jj1 < jjs.in 1 ;:::;ik / jj1 g fjjs.in 1 ;:::;ikC1 / jj1 · jjsn jj1 g/ ¼ 0. Under .i ;:::;i / the condition that jjsn jj1 < jjs.in 1 ;:::;ik / jj1 , if jjsn 1 kC1 jj1 · jjsn jj1 , then the ikC1 -th column of equation 2.7 must be very close to x in direction. Since the column is selected randomly, the probability can be viewed as zero. Thus, we have n ¼ ®kC1

¹nkC1 1 ¡ ®kn .1 ¡ ¹nkC1 /

®kn :

Dene a function f .¹nkC1 / D

difcult to nd that

¹nkC1 1¡®kn .1¡¹nkC1 / .

(B.2) Obviously, f .1/ D 1. It is not

1 ¡ ®kn df D > 0I n n d¹kC1 .1 ¡ ®k .1 ¡ ¹nkC1 //2 hence, f is monotonically increasing. Noting that ¹nkC1 · 1, we have f · 1, and n · ®kn ; ®kC1

(B.3)

that is, ¯.k C 1; n; n/ ¸ ¯.k; n; n/. Thus, the rst result of the lemma holds.

Analysis of Sparse Representation and Blind Source Separation

1229

2. Suppose that the sequence f®nn g does not tend to 0; then there exists an ²0 > 0 such that the sequence has a subsequence assumed to be f®iikk g, satisfying that ®iikk > ²0 . Since the subsequence has an upper bound of 1 and a lower bound of 0, it also has a convergent subsequence assumed to be itself without loss of generality. That is, lim ®iikk D p0 ¸ ²0 . k!C1

In view of B.2 and B.3, ®iikk ¼ · ·

1

1

¹iikk ik ¡ ®ik ¡1 .1

¡ ¹iikk /

®iikk¡1

¹iikk ® ik ik ¡ ®1 .1 ¡ ¹iikk / ik ¡1

ik Y

¹jik

jD2

1 ¡ ®1ik .1 ¡ ¹jik /

®1ik :

(B.4)

If there are two positive constants ¹0 and ®0 satisfying ¹0 ; ®0 < 1, such that ¹jik · ¹0 ; j D 1; : : : ; ik ; k D 1; 2; : : : ;

(B.5)

®1ik · ®0 ; k D 1; 2; : : : ;

(B.6)

then ®iikk · ·

ik Y

¹jik

jD2

1 ¡ ®0 .1 ¡ ¹jik /

ik Y jD2

µ ·

®0

¹0 ®0 1 ¡ ®0 .1 ¡ ¹0 /

¹0 1 ¡ ®0 .1 ¡ ¹0 /

¶ik ¡1

(B.7)

®0 :

In view of equation B.7, we have lim ®iikk D 0, which is in contradiction k!C1

to the denition of the sequence of f®iikk g. Thus, the two inequalities in B.5 and B.6 cannot be satised simultaneously. That is, the sequence f¹jik , j D 1; : : : ; ik , k D 1; 2; : : :g or the sequence f®1ik g has a subsequence that tends to 1. This is impossible in practice. Thus, the sequence f®nn g tends to 0, that is, lim ¯.n; n; n/ D 1: n!C1

This lemma is proved.

1230

Y. Li, A. Cichocki, and S. Amari

Appendix C: Proof of Lemma 3 1. Set ®N k D 1¡P.k; n; n/, which is the probability P.jjsn jj1 ¸ jjskn jj1 /. Similarly to equation B.1, we have ®N kC1 D P.jjsn jj1 ¸ jjskC1 n jj1 / \ D P.fjjsn jj1 ¸ jjskn jj1 g fjjskn jj1 ¸ jjskC1 n jj1 g/ \ CP.fjjsn jj1 ¸ jjskn jj1 g fjjskn jj1 < jjskC1 n jj1 g \ £ fjjskC1 n jj1 < jjsn jj1 g/ \ CP.fjjsn jj1 < jjskn jj1 g fjjsn jj1 ¸ jjskC1 n jj1 g/ ¼ ®N k ®N 1k C ®N k .1 ¡ ®N 1k /®N kC1 ;

(C.1)

where we denote P.jjskn jj1 ¸ jjskC1 N 1k . n jj1 / as ® Thus, we have ®N kC1 ¼

®N 1k

1 ¡ ®N k .1 ¡ ®N 1k /

®N k :

(C.2)

As in the proof of lemma 2, we can prove that ®N kC1 · ®N k . Thus, the rst result of the lemma holds. 2. Noting that P.n; n; n/ D ¯.n; n; n/, this lim P.n; n; n/ D 1 directly n!C1

from lemma 2. n¡k can be Now consider P.n ¡ k; n; n/ dened as P.jjsn jj1 < jjsn¡k n jj1 . sn obtained in two steps. First, using n columns selected randomly to replace the n columns of An in equation 2.6, equation 2.7 and corresponding solution snn .D s.1;:::;n/ / are obtained. Second, using k columns of An in equation n 2.6 to replace k columns of A.1;:::;n/ in equation 2.7, the corresponding solun .i ;:::;i / tion sn 1 n¡k can be obtained. Repeat this process until the solution sn¡k is n .i1 ;:::;in¡k / obtained, which satises that jjsn¡k jj D minfjjs jj 1 · ¢ ¢ ¢ ; i < < 1 1 1 n l in¡k · ng. Thus, we have P.n ¡ k; n; n/ D P.jjsn jj1 < jjsn¡k n jj1 / \ ¸ P.fjjsn jj1 < jjsnn jj1 g fjjsn jj1 < jjsn¡k n jj1 g/ D P.n; n; n/P.fjjsn jj1 < jjsn¡k n jj1 g=fjjsn jj1

< jjsnn jj1 g/:

(C.3)

n That jjsn jj1 > jjsn¡k n jj1 under the condition that fjjsn jj1 < jjsn jj1 g implies that there exists at least one column of An that is sufciently close to (or

Analysis of Sparse Representation and Blind Source Separation

1231

x opposite to) x in direction. Set xN D jjxjj , and ®0 > 0 to be a positive constant. 2 If jaki ¡ xN k j < ®0 is satised for k D 1; : : : ; n, then we say that the vector ai and x are sufciently close in direction. Dene the probability P.jaki ¡ xN k j < ®0 / D p0 > 0; then

P.jaki ¡ xN k j < ®0 ; k D 1; : : : ; n/ D P.jjak ¡ xN jj1 < ®0 / D pn0

P.

n [

fjjak ¡ xN jj1 < ®0 g/ · npn0 :

(C.4)

kD1

Thus, n P.fjjsn jj1 < jjsn¡k n jj1 g=fjjsn jj1 < jjsn jj1 g/ n D 1 ¡ P.fjjsn jj1 ¸ jjsn¡k n jj1 g=fjjsn jj1 < jjsn jj1 g/

¸ 1 ¡ npn0 ! 1:

(C.5)

In view of equations C.3 and C.5 and that lim P.n; n; n/ D 1, we have n!C1

lim P.n ¡ k; n; n/ D 1.

n!C1

This lemma is proved.

Appendix D: Proof of Theorem 3 1. P.1; l; n/; : : : ; P.l; l; n/ can be viewed as P.n ¡ l C 1; n; n/, : : : ; P.n; n; n/, respectively. From lemma 3, the result holds. 2. P.1; 2; n/; : : : ; P.l; n; n/ can be viewed as P.n ¡ 1; n; n/, : : : ; P.1; n; n/, respectively. From lemma 3, we have 1 ¸ P.1; 2; n/ ¸ ¢ ¢ ¢ ¸ P.1; n; n/. Now consider the probability P.1; 1; n/. l D 1 implies that sl has only one nonzero entry, assumed to be s1 ; thus, x D s1 a1 , and jjsl jj1 D js1 j. For an .1/ n £ n matrix A.1/ 1 selected randomly, the solution s1 of equation 2.7 satises .1/ .1/ jjs1 jj1 > jjs1 jj1 , provided that A1 does not contain the column vector a1 . Thus, P.1; 1; n/ D 1, and the second result holds. 3. For 1 · k · l and xed l, lim P.k; l; n/ D lim P.n ¡ l C k; n; n/ D 1 n!C1

n!C1

from lemma 3. 4. In view of the rst result in this theorem, we have

P.jjsl jj1 · jjNsl jj1 / D P.jjsl jj1 · min fminfjjs.il 1 ;:::;ik / jj1 ; kD1;:::;l

1 · i1 < ¢ ¢ ¢ < ik · lgg/

¸ P.1; l; n/P.2; l; n/ ¢ ¢ ¢ P.l; l; n/ ¸ .P.1; l; n//l :

(D.1)

1232

Y. Li, A. Cichocki, and S. Amari

From the third result of this theorem and equation D.1, we have lim P.jjsl jj1 · jjNsl jj1 / D 1 for xed l. Thus, the fourth result holds. The theorem is proved.

n!C1

Appendix E: Proof of Theorem 4 1. P.1; l; n; m/; P.2; l; n; m/; : : : ; P.l; l; n; m/ can be viewed as P.n ¡l C 1; n; n; m/; P.n ¡ l C 2; n; n; m/; : : : ; P.n; n; n; m/. The remaining proof is similar to that of the rst result in lemma 3. 2. This result can be proved as in the proof of the second result in theorem 3. 3. First, we prove that lim P.l; l; n; m/ D 1: n!C1

N m¡l arbitrarily assumed to be aN 1 ; : : : ; aN n to Choose n column vectors of A replace the l columns of Al , and obtain the corresponding solution s.1;:::;l;1;:::;n/ of equation 2.9. Using j .1 · j · minfn; m ¡ n ¡ lg) column vectors of faN nC1 ; : : : ; aN m¡l g to replace j columns of faN 1 ; : : : ; aN n g, we can obtain a set of new equations and its solution. For different choices, many sets of equations and their solutions can be obtained, of which sO .j/ is denoted as the solution with minimum 1¡norm. We have P.l; l; n; m/ D P.jjsl jj1 < minfjjs.1;:::;l;j1 ;:::;jn / jj1 ; 1 · j1 < ¢ ¢ ¢ < jn · m ¡ lg/ ¸ P.fjjsl jj1 < jjs.1;:::;l;1;:::;n/ jj1 g \ fjjOs.j/ jj1 > jjsl jj1 ; j D 1; : : : ; minfn; m ¡ n ¡ lgg/ D P.l; l; n/P.fjjOs.j/ jj1 > jjsl jj1 /g=fjjsl jj1

< jjs.1;:::;l;1;:::;n/ jj1 g; j D 1; : : : ; minfn; m ¡ n ¡ lgg/ ¸ P.l; l; n/

m¡n¡l Y jD1

P.fjjOs.j/ jj1 > jjsl jj1 g=

fjjsl jj1 < jjs.1;:::;l;1;:::;n/ jj1 g/:

(E.1)

Noting that under the condition that jjsl jj1 < jjs.1;:::;l;1;:::;n/ jj1 , jjOs.j/ jj1 · jjsl jj1 implies that there exists at least one vector of fNanC1 ; : : : ; aN m¡l g of which the direction is sufciently close to x (or sufciently opposite to). Similarly in the proof of Case 2 in lemma 3, it can be proved that the probability tends to zero when n ! C1. Thus, lim P.fjjOs.j/ jj1 > jjsl jj1 g=fjjsl jj1 < n!C1

jjs.1;:::;l;1;:::;n/ jj1 g/ D 1. In view of equation E.1, the third result in theorem 3, and that m ¡ n ¡ l are xed, we have lim P.l; l; n; m/ D 1. n!C1

Analysis of Sparse Representation and Blind Source Separation

1233

lim P.k; l; n; m/ D 1 can be proved using similar logic to that used in

n!C1

the proof of Case 2 in lemma 3. 4. For xed k; l and n, m tends to C1 implies that there exist innite column vectors. We can choose n ¡ l C k vectors that are sufciently close to x in direction such that the 1¡norm of the solution of equation 2.9 is smaller than jjsl jj1 . Thus, lim P.k; l; n; m/ D 0: m!C1

5. From theorem 2, if sO l 6D s1 , then jjs1 jj0 D n with probability of one. In view of the denition of P.k; l; n; m/ and the rst result in this theorem, we have P.Osl 6D s1 / D P.jjsl jj1 > jjs1 jj1 /

D 1 ¡ P.jjsl jj1 · jjs1 jj1 / · 1 ¡ P.1; l; n; m/P.2; l; n; m/ ¢ ¢ ¢ P.l; l; n; m/ · 1 ¡ .P.1; l; n; m//l ;

(E.2)

that is, P.sl D s1 / ¸ .P.1; l; n; m//l . Furthermore, from Case 3 of this theorem, we have lim P.Osl D s1 / D 1 n!C1

for xed l and m ¡ n. Thus, the fth result holds. 6. The result can be proved as in the proof of the fth result and using equation 2.14. Theorem 4 is proved. Acknowledgments The initial part of this work was partially supported by the Excellent Young Teachers Program of MOE, PRC. References Boll, P., & Zibulevsky, M. (2001). Underdetermined blind source separation using sparse representations. Signal Processing, 81(11), 2353–2362. Chen, S., Donoho D. L., & Saunders M. A. (1998). Automic decomposition by basis pursuit. SIAM Journal on Scientic Computing, 20(1), 33–61. Cichocki, A., & Amari, S. (2002). Adaptive blind signal and image processing. Chicester: Wiley. (Revised and corrected edition.) Delgado, K. K., Murray, J. F., Rao, B. D., Engan, K., Lee, T. W., & Sejnowski, T. J. (2003). Dictionary learning algorithms for sparse representation. Neural Computation, 15, 349–396. 1 Donoho, D. L., & Elad, M. (2003). Maximal sparsity representation via l minimization. Proc. Nat. Aca. Sci., 100, 2197-2202. Girolami, M. (2001). A variational method for learning sparse and overcomplete representations. Neural Computation, 13(11), 2517-1532.

1234

Y. Li, A. Cichocki, and S. Amari

Jang, G. J., & Lee, T. W. (2003). A probabilistic approach to single channel source separation. In S. Becker, S. Thrun, & K. Obermayer (Eds.), Advances in neural information processing systems 15, Cambridge, MA: MIT Press. Lee, D. D., & Seung, H. S. (1999). Learning the parts of objects by nonnegative matrix factorization. Nature, 401(21), 788–791. Lee, T. W., Lewicki, M. S., Girolami, M. & Sejnowski, T. J. (1999). Blind source separation of more sources than mixtures using overcomplete representations. IEEE Signal Processing Letter, 6(4), 87–90. Lewicki, M. S., & Sejnowski, T. J. (2000). Learning overcomplete representations. Neural Computation, 12(2), 337–365. Makeig, S., Westereld, M., Jung, T. P., Enghoff, S., Townsend J., Courchesne E., & Sejnowski T. J. (2002). Dynamic brain sources of visual evoked responses. Science, 295, 690–694. Olshausen, B. A., & Field, D. J. (1997). Sparse coding with an overcomplete basis set: A strategy employed by V1? Vision Research, 37, 3311–3325. Olshausen, B. A., Sallee, P., & Lewicki, M. S. (2001). Learning sparse image codes using a wavelet pyramid architecture. In S. A. Solla, T. K, Leen, & K.-R. Muller ¨ (Eds.), Advances in neural information processing systems, 12. Cambridge, MA: MIT Press. Zibulevsky, M., Kisilev, P., Zeevi, Y.Y., & Pearlmutter, B. A. (2001). Blind source separation via multinode sparse representation. In T. K. Leen, T. G. Dietterich, & V. Tresp (Eds.), Advances in neural information processing systems, 13. Cambridge, MA: MIT Press. Zibulevsky, M., & Pearlmutter, B. A. (2001). Blind source separation by sparse decomposition in a signal dictionary. Neural Computation, 13, 863–882. Zibulevsky, M., Pearlmutter, B. A., Boll, P., & Kisilev, P. (2000). Blind source separation by sparse decomposition in a signal dictionary. In S. J. Roberts, and R. M. Everson, (Eds.), Independent component analysis: Principles and practice. Cambridge: Cambridge University Press.

Received March 31, 2003; accepted October 21, 2003.

LETTER

Communicated by Erkki Oja

Minimax Mutual Information Approach for Independent Component Analysis Deniz Erdogmus [email protected].edu

Kenneth E. Hild II [email protected].edu

Yadunandana N. Rao [email protected].edu

Jos´e C. Pr´õ ncipe Computational NeuroEngineering Laboratory, Electrical & Computer Engineering Department, University of Florida, Gainesville, FL 32611, U.S.A.

Minimum output mutual information is regarded as a natural criterion for independent component analysis (ICA) and is used as the performance measure in many ICA algorithms. Two common approaches in information-theoretic ICA algorithms are minimum mutual information and maximum output entropy approaches. In the former approach, we substitute some form of probability density function (pdf) estimate into the mutual information expression, and in the latter we incorporate the source pdf assumption in the algorithm through the use of nonlinearities matched to the corresponding cumulative density functions (cdf). Alternative solutions to ICA use higher-order cumulant-based optimization criteria, which are related to either one of these approaches through truncated series approximations for densities. In this article, we propose a new ICA algorithm motivated by the maximum entropy principle (for estimating signal distributions). The optimality criterion is the minimum output mutual information, where the estimated pdfs are from the exponential family and are approximate solutions to a constrained entropy maximization problem. This approach yields an upper bound for the actual mutual information of the output signals—hence, the name minimax mutual information ICA algorithm. In addition, we demonstrate that for a specic selection of the constraint functions in the maximum entropy density estimation procedure, the algorithm relates strongly to ICA methods using higher-order cumulants. 1 Introduction Independent component analysis (ICA) deals with the problem of nding a set of directions such that when the input random vector x is projected c 2004 Massachusetts Institute of Technology Neural Computation 16, 1235–1252 (2004) °

1236

D. Erdogmus, K. Hild, Y. Rao, and J. Principe

on these directions, the projected random variables are as independent as possible. As a natural measure of independence between random variables, mutual information is commonly used in this framework. Shannon’s denition of mutual information between n random variables Y1 ; : : : ; Yn , whose joint probability density function (pdf) is fY .y/ and marginal pdfs are f1 .y1 /; : : : ; fn .yn /, respectively, is given by (Cover & Thomas, 1991), Z I.Y/ D

1

¡1

fY .y/ dy; o oD1 fo .y /

fY .y/ log Qn

(1.1)

where the vector y is composed of the entries yo , o D 1; : : : ; n. The mutual information is related to the marginal entropies and the joint entropy of these random variables through (Cover & Thomas, 1991), I.Y/ D

n X oD1

H.Yo / ¡ H.Y/;

(1.2)

where the marginal and joint entropies are dened as (Cover & Thomas, 1991) Z H.Yo / D ¡

¡1

Z H.Y/ D ¡

1

1

¡1

fo .yo / log fo .yo / dyo

(1.3)

fY .y/ log fY .y/ dy:

(1.4)

Minimization of output mutual information is “the canonical contrast for source separation, as Cardoso states (Cardoso & Souloumiac, 1993). Many researchers agree with this comment (Yang & Amari, 1997; Hyvarinen, 1999a; Almeida, 2000). However, three of the best-known methods for ICA—JADE (Cardoso & Soulomias, 1993), Infomax (Bell & Sejnowski, 1995), and FastICA (Hyvarinen, 1999b)—use the diagonalization of cumulant matrices, maximization of output entropy, and fourth-order cumulants, respectively. The difculty encountered in information-theoretic measures is the estimation of the underlying density of the output signals (or, in the case of Infomax, an accurate guess of the independent source densities). Algorithms that do not use robust estimations for the pdf suffer from sensitivity to samples. One commonly taken approach in designing information-theoretic ICA algorithms is the use of some form of polynomial expansion to approximate the pdf of the signals. Some of the commonly used polynomial expansions are Gram-Charlier, Edgeworth, and Legendre, where the pdf estimation is performed by taking a truncated polynomial estimate of the signal pdfs evaluated in the vicinity of a maximum entropy density (Comon, 1994; Amari, Cichocki, & Yong, 1996; Hyvarinen, 1998; Hyvarinen, 1999a; Erdogmus,

Minimax Mutual Information Approach for ICA

1237

Hild, & Principe, 2001). Even the higher-order cumulant-based contrasts can be understood in this framework (Cardoso, 1999). Since the truncation of these innite polynomials is necessary to keep computational requirements at a minimum, errors are naturally generated in these approaches. Another density estimation method used in the minimum mutual information (MMI) context is Parzen windowing. Hild, Erdogmus, and Principe (2001) combine the Parzen window pdf estimation method (Parzen, 1962) with Renyi’s mutual information (Renyi, 1970) to derive the Mermaid algorithm, which uses the same topology proposed by Comon (1994; Hild et al., 2001). Alternative algorithms using Parzen windowing include the quadratic infomation divergence approach (Xu, Principe, Fisher, & Wu, 1998) and Pham’s sum-ofmarginal-Shannon’s-entropy approach (Pham, 1996). In addition, the use of orthonormal basis functions in pdf estimation could be pursued for ICA (Girolami, 2002). However, such pdf estimates might become invalid when truncated (i.e., have negative values and do not integrate to one). Alternative techniques that do not use minimization of mutual information include second-order methods that achieve source separation through decorrelation of the outputs (Weinstein, Feder, & Oppenheim, 1993; Wu & Principe, 1997; Parra & Spence, 2000; Pham, 2001), nonlinear principal component analysis (NPCA) approaches (Oja, 1999), maximization of higher auto-statistics (Simon, Loubaton, Vignat, Juten, & d’Urso, 1998), cancellation of higher-order cross statistics (Comon, 1996; Cardoso, 1998; Hyvarinen, 1999a; Sun & Douglas, 2001), nongaussianity measures like the negentropy (Girolami & Fyfe, 1997; Torkkola, 1999; Wu & Principe, 1999a), maximum likelihood techniques, which are parametric by denition (Girolami, 1997; Wu & Principe, 1999b; Karvanen, Eriksson, & Koivnunen, 2000), and nally maximum entropy methods (Amari, 1997; Torkkola, 1996; Principe & Xu, 1999). In this letter, we take the minimum output mutual information approach in order to come up with an efcient and robust ICA algorithm that is based on a density estimate stemming from Jaynes’s maximum entropy principle. The commonly used whitening-rotation scheme, which is also described by Comon (1994), will be assumed, where the orthonormal portion of the separation matrix (the rotation stage) will be parameterized using Givens angles (Golub & Van Loan, 1993). In this framework, approximations to the maximum entropy pdf estimates that are “consistent to the largest extent with the available data and least committed with respect to unseen data‰ (Jaynes, 1957) will be used. Upon investigation, under the specic choice of polynomial moments as the constraint functions in the maximum entropy principle, the resulting criterion and the associated algorithm are found to be related to the kurtosis and other higher-order cumulant methods. 2 The Topology and the Cost Function Suppose that the random vector z is generated by a linear mixture of the form z D Hs, where the components of s are independent. Assume that

1238

D. Erdogmus, K. Hild, Y. Rao, and J. Principe

the mixture is square with size n and that the source vector s is zero mean and has a covariance matrix of identity. In that case, it is well known that the original independent sources can be obtained from z through a twostage process: spatial whitening to generate uncorrelated but not necessarily independent mixture x D Wz, and a coordinate system rotation in the n-dimensional mixture space to determine the independent components, y D Rx (Comon, 1994; Cardoso, 1999; Hild et al., 2001). The whitening matrix W is determined solely by the second-order statistics of the mixture. Specically, W D 3¡1=2 8T , where 3 is the diagonal eigenvalue matrix and 8 is the corresponding orthonormal eigenvector matrix of the mixture covariance matrix 6 D E[z zT ], assuming that the observations are zero mean. The rotation matrix R, which is restricted to be orthonormal by denition, is optimized through the minimization of the mutual information between the output signals (Comon, 1994; Hild et al., 2001). Considering equation 1.2 and the fact that the joint entropy is invariant under rotations, mutual information simplies to the sum of marginal output entropies for this topology: J.£/ D

n X

H.Yo /:

(2.1)

oD1

In equation 2.1, the vector £ contains the Givens angles, which are used to parameterize the rotation matrix (Golub & Van Loan, 1993). According to this, an n-dimensional rotation matrix is parameterized by n.n ¡ 1/=2 parameters, µij , where i D 1; : : : ; n ¡ 1 and j D i C 1; : : : ; n. The rotation matrix Rij .µij / is constructed by starting with an n £ n identity matrix and replacing the entries .i; i/th, .i; j/th, .j; i/th, and .j; j/th with cos µij , ¡ sin µij , sin µij , and cos µij , respectively. The total rotation matrix is then found as the product of these two-dimensional rotations: R.£/ D

n¡1 Y

n Y

Rij .µij /:

(2.2)

iD1 jDiC1

The described whitening-rotation scheme using the Givens angle parameterization of the rotation matrix has been utilized in ICA algorithms by many researchers, and efcient ways of handling the optimization of these parameters have been studied. Especially, the Jacobi iteration approach for sweeping the Givens angles, thus splitting the high-dimensional optimization problem into a sequence of one-dimensional problems, has found great interest (Comon, 1994). 3 The Maximum Entropy Principle Jaynes’s maximum entropy principle states that in order to determine the pdf estimate that best ts the known data without committing extensively to

Minimax Mutual Information Approach for ICA

1239

unseen data, one must maximize the entropy of the estimated distribution under some constraints. The reason is that the entropy of a pdf is related to the uncertainty of the associated random variable. In addition, the optimality properties of density estimates obtained using generalized maximum entropy principles have been discussed previously (Shore & Johnson, 1980; Kapur & Kesavan, 1992). The constrained entropy maximization problem is dened as follows: Z 1 max H D ¡ pxN .x/ log pxN .x/ dx pxN .x/

¡1

subject to ExN [ fk .X/] D ®k D EX [ fk .X/]

k D 1; : : : ; m:

(3.1)

It can be shown using calculus of variations that the solution to this problem is given by (Cover & Thomas, 1991), Á pX .x/ D C.¸/ exp

m X

! (3.2)

¸k fk .x/

kD1

where ¸ D [¸1 ; : : : ; ¸m ]T is the Lagrange multiplier vector and C.¸/ is the normalization constant. The constants ®k are prespecied or, in the case of adaptive ICA, determined from the data. The Lagrange multipliers need to be solved simultaneously from the constraints. This is not an easy task in the case of continuous random variables. Analytical results do not exist, and numerical techniques are not practical for arbitrary constraint functions due to the innite range of the denite integrals involved. In order to get around these problems, we will take a different approach. Consider now, for example, the ith constraint equation, Z ®i D EX [ fi .X/] D

1

¡1

(3.3)

fi .x/pX .x/ dx:

Applying the integrating by parts method with the following denitions, Z Á du D

u D pX .x/

m X

vD

!

¸k fk0 .x/

1

fi .x/ dx D Fi .x/ dv D fi .x/ dx;

pX .x/

kD1

(3.4)

where fk0 .x/ is the derivative of the constraint function, we obtain 1 ¡ ®i D Fi .x/pX .x/ ¡1

Z

1

¡1

Fi .x/

Á m X kD1

! ¸k fk0 .x/

pX .x/ dx:

(3.5)

1240

D. Erdogmus, K. Hild, Y. Rao, and J. Principe

If the functions fi .x/ are selected such that their integrals Fi .x/ do not diverge faster than the decay rate of the exponential pdf pX .x/, then the rst term on the right-hand side of equation 3.5 goes to zero. For example, if the constraint functions were selected as the moments of the random variable, this condition would be satised, since Fi .x/ will become a polynomial and pX .x/ decays exponentially. This yields ®i D ¡ D¡

Z

m X

¸k kD1 m X kD1

1

¡1

Fi .x/ fx0 .x/pX .x/ dx 1

¸k EX [Fi .X/ fk0 .X/] D ¡

m X

¸k ¯ik :

(3.6)

kD1

Note that the coefcients ¯ik can be estimated using the samples of X by approximating the expectation operators by sample means.1 Finally, introducing the vector ® D [®1 ; : : : ; ®m ]T and the matrix ¯ D [¯ik ], the Lagrange multipliers are given by the following solution of the linear system of equations shown in equation 3.6:

¸ D ¡¯ ¡1 ®:

(3.7)

The approach presented above provides a computationally simple way of nding the coefcients of the estimated pdf of the data directly from the samples once ® and ¯ are estimated using sample means. Besides being an elegant approach to nd the Lagrange multipliers of the constrained entropy maximization problem using only a linear system of equations, the proposed approach has an additional advantage. Since in the evaluation of ¯, the sample mean estimates are utilized, the pdf obtained with the corresponding Lagrange multiplier values will satisfy additional consistency conditions with the samples besides the normally imposed constraints in the problem denition. These extra conditions satised by the determined pdf will be PN of the form E[Fi .X/ fk0 .X/] D .1=N/ jD1 Fi .xj / fk0 .xj / for i D 1; : : : ; m and k D 1; : : : ; m. In order to understand this effect better, consider the choice fk .x/ D xk for the constraint functions. In that case, ¯ik D k®iCk =.i C 1/, since Fi .x/ D xiC1 =.i C 1/ and fk0 .x/ D kxk¡1 . Consequently, the rst 2m moments of the pdf estimate given by equation 3.2 are consistent with those of the samples. However, it should be noted that the pdf possesses the maximum entropy among all pdfs that have the same rst m moments. 1 The expectation is over the maximum entropy distribution, but using the sample mean will approximate these values by equivalently taking the expectationover the actual data distribution. This estimation will become more accurate as the actual density of the samples approaches the corresponding maximum entropy distribution. However, the irrelevance of the accuracy of this approximation for the operation of the algorithm will become clear with the following discussion.

Minimax Mutual Information Approach for ICA

1241

4 Gradient Update for the Givens Angles In order to nd the optimal ICA solution according to criterion 2.1, using the entropy estimate described in the previous section, a gradient descent update rule for the Givens angles can be employed. The derivative of H.Yo / with respect to µpq is m X @® o @H.Yo / D¡ ¸ok k : @µpq @µpq kD1

(4.1)

The derivation is provided in the appendix. Here, ¸ o is the parameter vector for the pdf of the oth output signal and ®ko is the value of the kth constraint for the pdf of the oth output. The former is found by solving the linear equations in equation 3.7, and the latter is easily determined from the corresponding output samples using ®ko D

N 1 X fk .yo;j /; N jD1

(4.2)

where yo;j is the jth sample at the oth output for the current angles. Finally, the derivative of equation 4.2 with respect to the angle µpq is given by ³ ´ ³ ´ N N @®ko @yo;j @yo;j T @Ro: T 1 X 1 X 0 0 D D f .yo;j / f .yo;j / @µpq N jD1 k @µpq N jD1 k @Ro: @µpq D

³ ´ N 1 X @R T fk0 .yo;j /xjT ; N jD1 @µpq o:

(4.3)

where the subscript in Ro: and .@R=@µpq /o: means the oth row of the corresponding matrix. The derivative of the rotation matrix with respect to µpq is also calculated from 0 10 1 p¡1 q¡1 n Y Y Y @Rpq .µpq / @R ij pj D@ R .µij /A @ R .µpj /A @µpq @µpq iD1 jDiC1 jDpC1 0 £@

10 n Y

jDqC1

1 n¡1 Y

Rpj .µpj /A @

n Y

Rij .µij /A :

(4.4)

iDpC1 jDiC1

The overall update rule for the Givens angles is the sum of contributions from each output:

£tC1 D £t ¡ ´

n X @H.Yo / ; @£ oD1

(4.5)

1242

D. Erdogmus, K. Hild, Y. Rao, and J. Principe

where ´ is a small step size. The computational complexity could be reduced by alternatively following the Jacobi iteration approach (Golub & Van Loan, 1993; Comon, 1994; Cardoso, 1994). 5 Discussion on the Algorithm We have described an ICA algorithm where the selected exponential density estimate for the outputs is motivated by Jaynes’s maximum entropy principle. Due to existing difculties in solving for the Lagrange multipliers analytically, we proposed an approximate numerical solution, which replaces the expectation operator over the maximum entropy distribution by a sample mean over the data distribution. This approximation causes the estimated cost function and the gradient update to deviate from theory. In this section, we will show that this deviation is not critical to the operation of the proposed algorithm. In addition, we will provide an argument on how to choose the constraint functions fk .¢/ in the formulation, as well as demonstrate how the criterion reduces to one based simply on higher-order moments and cumulants for a specic choice of constraint functions. Consider the gradient update given in equation 4.1 in vector form @H.Yo /=@µpq D ¡.¸o /T .@ ®o =@µpq /, and recall from equation 3.7 that ¸o D ¡.¯ o /¡1 ®o . Thus, the gradient contribution from the oth output channel is @H.Yo /=@µpq D .®o /T .¯ o /¡T .@ ®o =@µpq /, where the superscript ¡T denotes an inverse-transpose matrix operation. This can be observed to be the gradient of entropy with respect to the Givens angles where the entropy estimate P is oobtained using the exponential density estimate po .» / D C.¸o / exp. m kD1 ¸k fk .»// for the output signal Yo . This entropy estimate is found to be H.Yo / D ¡ log C.¸o / ¡ .¸ o /T ®o D ¡ log C.¸o / C .®o /T .¯ o /¡T ®o . Therefore, the proposed algorithm is understood to be essentially minimizing the following cost function: J.µ/ D

n X oD1

H.Yo / D

n X oD1

.¡ log C.¸o / C .®o /T .¯ o /¡T ®o /

(5.1)

Our brief investigation on the effect of the selected constraint functions on the separation performance of the algorithm revealed that the simple moments of the form fk .x/ D xk yield signicantly better solutions.2 Therefore, this choice of constraints becomes particularly interesting for further analysis. One motivation for using these constraint functions is 2 We have performed Monte Carlo comparisons using the following heuristically selected alternative constraint functions: fk .x/ D arctan.kx/, fk .x/ D jxj1= k , fk .x/ D tan.kx/, and fk .x/ D e¡kjxj . The cost functions associated with these alternatives exhibited numerous local minima, which hindered the performance of the gradient-descent algorithm greatly. The performance surface corresponding to the moment constraints was found to be much smoother.

Minimax Mutual Information Approach for ICA

1243

the asymptotic properties of the exponential density estimates of the form P k p.»/ D exp.¸0 C m kD1 ¸k » /. Consider continuous innite support distributions that could by the following Taylor series expansion: P bek approximated k k log q.»/ D 1 kD0 .» =k!/. @ .log q.»//=@» q.»¤ /D1 /. It is known that if the Taylor series expansion exists and the innite summation of q.x/ converges uniformly, then the exponential density of order m converges uniformly as the order m ! 1 (Barndorff-Nielsen, 1978; Crain, 1974). In addition, since the family of exponential distributions forms a linear manifold with orthogonality properties in the space of natural parameters, the maximum entropy distribution is the orthogonal projection of the 1-dimensional density to the m-dimensional linear manifold (Amari, 1985; Crain, 1974). Besides the desirable asymptotic convergence properties of the exponential family of density estimates, the selection of moments as constraint functions results in gradient updates that are simply gradients of higher-order moments with respect to the Givens angles. Specically, for this selection of constraints, the gradient for the cost function in equation 5.1 becomes ³ o´ n n X @ J.µ / @H.Yo / X @® D D .®o /T .¯ o /¡T @µpq @µpq @µpq oD1 oD1 0 2 32 o 2 0 0 ®2 B 76 : : B[® o : : : ® o ] 6 :: B 1 m 40 0 5 4 :: n B X o B 0 0 .m C 1/ ®mC1 D 2 32 o 3 B B 1 0 0 @® =@µ pq 1 oD1 B 76 7 B £6 :: 40 : : : 5 4 5 @ 0 : o 0 0 1=m @®m =@µpq

¢¢¢ :: : ¢¢¢

3¡1 1 o ®mC1 :: 7 C C : 5 C C o C ®2m C C C C A (5.2)

PN k where ®ko D N1 jD1 yo;j . Consider the simple case, where m D 4, constraint functions are the moments, and the source distributions (and the particular samples in the nite-sample case) are symmetric.3 In this case, the odd moments vanish, and also due to the whitening, the second-order moments are xed to unity.

3 In the nite case, the odd-sample moments need not become zero. If we extend the observation sample set by including ¡x samples (so that the total set consists of fxk ; ¡xk g k D 1; : : : ; N), the odd-sample moments of the extended data set become zero. This corresponds to modifying the source distributions to become symmetric, yet all the measurements are mixtures of the new symmetric sources through the same mixing matrix H.

1244

D. Erdogmus, K. Hild, Y. Rao, and J. Principe

Thus, the gradient in equation 5.2 becomes 0

2

1 0 ®4o B 6 n 0 ®4o 0 X B[0 1 0 ® o ] diag.2; : : : ; 5/ 6 @ J.µ / o B 4 4 D 0 ®6o ® B 4 @µpq @ oD1 0 ®6o 0 £ diam.1; : : : ; 1=m/[0 0 0 @®4o =@µpq ]T D

n X .5.®4o /2 ¡ 3®6o / @®4o : 4.®4o ®8o ¡ .®6o /2 / @µpq oD1

3¡1 1 0 C ®6o 7 7 C 05 C C A ®8o

(5.3)

The directional contribution from each output to the gradient is along the gradient of the kurtosis of that output. The sign and the magnitude are controlled by the term .5.®4o /2 ¡ 3®6o /=.®4o ®8o ¡ .®6o /2 /.4 In order to demonstrate how the sign of this term detects sub- or supergaussianity to adjust the update direction accordingly, we present an evaluation of this quantity for the generalized gaussian family in Figure 1 (sample approximations with 30,000 samples are used for each distribution). Since the cost function in equation 5.1 is minimized, negative-gradient direction is used so the updates using the gradient in equation 5.3 try to minimize the kurtosis for subgaussian signals and maximize kurtosis for supergaussian signals (as exo =@µ ; pected). In general, the overall gradient includes the terms up to @®m pq therefore, the update direction is determined not only by the gradients of the output kurtosis but also the gradients of higher-order moments. According to the analysis above, for the choice of moment constraints, Minimax ICA becomes an auto-cumulant method (identical to Fast ICA, in principle, when moments up to order 4 are considered). Other cumulant methods, for instance, JADE (Cardoso, 1999), consider cross-cumulants of the outputs. If, at the density estimation stage, we employ the maximum entropy principle to obtain an estimate of the joint output distribution, which can then be used to nd the marginal output distributions, the resulting algorithm would also involve updates based on the cross-moments or cumulants of the outputs. This extension to cross-cumulants will be demonstrated and studied in a future article. 6 Numerical Results In this section, we investigate the performance of the proposed Minimax ICA algorithm. Comparisons with other benchmark ICA algorithms will 4 Notice that the constants 3 and 5 in this expression correspond to the fourth- and sixth-order moments of a zero-mean unit-variance gaussian distribution. Consequently, if the output distribution approaches a unit-variance gaussian, the numerator approaches zero.

Minimax Mutual Information Approach for ICA

1245

0.2

Coefficient

0.15

0.1

0.05

0

-0.05

1

1.2

1.4

1.6 1.8 2 2.2 2.4 Generalized Gaussian Parameter ( )

2.6

2.8

3

Figure 1: The coefcient .5.®4o /2 ¡ 3®6o /=.®4o ®8o ¡ .®6o /2 / evaluated for the generalized gaussian family for a range of the parameter. The generalized gaussian family includes Laplacian .¯ D 1/, gaussian .¯ D 2/, and uniform .¯ ! 1/ as special cases.

be provided. The performance measure that will be used is the commonly used average signal-to-interference ratio (SIR), SIR.dB/ D

n maxk .O2ok / 1X 10 log10 ; n oD1 Oo: OTo: ¡ maxk .O2ok /

(6.1)

where O is the overall matrix after separation, that is, O D RWH (Hild et al., 2001).5 In each row, the source corresponding to the maximum entry is assumed to be the main signal at that output. The averaging is done over the dB values of each row to eliminate the possibility that one very good row would become dominant in the SIR calculation. Although for arbitrarily selected O matrices, the above SIR measure could have some problems (e.g., if two rows of O are identical with one dominant entry, SIR would be large, but source separation would not have been achieved, since two outputs would yield the same source signal), such occurrences are prevented by the selected topology, that is, the whitening-rotation scheme. As long as the 5

The subscript o: means the oth row, and ok indicates the corresponding entry.

1246

D. Erdogmus, K. Hild, Y. Rao, and J. Principe

Figure 2: Average performance of the Minimax ICA algorithm using the sample moments as constraints evaluated for combinations of sample size and number of constraints.

mixing matrix is invertible and all sources have nonzero power, the overall matrix will be full rank, and this problem is avoided. Our rst case study investigates the effect of sample size and the number of constraints on the performance of the Minimax ICA algorithm. In this experiment, we use a 2£2 mixture, where the sources are zero-mean, independent gaussian, and uniformly distributed random variables. For each combination of sample size and number of constraints .m; N/ selected from the sets f4; 5; 6; 7; 8g and f100; 200; 500; 750; 1000g, respectively, we performed 100 Monte Carlo simulations. In each run, a mixing matrix H is selected randomly (each entry uniform in [¡1; 1]) and new independent and identically distributed (iid) samples are generated. The constraint functions are selected as the moments. The average SIR levels obtained for each combination are shown in Figure 2. As expected, regardless of the number of constraints, increasing the number of training samples increases performance. In addition, the worst performance is obtained for the combination m D 8 N D 100. This is also expected since the estimation of higher-order moments requires a larger sample set for robust estimation. As a consequence, the performance of the Minimax ICA algorithm suffers from sensitivity to samples. However, we note that the performance increases as the number of constraints (the order of moments) is increased (under the condition that the number of samples is sufcient to estimate these higher-order statistics accurately).

Minimax Mutual Information Approach for ICA

1247

In conclusion, if the training set is small, one must use a small number of constraints (lower-order moments), and if the training set is large, one can increase the number of constraints (the order of moments) without compromising robustness. In the second case study, we conduct Monte Carlo simulations to compare the performance of ve algorithms based on the minimum output mutual information contrast: Minimax ICA (with four moment constraints), JADE (Cardoso, 1999), Comon’s minimum mutual information algorithm (MMI) (Comon, 1994), Fast ICA (Hyvarinen, 1999b), and minimum Renyi’s mutual information algorithm—Mermaid (Hild et al., 2001). Although we initially aimed to include Yang and Amari’s minimum mutual information method (1997), the preliminary results obtained with it were not competitive with the ve methods listed. Among the considered ve approaches, Mermaid was the computationally most expensive one, whose requirements increase as O.N 2 / with the number of samples (Hild et al., 2001) followed by Minimax ICA, whose complexity increases as O.Nm/ with the number of samples .N/ and the number of moment constraints .m/. In each run, N samples of a source vector composed of one gaussian, one Laplacian (supergaussian), and one uniformly (subgaussian) distributed entry were generated. A 3 £ 3 random mixing matrix, whose entries are selected from the interval [¡1; 1], is also xed. The mixed signals are then fed into the four algorithms. Evaluation of the overall performance is done by considering the average nal SIR value obtained by each algorithm after convergence is achieved. The averaging is done over 100 runs, and the results are summarized in Figure 3. According to these batch adaptation results, Minimax ICA and JADE perform identically for very small number of samples, outperforming the other three methods. However, as the number of samples increase, Minimax ICA takes more advantage of this and yields better separation solutions. The example presented here is a particularly difcult case since all three types of distributions (gaussian, supergaussian, and subgaussian) are represented in the sources. Experiments (not shown here) performed with the same type source distributions (with not more than one gaussian source) showed that the performance of all algorithms increases signicantly and the difference between performance becomes much less, especially for small training sets. 7 Conclusions Jaynes’s maximum entropy principle has found successful applications in many areas, starting with statistical physics and including the problem of spectral estimation. It basically states that one should use the probabilistic model that best describes the observed data, yet commits minimally to any possible unseen data. In order to achieve this, the density estimates are obtained through a constrained entropy maximization procedure. The constraints ensure that the nal density is consistent with the current data.

1248

D. Erdogmus, K. Hild, Y. Rao, and J. Principe 24

22

Minimax ICA Jade

20

SIR (dB)

Comon MMI 18

Fast ICA Mermaid

16

14

12

10 100

150

200

250

300

350

400

450

500

N Figure 3: Average SIR (dB) obtained by Minimax ICA (dot), JADE (diamond), Comon MMI (circle), Fast ICA (cross), and Mermaid (square) versus the number of training samples.

On the other hand, entropy maximization guarantees the selection of the model with maximum uncertainty—in other words, the model that is least committed to unseen data. In this letter, we proposed a novel ICA algorithm that is based on Jaynes’s maximum entropy principle in the pdf estimation step. The commonly used whitening-rotation topology is borrowed from the literature, whereas the criterion used, minimum output mutual information, is considered to be the natural information-theoretic measure for ICA. We have shown that the Lagrange multipliers of the maximum entropy pdf can be easily estimated from the samples by solving a system of linear equations. In addition, the gradient of the output mutual information with respect to the rotation angles, which characterize the rotation matrix, turned out to have a simple expression in terms of these Lagrange multipliers. Numerical case studies have shown that the sample moments form a useful set of constraint functions that result in a smooth cost function, free of local minima, and with accurate solutions. Although the specic alternative nonlinear constraint functions investigated in this letter resulted in surfaces with many local minima, for some applications the designer might have a priori knowledge about the form of the constraints that the data might satisfy. The selection of these functions provides some freedom to the designer in that respect. Comparisons with benchmark algorithms like Comon’s MMI, Fast ICA, and Jade in problems involving mixed-kurtosis sources (gaussian, sub-

Minimax Mutual Information Approach for ICA

1249

gaussian, and supergaussian) showed that the average performance of Minimax ICA is equal to that of JADE for small sample sets and gets better with an increasing number of samples. In simulations not reported here, we have observed that where all sources are uniformly distributed (or have densities with light tails), Fast ICA provides very good results that compete with Minimax ICA. Since Minimax ICA specically looks for the maximum entropy density estimates, the performance could degrade in some extreme situations, especially if the actual densities do not belong to the exponential family arising from the maximum entropy assumption. Future investigation will focus on extensions of the Minimax ICA algorithm to cross-cumulants by incorporating the joint moments of the outputs into the estimation. In addition, we will determine the effect of using series expansions other than Taylor’s (which leads to the moment constraints). This will allow the algorithm to extend beyond the exponential family of sources and the cumulantbased weight updates. Appendix We present the step-by-step derivation of the derivative of output marginal entropy with respect to one of the Givens angles: Z 1 @H.Yo / @ D¡ po .» / log po .»/ d» @µpq @µpq ¡1 ¶ Z 1µ @po .»/=@µpq @po .» / D¡ log po .» / C po .»/ d» @µpq po .»/ ¡1 Z 1 Z 1 @po .» / @po .» / D¡ log po .»/ d» ¡ d» ¡1 @µpq ¡1 @µpq Á ! Z 1 Z 1 m X @po .» / @ o o o D¡ C .¸ / C ¸k fk .»/ d» ¡ po .»/ d» @µpq ¡1 ¡1 @µpq kD1 D ¡.1 C Co .¸o // D¡

m X kD1

¸ok

@ @µpq

Z

@ @µpq 1

¡1

Z

1

¡1

po .» / d» ¡

m X

Z ¸ok

kD1

po .» / fk .»/ d» D ¡

m X kD1

¸ok

1

¡1

@po .»/ fk .» / d» @µpq

@®ko : @µpq

Acknowledgments We thank the anonymous reviewers for their valuable comments and suggestions. This work is partially supported by NSF grant ECS-0300340. Parts of this article were presented at the ICA 2003 conference at Nara, Japan (Erdogmus, Hild, Rao, & Principe, 2003).

1250

D. Erdogmus, K. Hild, Y. Rao, and J. Principe

References Almeida, L. B. (2000). Linear and nonlinear ICA based on mutual information. In Proceedings of AS-SPCC’00 (pp. 117–122). Piscataway, NJ: IEEE Press. Amari, S. I. (1985). Differential-geometrical methods in statistics. Berlin: SpringerVerlag. Amari, S. I. (1997). Neural learning in structured parameter spaces—natural Riemmanian gradient. In M. Mozer, M. Jordan, & T. Petsche (Eds.), Advances in neural information processing systems, 9 (pp. 127–133). Cambridge, MA: MIT Press. Amari, S. I., Cichocki, A., & Yang, H. H. (1996). A new learning algorithm for blind signal separation. In D. Touretzky, M. Mozer, & M. Hasselmo (Eds.), Advances in neural information processing systems, 8 (pp. 757–763). Cambridge, MA: MIT Press. Barndorff-Nielsen, O. E. (1978). Information and exponential families in statistical theory. New York: Wiley. Bell, T., & Sejnowski, T. (1995). An information-maximization approach to blind separation and blind deconvolution. Neural Computation, 7, 1129–1159. Cardoso, J. F. (1994). On the performance of orthogonal source separation algorithms. In Proceedings of EUSIPCO’94 (pp. 776–779). London: EURASIP. Cardoso, J. F. (1998). Blind signal separation: Statistical principles. In Proceedings of IEEE (pp. 2009–2025). Piscataway, NJ: IEEE Press. Cardoso, J. F. (1999). High-order contrasts for independent component analysis. Neural Computation, 11, 157–192. Cardoso, J. F., & Souloumiac, A. (1993). Blind beamforming for non-gaussian signals. IEE Proceedings F Radar and Signal Processing, 140, 362–370. Comon, P. (1994). Independent component analysis, a new concept? Signal Processing, 36, 287–314. Comon, P. (1996). Contrasts for multichannel blind deconvolution. IEEE Signal Processing Letters, 3, 209–211. Cover, T. M., & Thomas, J. A. (1991). Elements of information theory. New York: Wiley. Crain, B. R. (1974). Estimation of distributions using orthogonal expansions. Annals of Statistics, 2, 454–463. Erdogmus, D., Hild II, K. E., & Principe, J. C. (2001). Independent component analysis using Renyi’s mutual information and Legendre density estimation. In Proceedings of IJCNN’01 (pp. 2762–2767). Piscataway, NJ: IEEE Press. Erdogmus, D., Hild II, K. E., Rao, Y. N., & Principe, J. C. (2003). Independent component analysis using Jaynes’ maximum entropy principle. In Proceedings of ICA’03 (pp. 385–390). Available on-line: http://www.kecl.ntt.co.jp/icl/signal/ica2003/cdrom/index.htm. Girolami, M. (1997). Symmetric adaptive maximum likelihood estimation for noise cancellation and signal estimation. Electronic Letters, 33, 1437– 1438. Girolami, M. (2002). Orthogonal series density estimation and the kernel eigenvalue problem. Neural Computation, 14, 669–688.

Minimax Mutual Information Approach for ICA

1251

Girolami, M., & Fyfe, C. (1997). Kurtosis extrema and identication of independent components: A neural network approach. In Proceedings of ICASSP’97 (pp. 3329–3332). Piscataway, NJ: IEEE Press. Golub, G., & Van Loan, C. (1993). Matrix computation. Baltimore, MD: Johns Hopkins University Press. Hild II, K. E., Erdogmus, D., & Principe, J.C. (2001). Blind source separation using Renyi’s mutual information. IEEE Signal Processing Letters, 8, 174– 176. Hyvarinen, A. (1998). New approximations of differential entropy for independent component analysis and projection pursuit. In M. Kearns, M. Jordan, & S. Solla (Eds.), Advances in neural processing systems, 10 (pp. 273–279). Cambridge, MA: MIT Press. Hyvarinen, A. (1999a). Survey on independent component analysis. Neural Computing Surveys, 2, 94–128. Hyvarinen, A. (1999b). Fast and robust xed-point algorithms for independent component analysis. IEEE Transactions on Neural Networks, 10, 626–634. Jaynes, E. T. (1957). Information theory and statistical mechanics. Physical Review, 106, 620–630. Kapur, J., & Kesavan, H. (1992). Entropy optimization principles and applications. New York: Academic Press. Karvanen, J., Eriksson, J., & Koivunen, V. (2000). Maximum likelihood estimation of ICA model for wide class of source distributions. In Proceedings of NNSP’00 (pp. 445–454). Piscataway, NJ: IEEE Press. Oja, E. (1999). The nonlinear PCA learning rule in independent component analysis. In Proceedings of ICA’99 (pp. 143–148). Parra, L., & Spence, C. (2000). Convolutive blind separation of non-stationary sources. IEEE Transactions on Speech Audio Processing, 46, 320–327. Parzen, E. (1962). On estimation of a probability density function and mode. Annals of Mathematical Statistics, 33, 1065–1076. Pham, D. T. (1996). Blind separation of instantaneous mixture sources via an independent component analysis. IEEE Transactions on Signal Processing, 44, 2768–2779. Pham, D. T. (2001). Blind separation of instantaneous mixture of sources via the gaussian mutual information criterion. Signal Processing, 81, 855–870. Principe, J. C., & Xu, D. (1999). Information theoretic learning using Renyi’s quadratic entropy. In Proceedings of ICA’99 (pp. 407–412). Renyi, A. (1970). Probability theory. Amsterdam: North-Holland. Shore, J. E., & Johnson, R. W. (1980). Axiomatic derivation of the principle of maximum entropy and the principle of minimum cross-entropy. IEEE Transactions on Information Theory, 26, 26–37. Simon, C., Loubaton, P., Vignat, C., Jutten, C., & d’Urso, G. (1998). Blind source separation of convolutive mixtures by maximizing of fourth-order cumulants: The non-IID case. In Proceedings of ICASSIP’98 (pp. 1584–1588). Piscataway, NJ: IEEE Press. Sun, X., Douglas, S. C. (2001). Adaptive paraunitary lter banks for contrastbased multichannel blind deconvolution. In Proceedings of ICASSP’01 (pp. 2753–2756). Piscataway, NJ: IEEE Press.

1252

D. Erdogmus, K. Hild, Y. Rao, and J. Principe

Torkkola, K. (1996). Blind separation of delayed sources based on information maximization. In Proceedings of NNSP’96 (pp. 1–10). Piscataway, NJ: IEEE Press. Torkkola, K. (1999). Blind separation for audio signals—are we there yet? In Proceedings of ICA’99 (pp. 239–244). Weinstein, E., Feder, M., & Oppenheim, A. (1993). Multi-channel signal separation by decorrelation. IEEE Transactions on Speech Audio Processing, 1, 405–413. Wu, H. C., & Principe, J. C. (1997). A unifying criterion for blind source separation and decorrelation: Simultaneous diagonalization of correlation matrices. In Proceedings of NNSP’97 (pp. 465–505). Piscataway, NJ: IEEE Press. Wu, H. C., & Principe, J. C. (1999a). A gaussianity measure for blind source separation insensitive to the sign of kurtosis. In Proceedings of NNSP’99 (pp. 58–66). Piscataway, NJ: IEEE Press. Wu, H. C., & Principe, J. C. (1999b). Generalized anti-Hebbian learning for source separation. In Proceedings of ICASSP’99 (pp. 1073–1076). Piscataway, NJ: IEEE Press. Xu, D., Principe, J. C., Fisher, J., & Wu, H. C. (1998). A novel measure for independent component analysis. In Proceedings of ICASSP’98 (pp. 1161–1164). Piscataway, NJ: IEEE Press. Yang, H. H., & Amari, S. I. (1997). Adaptive online learning algorithms for blind separation: Maximum entropy and minimum mutual information. Neural Computation, 9, 1457–1482. Received December 13, 2002; accepted November 7, 2003.

LETTER

Communicated by C. Lee Giles

Improving Generalization Capabilities of Dynamic Neural Networks Miroslaw Galicki [email protected] Institute of MedicalStatistics,Computer Sciences and Documentation, FriedrichSchiller University, Jena, Germany, and Department of Management, University of Zielona G´ora, Zielona G´ora, Podg´orna 50, Poland

Lutz Leistritz [email protected] Institute of MedicalStatistics,Computer Sciences and Documentation, FriedrichSchiller University Jena, Germany

Ernst Bernhard Zwick [email protected] Department of Pediatric Orthopedics, Karl-Franzens-University, Graz, Austria

Herbert Witte [email protected] Institute of MedicalStatistics,Computer Sciences and Documentation, FriedrichSchiller University, Jena, Germany

This work addresses the problem of improving the generalization capabilities of continuous recurrent neural networks. The learning task is transformed into an optimal control framework in which the weights and the initial network state are treated as unknown controls. A new learning algorithm based on a variational formulation of Pontrayagin’s maximum principle is proposed. Under reasonable assumptions, its convergence is discussed. Numerical examples are given that demonstrate an essential improvement of generalization capabilities after the learning process of a dynamic network. 1 Introduction Articial neural networks are now commonly used in many practical tasks, such as the classication, identication and control, and fault diagnostic, of dynamic processes. Due to the nature of these processes, it seems more reasonable to apply recurrent networks, which are, in fact, (nonlinear) dynamic systems. A general problem for a neural network in all of these tasks is to nd an underlying mapping that correspondingly relates their input-output c 2004 Massachusetts Institute of Technology Neural Computation 16, 1253–1282 (2004) °

1254

M. Galicki, L. Leistritz, E. Zwick, and H. Witte

spaces based on a given (usually nite) set called a learning set of inputoutput pairs. As a rule, the learning pairs present some functions of time (trajectories). It is well known that the problem of nding such a mapping is ill posed (Tikhonov & Arsenin, 1977; Morozov, 1984) because its solution is not unique. Consequently, some criterion reecting a network performance, which in turn inuences the realization quality of the above tasks, should be introduced in order to obtain a unique solution. The capability of a neural network to generalize seems to be of utmost important from both a theoretical and practical point of view. The generalization property is strongly related to the bias and variance—components of the generalization error (Geman, Bienenstock, & Doursat, 1992). In the case of neural networks, these components depend in general on the training algorithm used that nds an optimal solution with respect to the weights, a network structure (an optimal balance between bias and variance), and the power of the learning set. In this context, several approaches may be distinguished, although they refer for the most part to feedforward neural networks being static systems. An exhaustive overview of the algorithms modifying a network structure during the training of the feedforward networks may be found in the work of Doering, Galicki, & Witte (1997). In the case of recurrent networks, the recurrent cascade correlation method has been proposed by Fahlman (1991) to tackle the problem of network parameter reduction. As was proved in Giles, Chen, and Sun (1995), it is unable to learn some practical tasks. Recently, several training algorithms for dynamic networks with xed structures have been proposed that nd constant optimal weights using, for example a Lyapunov-based stability theory (Fabri & Kadirkamanathan, 1996), an error function of integral type (Pearlmutter, 1995; Cohen, Saad, & Marom, 1997), the expectation-maximization algorithm (Ma & Ji, 1998) or a concept from the theory of stochastic learning automata (Sundareshan & Condarcure, 1998; see Galicki Leistritz, & Witte, 1999, for a broader review of the learning algorithms). Nevertheless, these procedures inuence only variations of the bias. Another approach to the improvement of generalization performance of a neural network is provided by the Vapnik-Chervonenkis (VC) dimension (Vapnik & Chervonenkis, 1971). Roughly speaking, the VC dimension determines conditions under which the bias decreases faster than the variance increases (Abu-Mostafa, 1989). Unfortunately, the resulting worst-case estimation of the number of training pairs required to train a neural network is very large (Holden & Niranjan, 1995). As recurrent networks are Turing machines (Kiliom & Siegelmann, 1996), the corresponding VC dimension may take arbitrarily large values for arbitrarily long input signals (DasGupta & Sontag, 1996; Hammer, 1997). An alternative means of controlling the generalization capability is through the use of regularization techniques that are based on the addition of a penalty term, called a regularizer to the error function. The penalty term inuences variations of the network variance. To the best of our knowledge, they have been applied

Improving Generalization Capabilities of Dynamic Neural Networks

1255

for the most part to static (feedforward) neural networks. In this context, three approaches may be distinguished. The rst is based on using a quadratic weight decay regularizer (Hinton, 1987; Czernichow, 1997; Taniguchi & Tresp, 1997; Hansen & Rasmussen, 1994). However, empirical evidence indicates that the regularizer of this form can lead to improvement in the generalization of feedforward networks. Utilizing the concept of small perturbations of the training data, Wu and Moody (1996) and Leung, Tsoi, and Chan (2001) have proposed an improved version of a weight decay regularizer. Moreover, this type of regularizer was analyzed for discrete recurrent networks in Wu and Moody (1996). In the next approach, involving network information criterion (NIC) (Murata, Yoshizawa, & Amari, 1994; Billings & Zheng, 1995), which is a generalization of Akaike’s information criterion (AIC) into neural networks, the penalty term is proportional to the number of weights. As variations of the error function might be much greater than the added penalty term, the NIC may result in a poor generalization capability of a feedforward network. The third approach, extensively studied by Simard, Victorri, Le Cun, and Denker (1991), Bishop (1991, 1995b), and Girosi, Jones, and Poggio (1995) is based on adding a regularizer to the error function that measures the discrepancy between the actual and desired directional derivative of some selected transformations in Simard et al. (1991) or penalizes the curvature of a feedforward network mapping (Bishop, 1991, 1995b; Girosi, Jones, & Poggio, 1995). In our opinion, this approach seems to be very efcient since it directly inuences variations of the variance (this is because directional derivative and curvature-driven methods yield network mappings with lower network mapping variance). Nevertheless, a suitable choice of penalty parameter that controls the bias-variance trade-off is very difcult in practice. As seen from the cited literature, very little has been written with reference to the improvement of generalization capabilities for continuous dynamic neural networks, although a solution to this problem has a crucial effect on the performance of multiple practical tasks. The purpose of this work is to propose a step toward the construction of learning algorithms satisfying this requirement. The generalized dynamic neural networks (GDNNs), rst introduced and theoretically analyzed by Leistritz, Galicki, and Witte (1998) and Galicki, Leistritz, and Witte (1999), seem to be suitable in such a case. They are characterized by time-dependent weights. Introducing time-dependent weights results in an overparameterization of GDNNs. However, this property allows us to dene simultaneously a larger number of tasks to be performed, thus reducing or (limiting) network parameter redundancy. More formally, our aim is to construct a learning algorithm by producing a small global error between output trajectories and the desired (target) ones, which, in addition, results in better generalization capabilities of GDNNs. As is known from the theory of static

1256

M. Galicki, L. Leistritz, E. Zwick, and H. Witte

neural networks, the minimization of their structures leads to a decrease of the variance. However, it has been numerically shown by Galicki et al. (1999) that a GDNN of a theoretically minimal size has worse generalization capabilities than a fully connected one. On the other hand, a weight decay term indirectly inuences the network mapping variance. Consequently, our approach is based on using either the sensitivity or the curvature of a GDNN’s mapping as the regularizers added to standard error function. They directly inuence variations of the neural network mapping and hence its variance, too. As opposed to static networks, the dynamic (time-varying) curvature or sensitivity is not directly given. An adjoint system of ordinary differential equations (ODEs) is introduced to determine the network sensitivity or curvature. In order to fully specify the equations, initial conditions must be found during the training of the dynamic network. By an analogy to the static regularization theory, we both extend and generalize it to the dynamic case using the adjoint system of ODEs. Nevertheless, the determination of a penalty parameter may still be difcult in such a case. However, GDNNs offer another possibility for tackling this problem. As is known, they possess a very useful property. Galicki et al. (1999) have shown that it is possible to nd the network weights that produce an arbitrarily small value of the error functional. This property makes it possible to formulate our task as the minimization of the curvature or sensitivity norm of a GDNN’s mapping subject to a given accuracy of learning. We immediately notice that the above optimization problem does not involve a penalty parameter whose determination is difcult in practice. Thus, the learning tasks dened above may be treated as a problem of dynamic programming with state constraints in which the weights and initial network state are treated as unknown controls. In this case, it is very difcult to use Pontryagin’s maximum principle (in its classical form; Pontryagin, Boltyansky, Gamkrelidze, & Mischenko, 1961) because it presents only the positive form for control. In fact, the strong variation algorithms based on the maximum principle (Srochko, 1986) are not oriented to the optimal control problems involving state inequality constraints. Sakawa and Shindo (1980) have designed an algorithm that handles state variable inequality constraints by using a prediction technique. Controls are computed by minimizing the penalized Hamiltonian. Strend and Balden (1990) have shown the ineffectiveness of this algorithm in the case of bang-bang control problems. The convergence proofs of many implementable algorithms are based on generating a sequence of controls. However, the optimal control problem with state inequality constraints may not have a limit control (in a class of measurable controls), although the optimal trajectories for such a problem may exist (Filippov, 1962). It is natural to attempt to employ other techniques in order to solve the learning tasks dened above. The variational formulation of Pontryagin’s maximum principle is used here for this purpose, which makes it possible to handle state constraints efciently. The technique of convergence proof of the proposed algorithm is also discussed. It is differ-

Improving Generalization Capabilities of Dynamic Neural Networks

1257

ent from those presented above and is based on searching for limit network trajectories whose existence is theoretically ensured. The article is organized as follows. Section 2 presents the learning task in terms of the optimal control problem. Section 3 describes how to employ the variational formulation of Pontryagin’s maximum principle to determine optimal weights. Some numerical examples of classication and time-series prediction problems, conrming theoretical results, are presented in section 4. Finally, concluding remarks are drawn in section 5. 2 Problem Formulation A neural network of the Hopeld type is considered whose state evolves according to the following differential and algebraic equations yP D f.y; w; u/ z D g.y/;

(2.1)

where RN 3 y D .y1 ; : : : ; yN /T is a state vector of the network; N denotes the number of all neurons; w D .w.i¡1/NCj : 1 · i; j · N/; w.i¡1/NCj represents the weight connection out of neuron i to neuron j; u D u.t/ D .u1 ; : : : ; uN /T stands for the vector of external inputs to the network; f D . f1 ; : : : ; fN /T ; PN fi .y; w; u/ D ¡yi C h. jD1 w.i¡1/nCj yj / C ui ; i D 1 : N; t 2 [0; T]; T is a xed time of network evolution; y.0/ D A D .A1 ; : : : ; AN /T denotes an initial network state; h.¢/ is strictly increasing neural activation function; g : RN ¡! RL ; g.y/ stands for a given mapping that represents outputs of the network; and N ¸ L is the number of neurons producing the outputs. The mapping g usually takes on the following form (states of neurons 1, : : : , L present outputs of the network): g.y/ D .y1 ; : : : ; yL /T :

(2.2)

In order to simplify further considerations, time constants of neurons as well as their synaptic delays will be omitted from now on. The natural prerequisite during the recognition of multiple classes is that the initial state of the network (given a priori or determined during the learning) should be the same for all the external inputs. Nevertheless, the specication (e.g., by the user) of a poor initial state may considerably slow the convergence rate of the learning process or even lead to convergence difculties (see Galicki et al., 2000, for learning with an unknown initial state). Hence, the learning task of multiple continuous trajectories is now to modify the weight vector w and the initial state y.0/ so that the output trajectories zp .t/ corresponding to external inputs up .t/ D .u1;p .t/; : : : ; uN;p .t//T follow differentiable desired (target) trajectories dp .t/ 2 RL , where p D 1 : P; 1 · P denotes a given number of training pairs.

1258

M. Galicki, L. Leistritz, E. Zwick, and H. Witte

As is known in many practical applications of dynamic neural networks, the available external inputs in the learning set are often noisy. If the error functional equals zero for the learning data, the capability of the network to generalize, that is, to produce an acceptable output when a novel external input is presented, will often be poor. This fact is a consequence of rapid variations in the network mapping that generally are needed to decrease the bias (i.e., to interpolate the noisy data). In order to improve the generalization capability of a dynamic neural network, it seems necessary to control the bias-variance trade-off, or, in other words, to approximate the learning set by the network mapping rather than to interpolate it. This requirement implies that in general, the error functional for the learning data equals no more zero. Nevertheless, approximation prevents large variations of the network variance. One constructive way to limit these variations is to add a penalty term to the error functional whose role is to produce large values for the rapid variations in the network mapping that are characteristic for a poor generalization. Techniques that exploit such types of constraints in learning problems of static neural networks (e.g., perceptrons, radial basis functions) are commonly termed regularization (Girosi et al., 1995; Bishop 1995a). This is frequently used in a number of other elds for tackling illposed problems (Tikhonov & Arsenin, 1977). To the best of our knowledge, using regularization with network curvature smoothing for training the dynamic neural networks has not been addressed so far in the literature, although it is important from both a theoretical and practical point of view. By analogy to the static regularization theory, the following performance index may be introduced in the dynamic case, 1 JD 2

Z

T 0

0 ® ® 1 P ® @ 2 z ®2 X p® ® @ kzp ¡ dp k2 C ¸ ® 2 ® A dt; ® @up ®

(2.3)

pD1

where the rst term in the brackets stands for the instantaneous error function and is a scaled square norm of the network curvature; h the second i @ 2 zp @up2

D

@ 2 zp;l ; k¢k denotes @up;k @up;j 1·j;k·N;1·l·L

the standard Euclidean norm; and

¸ is the regularization parameter that controls the compromise between the degree of smoothness of the network mapping and its closeness to the data. Therefore, ¸ is directly related to the degree of generalization that is en® ® ® @ 2 zp .t/ ®2 ® forced. Minimizing the term ® 2 ® @up .t/ ® in equation 2.3 forces the derivative with respect to external input up .t/ of the network mapping (its sensitivity) to be locally invariant with respect to small perturbations of up .t/ (for a xed time instant t). Moreover, small initial network sensitivity value (e.g., equal to zero, as is assumed in all the computer simulations) additionally results in approximately local invariance of the network output for small

Improving Generalization Capabilities of Dynamic Neural Networks

1259

variations of up .t/ at t. Furthermore, using a curvature-driven term instead of a sensitivity one makes it possible to approximate better the learning set by the network mapping. The term

@ 2 zp @up2

in equation 2.3 may be computed

based on equation 2.2. It equals @ 2 zp @ 2 g @yp @g @ 2 yp D C ; @up2 @up @yp @up @yp @ up2 where " # µ ¶ @yp @yp;i @ 2 yp @ 2 yp;i D I D @up @ up;j 1·i;j·N @up2 @up;k @up;j On account of the fact that

@ 2g @up @yp

: 1·i;j;k·N

D 0, we obtain

@ 2 zp @g @ 2 yp D ; 2 @up @yp @up2 and hence the performance index, equation 2.3, takes the form 1 JD 2

0

Z

T

@

0

P X pD1

® ® 1 ® @ g @ 2 y ®2 p® ® kzp ¡ dp k C ¸ ® ® A dt: ® @yp @up2 ® 2

(2.4)

Consequently, the computation of J from equation 2.4 requires the knowledge of the second derivative

@ 2 yp @up2

as a function of time. It may be found as

follows. First, let us differentiate equation 2.1 with respect to up (assuming that the corresponding partial derivatives are continuous). The result is d dt

³

@yp

´

@up

D

df ; dup

(2.5)

where µ ¶ df dfi @ f @yp @f D D C : dup dup;j @ yp @up @up Differentiating the last equation by up for the second time, we obtain the adjoint system of ODEs, Á d dt

@ 2 yp @up2

! D

d2 f ; dup2

(2.6)

1260

M. Galicki, L. Leistritz, E. Zwick, and H. Witte

where d2 f D dup2

"

d 2 fi dup;k dup;j

#

³ D C

@yp @up

´T

@ 2 f @yp @ 2 f @yp C @yp2 @up @ up @yp @up

@f @ 2 yp @ 2 f @yp @ 2f C C : @yp @up2 @ yp @up @up @up2

Let us note that the solution of equation 2.6 is possible only when the term @yp is known as a function of time. It may be found based on equation 2.5. @up In order to fully specify differential equations 2.5 and 2.6, initial values for derivatives

@yp @up

and

@ 2 yp , @up2

that is,

@yp .0/ D B @up @ 2 yp .0/ D C; @up2

(2.7)

where B D [Bij ]1·i;j·N ; C D [Cijk ]1·i;j;k·N denote some constant matrices, should be given. However, they are not known in practice. Therefore, an additional task arises, which consists of determining initial values 2.7 when learning dynamic network 2.1. A potential disadvantage of applying criterion 2.4 is that it requires knowledge of the regularization parameter ¸. This may be cumbersome from a practical point of view since this parameter is problem dependent. Therefore, an alternative performance index may be proposed in the form 1 JD 2

Z

® ® 2 ®2 ® @g @ yp ® ® ® dt; ® @yp @up2 ®

P ® TX 0

pD1

(2.8)

which reects only the generalization capability of the network. Thus, the learning task may now be formulated more compactly and formally, as follows. Minimize J.w; A; B; C/ D

1 2

Z

P TX 0

Cp dt

(2.9)

pD1

® ® ® @g @ 2 yp ®2 ® where Cp D kzp ¡ dp k2 C ¸ ® 2 ® @yp @up ® for the optimization problem 2.4 ® ® ® @g @ 2 yp ®2 ® through 2.7 and Cp D ® ® @yp @up2 ® in the case of learning task 2.5 through

Improving Generalization Capabilities of Dynamic Neural Networks

1261

2.8 subject to constraints yP p D f.yp ; w; up / d ± @yp ² df D dt @up dup d ± @ 2 yp ² d2 f D dt @ up2 dup2 yp .0/ D A;

@ yp .0/ D B; @ up

(2.10) @ 2 yp .0/ D C @ up2

(2.11)

and 1 2

Z

P TX 0

pD1

Dp dt · 0;

(2.12)

where Dp D 0 for the optimization problem 2.4 through 2.7 and Dp D 2² kzp ¡ dp k2 ¡ TP for the optimization task 2.5 through 2.8, respectively; 0 < ² is a small given number (accuracy of learning). The inequality 2.12 takes into account (in the nontrivial case) the condition of tracking the desired trajectories dp by the neural network outputs yp (a conventional task most commonly used in the literature). Let us note that giving some value for ² is more natural than searching for a compromise (i.e., ¸) between terms ® ® ® @g @ 2 yp ®2 2 ® kzp ¡ dp k and ® @yp @u2 ® ® in performance index 2.4. In addition, the generp

alization error is also proportional to an ² for nite time sequences (see, e.g., DasGupta & Sontag 1996, and Hammer, 1997 for details). One constructive method of determining the weight vector w that fullls the constraint 2.12 has been proposed by Galicki et al. (1999). Expressions 2.9 throught 2.12 present a relatively complex learning problem whose solution by means of classical networks (with constant weights) seems to be difcult (or even impossible). The reason is that the network should now accomplish simultaneously more tasks than one, that is, minimize proper performance index 2.9 by maintaining the conventional task given by equation 2.12. Therefore, more parameter redundancy is required for the network to fulll all the constraints 2.10 through 2.12. Generalized dynamic neural networks (GDNNs), rst introduced by Leistritz, Galicki and Witte (1998) and Galicki et al. (1999) offer a possibility of tackling this problem. They are dened by introducing time-dependent weights into a Hopeld-type network. As was shown by Galicki et al. (1999), it is possible to nd a weight vector w such that the constraint 2.12 is satised for each ² ¸ 0. This property makes the learning algorithm that is presented in the next section computationally efcient.

1262

M. Galicki, L. Leistritz, E. Zwick, and H. Witte

The time-dependent weights may be treated as controls. Thus dependences 2.9 through 2.12 formulate the learning task as an optimal control problem. The fact that there exist constraints imposed on coordinates of vector yp (state equality and inequality constraints) makes its solution difcult. An approach based on the application of the variational principle of Pontryagin’s maximum principle will be proposed to solve the optimization problem. 3 Learning by Smoothing the Network Mapping Before presenting the algorithm, all the variables yp ,

@yp @ 2 yp , @up @up2

will be trans-

formed into a vector form, which is more convenient for further considerations. Therefore, the following state vectors are introduced: ³ T

xp D .xp;1 ; : : : ; xp;NCN2 CN 3 / D

@yp;1 @yp;N ;:::; ; @ up;1 @up;N !T @ 2 yp;1 @ 2 yp;N ;:::; @up;1 @up;1 @up;N @up;N

yp;1 ; : : : ; yp;N ;

and x0 D .A1 ; : : : ; AN ; B11; : : : ; BNN ; C111; : : : ; CNNN /T ;

p D 1 : P:

Using the quantities x0 , xp and taking into account expressions 2.1, 2.5, and 2.6, the dynamic equations of the neural network may be written in an extended state space as xP p D F.xp ; w; up /

(3.1)

where ³ FD

f1 ; : : : ; fp ;

df1 dfN d 2 f1 d2 fN ;:::; ; ;:::; dup;1 dup;N dup;1 dup;1 dup;N dup;N

´T :

Consequently, the optimal control problem given by expressions 2.9 through 2.12 is transformed into a functional form (which is equivalent to the previous one), as follows. Minimize Z J.w; x0 / D

T

(3.2)

Cdt 0

where CD

P ³ X 1 pD1

2

k.xp;1 ; : : : ; xp;L /T ¡ dp k2 C

1 XNCN2 CLN 2 ¸ .x /2 mDNCN2 C1 p;m 2

´

Improving Generalization Capabilities of Dynamic Neural Networks

1263

P P 1 PNCN2 CLN 2 for the optimization problem 2.4 through 2.7 and C D pD1 2 mDNCN2 C1 .xp;m /2 in the case of learning task 2.5 through 2.8, subject to the constraints xP p D F.xp ; w; up /

(3.3)

xp .0/ D x0

(3.4)

G.w; x0 / · 0;

(3.5)

RT where G D 0 for the optimization problem 2.4 through 2.7 and G D 0 12 PP 2² 2 T pD1 Dp .xp /dt; Dp D k.xp;1 ; : : : ; xp;L / ¡ dp k ¡ TP in the case of learning task 2.5 through 2.8, respectively. Due to state constraints 3.4 and 3.5, the application of the variational formulation of Pontryagin’s maximum principle (Fedorenko, 1978) is proposed here to solve the optimal control problem 3.2 through 3.5 (see, e.g., Galicki, 1998, for details regarding the variational formulation of Pontryagin’s maximum principle). This formulation requires a specication of the initial weight vector w0 D w0 .t/, t 2 [0; T]. However, w0 does not necessarily satisfy the state inequality constraint 3.5. The initial trajectories xp0 .t/ are obtained by solving equation 3.3 for each up , p D 1 : P. By assumption, w0 and x0 do not minimize the performance index, equation 3.2. The use of the variational formulation of Pontryagin’s maximum principle requires incrementation of functionals 3.2 and 3.5. Therefore, it is assumed that the weight vector w0 D w0 .t/ and initial state x0 are perturbed by small variations ±w D ±w.t/ D .±wm : 1 · m · N2 / and ±x0 D .±x1;0 ; : : : ; ±xNCN2 CN3 ;0 /T , where k±wk1 D max f max j±wm jg · ½; t2[0;T] 1·m·N2

k±x0 k1 D

max

1·i·NCN2 CN3

fj±xi;0 jg · » ; ½ and » are given small numbers en-

suring the correctness of the presented method. According to the theory of small perturbations (Pontryagin et al., 1961) and on account of the fact that ±xp .0/ D ±x0 , the value of the functional J for the perturbed weight vector w0 C ±w and initial state x0 C ±x0 may be expressed to the rst order (by neglecting the higher-order terms o.±w; ±x0 /) by the following dependence, J.w0 C ±w; x0 C ±x0 / D J.w0 ; x0 / C ±J.w0 ; x0 ; ±w; ±x0 /;

(3.6)

PP R T T where ±J.w0 ; x0 ; ±w; ±x0 / D pD1 0 h.Fw .p; t// 9p ; ±widt C h9p .0/; ±x.0/i is the Frechet differential of the functional J.¢; ¢/; 9p .¢/ stand for conjugate mappings, which are determined by the solution of the Cauchy problem: P p C .Fxp .t//T 9p D ¡Cxp .t/I 9 ³

Fw .p; t/ D

@F @w

´

xp D xp0 w D w0

9p .T/ D 0I

I Fxp .t/ D

³

@F @xp

´

xp D xp0 w D w0

I

1264

M. Galicki, L. Leistritz, E. Zwick, and H. Witte

³ Cxp .t/ D

@C @ xp

´ : xp D xp0

The value of the functional determined by the left-hand side of equation 3.5 for weights w0 C ±w and perturbed initial state x0 C ±x0 is given to the rst order, as follows, G.w0 C ±w; x0 C ±x0 / D G.w0 ; x0 / C ±G.w0 ; x0 ; ±w; ±x0 /;

(3.7)

PP R T T where ±G.w0 ; x0 ; ±w; ±x0 / D pD1 0 h.Fw .p; t// 8p ; ±widt C h8p .0/; ±x.0/i is the Frechet differential of the functional G.¢; ¢/; mappings 8p .¢/ are computed based on the solution of the following differential equations: ³ ´ @Dp T P C D ¡D D 0I D 8p .Fxp .t// 8p Dxp .t/ : xp .t/I 8p .T/ @ xp xp D x0 p

For properly selected variations ±w and ±x0 , the Frechet differentials of functionals 3.2 and 3.5 approximate the increments of these functionals with any desired accuracy. Consequently, the variational formulation of Pontryagin’s maximum principle takes the following form, J.w0 C ±w¤ ; x0 C ±x¤0 / D J.w0 ; x0 / C min f±J.w0 ; x0 ; ±w; ±x0 /g; (3.8) .±w;±x0 /

subject to the constraints G.w0 ; x0 /

C±G.w0 ; x0 ; ±w¤ ; ±x¤0 / · 0

k±w¤ k1 · ½

k±x¤ k0 1 · »:

(3.9)

The assumption of nonoptimality of weight vector w0 and initial network state x0 implies the existence of sufciently small variations ±w¤ and ±x¤0 such that J.w0 C ±w¤ ; x0 C ±x¤0 / < J.w0 ; x0 /. Consequently, the task of minimizing functional 3.8 is well posed, and variations ±w¤ , ±x¤0 are results of its solution. A nite-dimensional approximation of the optimal control problem dened by dependences 3.8 and 3.9 seems to be very effective for solving it numerically. The process of approximation for weight variation ±w may be accomplished, for example, by using quasi-constant or quasi-linear functions or splines. This article looks for a sequence of weight variations in a class of polynomials ±w.t/ D VC.t/;

(3.10)

where V 2 RN £M denotes the matrix of coefcients to be found; M is a given number of basic functions (accuracy of computations); and C.t/ stands for 2

Improving Generalization Capabilities of Dynamic Neural Networks

1265

the M-dimensional vector of specied basic functions (e.g, Chebyshev’s polynomials scaled to [0; T]). Let us note that M could be optimized (implying a structural optimization), resulting in an increase of computational efciency (i.e., nding a minimal number of basic functions to give a satisfactory minimum of functional 3.5). For simplicity, M is assumed to be given. Consequently, only weight variation (coefcient matrix V) and initial network state variation ±x0 are subject to the minimization process. In this way, the determination of optimal variations ±w¤ and ±x¤0 from dependences 3.8 and 3.9 is reduced to solving a linear programming problem with the following cost function, J.w0 C ±w; x0 C ±x0/ D J.w0 ; x0 / + Z T *X P ¡ ¢T C Fw .p; t/ 9p ; VC.t/ dt 0

* C

pD1

+

P X

9p .0/; ±x0 ;

(3.11)

pD1

with respect to V and ±x0 , subject to the constraints Z 0

G.w ; x0 /

C

T 0

* C

*

P X pD1

P ¡ X

+ ¢T Fw .p; t/ 8p ; VC.t/ dt

pD1

+ 8p .0/; ±x0 · 0

kVk1 · ½ 0

k±x0 k1 · »;

(3.12)

where ½ 0 is a given small number (step length). After the determination of optimal variations ±w¤ , ±x¤0 from the minimization task 3.11–3.12, the computations are repeated for the revised quantities w1 D w0 C ±w¤ and x10 D x0 C ±x¤0 . A sequence of pairs .wk ; xk0 /, k D 0; 1; : : :, x00 D x0 may thus be obtained as a result of solving the iterative approximation scheme 3.11–3.12. Each element of this sequence corresponds to the state trajectory xpk .¢/. The RT continuity of Dp with respect to xp and the inequality max 0 Dp .xpk /dt < 1 k;p PP R T ¤ yields the following relation, pD1 · 0, where xp¤ stands for D .x /dt p p 0 a limit trajectory whose existence is theoretically ensured. Consequently, the convergence of this sequence to an optimal solution (in general local) follows from the work of Galicki et al. (1999). In order to run a minimization procedure 3.11–3.12, initial weights w0 and state vector x0 should be

1266

M. Galicki, L. Leistritz, E. Zwick, and H. Witte

specied. As is known, GDNNs render it possible to efciently nd such weights that fulll state inequality constraints 2.12. Next, initial vectors x0 and w0 are iteratively modied using variations ±x0 , ±w.t/ D VC.t/ in order to carry out a proper minimization of the performance index 3.11. The generality of the iterative scheme 3.11–3.12 carries with it a penalty; somewhat larger computations may be required during the learning process. They may be signicantly reduced by taking into account the square norm of network sensitivity instead of the mapping curvature. The role of the sensitivity is also to smooth out the rapid variations in the network mapping. As compared to the curvature, the network sensitivity is more local (the rst derivative used instead of the second one), and thus the generalization capability of the network may become somewhat worse for a small learning set, as the numerical results presented in the next section show. Nevertheless, application of the sensitivity measure as the penalty term leads to the minimization of the following performance index: 1 JD 2

Z

T 0

0 1 ® ®2 P X ® ® @y @g p ® A @ kzp ¡ dp k2 C ¸ ® ® @ y @u ® dt: p p pD1

(3.13)

Let us note that minimizing term ® ® ® @g @yp ®2 ® ® ® @ y @u ® p p forces the network output to be locally invariant with respect to small perturbations of external input up .t/ at time instant t. In such a case, state vector xp and the network mapping F take the following form, ³ ´ @yp;1 @yp;N T xp D .xp;1 ; : : : ; xp;NCN2 /T D yp;1 ; : : : ; yp;N ; (3.14) ;:::; @up;1 @up;N where x0 D .A1 ; : : : ; AN ; B 11 ; : : : ; BNN /T , and ³ FD

f1 ; : : : ; fp ;

df1 dfN ;:::; dup;1 dup;N

´T (3.15)

:

Let us note that using performance index 3.13 results in a signicant decrease of the size of the state vector (by N 3 ). Consequently, iterative scheme 3.11– 3.12 is now reduced as follows. Minimize Z

T

0

J.w ; x0 /C 0

*

P ¡ X pD1

+ * + P X ¢T Fw .p; t/ 9p ; VC.t/ dtC 9p .0/; ±x0 pD1

(3.16)

Improving Generalization Capabilities of Dynamic Neural Networks

1267

with respect to V and ±x0 , subject to the constraints kVk1 · ½ 0 k±x0 k1 · »:

(3.17)

As is easy to see, the above linear programming problem may be solved analytically, which signicantly reduces the computational cost. Eliminating regularization parameter ¸ from equation 3.13 leads (similarly as in the case of network curvature) to the minimization of the following performance index, 1 JD 2

P ® TX ®

Z 0

®2 ® ® @g @yp ® dt; ® @ y @u ® p p pD1

(3.18)

subject to constraints yp .0/ D A; 1 2

Z

P TX 0

pD1

@ yp .0/ D B; @ up

kzp ¡ dp k2 dt · ²:

(3.19)

(3.20)

Application of iterative minimization scheme 3.11–3.12 with xp and F dened by equations 3.14 and 3.15 yields an optimal solution. It follows from the considerations carried out that a relatively general method for training a continuous-time recurrent neural network subject to its good generalization is proposed here. However, the generality of our approach carries with it a penalty: somewhat larger computations are required to nd an optimal solution. In such a context, determination of timedependent network curvature

@ 2 yp @up2

requires O.N3 / solutions of ODEs, which

would seemingly mean that the proposed method would be appropriate only for small problems. Nevertheless, the time-dependent weights result in an overparameterization of such networks. Consequently, this property gives the possibility of reducing the network structure (to decrease the number of neurons N) even for the relatively complex problems considered in section 4. Moreover, as the numerical simulations (carried out in the next @y section) show, the sensitivity term @upp (requiring only O.N 2 / solutions of ODEs, which signicantly reduces the computational burden) has similar results compared to those obtained by minimizing the norm of network curvature. However, minimizing the norm of network curvature needs smaller training sets (which may be sometimes practically desirable especially in the medical eld) as compared to minimizing the norm of sensitivity. This

1268

M. Galicki, L. Leistritz, E. Zwick, and H. Witte

property seems to be obvious since the second derivative carries with it more information about the (local) network behavior as compared to the rst derivative. 4 Computer Examples The purpose of this section is to show, on the basis of several chosen tasks, the essential improvement of the generalization capability of GDNNs trained with the algorithms presented in section 3. Particular forms of performance indexes used in computer examples are dictated by the necessity of making it possible to compare the obtained numerical results of the methods proposed in section 3 with training methods without regularization (classical performance index) known from the literature.

4.1 One-Class Task. The rst task is the training of a circular trajectory by means of a fully connected GDNN with three neurons. In fact, the network state evolution is governed by the following differential equations, 1 P3 0 1 0 ¡y1 C arctan. jD1 wj ¢ yj / yP 1 P3 B C yP D @yP 2 A D @ ¡y2 C arctan. jD1 wjC3 ¢ yj / A ; P yP 3 ¡y3 C arctan. 3 wjC6 ¢ yj / C u3

(4.1)

jD1

with the initial network state y.0/ D .0; 0; 0/T and a network evolution time of T D 12¼ . The network output z is dened by the rst two coordinates: ³ ´ y1 .t/ z.t/ D : y2 .t/ Evidently, y.0/ is a x point of equation 4.1 if no external input .u3 .t/ D 0/ is provided to the network. That is why a nonzero input u3 is necessary in order to produce the trajectory ³ ´ cos.t/ d.t/ D : sin.t/ Consequently, a constant input function u3 .t/ D 1 was used in this example. In fact, the condition u3 .t/ 6D 0; 0 · t · ±, for an arbitrary ± > 0 is sufcient for the solution of the learning task (Galicki et al. 1999). In other words, the principal task of the external input is to thrust aside the system 4.1 from y.0/. Starting with a common initial weight vector w0 .t/ D 0 for all t, nine GDNN were trained: the rst one (net1) by means of the classical performance index (¸ D 0); four networks (net2–net5) by means of the performance index, equation 3.13 including a rst-order regularization term ¸ 2 f0:02; 0:05; 0:1; 0:2g; and four further networks (net6–net9) by means of

Improving Generalization Capabilities of Dynamic Neural Networks

0.02

(A)

(B)

(C)

J0

J1

J2

0.6

0.25

0.016

0.48

0.2

0.012

0.36

0.15

0.008

0.24

0.1

0.004

0.12

0.05

0

1 2345 6789 net

0

1 2345 6789 net

0

1269

1 2345 6789 net

Figure 1: Performance partitions for trained networks: (A) the classical error, (B) sensitivity, and (C) curvature.

the performance index, equation 2.4 including a second-order regularization term ¸ 2 f0:02; 0:05; 0:1; 0:2g. Dening the performance contributions Z J0 D

12¼ 2¼

Z kz ¡ dk2 dt; J1 D

12¼ 2¼

® ® ® @g @ y ®2 ® ® ® @y @u ® dt;

® 12¼ ® ® @ g @ 2 y ®2 ® ® J2 D ® @ y @u2 ® dt; 2¼ Z

one can see the expected inuences of the regularization techniques in Figure 1. It can be seen from this gure that the regularization technique does not inuence error J0 in dependence on ¸. Moreover, all networks have similar approximation properties. In contrast, the smoothing terms J1 and J2 show a direct dependence on the regularization constant. Moreover, a regularization with J1 leads to a decrease of J2 , and vice versa in this example. This observation is a typical one in the presence of a single training pattern (P D 1) but cannot be afrmed for P > 1. In order to evaluate the generalization capabilities of the networks, the network outputs with respect to disturbed external inputs were considered.

1270

M. Galicki, L. Leistritz, E. Zwick, and H. Witte

At rst, gaussian noise with different variances was added to u3 , where u3;¾a D u3 .t/ C ¾a ¢ X.t/; X.t/ » N.0; 1/; ¾a D 0:02; 0:04; : : : ; 2 denotes the noisy input. Second, the input u3 was modied by random Chebyshev’s polynomials of order 11 (scaled to [0; 12¼ ]). That is, the perturbed input was given by P10 ´i ¢ Ci .t/ u3;¾b D u3 .t/ C ¾b ¢ Y.t/; Y.t/ D PiD0 ; ¾b 10 k iD0 ´i ¢ Ci k2 D 0:02; 0:04; : : : ; 2;

with a set of independent identical uniform distributed random variables ´i » U.[¡1; 1]/. At each ¾ , 100 realizations for both types of perturbations were used as external inputs and provided to all of the nine trained networks. The mean L2 distances between the network response and the target are given in Figure 2. Obviously, the regularization with J1 or J2 improves the generalization capabilities, where the regularization with the rst-order term J1 is more favorable due to the better cost-value ratio.

4.2 Two-Classes Task. The second example deals with the classication of external inputs, which are dened as solutions of the one-dimensional van der Pol equation, uR ¹ D ¹ ¢ .1 ¡ u2¹ /uP ¹ ¡ u¹ ;

(4.2)

for any real number ¹. By means of the parameter ¹, we dene two classes !1 and !2 distributed according to the class conditional density functions (see Figure 3) ’1 .¹/ D and

1:2 ¢

1 p

1

2¼

¹

e¡ 2 . 1:2 /

1 1 p ’2 .¹/ D ¢ 2 0:8 ¢ 2¼

2

³ ¡ 12

e

¡ ¹¡2:4 ¢2 0:8

¡ 12

Ce

¡ ¹C2:4 ¢2 ´ 0:8

:

Assuming equal a priori probabilities for the classes !1 and !2 , the optimal thresholds for the discrimination are given by the points of intersection of ’1 and ’2 (Fukunaga, 1990). In fact, they are approximately equal to §1:558 and §7:082. However, for our further considerations, the values §7:082 are disregarded, since the corresponding contribution to the classier error is less than 10¡8 . By this setup, we have two classes of functions whose distribution is implicitly known by the parameter ¹. The question now is whether a GDNN

Improving Generalization Capabilities of Dynamic Neural Networks

(A)

(B)

0.08

0.08

0.06

0.06

0.04

0.04

0.02

0.02

0

0

16 27 38 49 5 net

2 1 0

1271

s

a

16 27 38 49 5 net

2 1 0

s

b

Figure 2: Mean L2 distances between network outputs and the target .n D 100/. (A) Noisy input u3;¾a . (B) Random polynomials u3;¾ b . The networks are arranged according to ascending J1 or J2 , respectively.

is able to discriminate the two classes without any preprocessing or estimation of the underlying parameter ¹ and which sample size is sufcient to solve this problem with different performance indexes. First, a random training sample of size 84 (42 from !1 , 42 from !2 ) was generated. Furthermore, in order to investigate the inuence of the sample size, 10 subsamples of sizes 12; 20; : : : ; 84 were selected. For each subsample, three GDNNs were trained: one by means of the classical performance index (¸ D 0), one by means of the performance index 3.13 (rst-order regularization) with ¸ D 5 ¢ 10¡2 , and one by means of the performance index 2.4 (second-order regularization), where ¸ D 25 ¢ 10¡4 . The common topology for all fully connected GDNNs with three neurons is given by 1 P3 0 1 0 ¡y1 C arctan. jD1 wj ¢ yj / yP 1 P3 B C yP D @ yP 2 A D @ ¡y2 C arctan. jD1 wjC3 ¢ yj / A ; P 3 yP 3 ¡y3 C arctan. jD1 wjC6 ¢ yj / C u¹ with y.0/ D .0; 0; 0/T and a network evolution time of T D 10. The network input u¹ is dened with starting conditions u¹ .0/ D 2 and uP ¹ .0/ D 0. The GDNN output z is dened by the rst coordinate. The networks were

1272

M. Galicki, L. Leistritz, E. Zwick, and H. Witte

0.3

0.2 w

w

w

4

0 m

1

2

w

1

2

w

1

0.1

0 8

4

8

Figure 3: Class conditional densities. The solid line stands for the class !1 and the dashed one for !2 . The dotted lines indicate the optimal thresholds assuming equal a priori probabilities for both classes.

forced to produce an output d1 .t/ D 0:1 ¢ t if an input u¹ with ¹ 2 !1 was provided, or d2 .t/ D ¡0:1 ¢ t in the case that ¹ 2 !2 holds. Starting from this construction, the determination of the classier decision rule is straightforward: Z Z

10 0 10 0

Z .z ¡ d1 /2 dt < .z ¡ d1 /2 dt >

Z

10 0 10 0

.z ¡ d2 /2 dt H) !1 .z ¡ d2 /2 dt H) !2 :

The estimated classier errors for all GDNNs are depicted in Figure 4. It can be stated that even a small sample size of n D 36 is sufcient for the construction of a GDNN-based classier, independent of the error functional used for the network training. Smaller sample sizes lead to the wellknown overtting effect using the classical performance index. In contrast, a smoothing of the network mapping can reduce the overtting and results in high-performance classiers for a sample size of 28. For smaller samples, the regularization cannot eliminate the overtting, but it intensely reduces it and, for this reason, the classier errors too. The higher value for n D 20

Improving Generalization Capabilities of Dynamic Neural Networks

second order regularization first order regularization no regularization

0.3

classifier error

1273

0.2

0.1

0

12

20

28

36

44 52 60 sample size

68

76

84

Figure 4: Estimated classier errors for all trained GDNN. The dotted lines indicate the Bayes error (¼ 0:1702) estimated directly from ’1 and ’2 for equal a priori probabilities.

is most probably attributed to the local convergence of the training procedure, albeit the second-order regularization term seems to be the most capable tool for small samples. For sufciently large sample sizes, there is no need to work with penalty terms, subject to the condition that an appropriate network topology was chosen. However, the regularization does not inuence the classiers negatively. 4.3 Real-Life Example. The last example is given by a classication task of gait analysis data of children with cerebral palsy, a neuromuscular disorder that occurs in about 3 of 1000 live births. A permanent but not progressive brain lesion is frequently associated with spasticity and deciencies in voluntary movement control. An excessive plantar exion at the ankle joint is frequently observed in ambulatory children with cerebral palsy. The deformity results in an awkward gait and interferes with balance during standing and walking. Whenever this excessive exion becomes rigid, it represents an indication for orthopedic surgery to lengthen the calf musculature or its tendon. To date, the rigidity of the calf muscles had to be tested under general anesthesia because spasticity and increased muscle tone in

1274

M. Galicki, L. Leistritz, E. Zwick, and H. Witte

patients with cerebral palsy make physical examination difcult and any clinical test unreliable. While treatment options for the two types of deformity have been studied extensively, the distinction between dynamic tightness and a xed contracture remains a clinical challenge. To date, the evaluation of ankle dorsiexion under general anesthesia and muscle relaxation still resembles the most accurate but most often infeasible clinical method to assess the quality of this deformity. Instrumented gait analysis has considerably contributed to the understanding of gait dysfunction in cerebral palsy. Also, the technique has proven to be a valuable tool to objectively document gait changes after surgical interventions at the calf for gait improvement. However, as a diagnostic tool for discriminating between dynamic tightness and xed contracture of the triceps surae, instrumented gait analysis is not yet established. Instrumented gait analysis provides a tool to assess human walking three-dimensionally and to document gait abnormalities in children with spastic disorders. The system uses a video-based six-camera motioncapturing device and up to 18 reective markers that are placed at various landmarks of the lower limbs. These markers are then tracked in space to provide motion paths of the involved limb segments during walking. Finally, kinematic segment parameters and angle changes at joints connecting individual segments can be computed. Data acquisition consists of recording at least ve valid walking cycles while patients are asked to walk at a self-selected speed over an 8-meter walkway. The aim of this example was to investigate whether GDNNs are able to differentiate between rigid (the rst class) and exible (the second class) plantar exion deformity in children with spastic cerebral palsy. The signals used as external inputs were given by the time courses of ankle angle uaa , ankle moment uam , knee angle uka , and knee moment ukm (see Figure 5) during the walking of 56 children (22 rigid, 34 exible). The corresponding GDNN is dened by 0

yP 1 B yP 2 B yP D B B yP 3 @ yP 4 yP 5

1 P5 ¡y1 C arctan. jD1 wj ¢ yj / P B C 5 C B ¡y2 C arctan. jD1 wjC5 ¢ yj / C uaa =20 C C B C P5 C D B ¡y3 C arctan. C jD1 wjC10 ¢ yj / C uam C ; C B P C A B ¡y C arctan. 5 w @ 4 jD1 jC15 ¢ yj / C uka =40 A P 5 ¡y5 C arctan. jD1 wjC20 ¢ yj / C ukm 1

0

with y.0/ D .0; 0; 0; 0; 0/T , a network evolution time of T D 1 and target functions dp .t/ 2 f0:5 ¢ t; ¡0:5 ¢ tg, where dp .t/ D 0:5 ¢ t was taken for the rst (”rigid”) class and dp .t/ D ¡0:5 ¢ t for the second (”exible”) one. The GDNN output z is dened by the rst coordinate, z D y1 . From a practical point of view, it is suitable to discriminate between neurons provided with external inputs and the neurons dening the network output. Consequently, the network topology used in this example was chosen to be as small as possible (four input neurons and one output). The time

Improving Generalization Capabilities of Dynamic Neural Networks

(A)

1275

(B) 1

u

aa

1

0.5

1

3 0 2

0.5

1

0.5

1

0 0 2

0.5

1

0.5

1

0.5

1

0.5 t

1

uam

3 0 2

uka

0 0 2

0 0

1

1

u

km

0 0

1 0

0.5 t

1 0

1

Figure 5: The patterns belonging to the (A) rst and the (B) second class, respectively.

Table 1: Performance Indices for Different Network Trainings. GDNN Type

Performance Index

PP R 1

2

1

kzp ¡ dp k dt 0 Equation 3.18

2

Equation 2.8

0

pD1

Constraint ———– Equation 3.20, ² D Equation 3.20, ² D

5 2 5 2

¢ P ¢ 10¡3 ¢ P ¢ 10¡3

courses of angles and moments were of different scales. Due to technical reasons, the angles had to be scaled by 20 and 40, respectively. In order to access the generalization capabilities, the complete data set was randomly split into training sets of sizes 23, 18, and 13 and their corresponding independent test sets of sizes 33, 38, and 43. For each training set, three GDNNs were trained according to the performance indices given in Table 1. Table 2 depicts the classier performances in dependence on different training sets and the various performance indices used for the network training. Similar to the previous example, the smoothing of the net-

1276

M. Galicki, L. Leistritz, E. Zwick, and H. Witte

Table 2: Estimated Classication Rates for All Training Modes. Training Set Size GDNN Type

23

18

13

0

81.8% .n D 33/

84.2% .n D 38/

69.8% .n D 43/

1

84.8% .n D 33/

86.8% .n D 38/

83.7% .n D 43/

2

84.8% .n D 33/

86.8% .n D 38/

81.4% .n D 43/

work mapping improves the generalization capabilites. Furthermore, the method proposed in section 3 performed better than the classical training algorithm in all cases. As in the rst example, the smoothing by means of the rst-order term is more favorable due to the better cost-value ratio. For comparable purposes, the clinical examination without anesthesia results in a classication rate of 39:3%, and a subjective chart analysis of gait analysis data by an experienced clinical expert yielded 80:4% correct classications. We have seen that the smoothing of the network mapping can improve the generalization capabilities of GDNNs. Thereby, it is irrelevant whether the regularization technique or the constraint learning is applied. However, it is considerably easier to x the bound ² instead of the regularization constant ¸ from a practical point of view. The reason is that one has to know the order of magnitudes of the error term J0 and the regularization term J1 or J2 in order to nd an appropriate compromise between learning quality and network mapping smoothing. In the case of constraint learning, only information about J0 is necessary. For example, this information can be easily extracted from a few classical training runs. 4.4 The Time-series Prediction. The purpose of this section is to show on a chosen test bed time-series prediction problems (sunspot prediction and chemical process control) the performance of GDNNs trained with the algorithm proposed in section 3. 4.4.1 Wolfe’s Sunspot Data. A data set that has been frequently (probably most often) investigated for time-series prediction is the set of Wolfe’s sunspot data. The time series consists of the average number of sunspots observed per year from 1700 to the present. In order to keep comparability to other approaches, samples recorded until 1920 were used for training and data from years 1921–1955 (test set 1) and 1956–1979 (test set 2) for testing. GDNN-based predictors were constructed by fully connected GDNNs consisting of six neurons with the common neural activation function ®.z/ D .1 C e¡z /¡1 and constant weights. Let ·SP be the sunspot time series. Then

Improving Generalization Capabilities of Dynamic Neural Networks

1277

Table 3: Root-Mean-Squared Errors for Different Training Performance Indexes.

Classical training Smoothed network mapping With regularization

Test Set 1

Test Set 2

0.0688 0.0642 0.0774

0.1404 0.1292 0.1518

the network inputs for the prediction of ·SP .t/ were dened by u.t/ D .·SP .t ¡ 1/; ·SP .t ¡ 2/; ·SP .t ¡ 4/; ·SP.t ¡ 6/; ·SP .t ¡ 8//T = 95:1 ¡ 1;

and the desired network output was given by d.t/ D ·SP .t/=190:2: In order to investigate the inuence of the constraint learning, three performance indexes were used for the network training: the classical one, the rst-order term (see equation 3.18) for smoothing the network mapping, and the rst-order regularization term according to equation 3.13 with ¸ D 10¡3 . Table 3 depicts the prediction error as a root-mean-squared error between the GDNN-based prediction and Wolfe’s sunspot data. As is seen from Table 3, a regularization with the penalty term 3.13 worsens the performance of the network as compared to minimization of pure sensitivity term 3.18 with constraint 2.12 (we ran the method presented in this article with other values for ¸, but they did not yield better results). Moreover, our results are comparable with the best results known from the literature (Rementeria & Olabe, 2000; Wan, 1997; Azoff, 1993). 4.4.2 Continuous Stirred Tank Reactor. The next example is concerned with a chemical process modeling problem: the modeling of pH in a continuous stirred tank reactor (CSTR) (Bhat & McAvoy, 1990). The considered CSTR has two input streams—one containing sodium hydroxide (NaOH) and the other one acetic acid (HAC). Every stream has a ow rate Fk and a concentration Ck ; k D 1; 2, where the index 1 stands for HAC, and index 2 points to the values of NaOH. Let ³1 be the notation for the total acetate concentration [HAC + AC¡ ], and let ³2 be the concentration of NaC . Assuming that the acid-base equilibrum and electroneutrality relationships hold, the total acetate balance and the sodium balance are given by Bhat and McAvoy (1992): V ¢ ³P1 D F1 C1 ¡ .F1 C F2 / ¢ ³1

1278

M. Galicki, L. Leistritz, E. Zwick, and H. Witte

Table 4: Parameters and Initial Values for CSTR. Variable

Meaning

Value

V F1 F2 C1 C2 Ka Kw ³1 .0/ ³2 .0/

Tank volume Flow rate of acid Flow rate of base Concentration of acid in F1 Concentration of base in F2 Acid equilibrium constant Water equilibrium constant Initial concentration of acid Initial concentration of base

1000 liter 81 l/min 515 l/min 0.32 mole/liter 0.05 mole/liter 1:76 ¢ 10¡5 1:00 ¢ 10¡14 0.0435 mole/liter 0.0432 mole/liter

and V ¢ ³P2 D F2 C2 ¡ .F1 C F2 / ¢ ³2 : The pH is calculated from the previous equations by [HC ]3 C .Ka C ³2 / ¢ [HC ]2 C .Ka ¢ ³2 ¡ Ka ¢ ³1 ¡ Kw / ¢ [HC ] ¡ Ka Kw D 0 and pH D ¡ log [HC ] with process parameters given in Table 4 (steady-state operating point; Bhat & McAvoy, 1990). The objective is the prediction of pH in dependence on changes in the ow rate of sodium stream F2 . For comparison, the same parameters as in Sridhar Bartlett, and Seagrave (1998) were selected. In fact, the sampling time was 0.2 min, and the training set was designed over 1 hour by uniform distributed random ow rates F2 within 10% of the steady-state operation point. Three further hours were used for testing. Two GDNN-based predictors were constructed by fully connected GDNNs consisting of 20 neurons with the common neural activation function 10 ¢ arctan and constant weights. Let ·pH be the pH time series. Then, ·pH .t ¡ 1t/; ·pH .t ¡ 21t/; ·pH .t ¡ 31t/; ·pH .t ¡ 41t/ as well as F2 .t ¡ 1t/; F2 .t ¡ 21t/; F2 .t ¡ 31t/; F2 .t ¡ 41t/ served as inputs for the prediction of ·pH .t/, where 1t D 0:2 holds. Again, the classical performance index and the rstorder term 3.18 for smoothing the network mapping were used for different network trainings. The classical error criterion results in a mean squared error of 0.030, which equals the value in Sridhar et al. (1998). The constraint training by means of equation 3.18 was able to improve the generalization with a mean squared error of 0.028, which is a slightly better result than the reference value from Sridhar, Bartlett, & Seagrave, (1998).

Improving Generalization Capabilities of Dynamic Neural Networks

1279

5 Concluding Remarks In this work, a new learning algorithm for GDNNs has been presented and tested on nontrivial classication and time-series prediction tasks. The learning procedure was formulated as an optimal control problem with state-dependent constraints. The convergence of the proposed approximation scheme to an optimal solution was also discussed. Galicki et al. (1999) have shown that GDNNs possess many advantages as compared with classical neural networks (constant weights). Nevertheless, the problem of improving the generalization capability by means of mapping curvature for a dynamic network has not been addressed previously, although it is crucial in the normal functioning (i.e., after the learning) of the neural network. The simulation results have unambiguously shown that the application of mapping curvature requires a signicantly smaller training set as compared to the sensitivity measure to achieve a good generalization. The generality of the presented approach carries with it a penalty: a somewhat larger number of computations is required to train a GDNN subject to performance index 2.4 or 2.9 (a linear programming problem has to be solved in each iteration in the case of nontrivial state constraint 2.12). However, in our opinion, the capability of good generalization seems to be the most important when applying GDNNs to many practical dynamic tasks. Finally, this work is a rst attempt in constructing such algorithms by using the dynamic network curvature. 6 Acknowledgments This work was supported by the Deutsche Forschungsgemeinschaft (DFG, WI 1166/4-1, Ga 652/1-1). References Abu-Mostafa, Y. S. (1989). The Vapnik-Chervonenkis dimension. Information versus complexity in learning. Neural Comp., 1, 312–317. Azoff, E. M. (1993). Reducing error in neural network time series forecasting. Neural Computation and Applications, 1, 140–247. Bhat, N. V., & McAvoy, T. J. (1990). Use of neural nets for dynamic modelling and control of chemical process systems. Computers and Chemical Engineering, 14, 573–583. Bhat, N. V., & McAvoy, T. J. (1992). Determining model structure for neural models by network stripping. Computers and Chemical Engineering, 16, 271– 281. Billings, S., & Zheng, G. L. (1995). Radial basis function network conguration using genetic algorithms. Neural Networks, 8, 877–890. Bishop, Ch. (1991). Improving the generalization properties of radial basis function neural networks. Neural Comp., 3, 579–588.

1280

M. Galicki, L. Leistritz, E. Zwick, and H. Witte

Bishop, Ch. (1995a). Training with noise is equivalent to Tikhonov regularization. Neural Comp., 7, 108-116. Bishop, Ch. (1995b). Neural networks for pattern recognition. New York: Oxford University. Press. Cohen, B., Saad, D. & Marom E. (1997). Efcient training of recurrent neural network with time delays. Neural Networks, 10, 51–59. Czernichow, T. (1997). A double gradient algorithm to optimize regularization. In Proc. Articial Neural Networks–ICANN’97, (pp. 289–294). Berlin: SpringerVerlag. DasGupta, B., & Sontag, E. D. (1996). Sample complexity for learning recurrent perceptron mappings. IEEE Trans. Information Theory, 42, 1479–1487. Doering, A., Galicki, M., & Witte, H. (1997). Structure optimization of neural networks with the A*-algorithm. IEEE Trans. Neural Networks, 8, 1434–1445. Fabri, S., & Kadirkamanathan, V. (1996). Dynamic structure of neural networks for stable adaptive control of nonlinear systems. IEEE Trans. Neural Networks, 7, 1151–1166. Fahlman, S. (1991). The recurrent cascade-correlation architecture, In R. Lippman, J. Moody & D. Touretzky (Eds.), Advances in neural information processing systems, 3 (pp. 190–196). San Mateo, CA: Morgan Kaufmann. Fedorenko, R. R. (1978). Approximate solutions of optimal control problems. Moscow: Nauka. (in Russian) Filippov, A. F. (1962).On certian questions in the theory of optimal control. SIAM J. Control, 1, 76–84. Fukunaga, K. (1990). Introduction to statistical pattern recognition. San Diego, CA: Academic Press. Galicki, M. (1998). The planning of robotic optimal motions in the presence of obstacles. Int. J. Robot. Res., 17, 248–259. Galicki, M., Leistritz, L., & Witte, H. (1999). Learning continuous trajectories in recurrent neural networks with time-dependent weights. IEEE Trans. Neural Networks, 10, 741–756. Galicki, M., Leistritz, L., & Witte, H. (2000). Improved learning of multiple continuous trajectories with initial network state. In Proc. 2000 Int. Joint Conference on Neural Networks, IJCNN’2000 (pp 15-20) Los Alamitos, CA: IEEE Computer Society. Geman, S., Bienenstock, E., & Doursat, E. R. (1992). Neural networks and the bias/variance dilemma. Neural Comp., 4, 1–58. Giles, C. L., Chen, D., & Sun, G. Z. (1995). Constructive learning of recurrent neural networks. Limitations of recurrent cascade correlation and a simple solution. IEEE Trans. Neural Networks, 6, 829–836. Girosi, F., Jones, M., & Poggio, T. (1995). Regularization theory and neural networks architectures. Neural Comp., 7, 219–269. Hammer, B. (1997). Generalization of Elman networks. In Proc. Articial Neural Networks–ICANN’97, (pp. 409–414). Berlin: Springer-Verlag. Hansen, L. K., & Rasmussen, C. E. (1994). Pruning from adaptive regularization. Neural Comput., 6, 1223–1232. Hinton, G. E. (1987).Learning translation invariant recognition in massively parallel networks. In J. W. de Bakker, A. J. Nijman, & P. C. Treleaven (Eds.), Proc.

Improving Generalization Capabilities of Dynamic Neural Networks

1281

PARLE Conf. Parallel Architectures and Languages Europe, (pp. 1–13). Berlin: Springer-Verlag. Holden, S. B., & Niranjan, M. (1995). On the practical applicability of VC dimension bounds. Neural Comp., 7, 1265–1288. Kiliom, J., & Siegelmann, H. T. (1996). The dynamic universality of sigmoidal neural networks. Information and Computation, 128, 48–56. Leistritz, L., Galicki, M., & Witte, H. (1998). Training continuous trajectories by means of dynamic neural networks with time dependent weights. In Proc. Int. ICSC/IFAC Symp. Neural Networks, (pp. 591–596). Vienna, Austria. Leung, C. S., Tsoi, A. C., & Chan, L. W. (2001). Two regularizers for recursive least squared algorithms in feedforward multilayered neural networks. IEEE Trans. Neural Networks, 12, 1314–1332. Ma, S., & Ji, C. (1998). Fast training of recurrent networks based on the EM algorithm. IEEE Trans. Neural Networks, 9, 11–26. Morozov, V. A. (1984). Methods for solving incorrectly posed problems. Berlin: Springer-Verlag. Murata, N., Yoshizawa, S., & Amari, S. (1994). Network information criterion– determining the number of hidden units for an articial neural network model. IEEE Trans. Neural Networks, 5, 865–872. Pearlmutter, B. A. (1995). Gradient calculation for dynamic recurrent neural networks: A survey. IEEE Trans. Neural Networks, 6, 1212–1228. Pontryagin, L. C., Boltyansky, V. C., Gamkrelidze, R. V., & Mischenko, E. F. (1961). Mathematical theory of optimal problem. Moscow: Nauka. (in Russian) Rementeria, S., & Olabe X. B. (2000). Predicting sunspots with a self-conguring neural system. In Proc. of the 8th Conference on Information Processing and Management of Uncertainty in Knowledge-Based Systems (IPMU 2000). Madrid, Spain. Sakawa, Y., & Shindo, Y. (1980). On global convergence of an algorithm for optimal control. IEEE Trans. Automatic Control, 25, 1149–1153. Simard, P., Victorri, B., Le Cun, Y., & Denker J. (1991). Tangent Prop—A formalism for specifying selected invariances in an adaptive network. In J. E. Moody, J. S.s Hanson, & R. P. Lippmann (Eds.), Advances in neural information processing systems 4 (pp. 895–903). San Mateo, CA: Morgan Kaufmann. Sridhar, D. V., Bartlett E. B., & Seagrave, R. C. (1998).Information theoretic subset selection for neural network models. Computers and Chemical Engineering, 22, 613–626. Srochko, V. A. (1986). Application of maximum principle for numerical solution of optimal control problems with terminal state constraints. Kibernetika, 1, 73–77. (in Russian) Strend, S., & Balden, J. (1990). A comparison of constrained optimal control algorithms. In V. Utkin & V. Jaaksoo (Eds.), Prep. 11th IFAC World Congress, (pp. 192–198). Tallinn, Estonia: Technical University. Sundareshan, M., & Condarcure, T. (1998). Recurrent neural networks training by a learning automaton approach for trajectory learning and control system design. IEEE Trans. Neural Networks, 9, 354–368. Taniguchi, M., & Tresp, V. (1997). Combining regularized networks. In Proc. Articial Neural Networks–ICANN’97, (pp. 349-354). Berlin: Springer-Verlag.

1282

M. Galicki, L. Leistritz, E. Zwick, and H. Witte

Tikhonov A. N., & Arsenin, V. Y. (1977). Solutions of ill-posed problems. Washington, D. C.: W. H. Winston. Vapnik, V. N., & Chervonenkis, A. Y. (1971). On the uniform convergence of relative frequencies of events to their probabilities. Theory of Probability Applicat., 16, 264–280. Wan, E. A. (1997).Combining fossil and sunspot data: Committee predictions. In Proc. of the 1997 International Conference on Neural Networks (ICNN 97). Berlin: Spring-Verlag. Wu, L., & Moody, J. (1996). A smoothing regularizer for feedforward and recurrent neural networks. Neural Comput., 8, 461–489. Received November 26, 2002; accepted November 3, 2003.

LETTER

Communicated by Todd Leen

A Modied Algorithm for Generalized Discriminant Analysis Wenming Zheng wenming [email protected]

Li Zhao [email protected]

Cairong Zou [email protected] Engineering Research Center of Information Processing and Application, Southest University, Nanjing, Jiangsu 210096, People’s Republic of China

Generalized discriminant analysis (GDA) is an extension of the classical linear discriminant analysis (LDA) from linear domain to a nonlinear domain via the kernel trick. However, in the previous algorithm of GDA, the solutions may suffer from the degenerate eigenvalue problem (i.e., several eigenvectors with the same eigenvalue), which makes them not optimal in terms of the discriminant ability. In this letter, we propose a modied algorithm for GDA (MGDA) to solve this problem. The MGDA method aims to remove the degeneracy of GDA and nd the optimal discriminant solutions, which maximize the between-class scatter in the subspace spanned by the degenerate eigenvectors of GDA. Theoretical analysis and experimental results on the ORL face database show that the MGDA method achieves better performance than the GDA method. 1 Introduction Linear discriminant analysis (LDA) (Duda & Hart, 1973) is a well-known feature extraction method in pattern recognition. It nds the set of the optimal vectors that map the high-dimensional samples onto a low-dimensional feature space, where the ratio of the between-class scatter to the within-class scatter of the projected samples is extreme and the projected samples are well separated. Although the LDA method works for linear problems, it fails for nonlinear cases. Baudat and Anouar (2000) extended the LDA method from the linear to the nonlinear domain using the kernel trick (Vapnik, 1995; Sch¨olkopf, Smola, & Muller, ¨ 1998) and then presented the generalized discriminant analysis (GDA) method. The GDA aims to nd the optimal discriminant nonlinear features for the training samples when they are not linearly separable. This is implemented by mapping the input space into a high-dimension (or even innite-dimension) feature space using a nonlinear kernel function and performing the feature extractions using LDA in this feature space, thus producing the nonlinear features in the input space. c 2004 Massachusetts Institute of Technology Neural Computation 16, 1283–1297 (2004) °

1284

W. Zheng, L. Zhao, and C. Zou

However, using the Baudat-Anouar algorithm (Baudat & Anouar, 2000) one may face a degenerate eigenvalue problem. This occurs especially in the case of the so-called small sample size (SSS) problem (Chen, Liao, Ko, Lin, & Yu, 2000), where the most discriminant eigenvectors of GDA correspond to the same eigenvalue (i.e., degeneracy of eigenvectors; Schiff, 1968). In this letter, we modify the Baudat-Anouar algorithm and propose a robust and efcient algorithm to overcome the degenerate eigenvalue problem. The proposed algorithm aims to nd the discriminant eigenvectors that maximize the between-class scatter matrix in the subspace spanned by the degenerate eigenvectors of GDA and thus remove the degeneracy of the solutions. In the next section, we introduce a theorem used for this purpose. Then we review the GDA method. In section 3, we propose the MGDA method and develop the formulation. Section 4 is devoted to the experiments on the ORL face database. The discussion and conclusion are given in the last section. 2 Related Work 2.1 Related Theorems. Suppose that SB , SW , and ST are the betweenclass scatter matrix, the within-class scatter matrix, and the total-class scatter matrix of the training samples, respectively. Let I be the identity matrix. Theorem 1 (Duchene & Leclercq, 1988). Let ’1 ; : : : ; ’r be the rst r discriminant eigenvectors of the Foley-Sammon optimal set of discriminant vectors (FSODV; Foley & Sammon, 1975). Then the .rC 1/th discriminant direction ’rC1 of FSODV is the eigenvector corresponding to the maximum eigenvalue of the eigenquation PSB ’ D ¸SW ’, where ¡1 T ¡1 P D I ¡ DT .DS¡1 W D / DSW ;

D D [’1 ; ’2 ; : : : ; ’r ]T :

Theorem 2 (Jin, Yang, Hu, & Lou, 2001). Suppose that ’1 ; : : : ; ’r are the rst r discriminant eigenvectors of the statistically uncorrelated optimal discriminant vectors (UODV; Jin et al., 2001). Then the .r C 1/ discriminant direction ’rC1 of UODV is the eigenvector corresponding to the maximum eigenvalue of the eigenequation PSB ’ D ¸SW ’, where ¡1 T ¡1 P D I ¡ ST DT .DST S¡1 W ST D / DST SW ;

D D [’1 ; ’2 ; : : : ; ’r ]T :

In fact, theorems 1 and 2 can be extended to be more general. We give it as theorem 3. (The proof is given in the appendix.)

Modied Algorithm for Generalized Discriminant Analysis

1285

Theorem 3. Suppose that B and R are positive semidenite matrices, and V is a positive matrix. The discriminant criteria are dened as F.’/ D

’ T B’ : ’ T V’

(2.1)

Let ’1 be the discriminant eigenvector that maximizes F.’/. Suppose that ’i .i D 1; 2; : : : ; r ¸ 1/ are obtained. Let ’rC1 be the .r C 1/th discriminant eigenvector that maximizes F.’/ under the following constraints: T ’rC1 R’i D 0;

.i D 1; 2; : : : ; r/:

(2.2)

Then ’rC1 is the eigenvector corresponding to the largest eigenvalue of the following eigenequation: PB’ D ¸V’;

(2.3)

P D I ¡ RDT .DRV ¡1 RDT /¡1 DV¡1

(2.4)

D D [’1 ; ’2 ; : : : ; ’r ]T

(2.5)

where

2.2 GDA Formulation in Kernel Space. Suppose that X is an n-dimensional sample set with N elements. Let Xl denote the subset of X. Thus, classes. The cardinality of the X D [clD1 Xl , where c is the number of the P subsets Xl is denoted by Nl . Thus, we have clD1 Nl D N. Let X be mapped into a Hilbert space F through a nonlinear mapping function 8, 8: X ! F;

x ! 8.x/:

(2.6)

8 The between-class scatter matrix S8 B , the within-class scatter matrix SW , and 8 the total-scatter matrix ST in F are given as follows:

S8 B D S8 W D S8 T D

c X lD1

8 8 8 T Ni .u8 i ¡ u /.ui ¡ u /

Ni c X X iD1 jD1 c X Ni X iD1 jD1

(2.7)

j

j

(2.8)

j

j

(2.9)

8 T .8.xi / ¡ u8 i /.8.xi / ¡ ui /

.8.xi / ¡ u8 /.8.xi / ¡ u8 /T ;

1286

W. Zheng, L. Zhao, and C. Zou j

where xT represents the transpose of the vector x, 8.xi / is the jth sample in 8 the ith class, u8 i is the mean of the ith class samples, and u is the mean of all samples in F:

u8 i D

Ni 1 X j 8.xi /; Ni jD1

u8 D

Ni c X 1 X j 8.xi /: N iD1 jD1

(2.10)

GDA aims to nd eigenvalues ¸ and eigenvectors !, solutions of the equation 8 S8 B ! D ¸ST !:

(2.11)

The largest eigenvalue of equation 2.11 gives the maximum of the following quotient of the inertia: ¸D

! T S8 B! ! T S8 T!

(2.12)

:

Because the eigenvectors are linear combinations of F elements, there exist coefcients ®pq .p D 1; : : : ; c; q D 1; : : : ; Np / such that !D

Np c X X pD1 qD1

q

®pq .8.xp / ¡ u8 /:

(2.13)

Assume that a kernel function k.xi ; xj / can be expressed as the dot product form on the Hilbert space F: kij D k.xi ; xj / D h8.xi /; 8.xj /i D .8.xi //T 8.xj /;

(2.14)

where h8.xi /; 8.xj /i stands for the dot product of 8.xi / and 8.xj /. For given classes p and q, this kernel function can be expressed as j

.kij /pq D .8.xpi //T 8.xq /:

(2.15)

Let W D .Wl /lD1;:::;c be an N £ N block diagonal matrix, where W l is an Nl £ Nl matrix with all terms equal to 1=Nl . Let M D .mij /iD1;:::;NIjD1;:::;N be an N £ N matrix with all terms equal to 1=N and let 1 Nc 1 8.X/ D [8.x11 / ¢ ¢ ¢ 8.xN 1 / ¢ ¢ ¢ 8.xc / ¢ ¢ ¢ 8.xc /]:

(2.16)

Modied Algorithm for Generalized Discriminant Analysis

1287

Then equation 2.13 can be expressed as ! D 8.X/.I ¡ M/®;

(2.17)

where ® D [®11 ¢ ¢ ¢ ®1N1 ¢ ¢ ¢ ®c1 ¢ ¢ ¢ ®cNc ]T . Let K be an Np £ Nq matrix dened on the class elements by .Kpq /pD1;:::;cIqD1;:::;c : K D .Kpq /pD1;cIqD1;c ;

(2.18)

where .Kpq/ is an Np £ Nq matrix in the feature space F: Kpq D .kij /pq;

i D 1; : : : ; Np ;

j D 1; : : : ; Nq :

(2.19)

Using the notations above, equations 2.11, 2.12, and 2.18 are equivalent to the following expressions, respectively: B® D ¸T®

(2.20)

® T B® ® T T®

(2.21)

¸D

K D .8.X//T 8.X/

(2.22)

where B D .I ¡ M/T K.W ¡ M/.W ¡ M/T K.I ¡ M/

(2.23)

T D .I ¡ M/T K.I ¡ M/.I ¡ M/T K.I ¡ M/:

(2.24)

Baudat and Anouar (2000) give a method to resolve eigenequation 2.20 by using the eigenvectors decomposition method (for more details on the algorithm, see Baudat & Anouar, 2000). Although that method can resolve the eigenvectors and the corresponding eigenvalues of eigenequation 2.20 very well, there is a weakness in it that has not yet been overcome. This is the degenerate perturbation problem of the eigenvectors (i.e., several eigenvectors with the same eigenvalue). This problem often occurs in many cases, such as the small sample size problem: denote the null space of S8 W by 8 S8 W .0/. As for the small sample size problem, suppose the solution ! 2 SW .0/. 8 8 8 Then we have S8 W ! D 0. Note that ST D SW C SB ; thus, we have T 8 T 8 T 8 ! T S8 T ! D ! SW ! C ! SB ! D ! SB !:

(2.25)

From equations 2.17 and 2.25, we have ® T T® D ® T B® )

® T B® D 1: ® T T®

(2.26)

From equation 2.26, we obtain that all the eigenvectors in the null space of the within-class scatter matrix share the same maximal eigenvalue .D 1/.

1288

W. Zheng, L. Zhao, and C. Zou

3 Modied Algorithm for GDA The GDA method provides an efcient technique to calculate the discriminant eigenvectors in the feature space F. However, in many cases, such as the small sample size problem, some of the eigenvectors solved by GDA may be degenerate. Suppose that ®k.1/ ; : : : ; ®k.t/ .t > 1/ are the degenerate eigenvectors of eigenequation 2.20 sharing the same eigenvalue ¸k , that is, .i/ .i/ B®k D ¸k T®k ;

i D 1; : : : ; t:

(3.1)

From equations 2.11 and 2.17, equation 3.1 is equivalent to the expression .i/ 8 .i/ S8 B !k D ¸k ST !k ;

i D 1; : : : ; t;

(3.2)

where .i/ .i/ !k D 8.X/.I ¡ M/®k ;

i D 1; : : : ; t:

(3.3)

Thus, !k.i/ .i D 1; : : : ; t/ are the eigenvectors corresponding to the same eigenvalue ¸k of eigenequation 2.11. Let Ä denote the subspace spanned by the eigenvectors !k.i/ .i D 1; : : : ; t/. Then any vector in Ä is the eigenvector of eigenequation 2.11 corresponding to the eigenvalue ¸k . Thus, the solutions of GDA may be unstable with respect to changes in training data (model variance) and probably are not optimal in terms of the discriminant ability. The MGDA method overcomes this problem by limiting the attention to the subspace Ä to nd the eigenvectors with the best discriminant ability. Suppose that !Q k.i/ 2 Ä .i D 1; : : : ; t/ are the eigenvectors of eigenequation 2.11 with the best discriminant ability, where .i/ !Q k D

t X

.j/

!k °ij jD1

.i D 1; : : : ; t/:

(3.4)

Let A D [®k.i/ ¢ ¢ ¢ ®k.t/ ] and °i D [°i1 ¢ ¢ ¢ °it ]T . Then equation 3.4 can be rewritten as .i/ !Q k D 8.X/.I ¡ M/A°1 ;

i D 1; : : : ; t:

(3.5)

According to the physical meaning of discriminant analysis (Fukunaga, 1990), the projections of the training samples projected by the solutions !Q k.i/ .i D 1; : : : ; t/ should have maximal between-class scatter in order to get better discriminant ability. To do this, we should dene new discriminant

Modied Algorithm for Generalized Discriminant Analysis

1289

criteria to replace the previous discriminant criteria (see equation 2.12). Such a criterion can be expressed as J1 .!/ D

! T S8 B! ; !T !

where ! D Ä:

(3.6)

By using the new criteria, these discriminant eigenvectors can be generated in the following forms. The rst eigenvector !Q k.1/ is the one that maximizes .i/ J1 .!/ in Ä. Suppose that the rst r discriminant eigenvectors !Q k D 8.X/.I ¡ M/A°i .i D 1; 2; : : : ; r < t/ are obtained. Then the .r C 1/th eigenvector !Q k.rC1/ is the one that maximizes J1 .!/ in Ä under the following orthogonal constraints: .rC1/ T

.!Q k

.i/ / !Q k D 0;

i D 1; 2; : : : ; r:

(3.7)

From equation 3.5, we obtain that to nd !Q .1/ is equivalent to nding the k coefcient °1 that maximizes J2 .° /, where J2 .° / D

Q ° T B° T ° R°

(3.8)

BQ D AT BA

(3.9)

R D AT .I ¡ M/T K.I ¡ M/A:

(3.10)

From discriminant analysis (Duda & Hart, 1973; Fukunaga, 1990), °1 is the eigenvector of the following generalized eigenequation corresponding to the largest eigenvalue: Q D ¸R° : B°

(3.11)

By the same method, nding the .r C 1/th eigenvector !Q k.rC1/ that maximizes J1 .!/ under the orthogonal constraints of equation 3.7 in Ä is equivalent to nding the coefcient °rC1 that maximizes J2 .° / under the following constraints: T °rC1 R°i D 0;

i D 1; 2; : : : ; r:

(3.12)

From theorem 3, we obtain that °rC1 is the eigenvector corresponding to the largest eigenvalue of the following eigenequation: Q D ¸R° PB°

(3.13)

1290

W. Zheng, L. Zhao, and C. Zou

where P D I ¡ RDT .DRDT /¡1 D

(3.14)

D D [°1 ; °2 ; : : : ; °r ]T :

(3.15)

The coefcient °i .i D 1; : : : ; t/ is normalized by requiring that the corresponding eigenvector !Q k.i/ .i D 1; : : : ; t/ is normalized in F, that is, .i/ .i/ .!Q k /T !Q k D 1:

(3.16)

Using equations 2.22 and 3.5, we have °iT AT .I ¡ M/T K.I ¡ M/A°i D 1:

(3.17)

p Thus, the coefcient °i is divided by °iT AT .I ¡ M/T K.I ¡ M/A°i in order to get normalized eigenvector !Q k.i/ . The MGDA procedure can be summarized in the following steps: 1. Compute the discriminant vectors by using the Baudat-Anouar algorithm. If no degeneracy occurs, then nish the algorithm and the solutions of MGDA equal to those of GDA; else go to step 2. 2. Select the discriminant vectors ®k.i/ .i D 1; : : : ; t/ corresponding to the same eigenvalue ¸k , where t is the number of the degenerate vectors. 3. Compute the matrices BQ and R (see equations 3.9 and 3.10). 4. Compute the eigenvector °i using system 3.11. 5. Compute the eigenvectors °i .i D 2; : : : ; t/ using system 3.13.

6. Compute eigenvectors !Q k.i/ using °i .i D 1; : : : ; t/ (see equation 3.5) and normalize them (see equation 3.17). 4 Experiments We test the MGDA method on the Olivetti Research Lab (ORL) face database in Cambridge (http://www.cam-orl.co.uk/facedatabase.html). The ORL database contains 40 distinct subjects, with each containing 10 different images taken at different times, with the lighting varying slightly. All the images are taken against a dark homogeneous background, and the persons are in upright, frontal position, with tolerance for some tilting and rotation. Figure 1 shows 10 face images of one subject in the face database. The original face images are all sized 112 £ 92 pixels with a 256-level gray scale. For each image, we use a two-level wavelet transform (Chien & Wu, 2002) and get a low-pass image of 28 £ 23 size pixels. Then we normalize the intensity values of the low-pass image with a linear function. After that, the

Modied Algorithm for Generalized Discriminant Analysis

1291

Figure 1: Ten face images for one subject in ORL face database.

dimension of the image vector is 644. The mean and standard deviation of kurtosis of the images are 2.08 and 0.39, respectively. The polynomial and gaussian kernels used in the experiments are dened as follows (Baudat & Anouar, 2000): Polynomial kernel: k.x; y/ D .hx; yi/d , where d is the polynomial degree. Gaussian kernel: k.x; y/ D exp.¡kx; yk2 =¾ /, where the parameter ¾ has to be chosen. 4.1 Face Recognition. Two examples for face recognition based on the nearest-neighbor classier are performed in this experiment. The rst example is similar to that done by Yang (2002), which aims to compare the performance of GDA and MGDA with other methods in face recognition. We use the leave-one-out strategy to perform this experiment. To classify a face image, we remove it from the whole face image set, and the discriminant vectors are computed using the training set of the remaining 399 images. The test image and the training images are then projected to a reduced space using the computed discriminant vectors of GDA and MGDA, respectively. Table 1 shows the experimental results. We can see from Table 1 that the MGDA method achieves an error rate as low as the kernel sherface method (D 1:25%), which is the lowest error rate among the methods reported by Yang (2002). We also see that the MGDA method achieves better performance than the GDA method in this example. The second example also compares the performance of GDA and MGDA in face recognition. Ten images per subject are randomly partitioned into ve training images and ve test images, for a total of 200 training images and 200 test images. There is no overlap between the two sets. Two trials of tests are performed by swapping the training and the test sets. We treat each test as a Bernoulli random test and the average test error

1292

W. Zheng, L. Zhao, and C. Zou

Table 1: Performance of Various Systems. Method

Reduced Space

Error Rate (%)

40 39 NA 40 39 39 39 39 39

2.50 (10/400) 1.50 (6/400) 3.00 (12/400) 2.00 (8/400) 1.25 (5/400) 2.25 (9/400) 1.5 (6/400) 1.5 (6/400) 1.25 (5/400)

Eigenface (Yang, 2002) Fisherface (Yang, 2002) Support vector machine (Yang, 2002) Kernel eigenface (Yang, 2002) Kernel sherface (Yang, 2002) GDA (polynomial kernel with d D 2) MGDA (polynomial kernel with d D 2) GDA (gaussian kernel with ¾ D 10,000) MGDA (gaussian kernel with ¾ D 10,000)

Table 2: Average Test Error Rate for GDA and MGDA. Method

LDA MLDA GDAa MGDAa GDAb MGDAb

Average test error rate (%) 10.0 Standard deviation 6.0

5.0 4.75 4.3589 4.2541

3.0 3.4117

4.25 4.0345

2.75 3.2707

a Polynomial b

kernel with d D 2 is used. Gaussian kernel with ¾ D 10,000 is used.

rate as the probability of the wrong classication over all the 400 tests. Table 2 shows the average test error rate and the standard deviation for each method, where MLDA is a particular case of GDA using the polynomial kernel with the degree d D 1. As can be seen from Table 2, the MGDA achieves a lower average test error rate and standard deviation than GDA over all the tests. 4.2 Stability Test. This experiment aims to compare performance stability between GDA and MGDA. We select ve images randomly from each subject as training samples and use the other ve images as test images; thus, we have 200 training images and 200 test images. The gaussian kernel with ¾ D 10,000 and the nearest-neighbor classier are used in this experiment. Suppose that !i and !Q i .i D 1; : : : ; 39/ are the 39 discriminant eigenvectors computed by GDA and MGDA, respectively. Let ZGDA D [!i ; : : : ; !39 ], and let ZMGDA D [!Q 1 ; : : : ; !Q 39 ]. We reorder the training images in each subject and then recompute the projected matrices ZGDA and ZMGDA . Ten trials are repeated in this experiment. In each trial, we write down the value of 9.ZGDA / and 9.ZMGDA /, respectively, where 9.Z/ D tr.ZT S8 B Z/. We also use the 200 test images to perform face recognition using the projected matrices ZGDA and ZMGDA , respectively. Figure 2 shows the experimental results of the 10 trials. From Figure 2, we can see that the values of 9.ZMGDA / and the corresponding test error rate are stable over all 10 trials for MGDA. However, it is not the case for GDA. As for GDA, the values of 9.ZGDA /

Modied Algorithm for Generalized Discriminant Analysis

1293

2.7 GDA

2.6 2.5 2.4 a)

2.3

1

2

3

4

5

6

7

8

9

10

4.7 MGDA

4.6 4.5 4.4 b)

4.3

1

2

3

4

5

6

7

8

9

10

(% ) 6 GDA MGDA

4 2 c)

0

1

2

3

4

5

6

7

8

9

10

Figure 2: Comparison of the stability between GDA and MGDA. (a) The value of 9.ZGDA / in each trial. (b) The value of 9.Z MGDA / in each trial. (c) The test error rate of GDA and MGDA in each trial.

have a slight change over all the trials, and the test error rates also have small change in some trials. 5 Discussion and Conclusion The GDA method is the generalization for the LDA method as nonlinear discrimination analysis using the kernel trick. Baudat and Anouar (2000) provide an algebraic formulation and the eigenvalue resolution, which can give an exact solution for GDA even if some points, such as the choice of kernel function, require further investigation. However, further study of GDA shows that the resolution method for GDA that Baudat and Anouar (2000) developed may suffer from instability and inaccuracy due to the degenerate perturbation of the solutions, especially to the small sample size

1294

W. Zheng, L. Zhao, and C. Zou

problem, where the most discriminant eigenvectors in the null space of the within-class scatter matrix share the same maximal eigenvalue .D 1/. In this article, we have developed a modied algorithm for GDA (MGDA), to overcome this problem. The performance of MGDA is the same as GDA when no degeneracy occurs. If degeneracy occurs, our theoretical analysis and the experiments based on the ORL face database show that the performance of the MGDA method is good and is superior to the GDA method. In the rst example in section 4.1, we saw that the MGDA method has the same performance as the kernel sherface method. In fact, we can see that the kernel sherface algorithm developed by Yang (2002) and the GDA algorithm by Baudat and Anouar (2000) are two different algorithms based on the same criteria (equation 2.12), which solves the discriminant analysis problems in feature space. However, both methods did not overcome the possible degeneracy problem of the solutions. Thus, the best performance of MGDA could be exactly equal to that of kernel sherfaces. Appendix Proof of Theorem 3 Proof. Let ’QrC1 be the direction of ’rC1 , and ’QrC1 satises the following constraint T ’Q rC1 V ’QrC1 D 1:

(A.1)

Thus, from equation 2.2, we have T ’Q rC1 R’i D 0;

i D 1; : : : ; r:

(A.2)

In order to compute ’QrC1 , we use the method of Lagrange multipliers to transform criteria 2.1 including all the constraints, T T L.’QrC1 / D ’QrC1 B’QrC1 ¡ ¸[’QrC1 V ’QrC1 ¡ 1] ¡

r X iD1

T ¹i ’QrC1 R’i ;

(A.3)

where ¸ and ¹i , i D 1; 2; : : : ; r are Lagrange multipliers. The optimization is performed by setting the partial derivative of L.’Q rC1 / with respect to ’Q rC1 equal to zero: 2B’QrC1 ¡ 2¸V’QrC1 ¡

r X iD1

¹i R’i D 0:

(A.4)

T , and from equation Multiplying the left-hand side of equation A.4 by ’Q rC1

Modied Algorithm for Generalized Discriminant Analysis

1295

A.2, we obtain that ¸D

T ’QrC1 B ’Q rC1

T ’QrC1 V’Q rC1

D F.’Q rC1 /:

(A.5)

Thus, ¸ represents the expression F.’QrC1 / to be maximized. Multiplying the left-hand side of equation A.4 by ’jT RV ¡1 , j D 1; 2; : : : ; r, we obtain a set of r expression, 2’jT RV ¡1 B’Q rC1 ¡

r X iD1

¹i ’jT RV ¡1 R’i D 0;

j D 1; 2; : : : ; r;

(A.6)

or in another form, 2’1T RV ¡1 B’Q rC1 ¡ 2’2T RV ¡1 B’Q rC1 ¡

2’rT RV ¡1 B’Q rC1 ¡

r X iD1 r X iD1

r X iD1

¹i ’1T RV ¡1 R’i D 0 ¹i ’2T RV ¡1 R’i D 0

(A.7)

¢¢¢ ¹i ’rT RV ¡1 R’i D 0;

that is, 2 6 6 26 6 4

’1T ’2T :: : ’rT

3

2

7 6 7 6 7 RV¡1 B’Q rC1 ¡ 6 7 6 5 4

’1T ’2T :: : ’rT

3

2

7 6 7 6 7 RV ¡1 R 6 7 6 5 4

’1T ’2T :: : ’rT

32 76 76 76 74 5

¹1 ¹2 :: : ¹r

3 7 7 7 D 0: (A.8) 5

Let

¹ D [¹1

¹2

¢¢¢

¹ r ]T :

(A.9)

Considering the matrix notation 2.5, the previous set of r equations, A.8, can be represented in a single matrix relation: DRV¡1 RDT ¹ D 2DRV ¡1 B’QrC1 :

(A.10)

Thus, we obtain

¹ D 2.DRV ¡1 RDT /¡1 DRV¡1 B’QrC1 :

(A.11)

1296

W. Zheng, L. Zhao, and C. Zou

It is obvious that the following equation holds: r X iD1

¹i R’i D RDT ¹:

(A.12)

Thus, equation A.4 can be written as 2B’QrC1 ¡ 2¸V’QrC1 ¡ RDT ¹ D 0:

(A.13)

Substituting equation A.11 into A.13, we have 2B’QrC1 ¡ 2¸’Q rC1 ¡ RDT [2.DRV¡1 RDT /¡1 DRV ¡1 B’QrC1 D 0;

(A.14)

or, in another form, B ’Q rC1 ¡ RDT .DRV¡1 RDT /¡1 DRV ¡1 B’QrC1 D ¸V ’QrC1 ;

(A.15)

that is, [I ¡ RDT .DRV ¡1 RDT /¡1 DRV ¡1 ]B’QrC1 D ¸V’Q rC1 :

(A.16)

Considering the matrix notation 2.4, equation A.16 can be expressed as PB’Q rC1 D ¸V’QrC1 :

(A.17)

Because ’QrC1 is the direction of ’rC1 , we therefore obtain PB’rC1 D ¸V’rC1 :

(A.18)

Besides, it is noted that ’rC1 is a unitary vector. Thus, we have T ’rC1 ’rC1 D 1:

(A.19)

From equations A.18 and A.19, we know that ’rC1 is the eigenvector corresponding to the largest eigenvalue ¸ of the eigenequation A.18. References Baudat, G., & Anouar, F. (2000).Generalized discriminant analysis using a kernel approach. Neural Computation, 12, 2385–2404. Chen, L. F., Liao, H. Y. M., Ko, M. T., Lin, J. C., & Yu, G. J. (2000). A new LDAbased face recognition system which can solve the small sample size problem, Pattern Recognition, 33, 1713–1726. Chien, J. T. & Wu, C. C. (2002). Discriminant waveletfaces and nearest feature classiers for face recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24, 1644–1649.

Modied Algorithm for Generalized Discriminant Analysis

1297

Duchene, J., & Leclercq, A. (1988). An optimal transformation for discriminant and principal component analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence, 10, 978–983. Duda, R. O., & Hart, P. E. (1973). Pattern classication and scene analysis. New York: Wiley. Foley, D. H., & Sammon, J. W., Jr. (1975). An optimal set of discriminant vectors. IEEE Trans. on Computer, C-24, 281–289. Fukunaga, K. (1990). Introduction to statistical pattern recognition. Orlando, FL: Academic Press. Jin, Z., Yang, J. Y., Hu, Z. S., & Lou, Z. (2001). Face recognition based on the uncorrelated discriminant transformation. Pattern Recognition, 34, 1405–1416. Schiff, L. (1968). Quantum Mechanics (3rd ed.). New York: McGraw-Hill. Schölkopf, B., Smola, A., & Müller, K. R. (1998). Nonlinear component analysis as a kernel eigenvalue problem. Neural Computation, 10, 1299–1319. Vapnik, V. (1995). The nature of statistical learning theory. Berlin: Springer-Verlag. Yang, M. H. (2002). Kernel eigenfaces vs. kernel sherfaces: Face recognition using kernel methods. In Proceedings of the Fifth International Conference on Automatic Face and Gesture Recognition (pp. 215–220). Los Alamitos, CA: IEEE Computer Society. Received January 31, 2003; accepted November 7, 2003.

LETTER

Communicated by Robert Tibshirani

Stability-Based Validation of Clustering Solutions Tilman Lange [email protected]

Volker Roth [email protected] Swiss Federal Institute of Technology (ETH) Zurich, Institute for Computational Science, CH-8092 Zurich, Switzerland

Mikio L. Braun [email protected] ¨ Informatik III, Rheinische Friedrich-Wilhelms-Universit¨at Bonn, Institut fur 53117 Bonn, Germany

Joachim M. Buhmann [email protected] Swiss Federal Institute of Technology (ETH) Zurich, Institute for Computational Science, CH-8092 Zurich, Switzerland

Data clustering describes a set of frequently employed techniques in exploratory data analysis to extract “natural” group structure in data. Such groupings need to be validated to separate the signal in the data from spurious structure. In this context, nding an appropriate number of clusters is a particularly important model selection question. We introduce a measure of cluster stability to assess the validity of a cluster model. This stability measure quanties the reproducibility of clustering solutions on a second sample, and it can be interpreted as a classication risk with regard to class labels produced by a clustering algorithm. The preferred number of clusters is determined by minimizing this classication risk as a function of the number of clusters. Convincing results are achieved on simulated as well as gene expression data sets. Comparisons to other methods demonstrate the competitive performance of our method and its suitability as a general validation tool for clustering solutions in realworld problems. 1 Introduction One of the fundamental tools for understanding and interpreting data when prior knowledge of the underlying distribution is missing is the automatic organization of the data into groups or clusters. Methods for data clustering provide core techniques for exploratory data analysis and have been c 2004 Massachusetts Institute of Technology Neural Computation 16, 1299–1323 (2004) °

1300

T. Lange, V. Roth, M. Braun, and J. Buhmann

successfully applied in numerous domains, such as computer vision, computational biology, and data mining (Jain, Murty, & Flynn, 1999; Gordon, 1999; Buhmann, 1995; Jain & Dubes, 1988). Adopting a machine learning perspective, clustering belongs to the class of unsupervised learning problems. Several steps in a cluster analysis are typically required: 1. The objects that have to be clustered should be represented by informative features. The features often require some preprocessing such as standardization. Furthermore, a suitable similarity measure has to be selected. The choices made here largely predetermine what a clustering algorithm can achieve in the best case. 2. The type of group structure has to be specied in mathematical terms. This choice is essentially determined by selecting a clustering algorithm that encodes a model for the data. Clustering algorithms reect the structural bias of a grouping principle, for example, that clusters are spherically distributed rather than elongated. 3. The user has to assess the validity of the resulting grouping solution, that is, he has to infer the “correct” number of clusters k as well as the interpretation of the groups. In the absence of prior knowledge, this question represents a particularly hard and important part of the analysis. The main topic of this article is the reliable inference of an appropriate number of clusters or model order. The notion of cluster validation refers to concepts and methods for the quantitative and objective evaluation of the output of a clustering algorithm (Jain et al., 1999; Gordon, 1999; Jain & Dubes, 1988). A general principle for cluster validation should be applicable to every clustering algorithm and should not be restricted to a specic group of clustering methods. Adopting this point of view, a model validation scheme should avoid relying on additional assumptions about the group structure in the data which have not been captured by the clustering algorithm. Otherwise, it could be easily biased and consequently would lack objectivity. In this article, such a general principle is proposed by introducing the notion of the stability of clustering solutions. Our approach to cluster validation requires that solutions are similar for two different data sets that have been generated by the same (probabilistic) source. Hence, the replicability of clustering solutions is essentially assessed, an approach also taken by several other methods for cluster validation in general and for estimating the number of clusters in particular. Some methods (Ben-Hur, Elisseeff, & Guyon, 2002; Levine & Domany, 2001) cluster two nondisjoint data sets in order to measure the similarity of the clustering solutions obtained for the intersection of both data sets (see section 3). In section 2, we point out why this approach could be biased. A different approach has been pioneered in an early work by Breckenridge (1989): the basic idea is to measure the agreement of clustering solutions generated by a clustering algorithm and by a

Stability-Based Validation of Clustering Solutions

1301

classier trained using a second (clustered) data set. Breckenridge’s work did not lead to a specic implementable procedure, in particular not for model order selection. However, his study suggests the usefulness of such an approach for the purpose of validation. Our method as well as Dudoit and Fridlyand’s (2002) Clest, and Tibshirani, Walther, Botstein, and Brown’s (2001) Prediction Strength build on Breckenridge’s ideas (see sections 2 and 3 for a detailed description of these methods). The methods described so far can be classied as essentially model free. In contrast to these techniques, model-based approaches (e.g. Smyth, 1998; Fraley & Raftery, 1998) rely on a probabilistic formulation of the clustering problem or implicitly encode assumptions on the type of group structure. Although not explicitly formulated in a probabilistic way, the general validation methodology proposed in Yeung, Haynor, and Ruzzo (2001) as well as the Gap Statistic (Tibshirani, Walther, & Hastie, 2001; see section 3) are model dependent, since both methods assume compact clusters tightly packed around cluster centroids. Our notion of stability is introduced and developed in section 2 as the expected dissimilarity of clustering solutions that can serve as a condence measure for data partitioning. Section 3 starts with a short account on the different validation methods used in the experimental study of simulated data sets (see section 3.2) and of gene expression levels measured by DNA microarrays (see section 3.3). The cluster analysis of the data from molecular biology addresses the problem of tumor (sub-)type or class discovery. From the perspective of model order selection, it is necessary to determine how many subtypes can be reliably identied in a given data set. 2 The Stability Measure The concept of stability as a paradigm to assess solutions to clustering problems is mathematically formalized in this section. Based on this formalism, the number of clusters is estimated from the stability analysis of clustering solutions. Let X D .X1 ; : : : ; Xn / 2 X n be a nite data set. The problem of clustering is to nd a partition of this data set into k disjoint classes or clusters. A solution is formally represented by a vector of labels Y D .Y1 ; : : : ; Yn / where Yi 2 L :D f1; : : : ; kg and Yi D º if Xi is assigned to cluster º. A clustering algorithm Ak constructs such a solution: Y :D Ak .X/.1 We assume that the data set has been drawn from a probabilistic source. This assumption is particularly reasonable in the context of experimental biological data. 1 The proposed stability method relies on partitions of the object set as solutions to clustering problems. However, since rooted trees are nested sequences of partitions, the proposed stability approach can be straightforwardly adapted for validating hierarchical groupings.

1302

T. Lange, V. Roth, M. Braun, and J. Buhmann

Strictly speaking, model selection in clustering consists of two steps. First, a grouping algorithm has to be selected. Once this choice is made, the model order remains to be determined for the given data set in the second step. In this contribution, we mainly address the second issue and assume that the user has xed a clustering algorithm for the application in mind. Thus, in the strict sense, we provide only a partial answer to the problem of cluster model selection. However, due to the lack of a general formal objective in clustering, the question of identifying an appropriate cluster algorithm is ill posed. Posed differently, providing a denitive answer to this question would amount to resolving the clustering problem itself. In spite of this general problem, our experiments indicate that if one grouping strategy exhibits much higher stability than a second one, then the rst method is likely to be more appropriate for the data set than the alternative. In case two methods achieve equally high scores, our validation scheme does not prefer one method over the other. This ambiguity, however, cannot be resolved by statistical validation schemes, since the different methods potentially emphasize different aspects of the data. Taking such an abstract viewpoint, we propose that our approach, although primarily designed for solving the model order selection problem, might also help to decide which model is more appropriate in cases where no prior knowledge from the application domain can guide the selection of a clustering algorithm. The model order selection problem in cluster analysis amounts to nding a suitable number of clusters k. In general, there might exist more than one useful answer to this question. An indispensable requirement, however, is the robustness of a clustering solution, that is, the result of a cluster analysis should be reproducible on other data sets drawn from the same source and should not depend too sensitively on the sample set at hand. This means, for example, that in the context of novel tumor class identication, a second set of tissues originating from the same tumor classes should be clustered in a similar manner. The stability of the clustering solution varies with the number of clusters that are inferred. For instance, inferring too many clusters leads to arbitrary splits of the data, and the solution is inuenced heavily by sample uctuations. Inferring too few clusters might also lead to unstable solutions since the lack of degrees of freedom forces the algorithm to ambiguously mix structures that should be kept separate (the 5 gaussians example in section 3 provides an example for this). Following these considerations, we dene the “correct” number of clusters as that of clustering solutions with maximal stability. The notion of stability that we propose here is based on considering the average dissimilarity of solutions computed on two different data sets. We need to derive a measure of dissimilarity rst. Note that a labeling Y :D Ak .X/ is dened only with regard to a data set X. Therefore, solutions Y :D Ak .X/ and Y0 :D Ak .X0 / are not directly comparable since they depend on different (often disjoint) data sets X and X0 . For the purpose of assessing the similarity of clustering solutions, a mechanism has to be devised that

Stability-Based Validation of Clustering Solutions

1303

renders these solutions comparable. Such a mechanism represents a fundamental difference to the approaches by Levine and Domany (2001) and Ben-Hur et al. (2002), which operate on data sets with overlap. 2.1 Transfer by Prediction. Supervised classication learns a function Á that assigns each element of a space X to one out of k classes based on a labeled input training data set. The function Á, which is inferred from the training data set, is called a classier. The data set X together with its clustering solution Y :D Ak .X/ can be considered as a training data set used to construct a classier. This classier Á trained on .X; Y/ predicts a label Á.X0 / for a new object X0 . By applying Á to each object X0i in a new set X0 D .X10 ; : : : ; X0n /, we obtain a labeling for X0 . For our purpose, we consider the predicted labeling Á .X0 / :D .Á .Xi0 //i·n as the extension of the clustering solution Y of the data set X to the data set X0 . These predicted labels can be compared to those generated by the clustering algorithm, that is, with Ak .X0 /. Hence, the labelings Ak .X/ and Ak .X0 / are turned comparable by utilizing a classier Á as a solution transfer mechanism. The type of classier largely inuences the stability measurement and therefore has to be selected with care, as will be discussed below. After a Á has been chosen, the comparison of solutions on the same data set can be addressed. 2.2 Comparing Clustering Solutions. We have derived two solutions Á.X0 / and Y0 for the same data set X0 . A very natural distance measure for comparing these two labeling vectors Á.X0 /; Y0 is to consider their normalized Hamming distance, d.Á .X0 /; Y0 / :D

n 1X 1fÁ .X0i / 6D Y0i g; n iD1

(2.1)

where 1fÁ .X0i / 6D Y0i g D 1, if Á.Xi0/ 6D Y0i and zero otherwise. In the context of supervised classication, this distance between class assignments of an object is called the 0-1 loss function (see, e.g., Duda, Hart, & Stork, 2000). The empirical average over the 0-1 loss of n samples is called empirical misclassication risk. Hence, equation 2.1 can be interpreted as the empirical misclassication risk of Á with regard to the test data set .X0 ; Y0 D A k .X0 //. The empirical misclassication risk, equation 2.1, compares two sets of labels that are not necessarily in natural correspondence. Contrary to classication problems, the labeling of the predicted clusters Á.Xi0/ on the second data set has to correspond to the optimized clusters Y0i only up to an unknown permutation ¼ 2 Sk of the indices, where Sk is the set of all permutations of the elements in L. For example, two partitionings of the data set X0 might be structurally equivalent although the labelings Á .X0 / and Y0 are differently represented (i.e., d.Á .X0 /; Y0 / 6D 0). For example, for k D 2, the cluster labeled 1 in the rst solution might correspond to the one labeled

1304

T. Lange, V. Roth, M. Braun, and J. Buhmann

2 in the second solution, and vice versa. This ambiguity poses an intrinsic problem in unsupervised learning and arises from the fact that there exists no label information prescribed for the clusters (in contrast to classication). To overcome the nonuniqueness of representation, we modify the dissimilarity such that the label indices in one solution are optimally permuted to maximize the agreement between the two solutions under comparison, that is, dS k .Á .X0 /; Y0 / :D min ¼2S k

n 1X 1f¼.Á .Xi0// 6D Y0i g: n iD1

(2.2)

Technically, the minimization can be performed in time O.nCk3 / by using the Hungarian method (Kuhn, 1955) for minimum weighted bipartite matching, which is guaranteed to nd the globally optimal ¼ 2 Sk . The procedure requires time O.n/ for setting up a weight matrix. The matching itself can be performed in time O.k3 /. Thus, the computation of the dissimilarity is still tractable. So far, a measure of disagreement dS k between solutions generated from two different data sets has been derived. The measure quanties the fraction of differently labeled points and can be understood as the empirical misclassication risk of Á with respect to the “label generator ” Ak . Our measure of disagreement is very intuitive since it matches clusters of one solution with those of the other solution in order to maximize the overlap of the respective clusters. Fixing this specic dissimilarity measure represents a signicant difference to Clest and the Prediction Strength method (see section 3.1 for a discussion). 2.3 Stability as the Average Distance Between Partitions. Since we consider the data to be drawn from a probability distribution, we are interested in the average distance between solutions where the expectation is taken with regard to pairs of independent data sets X; X0 of size n, that is, S.Ak / :D EX;X0 dS k .Á .X0 /; Y0 /:

(2.3)

We call S the stability index of the clustering algorithm Ak . It is the average dissimilarity of clustering solutions with regard to the distribution of the data. Hence, the smaller the values of the index S.Ak / 2 [0; 1], the more stable are the solutions. The stability index S.A k / provides a measure of the reproducibility of clustering solutions that are obtained from a probabilistic data source with the clustering algorithm A k . Conceptually, the expected disagreement between solutions, introduced here as a quality measure, is essentially the quantity of interest in all methods building on the stability idea. 2.4 On the Choice of the Classier. By selecting an inappropriate classier, one can articially increase the discrepancy between solutions, as

Stability-Based Validation of Clustering Solutions

A

1305

B

Figure 1: Effect of wrong predictor. Using the training data on the left generated with path-based clustering (Fischer, Zoller, ¨ & Buhmann, 2001),a nearest centroid classier (right plot) cannot reproduce the target labeling. In this case, the disagreement between two solutions is articially increased by the inappropriate predictor. Label indices are represented by symbols in the plots.

illustrated in Figure 1. Choosing a predictor for the stability assessment requires care, a fact disregarded by Dudoit and Fridlyand (2002) as well as by Tibshirani, Walther, Botstein et al. (2001). Since we want to measure the data distribution inherent stability, the inuence of the classier should be minimized, that is, select Á ¤ 2 C with minimal misclassication error, "

¤

Á :D argmin EX;X0 Á2C

# n 1X 0 0 min 1f¼.Á .Xi // 6D Yi g : ¼ 2S k n iD1

(2.4)

Note that the minimization over all predictors is safe and cannot lead to overtting or trivial stability because Á ¤ minimizes the misclassication on every independent test data set .X0 ; Y0 / for all training data sets .X; Y/ (Á ¤ plays the role here of the Bayes optimal predictor with regard to the label source Ak .X/). Minimizing S.A k / is an important requirement. Consider the predictor that outputs the training labels for objects from the training data set and a constant label (say, 1) to all other, previously unobserved objects. Clearly, minimizing the training error would not sufce since overtting would remain undetected. Thus, the expected prediction error with regard to the label source Ak .X0 / D Y0 has to be minimized. Due to the lack of a general nite sample size theory for the accuracy of classiers, the identication of the optimal classier by analytical means seems to be unattainable. Therefore, we have to resort to potentially suboptimal classiers in practical applications. Note, however, that a poorly chosen predictor can only worsen the stability. Equation 2.3 measures the mismatch between clustering and prediction strategy. Intuitively, this mismatch is minimized by mimicking the clustering algorithm. This intuition,

1306

T. Lange, V. Roth, M. Braun, and J. Buhmann

which is conrmed by our experiments, allows us to devise natural, good predictors in practice. For clustering algorithms that minimize a cost function H (as the squared error in k-means clustering or the negative log likelihood in mixture modeling), the grouping strategy can be mimicked if the predictor uses the least-cost increase criterion. Given the training data .X; Y/, assign a new datum Xnew to the class Ynew , for which the cost increase H..X; Xnew /; .Y; Ynew // ¡ H.X; Y/ becomes minimal. The same strategy is applicable for agglomerative algorithms. The working principle of these methods is the least-cost-increase strategy, since they merge the two closest clusters. For k-means, this strategy leads to the nearest centroid classier up to (negligible) O. n1 / corrections. For single linkage, the nearest-neighbor classier becomes the classier of choice. In general, there exist algorithms that cannot be easily understood as minimizers of a cost function H (e.g., CLICK by Sharan & Shamir, 2000). For these cases, one can safely resort to K-nearest-neighbor classication to transfer solutions, which is asymptotically Bayes’ optimal, at least for metric data. 2.5 Stability for Model Order Selection. To apply our notion of stability to model order selection, some additional considerations are necessary. Note that the range of possible stability values S.Ak / depends on the number of clusters k. This dependency arises from the fact that the similarity measure for partitions scales with the number of classes. A misclassication rate of approximately 0:5 for k D 2 essentially means that the predictor is as poor as a random predictor. The same misclassication rate, however, for k D 100 is far better than random guessing with an asymptotical error of 0:99. Hence, stability indices are not directly comparable for different values of k, but comparability is mandatory to use the stability index for model order selection. The strategy employed here represents one possible way to achieve comparability while keeping the stability measure interpretable. We stress, however, that there does not exist a generally accepted “correct” strategy for normalization and, thereby, scale adjustment. Note that S.A k / · 1 ¡ 1k , since for higher-stability index values above the bound, there exists a label permutation ¼ so that the index after permuting drops below this bound (more formally, since dS k .Á .X0 /; Y0 / · E52S k [d.5 ± Á.X0 /; Y0 /] D 1¡ 1k assuming a uniform distribution over all k! permutations). The random labeling algorithm Rk that assigns an object to class º with probability 1k asymptotically achieves this index value (i.e., for n ! 1). By normalizing the empirical misclassication rate of the clustering algorithm S.Ak / with the asymptotic misclassication rate of random labelings S.Rk /, the stability index values are normalized to the same scale, and thereby comparability is achieved. Hence, we set N Ak / :D S.Ak / : S. S.Rk /

(2.5)

Stability-Based Validation of Clustering Solutions

1307

SN is the stability of Ak relative to the stability achieved by random label guessing. Note that the stability measure is not dened for k D 1. We assume that the data contain nontrivial structure. In our view, the question if the data contain structure (i.e., k D 1 or k > 1?) should be addressed in advance, for example, by an application-specic test of unimodality. Inclusion of such tests into validity measures can lead to unreliable results, as demonstrated in the experiments. However, the proposed stability index SN can be considered as a measure of condence in solutions with k > 1. Unusually high instability for k > 1 supports the conclusion that only the trivial but uninformative solution (i.e., k D 1) contains “reliable” structure. 2.6 Using Stability in Practice. Ideally, one evaluates SN for several k and chooses the number of clusters k for which a minimum of SN is obtained. In practical applications, only one data set X is usually available. However, for N an expected value (with regard to two data sets) has to be estievaluating S, mated using the given data. We propose here a simple subsampling scheme for emulating two independent data sets: iteratively split the data into two halves, and compare the solutions obtained for these halves. Compute estimates for SN for different k and identify the k¤ with the smallest estimate (in case of nonunique minima, the largest k is selected). This k¤ is chosen as the estimated number of clusters. A partition of the whole data set can be obtained by applying the clustering algorithm with the chosen k¤ . The steps of the procedure are summarized in Table 1. A few comments are necessary to explain the subsampling scheme. The data should be split into two disjoint subsets because their overlap could otherwise already determine the group structure. This statistical dependence would lead to an undesirable, articially induced stability. Hence, the use of bootstrapping can be dangerous as well in that it can lead to articially low Table 1: Overview of the Algorithm for Stability-Based Model Order Selection. Repeat for each k 2 fkmin ; : : : ; kmax g: 1.

2. 3.

Repeat the following steps for r splits of the data to get an estimate O k / by averaging: S.A 1a. Split the given data set into two halves X; X0 and apply Ak to both. 1b. Use .X; Ak .X// to train the classier Á and compute Á.X0 /. 1c. Calculate the distance of the two solutions Á.X0 / and Ak .X0 / for X0 (see eq. (2.2)). Sample s random k-labelings, compare pairs of these, and O k /. compute the empirical average of the dissimilarities to estimate S.R O k / with S.R O k / to get an estimate for S.A N k /. Normalize each S.A

N k / as the estimated number of clusters. Return kO D argmink S.A

1308

T. Lange, V. Roth, M. Braun, and J. Buhmann

disagreement between solutions. This is one of the reasons that the methods by Levine and Domany (2001) and Ben-Hur et al. (2002) can produce unreliable results (see section 3). Furthermore, the data sets should have (approximately) equal size so that an algorithm can nd similar structure in both data sets. If there are too few samples in one of the two sets, the group structure might no longer be visible for a clustering algorithm. Hence, the option in Clest (see section 3) to have subsets of signicantly different sizes can inuence the assessment in a negative way and can render the obtained statistics useless in the worst case. Instead of the splitting scheme devised above, one could also t a generative model to the data from which new (noisied) data sets can be resampled. 3 Experimental Evaluation In this section, we provide empirical evidence for the usefulness of our approach to model order selection. By using toy data sets, we can study the behavior of the stability measure under well-controlled conditions. By applying our method to gene expression data, we demonstrate the competitive performance of the proposed stability index under real-world conditions. Preliminary results have been presented in Lange et al. (2003). For the experiments, a deterministic annealing variant of k-means (Rose, Gurewitz, & Fox, 1992) and path-based clustering (Fischer et al., 2001) optimized via an agglomerative heuristic are employed. Averaging is performed over r D s D 20 resamples and for 2 · k · 10. As discussed in section 2, the predictor should match the grouping principle. Therefore, we use (in conjunction with the stability index) the nearest centroid classier for k-means and a variant of a nearest-neighbor classier for path-based clustering that can be considered as a combination of single linkage and pairwise clustering (cf. Fischer et al., 2001; Hofmann & Buhmann, 1997). Concerning the classier training, see Hastie, Tibshirani, and Friedman (2001). We empirically compare our method to the Gap Statistic, Clest, Tibshirani’s Prediction Strength, Levine and Domany’s gure of merit, as well as the model explorer algorithm by Ben-Hur et al. (2002). These methods are described next, with an emphasis on differences and potential shortcomings. The results of this study are summarized in the Tables 2 and 3. Here, kO is used to denote the estimated number of clusters. 3.1 Competing Methods 3.1.1 The Gap Statistic. Recently, the Gap Statistic has been proposed as a method for estimating the number of clusters (Tibshirani, Walther, & Hastie, 2001). It is not a model-free validation method since it encodes assumptions about the group structure to be extracted. The Gap Statistic relies on considering the total sum of within-class dissimilarities for a given num-

Stability-Based Validation of Clustering Solutions

1309

Table 2: Estimated Model Orders for the Toy Data Sets.

Data Set 3 gaussians 5 gaussians 3 rings k-means 3 rings path-based 3 spirals k-means 3 spirals path-based

Stability Method

Gap Statistic

Clest

Prediction Strength

Levine’s FOM

Model Explorer

“True” Number k

kO D 3 kO D 5

kO D 3 kO D 1

kO D 3 kO D 5

kO D 3 kO D 5

kO D 2; 3 kO D 5

kO D 2; 3 kO D 5

kD3 kD5

kO D 7

kO D 1

kO D 7

kO D 1

kO D 8

—

kD3

kO D 3

kO D 1

kO D 1

kO D 1

kO D 2; 3; 4

kO D 2; 3

kD3

kO D 6

kO D 1

kO D 10 kO D 1

kO D 6

kO D 6

kD3

kO D 3

kO D 1

kO D 6

kO D 2; 3

kO D 3; 6

kD3

kO D 1

Table 3: Estimated Model Orders for the Gene Expression Data Sets.

Data Set

Stability Method

Golub et al. (1999) kO D 3 Alizadeh et al. (2000) kO D 2

Gap Statistic

Clest

kO D 10 kO D 4

Prediction Strength

Levine’s FOM

Model Explorer

“True” Number k

kO D 3 kO D 1

kO D 2; 8; 10

kO D 2

k 2 f3; 2g

kO D 2 kO D 1

kO D 2; 9

kO D 2

kD3

ber of clusters k, data set X, and clustering solution Y D Ak .X/: Wk :D

X 1·º·k

.2nº /¡1

X

Dij :

(3.1)

i;j:Yi DYj Dº

Here, Dij denotes the dissimilarity between Xi and Xj and nº :D jfi j Yi D ºgj the number of objects assigned to cluster º by the labeling Y. If Xi ; Xj 2 Rd , and Dij D kXi ¡ Xj k2 , Wk corresponds to the squared-error criterion optimized by the k-means algorithm. The Gap Statistic investigates the relationship between the log.W k / for different values of k and the expectation of log.Wk / for a suitable null reference distribution through the denition of the gap: gap k :D E¤ [log.W k /] ¡ log.W k /:

(3.2)

Here, E¤ denotes the expectation under the null reference distribution (as in Tibshirani, Walther, & Hastie, 2001). A possible reference distribution is the uniform distribution on the smallest hyper-rectangle that contains the

1310

T. Lange, V. Roth, M. Braun, and J. Buhmann

original data. In practice, the expected value is estimated by drawing B samples from the null distribution (B D 20 in the experiments), hence d k :D gap

1X ¤ log.Wkb / ¡ log.Wk /; B b | {z }

(3.3)

N¤ D:W k ¤ where W kb is the total within-cluster scatter for the bth data set drawn from the null reference distribution. Now, p let stdk be the standard deviation of the ¤ and sampled log.Wkb / sk :D stdk 1 C 1=B. Then the Gap Statistic selects the smallest number of clusters k for which the gap (corrected for the standard N ¤ and log.Wk / is large, deviation) between W k

d k ¸ gap d kC1 ¡ skC1 g: kO :D minfk j gap

(3.4)

Since the Gap Statistic relies on Wk in equation 3.1, it presupposes spherically distributed clusters. Hence, it contains a structural bias that should be avoided. The null reference distribution is a free parameter of the method, which essentially determines when to vote for no structure in the data (kO D 1). The experiments in Tibshirani, Walther, and Hastie (2001) as well as the 5 gaussian in section 3.2 demonstrate a sensitivity to the choice of the baseline distribution. 3.1.2 Clest. An approach related to ours has been proposed by Dudoit and Fridlyand (2002). We mainly comment on the differences here. A given data set is repeatedly split into two nondisjoint sets. The sizes of the data sets are free parameters of the method. As we already pointed out in section 2, very unbalanced splitting schemes can lead to unreliable results. After clustering both data sets, a predictor is trained on one data set and tested on the other data set. The predictor is a free parameter of Clest. No guidance is given in Dudoit and Fridlyand (2002) concerning predictor choice. Unfortunately, the subsequent steps are largely determined by the reliability of the prediction step. A similarity measure for partitions is used in order to measure the similarity of the predicted labels to the cluster labels. The measure itself is a free parameter again, in contrast to the method proposed here. In the experiments, the Fowlkes and Mallows (FM) index (Fowlkes & Mallows, 1983) was used. Given a xed k, the median tk of the similarity statistics obtained for B splits of the data is compared to those obtained for B0 data sets drawn from a null reference distribution. The difference dk :D tk ¡ t0k between P the average median statistic under the null hypothesis t0k :D .1=B0 / b tk;b and the observed statistic tk and pk :D jfb j tk;b ¸ tk gjB¡1 0 as a measure of variation in the baseline samples are used to select candidate number of

Stability-Based Validation of Clustering Solutions

1311

clusters. The number of clusters kO is chosen by bounding pk and dk , yielding two additional free parameters dmin and pmax . Dudoit and Fridlyand select » Ok D 1; min K¡ ;

if K¡ D ; otherwise

(3.5)

for K¡ :D f2 · k · M j pk · pmax ; dk ¸ dmin g. The whole procedure is repeated for k 2 f2; : : : ; Mg, where M is some predened upper bound for the number of clusters. Note that the set K¡ is essentially determined by the bounds on pk and dk . They can be chosen badly so that K¡ is always empty, for example. We conclude that Clest comes with a large number of parameters that have to be set by the user. At the same time, little guidance is given on how to reasonably select values for parameters in practice. This lack of parameter specication poses a severe practical problem since the obtained statistics are of little value for poor parameter selection. In contrast, our method gives guidance concerning the most important parameters. Due to the large degree of freedom in Clest, we consider it a conceptual framework, not a fully specied algorithm. For the experiments, we set B D B0 D 20, pmax D dmin D 0:05 and choose a simple linear discriminant analysis classier as the predictor, following the choice in Dudoit and Fridlyand (2002). 3.1.3 Prediction Strength. This recently proposed method (Tibshirani, Walther, Botstein et al., 2001) is related to ours. Again, the data are repeatedly split into two parts, and the nearest class centroid classier is employed for prediction. The similarity of clustering solutions is assessed by essentially measuring the intersection of the two clusters in both solutions that match worst. A threshold on the average similarity score is used to estimate the number of clusters. The largest k is selected for which the average similarity is above the user-specied threshold. From a conceptual point of view as well as in practice, this procedure has two severe drawbacks: (1) it is reasonably applicable to squared-error clustering only due to the use of the nearest centroid predictor, and (2) the similarity measure employed for comparing partitions can trivially drop to zero for large k. In particular, the latter point severely limits the applicability of the Prediction Strength method. In the experiments, we averaged over 20 splits, and the threshold was set to 0:9. 3.1.4 Model Explorer Algorithm. This method (Ben-Hur et al., 2002) clusters nondisjoint data sets: given a data set of size n, two subsamples are generated of size d f ne, where f 2 .0:5; 1/ ( f :D 0:8 in the experiments). The solutions obtained for these subsamples are compared at the intersection of the sets. The similarity measure for partitions is a free parameter of the method (in the experiments, the FM index was used). To estimate k, the experimental section of Ben-Hur et al. (2002) suggests looking for

1312

T. Lange, V. Roth, M. Braun, and J. Buhmann

“a jump in,”

P fSk > ´g ¼

r 1X 1fsj;k > ´g; r jD1

(3.6)

where Sk is the similarity score of two k-partitions and sj;k denotes the empirically measured similarity for the jth (out of r) subsample for some xed ´. We choose ´ D 0:9 in the experiments. Looking for a “jump” in the distribution is not a well-dened criterion for determining a number of clusters k as it is qualitative in nature. This vagueness can result in situation where no decision can be made. Furthermore, the Model Explorer algorithm can be biased (toward smaller k) due to the overlapping data sets where the data in both sets can potentially feign stability. As for the other methods, 20 pairs of subsamples were used in the experimental evaluation. 3.1.5 Levine and Domany’s Resampling Approach. This method (described here in a simplied way) creates r subsamples of size d f ne where f 2 [0; 1] from the original data (Levine & Domany, 2001). For the full data and for the resamples, solutions are computed. They dene a gure of merit M that assesses the average similarity of the solutions obtained on the subsamples with the one obtained on the full sample. 2 The matrix T 2 f0; 1gn with Tij :D 1fi 6D j and i; j are in the same clusterg where i; j 2 f1; : : : ; ng, is called the cluster connectivity matrix, which is a different representation of a clustering solution. The resampling results in a matrix T for the full data and r d f ne £ d f ne matrices T .1/ ; : : : ; T .r/ for the subsamples. For the parameter k, the authors dene their gure of merit as P r 1X M.k/ :D r ½D1

P i2J½

P

j2N i0 2J½

½ ;i

jN

.±Tij ;T.½/ / ij

½ ;i 0

j

;

(3.7)

where J½ is the set of samples in the ½th resample and N ½ ;i , i 2 J½ , denes a neighborhood between samples. In the original article, the neighborhood denition was left as a free parameter. The criterion we have chosen for our experiments is called the ·-mutual nearest neighbor neighborhood denition (see Levine, 1999). M.k/ measures the extent to which the grouping computed on the subsample is in agreement with the solution on the full data set. Thus, M.k/ D 1 yields perfect agreement. The authors suggest choosing the parameter(s) k for which local maxima of M are observed. Several maxima can occur, and hence it is not clear how to choose a single number of clusters in that case. In the experiments, all of them are taken into consideration. We have set f D 2=3, r D 20, and · D 20.

Stability-Based Validation of Clustering Solutions

1313

3.2 Experiments Using Toy Data. The rst data set consists of three fairly well separated point clouds, generated from three gaussian distributions (25 points from the rst and the second and 50 points from the third were drawn) and has also been used by Dudoit and Fridlyand (2002), Tibshirani, Walther, Botstein et al. (2001), and Tibshirani, Walther, and Hastie (2001). For some k, for example, k D 5 in Figure 2B, the variance in the stability over different resamples is relatively high. This effect can be explained by the model mismatch: for k D 5, the clustering of the three classes depends highly on the subset selected in the resampling. We conclude that additional information about the t can be obtained from the distribution of the stability index values over the resampled subsets apart from the absolute value of the stability index. For this data set, all methods under comparison are able to infer the “true” number of clusters k D 3. Figures 2A and 2B show the clustered data set and the proposed stability index. For k D 2, the 0.8 0.7

stability index

0.6 0.5 0.4 0.3 0.2 0.1 0 2

3

4

A

5 6 7 8 number of clusters

9

10

9

10

B 1

stability index

0.8 0.6 0.4 0.2 0 2

C

3

4

5 6 7 8 number of clusters

D

Figure 2: Results of the stability index on the toy data (see section 3.2). (A) Clustering of the 3 gaussians data for k D 3. (B) The stability index for the 3 gaussians data set with k-means. (C) Clustering of the 5 gaussians data with the estimated k D 3. (D) The stability index for the 5 gaussians data set with k-means.

1314

T. Lange, V. Roth, M. Braun, and J. Buhmann

stability is relatively high due to the hierarchical structure of the data set, which enables stable merging of the two smaller subclusters with 25 data points. The second proof-of-concept data set consists of n D 800 points in R2 drawn from 5 gaussians (400 from the center one, 100 from the four outer gaussians; see Figure 2C). All methods here estimate the correct number of clusters, k D 5, except for the Gap Statistic. In this example the uniformity hypothesis chosen for the Gap Statistic is too strict: groupings of uniform data sets drawn from the baseline distribution have very similar costs to those of the actual data. Hence, there is no large gap for k > 1. Figure 2D shows the stability index for this data set. Note the high variability and the poor stability index value for k < 5. This high instability is caused by the merging of clusters that do not belong together. Which clusters are merged solely depends on the current noise realization (i.e., on the current subsample). In the three ring data set (depicted in Figures 3A and 3C), which was also used by Levine and Domany (2001), three ring-shaped clusters can be naturally distinguished. These clusters obviously violate the modeling assumption of k-means of spherically distributed clusters. With k D 7, k-means is able to identify the inner circle as a cluster. Thus, the stability for this number of clusters k is highest (see Figure 3B). Clest infers kO D 7, and Levine’s FOM suggest kO D 8 while the Gap Statistic and Prediction Strength estimate kO D 1. The Ben-Hur et al. (2002) method does not lead to any interpretable result. Applying the proposed stability estimator with path-based clustering on the same data set yields highest stability for k D 3, the “correct” number of clusters (see Figures 3C and 3D). Here, most of the other methods fail and estimate kO D 1. The Gap Statistic fails here because it directly incorporates the assumption of spherically distributed data. Similarly, the Prediction Strength measure and Clest (in the form used here) use classiers that support only linear decision boundaries, which obviously cannot discriminate between the three ring-shaped clusters. In all these cases, the basic requirement for a validation scheme is violated: it should not incorporate additional assumptions about the group structure in a data set that go beyond the assumptions of the clustering principle employed. Levine’s gure of merit as well as Ben-Hur’s method infer multiple numbers of clusters. Apart from this observation, it is noteworthy that the stability of k-means is signicantly worse than the one achieved with path-based clustering. This stability ranking indicates that the latter is the preferred choice for this data set. Again, the stability can be considered as a condence measure for solutions. Experiments on uniform data sets, for example, lead to very large values of the stability index (for all k > 1), rendering solutions on such data questionable. The spiral arm data set (shown in Figures 4A and 4C) consists of three clusters formed by three spiral arms. Note that the assumption of compact

Stability-Based Validation of Clustering Solutions

1315

1

stability index

0.8 0.6 0.4 0.2 0 2

3

4

A

5 6 7 8 number of clusters

9

10

9

10

B

stability index

0.5 0.4 0.3 0.2 0.1 0 2

C

3

4

5 6 7 8 number of clusters

D

Figure 3: Results of the stability index on the toy data (see section 3.2). (A) kmeans clustering solution on the full three-ring data set for k D 7. (B) The stability index for the three-ring data set with k-means clustering. (C) Clustering solution on the full data set for k D 3 with path-based clustering. (D) The stability index for the three-ring data set with path-based clustering.

clusters is violated again. With k-means, the stability measure overestimates the “true” number of clusters since kO D 6 is returned (see Figure 4B). Most of the other methods fail in this case, since they estimate kO D 1, except for Clest, Levine’s FOM, and the Model Explorer algorithm. Clest favors the 10cluster solution. Note, however, that this kO is returned because the maximum number of clusters is 10. Levine’s FOM and Ben-Hur’s method both suggest k D 6, as our method does. When path-based clustering is employed, the “correct” number of clusters k D 3 is inferred by the stability-based approach (see Figure 4D). In this case, however, most competitors fail again to provide a useful estimate. Only the method by Levine and Domany (2001) and Ben-Hur et al. (2002) estimate k D 3 among other number of clusters. Again, the minimum stability index values for path-based clustering are signicantly below the index values for k-means. This again indicates that the

1316

T. Lange, V. Roth, M. Braun, and J. Buhmann 1

stability index

0.8

0.6

0.4

0.2

0 2

3

4

A

5 6 7 8 number of clusters

9

10

9

10

B 0.6

stability index

0.5 0.4 0.3 0.2 0.1 0 2

C

3

4

5 6 7 number of clusters

8

D

Figure 4: Results of the stability index on the toy data (see section 3.2). (A) Clustering solution on the full spiral arm data set for k D 6 with k-means. (B) The stability index for the three-spirals data set with k-means clustering. (C) Clustering solution on the full spiral arm data set for k D 3 with path-based clustering (D) and the stability index for this data set.

stability index quanties the reliability of clustering solutions and, hence, provides useful information for model (order) selection and validation of model assumptions. 3.3 Analysis of Gene Expression Data Sets. With the advent of microarrays, the expression levels of thousands of genes can be simultaneously monitored in routine biomedical experiments (Lander, 1999). Today, the huge amount of data arising from microarray experiments poses a major challenge to gain biological or medical insights. Cluster analysis has turned out to be a useful and widely employed exploratory technique (see, e.g., Tamayo et al., 1999; Eisen, Spellman, Botstein, & Brown, 1998; Shamir & Sharan, 2001) for uncovering natural patterns of gene expression: for example, it is used to nd groups of similarly expressed genes. Genes that are clustered together this way are candidates for being coregulated and hence

Stability-Based Validation of Clustering Solutions

1317

are likely to be functionally related. Several studies have investigated their data using such an approach (e.g., Iyer et al., 1999). Another application domain is that of class discovery. Recently, several authors have investigated how novel tumor classes can be identied based exclusively on gene expression data (Golub et al., 1999; Alizadeh et al., 2000; Bittner et al., 2000). A fully Bayesian approach to class discovery is proposed in Roth and Lange (in press). Viewed as a clustering problem, a partition of the arrays is sought that collects samples of the same disease type in one class. Such a partitioning can be used in subsequent steps to identify indicative or explaining genes for the different disease types. The nal goal of many such studies is to detect marker genes that can be used to reliably classify and identify new cases. Due to the random nature of expression measurements, the main problem remains to assess and interpret the clustering solutions. We reinvestigate here the data sets studied by Golub et al. (1999) and Alizadeh et al. (2000) using the stability method to determine a suitable model order. Ground-truth information on the correct number of clusters is available for both data sets that have also been re-analyzed by Dudoit and Fridlyand (2002). 3.3.1 Clustering of Leukemia Samples. Golub et al. (1999) studied in their analysis the problem of classifying acute leukemias. They used self-organizing maps (SOMs; see, Duda et al., 2000) for the study of unsupervised cancer classication. Nevertheless, the important question of inferring an appropriate model order remains unaddressed in their article since a priori knowledge is used to select a number of clusters k. In practice, however, such knowledge is often not available. Acute leukemias can be roughly divided into two groups, acute myeloid leukemia (AML) and acute lymphoblastic leukemia (ALL). Furthermore, ALL can be subdivided into B-cell ALL and T-cell ALL. Golub et al. (1999) used a data set of 72 leukemia samples (25 AML, 47 ALL of which 38 are Bcell ALL samples).2 For each sample, gene expression was monitored using Affymetrix expression arrays. Following the authors of the leukemia study, three preprocessing steps are applied to the expression data. At rst, all expression levels are restricted to the interval [100; 16,000], with values outside this interval being set to the boundary values. Second, genes are excluded for which the quotient of maximum and minimum expression levels across samples is smaller than 5 or the difference between maximum and minimum expression is smaller than 500. Finally, the data were log10-transformed. An additional step standardizes samples so that they have zero mean and unit variance across genes. The resulting data set consisted of 3571 genes and 72 samples.

2

Available on-line at http://www-genome.wi.mit.edu/cancer/.

1318

T. Lange, V. Roth, M. Braun, and J. Buhmann

0.7

stability index

0.6 0.5 0.4 0.3 0.2 0.1 0 2

3

4

5 6 7 8 number of clusters

9

10

Figure 5: Stability index for the leukemia data set.

For the purpose of cluster analysis, the feature set was additionally reduced by retaining only the 100 genes with highest variance across samples because genes with low variance across samples are unlikely to be informative for the purpose of grouping. This step is adopted from Dudoit and Fridlyand (2002). The nal data set consists of 100 genes and 72 samples. Cluster analysis has been performed with k-means in conjunction with the nearest centroid classier. Figure 5 shows the stability curve of the analysis for 2 · k · 10. For k D 3, we estimate the lowest empirical stability index. We expect that clustering with k D 3 separates AML, B-cell ALL, and T-cell ALL samples from each other. Figure 6A shows the resulting labeling for this case. With respect to the known ground-truth labels, 91:6% of the samples (66 samples) are correctly classied (the bipartite matching is used again to map the cluster onto the ground-truth labeling). Note that for k D 2, similar stability is achieved. Hence, we cluster the data set again for k D 2 and compare the result with the ALL – AML labeling of the data. The result is shown in Figure 6A. Here, 86:1% of the samples (62 samples) are correctly identied. The Gap Statistic overestimates the “true” number of clusters, while Prediction Strength does not provide any useful information by returning kO D 1. Clest infers the same number of clusters as our method does. The Model Explorer algorithm reveals a bias to a smaller number of clusters, while Levine’s FOM generates several local minima on this data set (not including k D 3). We conclude that our method is able to infer biologically relevant model orders. Furthermore, it suggests the model order for which a high accuracy is achieved. If the true labeling had been unknown, our reanalysis of this data set demonstrates that the different cancer classes could have been discovered based exclusively on the expression data by utilizing our model order selection principle and k-means clustering.

Stability-Based Validation of Clustering Solutions

1319

A

B

Figure 6: Results of the cluster analysis of the leukemia data set (100 genes, 72 samples) for the most stable model orders k D 2 (top) and k D 3 (bottom). The data are arranged according to the ground-truth labeling (the vertical bar labeled “True”) for both k. The vertical bar “Pred” indicates the predicted cluster labels.

3.3.2 Clustering of Lymphoma Samples. Alizadeh et al. (2000) measured gene expression patterns for three different lymphoid malignancies—diffuse large B-cell lymphoma (DLBCL), follicular lymphoma (FL), and chronic follicular leukemia (CLL)—by utilizing a special microarray chip, the lymphochip. The lymphochip, a cDNA microarray, uses a set of genes that is of importance in the context of lymphoid diseases. The study in Alizadeh et al. (2000) produced measurements for 4682 genes and 62 samples (42 samples of DLBCL, 11 samples of CLL, 9 samples of FL). The data set contained missing values, which were set to 0. After that, the data were standardized as above. Furthermore, the 200 genes with highest variance across samples have been selected for further processing, leading to a data set of 62 samples and 200 genes. Cluster analysis has been performed with k-means and the nearest centroid rule. The resulting stability curve is depicted in Figure 7. We estimate

1320

T. Lange, V. Roth, M. Braun, and J. Buhmann

1

stability index

0.8 0.6 0.4 0.2 0 2

3

4

5 6 7 8 number of clusters

9

10

Figure 7: Stability index for the lymphoma data set.

Figure 8: The lymphoma data set extracted from Alizadeh et al. (2000), together with a two-means clustering solution and ground-truth information. k-means almost perfectly separates here DLBCL from CLL and FL. The bar labeled “Pred” indicates the cluster labels, while “true” refers to the ground-truth label information.

here kO D 2, since we get the highest stability index value for this number of clusters. Note that this contradicts the ground-truth number of clusters k D 3. Taking a closer look at the solutions (see Figure 8) reveals, however, that the split of the data produced by k-means with k D 2 is biologically plausible since it separates DLBCL samples from FL and CLL samples. Furthermore, a 3-means solution does not separate DLBCL, FL, and CLL but splits the DLBCL cluster and generates two clusters consisting of FL, CLL, and DLBCL samples and one cluster of DLBCL samples. In total, we get an agreement of over 98% for k D 2, but for k D 3, an agreement of only ¼ 57% in comparison with the ground-truth labeling. Hence, our method

Stability-Based Validation of Clustering Solutions

1321

estimated the k, for which the most reliable grouping with k-means can be achieved. Clest and Ben-Hur’s method achieve the same result as we do. Levine’s FOM leads to estimating k D 2 and k D 9. The Prediction Strength method suggests an uninformative one-cluster solution, while the Gap Statistic proposes k D 4. The corresponding grouping solution with k-means again mixes the three different classes in the same clusters. Again, we conclude that we nd a biologically plausible partitioning of the data in an unsupervised way. Furthermore, the stability index has selected the number of clusters for which the most consistent data partition with regard to the ground-truth is extracted by k-means clustering.

4 Conclusion The concept of stability to assess the quality of clustering solutions was introduced. It has been successfully applied to the model order selection problem in unsupervised learning. The stability measure quanties the expected dissimilarity of two solutions, where one solution was extended by a predictor to the other one. In contrast to other approaches, the important role of the predictor is appreciated in the concept of cluster stability. Under the assumption that the labels produced by the clustering algorithm are the true labels, the disagreement can be interpreted as the attainable misclassication risk for two independent data sets from the same source. Normalizing the stability measure with the stability costs of a random predictor allows us to assess the suitability of different numbers of clusters k for a given data set and clustering algorithm in an objective way. In order to estimate the stability costs in practice, an empirical estimator is used that emulates independent samples by resampling. Experiments are conducted on simulated and well-known microarray data sets (Alizadeh et al., 2000; Golub et al., 1999). On the toy data, the stability method has demonstrated its competitive performance under wellcontrolled conditions. Furthermore, we have pointed out where and why competing methods fail or provide unreliable answers. Concerning the gene expression data sets, our reanalysis effectively shows that the stability index is a suitable technique for identifying reliable partitions of the data. The nal groupings are in accordance with biological prior knowledge. We conclude that our validation scheme leads to reliable results and is therefore appropriate for assessing the quality of clustering solutions, in particular in biological applications where prior knowledge is rarely available.

Acknowledgments This work has been supported by the German Research Foundation, grants Buh 914/4, Buh 914/5.

1322

T. Lange, V. Roth, M. Braun, and J. Buhmann

References Alizadeh, A. A., Eisen, M. B., Davis, R. E., Ma, C., Lossos, I. S., Rosenwald, A., Boldrick, J. C., Sabet, H., Tran, T., Yu, A., Powell, J. I., Yang, L., Marti, G. E., Moore, T., Judson, J. Jr., Lu, L., Lewis, D. B., Tibshirani, R., Sherlock, G., Chan, W. C., Greiner, T. C., Weisenburger, D. D., Armitage, J. O., Warnke, R., Levy, R., Wilson, W., Grever, M. R., Byrd, J. C., Botstein, D., Brown, P. O., & Staudt, L. M. (2000). Distinct types of diffuse large B-cell lymphoma identied by gene expression proling. Nature, 403, 503–511. Ben-Hur, A., Elisseeff, A., & Guyon, I. (2002). A stability based method for discovering structure in clustered data. In Pacic Symposium on Biocomputing 2002 (pp. 6–17). Singapore: World Scientic. Bittner, M., Meltzer, P., Chen, Y., Jiang, Y., Seftor, E., Hendrix, M., Radmacher, S., R., Yakhini, Z., Ben-Dor, A., Sampas, N., Dougherty, E., Wang, E., Marincola, F., Gooden, C., Lueders, J., Glatfelter, A., Pollock, P., Carpten, J., Gillanders, E., Leja, D., Dietrich, K., Beaudry, C., Berens, M., Alberts, D., & Sondak, V. (2000). Molecular classication of cutaneous malignant melanoma by gene expression proling. Nature, 406(3), 536–540. Breckenridge, J. (1989). Replicating cluster analysis: Method, consistency and validity. Multivariate Behavioral Research, 24, 147–161. Buhmann, J. M. (1995). Data clustering and learning. In M. Arbib (Ed.), Handbook of brain theory and neural networks (pp. 278–282). Cambridge, MA: MIT Press. Duda, R. O., Hart, P. E., & Stork, D. G. (2000). Pattern classication. New York: Wiley. Dudoit, S., & Fridlyand, J. (2002). A prediction-based resampling method for estimating the number of clusters in a dataset. Genome Biology, 3(7). Available on-line: http://genomebiology.com/2002/317/research/0036. Eisen, M., Spellman, P., Botstein, D., & Brown, P. (1998). Cluster analysis and display of genome-wide expression patterns. Proc. Natl. Acad. Sci. USA, 95, 14863–14868. Fischer, B., Zoller, ¨ T., & Buhmann, J. M. (2001). Path based pairwise data clustering with application to texture segmentation. In M. A. T. Figueiredo, J. Zerubia, & A. K. Jain (Eds.), LNCS energy minimization methods in computer vision and pattern recognition. Berlin: Springer-Verlag. Fowlkes, E., & Mallows, C. (1983). A method for comparing two hierarchical clusterings. Journal of the American Statistical Association, 78, 553–584. Fraley, C., & Raftery, A. (1998). How many clusters? Which clustering method? Answers via model-based cluster analysis. Computer Journal, 41(8), 578–588. Golub, T. T., Slonim, D. K., Tamayo, P., Huard, C., Gaasenbeek, M., Mesirov, J. P., Coller, H., Loh, M. L., Downing, J. R., Caligiui, M. A., Bloomeld, C. D., & Lander, E. S. (1999). Molecular classication of cancer: Class discovery and class prediction by gene expression monitoring. Science, 286, 531–537. Gordon, A. D. (1999). Classication (2nd ed.). London: Chapman & Hall. Hastie, T., Tibshirani, R., & Friedman, J. (2001). The elements of statistical learning: Data mining, inference and prediction. New York: Springer-Verlag.

Stability-Based Validation of Clustering Solutions

1323

Hofmann, T., & Buhmann, J. M. (1997). Pairwise data clustering by deterministic annealing. IEEE PAMI, 19(1), 1–14. Iyer, V. R., Eisen, M. B., Ross, D. T., Schuler, G., Moore, T., Lee, J. C. F., Trent, J. M., Staudt, L. M., Hudson, J. Jr., Boguski, M. S., Lashkari, D., Shalon, D., Botstein, D., & Brown, P. O. (1999). The transcriptional program in the response of human broblasts to serum. Science, 283(1), 83–87. Jain, A. K., & Dubes, R. C. (1988). Algorithms for clustering data. Upper Saddle River, NJ: Prentice Hall. Jain, A. K., Murty, M., & Flynn, P. J. (1999). Data clustering: A review. ACM Computing Surveys, 31(3), 265–323. Kuhn, H. (1955). The Hungarian method for the assignment problem. Naval Res. Logist. Quart., 2, 83–97. Lander, E. (1999). Array of hope. Nature Genetics Supplement, 21, 3–4. Lange, T., Braun, M., Roth, V., & Buhmann, J. (2003). Stability-based model selection. In S. Becker, S. Thrun, & K. Obermayer (Eds.), Advances in neural information processing systems, 15 (pp. 617–624). Cambridge, MA: MIT Press. Levine, E. (1999). Un-supervised estimation of cluster validity—methods and applications. Master’s thesis, Weizmann Institute of Science. Levine, E., & Domany, E. (2001). Resampling method for unsupervised estimation of cluster validity. Neural Computation, 13, 2573–2593. Rose, K., Gurewitz, E., & Fox, G. (1992). Vector quantization and deterministic annealing. IEEE Trans. Inform. Theory, 38(4), 1249–1257. Roth, V., & Lange, T. (in press). Bayesian class discovery in microarray datasets. IEEE Transactions on Biomedical Engineering. Shamir, R., & Sharan, R. (2001). Algorithmic approaches to clustering gene expression data. In T. Jiang, T. Smith, Y. Xu, & M. Q. Zhang (Eds.), Current topics in computational biology. Cambridge, MA: MIT Press. Sharan, R., & Shamir, R. (2000). CLICK: A clustering algorithm with applications to gene expression analysis. In ISMB’00 (pp. 307–316). Menlo Park, CA: AAAI Press. Smyth, P. (1998). Model selection for probabilistic clustering using cross-validated likelihood (Tech. Rep. 98-09). Irvine, CA: Information and Computer Science, University of California, Irvine. Tamayo, P., Slonim, D., Mesirov, J., Zhu, Q., Kitareewan, S., Dmitrovsky, E., Lander, E. S., & Golub, T. R. (1999). Interpreting patterns of gene expression with self-organizing maps: Methods and application to hematopoietic differentiation. PNAS, 96, 2907–2912. Tibshirani, R., Walther, G., Botstein, D., & Brown, P. (2001). Cluster validation by prediction strength (Tech. Rep.). Stanford, CA: Statistics Department, Stanford University. Tibshirani, R., Walther, G., & Hastie, T. (2001). Estimating the number of clusters via the gap statistic. J. Royal Statist. Soc. B, 63(2), 411–423. Yeung, K., Haynor, D., & Ruzzo, W. (2001). Validating clustering for gene expression data. Bioinformatics, 17(4), 309–316. Received May 15, 2003; accepted November 24, 2003.

NOTE

Communicated by Kechen Zhang

Bayesian Estimation of Stimulus Responses in Poisson Spike Trains Sidney R. Lehky [email protected] Cognitive Brain Mapping Laboratory, RIKEN Brain Science Institute, Saitama 3510198, Japan, and Laboratory of Brain and Cognition, National Institute of Mental Health, Bethesda, MD 20892, U.S.A.

A Bayesian method is developed for estimating neural responses to stimuli, using likelihood functions incorporating the assumption that spike trains follow either pure Poisson statistics or Poisson statistics with a refractory period. The Bayesian and standard estimates of the mean and variance of responses are similar and asymptotically converge as the size of the data sample increases. However, the Bayesian estimate of the variance of the variance is much lower. This allows the Bayesian method to provide more precise interval estimates of responses. Sensitivity of the Bayesian method to the Poisson assumption was tested by conducting simulations perturbing the Poisson spike trains with noise. This did not affect Bayesian estimates of mean and variance to a signicant degree, indicating that the Bayesian method is robust. The Bayesian estimates were less affected by the presence of noise than estimates provided by the standard method. 1 Introduction The goal here is to use Bayesian methods to improve estimates of neural responses. The Bayesian analysis accomplishes this by incorporating additional information about spike train statistics into the analysis relative to standard methods, which simply calculate moments (e.g., mean, variance) of the data. In this analysis, we incorporate the assumption that spike counts are Poisson distributed. There have been previous applications of Bayesian methods to neural data (Brown et al., 1998; Martignon et al., 2000; Sanger, 1996; Zhang, Ginzburg, McNaughton, & Sejnowski, 1998). However, the focus of those studies was determining the most likely stimulus given a set of responses within a neural population. Here we are not concerned with determining the most likely stimulus, but rather move the analysis to a lower level and infer the probability distribution of the response ring rate to one particular stimulus, given observations of spike counts within xed time periods. The emphasis is on practical data analysis methods rather than theoretical issues in neural coding. c 2004 Massachusetts Institute of Technology Neural Computation 16, 1325–1343 (2004) °

1326

S. Lehky

The Bayes formulation for the case at hand is prob.¸ j n; T/ D

prob.n j ¸; T/ prob.¸; T/ ; prob.n; T/

(1.1)

where n is the observed spike count during time interval T and ¸ is the estimated spike rate. Omitting the normalization factor prob.n; T/ in the denominator leaves the proportionality: prob.¸ j n; T/ / prob.n j ¸; T/ prob.¸; T/ posterior / likelihood ¤ prior:

(1.2)

2 The Likelihood Function The experimental situation we envisage is that a particular stimulus is presented during multiple trials. Within each trial, there is a prestimulus period, when activity of the neuron is at spontaneous level, and a stimulus period. Spike counts from the two periods are used to form likelihood functions for spontaneous and stimulus ring rates, ¸spont and ¸stim . From these, the likelihood function of the response ¸stim ¡ ¸spont is derived. The Poisson probability density function (pdf) we assume gives the distribution of spike count n during period T, given mean spike count ¸T: prob.n j ¸; T/ D

.¸T/n ¡¸T e : n!

(2.1)

What we actually need for a Bayesian estimate, however, is the likelihood of ¸, which involves keeping the same equation but now making n a constant and ¸ a variable, opposite to the situation in equation 2.1. This change in perspective converts equation 2.1 into gamma-distribution-shaped likelihood function of ¸, with shape parameter n C 1 and rate parameter 1=T: prob.n j ¸; T/ D likelihood.¸ j n; T/ D

.¸T/n ¡¸T e 0.n C 1/

D C¸n e¡¸T :

(2.2a) (2.2b)

(Multiplying this likelihood function by a constant factor T would normalize it to a gamma pdf.) In equation 2.2b, all the multiplicative constants are collected into C. The precise value of C is unimportant for likelihood calculations as we are interested in relative rather than absolute values of the function. Assuming prestimulus and stimulus spike rates are independent, the joint likelihood of ¸spont and ¸stim is formed by multiplying their individual likelihoods: likelihoodi .¸stim ; ¸spont j nistim ; nispont; Tstim ; T spont / ni

nispont

stim D Ci ¸stim ¸spont e¡.¸stim Tstim C¸spont Tspont /

(2.3)

Bayesian Estimation of Neural Responses

1327

(for the ith stimulus presentation). All multiplicative constants are grouped into Ci , whose value again is not important. The joint likelihood for N stimulus trials is then found by multiplying the likelihood functions for the individual trials: likelihood.¸stim ; ¸spont j nE stim ; nE spont ; Tstim ; Tspont / D

N Y

ni

nispont

stim [Ci ¸stim ¸spont e¡.¸stim Tstim C¸spont Tspont / ]:

(2.4)

iD1

3 Prior Probability Distribution We dene two prior distributions as examples. One possibility, the agnostic assumption, is that all response magnitudes are equally likely. This is represented by a two-dimensional uniform distribution, 1 a2 D0

priorProbAg .¸stim ; ¸spont/ D

0 · ¸spont; ¸stim · a; otherwise

(3.1)

with a dening a biophysically plausible range of ring rates. Another possibility, the skeptical assumption, is that there is no stimulus response. That is, ¸spont and ¸stim are on average identical. Given the longterm mean spontaneous spike rate, ¸N spont , this distribution, analogous to equation 2.3, is: priorProbSk .¸stim ; ¸spont j ¸N spont ; Tstim ; Tspont / N

N

D C.¸stim /¸spont Tstim .¸spont /¸spont Tspont e¡.¸stim Tstim C¸spont Tspont / ;

(3.2)

where C is the normalizing constant. Besides these example prior distributions, others may be suggested by the particular aspects of an experiment. 4 Posterior Probability Distribution The posterior probability of the ring rates is proportional to the prior probability (see equation 3.1 or 3.2), and the joint likelihood function (see equation 2.4): postProb.¸stim ; ¸spont j nE stim ; nE spont ; Tstim ; Tspont / D C[priorProb.¸stim ; ¸spont /

likelihood.¸stim ; ¸spont j nE stim ; nE spont; Tstim ; T spont /];

in which C normalizes the area under postProb.¸stim ; ¸spont /.

(4.1)

1328

S. Lehky

What we require, however, rather than the joint pdf of ¸spont and ¸stim , is the distribution of the response ¸resp D ¸stim ¡ ¸spont . This is accomplished by rst changing variables from prob.¸stim ; ¸spont / to prob.¸stim ¡ ¸spont ; ¸stim C ¸spont /, which involves a rotation and dilation of the coordinate axes. Then prob.¸stim ¡ ¸spont; ¸stim C ¸spont / is integrated along the ¸stim C ¸spont axis to give the marginal distribution prob.¸stim ¡ ¸spont /. For convenience, we relabel the transformed variables as follows: ¸sum D ¸stim C ¸spont

¸dif D ¸stim ¡ ¸spont :

(4.2)

Under the transformed variables, the likelihood function for a single trial (analogous to equation 2.3) becomes: likelihoodi .¸sum ; ¸dif j nistim ; nispont; Tstim ; Tspont / i

i

D Ci .¸sum C ¸dif /nstim .¸sum ¡ ¸dif /nspont 1

£ e¡ 2 [.¸sum C¸dif /Tstim C.¸sum ¡¸dif /Tspont ] :

(4.3)

The joint likelihood over N stimulus trials is then found by taking the product of individual trials, analogous to equation 2.4. The agnostic prior, equation 3.1, remains essentially unchanged under transformed coordinates because it is a constant. The transformed equation for the skeptical prior, equation 3.2, becomes: priorProbSk .¸sum ; ¸dif j ¸N spont ; Tstim ; Tspont/ N

N

D C.¸sum C ¸dif /¸spont Tstim .¸sum ¡ ¸dif /¸spont Tspont 1

£ e¡ 2 [.¸sum C¸dif /Tstim C.¸sum ¡¸dif /Tspont ] :

(4.4)

Given the transformed likelihood and prior probability functions, the posterior probability is postProb.¸sum ; ¸dif j nE stim ; nE spont; Tstim ; Tspont / D C[priorProb.¸sum ; ¸dif /

£ likelihood.¸sum ; ¸dif j nE stim ; nE spont; Tstim ; Tspont /];

(4.5)

where again C normalizes the distribution. Integrating equation 4.5 with respect to ¸sum takes us from postProb.¸sum ; ¸dif / to postProb.¸dif /: E stim ; nE spont ; Tstim ; Tspont/ postProb.¸dif j n Z 1 DC priorProb.¸sum ; ¸dif / ¡1

£ likelihood .¸sum ; ¸dif j nE stim ; nE spont ; Tstim ; Tspont/d¸sum: (4.6)

Bayesian Estimation of Neural Responses

1329

This is the nal result we seek—the posterior probability distribution of the stimulus response. 5 Refractory Periods Thus far, the analysis has been carried out for a pure Poisson process. A signicant perturbation away from this ideal is caused by the existence of a refractory period. The refractory period acts to regularize a spike train, reducing the variance of spike counts within a xed period, and therefore the variance of observed ring rates. Given mean interspike interval tNISI and refractory period tref , the fractional variance relative to a pure Poisson process is

ºD

Var.¸/ref Var.¸/pure

Á D

tNISI ¡ tref tNISI

!2 (5.1)

:

(By denition of tref , tNISI < tref is impossible.) Since tNISI D T=n: ± n ²2 º D 1 ¡ tref : T

(5.2)

To take into account the refractory period, therefore, the variance of the likelihood function (as well as nonuniform priors) should be reduced by the refractory period correction factor º. A modication to the likelihood function equation 2.2b that approximates that transform is: nC1

likelihood.¸ j n; T; º/ D C¸ º ¡1 e¡ º ³ ´ nC1 D ± ¸¡ T ¸T

0<º·1 º D 0;

(5.3)

where ± is the Dirac delta function. This approximation leads to small biases in the estimates of spike statistics. Because the purpose of this study is to quantify properties of the Bayesian method, an exact refractory period correction was carried out numerically in the simulations described below. Using a likelihood function and prior corrected for refractory periods, the rest of the analysis proceeds as before. 6 Mean, Variance, and Variance of Variance of Response Estimates The equations that follow are derived for the situations of a pure Poisson process and a pure Poisson process plus refractory period. A uniform prior is assumed. In a later section, we present simulations that examine what happens when the pure Poisson process plus refractory period is further perturbed by adding noise.

1330

S. Lehky

Response statistics can be presented as either a function of observed spike counts n or the spike density parameter r (which is known in the case of simulations). The relation between these two variables is given by the formula for average spike count per trial: PN iD1

ni

N

D rT:

(6.1)

6.1 Mean 6.1.1 Bayesian. Under the Bayesian analysis, if spike counts are Poisson distributed, then spike rates have a gamma-distribution-shaped likelihood function (see equation 2.2a). Given a set of N spike trains with spike count D ni , i D 1; : : : ; N, and duration D T, the posterior probability formed by multiplying together a set of such likelihood functions with a uniform prior and the appropriate normalization factor gives a gamma distribution with P i shape parameter N iD1 n C 1 and scale parameter 1=.NT/. Given this posterior probability distribution, the optimal location estimate of ¸ can be dened as either the mode or mean of the distribution, depending on the optimality criterion (Kay, 1993). Using mean gives the minimum mean square error (MMSE) estimate, while using the mode gives the maximum a posteriori (MAP) estimate, which for a uniform prior is also the maximum likelihood (ML) estimate. The MMSE estimate leads to an expected spike rate of PN

i

C1 NT 1 D rC ; NT iD1 n

E.¸/bay D

(6.2)

while the MAP estimate is PN

iD1 n

mode.¸/bay D

i

NT

D r:

(6.3)

The trade-off between the two is lower bias or lower mean square error. The estimated value of the response ¸resp D ¸stim ¡ ¸spont is then either PN E.¸resp /bay D

i iD1 nstim

NTstim

C1

PN ¡

1 D rstim ¡ rresp C N

³

i iD1 nspont

NTspont 1 Tstim

¡

C1

1 Tspont

´ (6.4)

Bayesian Estimation of Neural Responses

1331

or PN

PN

i iD1 nstim

mode.¸resp/bay D

NTstim

¡

i iD1 nspont

NTspont

D rstim ¡ rresp :

(6.5)

In the discussion below, we shall assume the MMSE estimate. Under the standard analysis, the mean ring rate is

6.1.2 Standard. PN

iD1 n

E.¸/std D

i

NT

D r:

(6.6)

The expected value of ¸resp then becomes PN

PN

i iD1 nstim

E.¸resp/std D

NTstim

¡

i iD1 nspont

NTspont

D rstim ¡ rspont :

(6.7)

Thus, the Bayesian MMSE estimate and the standard estimates of the response differ (except for the special case Tstim D Tspont), with the two estimates asymptotically converging as the sample size NT increases. The Bayesian MAP estimate is identical with the standard estimate. 6.2 Variance 6.2.1 Bayesian. Given a posterior probability that is a gamma distribuP i tion with shape parameter N iD1 n C1 and scale parameter 1=.NT/, expected variance of the ring rate is PN

ni C 1 .NT/2 1 r D C : NT .NT/2

E.Var.¸//bay D

iD1

(6.8)

The variance of the response ¸resp is then PN i nistim C 1 iD1 nspont C 1 D C 2 .NTstim / .NTspont /2 Á ! ³ ´ rspont 1 rstim 1 1 1 D C C 2 C 2 : 2 N Tstim T spont N Tstim Tspont PN

E.Var.¸resp //bay

iD1

(6.9)

1332

S. Lehky

The effect of the refractory period on expected variance can be approximated if we dene, following equation 5.2, a mean refractory period correction factor: Á ºN D

1¡

PN

ni

iD1

NT

!2 tref

D .1 ¡ rtref /2 :

(6.10)

Using this, the corrected equation 6.9 becomes

E.Var.¸resp //bay D

ºN stim

±P

N i iD1 nstim

.NTstim ±P N

C

ºN spont

iD1

²

/2

C ºN stim

² nispont C ºN spont

.NTspont /2 ´ ºNspont rspont 1 ºN stim rstim D C N Tstim Tspont Á ! ºN spont 1 ºN stim C 2 C 2 : 2 N Tstim Tspont ³

(6.11)

Although it can be convenient to use ºN on pooled data rather than doing full trial-by-trial calculations with different values of º for each trial, this introduces biases in calculated refractory-corrected variances, as well as variances of variances. 6.2.2 Standard. Given spike counts n that are Poisson distributed, Var.n/ P i D N iD1 n =N. We also have mean ring rate E.¸/ as a function of n, equation 6.6. Combining these two relations gives the variance of ¸ from its mean, leading to PN E.Var.¸//std D D

iD1 n .NT/2

i

r : NT

(6.12)

Response variance is then PN i nistim iD1 nspont D C 2 .NTstim / .NTspont /2 ³ ´ rspont 1 rstim D C : N Tstim Tspont PN

E.Var.¸resp //std

iD1

(6.13)

Bayesian Estimation of Neural Responses

1333

Bayesian response variances (see equation 6.9) are larger than those calculated using the standard method (see equation 6.13), although they converge for large sample sizes. With correction for refractory period:

E.Var.¸resp //std

P P i i ºNspont N ºNstim N iD1 nspont iD1 nstim D C .NTstim /2 .NTspont /2 ³ ´ ºN spont rspont 1 ºN stim rstim D C : N Tstim Tspont

(6.14)

6.3 Variance of the Variance. Variance of the variance is important because it determines the precision of an interval estimate of a parameter. 6.3.1 Bayesian. From equation 6.8, we have ring rate variance as a function of n. Again, since spike counts are Poisson distributed, Var.n/ D PN i iD1 n =N. Combining these two relations gives the variance of variance: PN

ni .NT/4 r D : .NT/3

Var.Var.¸//bay D

iD1

(6.15)

Response variance of variance is

Var.Var.¸resp //bay

1 D 4 N

ÁP Á

1 D 3 N

N i iD1 nstim 4 Tstim

rstim 3 Tstim

C

PN C

rspont

!

i iD1 nspont 4 Tspont

!

(6.16)

;

3 Tspont

and with correction for refractory period: Á Var.Var.¸resp //bay

1 D 4 N 1 D 3 N

Á

2 ºNstim

PN

i iD1 nstim 4 Tstim

2 ºNstim rstim 3 Tstim

C

C

2 ºNspont

2 ºN spont rspont 3 T spont

!

PN

i iD1 nspont 4 Tspont

:

!

(6.17)

P i 6.3.2 Standard. For mean spike count per trial N iD1 n =N not close to zero, the ring rate estimate ¸ is approximately normally distributed. Performing the subtraction ¸resp D ¸stim ¡ ¸spont provides an even better approximation of normality. A normally distributed variable has variances that are

1334

S. Lehky

chi-square distributed with N ¡ 1 degrees of freedom, and the variance of variance is Var.Var.¸//std D

2.E.Var.¸/// 2 : N¡1

(6.18)

Substituting the value of E.Var.¸// in equation 6.12 gives

Var.Var.¸//std D

2

±P

N iD1

ni

²2

.N ¡ 1/.NT/4 2r2 D : .N ¡ 1/.NT/2

(6.19)

The response variance of variance is:

Var.Var.¸resp //std

2 D .N ¡ 1/N4 2 D .N ¡ 1/N2

ÁP ³

N i iD1 nstim 2 Tstim

PN C

rspont rstim C Tstim Tspont

´2

i iD1 nspont 2 Tspont

!2

(6.20)

:

and corrected for refractory period: Á Var.Var.¸resp //std

2 D .N ¡ 1/N4 D

2 .N ¡ 1/N2

³

ºN stim

PN

i iD1 nstim 2 Tstim

C

ºN spont

ºN spont rspont ºN stim rstim C Tstim Tspont

PN

i iD1 nspont

!2

2 Tspont

´2

:

(6.21)

Firing rate statistics calculated from the above equations for a pure Poisson process (without refractory period) are presented in Table 1, for parameters rstim D 24 Hz, rspont D 8 Hz, and Tstim D Tspont D 0:5 sec. Table 1 shows identical values for E.¸/ under both methods, as expected for the special case Tstim D Tspont . Bayesian variances E.Var.¸// are slightly larger than standard estimates but converge as the number of trials increases. Finally, the Bayesian method gives much smaller values for the variance of variance Var.Var.¸// than the standard analysis. 7 Estimating Neural Responses for Spike Trains with a Refractory Period The analysis is now applied to simulated data shown in Figure 1. At the bottom is a raster plot of spike trains from 20 trials in a simulated experiment.

Bayesian Estimation of Neural Responses

1335

Table 1: Stimulus Response Statistics for a Pure Poisson Process. E.¸resp /

E.Var.¸resp //

Var.Var.¸resp //

Trials

Standard

Bayesian

Standard

Bayesian

Standard

Bayesian

1 2 5 10

16.0 16.0 16.0 16.0

16.0 16.0 16.0 16.0

– 32.0 12.8 6.40

72.0 34.0 13.1 6.48

– 2048 81.9 9.10

256 32.0 2.05 0.256

Notes: These were calculated using parameters rstim D 24 Hz, rspont D 8 Hz, and Tstim D Tspont D 0:5 sec. Bayesian estimates are based on a uniform prior and use the MMSE method.

Spikes/sec

35 30 25 20 15 10 5 0

Trial

4 8

12 16

-500

0

500

1000

Time (msec) Figure 1: Simulated spike trains used to demonstrate the Bayesian analysis. At bottom is a raster plot of 20 trials; the top shows the peristimulus time histogram. Spike counts are Poisson distributed, but with a 3 ms refractory period included. The timescale has been shifted by the latency so that zero coincides with the onset of stimulus response.

1336

S. Lehky

The spikes are Poisson distributed but include a 3 ms refractory period. In each trial, there is a 0.5 sec prestimulus period with spontaneous activity of 8 Hz, followed by a 0.5 sec stimulus period in which activity rises to 24 Hz. Activity then returns to spontaneous. The timescale has been translated by the stimulus latency so that the stimulus period starts at zero. The response distribution is found by starting with a prior distribution and successively multiplying in the likelihood function for each trial. For example, the rst trial has 4 spikes in the prestimulus and 20 spikes in 1 the stimulus period. Substituting n1spont D 4, n1stim D 20, Tstim D 0:500, and 1 Tspont D 0:500 into the refractory-period corrected analog of equation 4.3 gives the likelihood.¸sum; ¸dif / for that trial. Following incorporation of data from the nal trial, the distribution postProb.¸sum ; ¸dif / is integrated along ¸sum to give the response distribution, postProb.¸dif /. (For the discussion below, ¸dif shall be relabeled ¸resp.) Figure 2 illustrates the evolution of the response distribution as data from successive trials are included, starting from two prior distributions. Also shown are the conventional means and standard errors of the response. No standard error bar is shown for the Trial = 1 panel, because the conventional method does not allow calculating that statistic from a single trial. The rst panel plots the priors themselves (0th trial). We can obtain a better idea of the relation between the Bayesian and conventional estimates by comparing statistics from multiple replications of the experiment in Figure 1, determined for a uniform prior. Results of the simulations are given in Table 2, based on 10,000 experiment replications with N stimulus trials per experiment, and with rstim D 24 Hz, rstim D 8 Hz, and Tstim D Tspont D 0:5 sec, where r is the spike density parameter used to generate the spike trains. Table 2 shows E.¸resp /, mean response over 10,000 replications; E.Var.¸resp//, mean of the 10,000 response variances; and Var.Var.¸resp //, variance of the 10,000 response variances. Under these simulations, E.¸resp/ is identical under both methods, as expected for the special case Tstim D Tspont , and equal to rstim ¡ rspont . Values of ¸resp are approximately normally distributed in both cases.

Table 2: Stimulus Response Statistics from Simulations Using Homogeneous Poisson Spike Trains with a 3 Msec Refractory Period Added. E.¸resp /

E.Var.¸resp //

Var.Var.¸resp //

Trials

Standard

Bayesian

Standard

Bayesian

Standard

Bayesian

1 2 5 10

16.0 16.0 16.0 16.0

16.0 16.0 16.0 16.0

– 28.8 11.6 5.80

63.2 29.9 11.6 5.72

– 1590 67.9 7.54

136 16.7 1.10 0.140

Note: Simulation parameters are as given in Table 1.

Rel Prob Density

Bayesian Estimation of Neural Responses

1.0

Rel Prob Density

Trial=0

0.8

Trial=1

(Prior prob)

0.6 0.4 0.2 0

-20

0

20

40

1.0

60 -20

0

20

40

Trial=2

0.8

60 Trial=5

0.6 0.4 0.2 0

Rel Prob Density

1337

-20

0

20

1.0

40

60 -20

0

20

Trial=10

0.8

40

60

Trial=20

0.6 0.4 0.2 0

-20

0

20

l resp

40

(spikes/sec)

60 -20

0

20

l resp

40

60

(spikes/sec)

Figure 2: Evolution of the Bayesian probability distribution of responses .¸resp D ¸stim ¡ ¸spont / as data from increasing numbers of trials accumulate. The rst panel (Trial 0) shows two prior distributions—the uniform distribution (dotted line) and a distribution that assumes stimulus and spontaneous ring rates are identical (dashed line). Solid vertical and horizontal lines in the panels indicate means and standard errors calculated in the standard way. All curves have been normalized to a height of one.

1338

S. Lehky

Prob Density

0.4

Bayesian Standard

0.3 0.2 0.1 0

0

10

20

Var(l resp)

30

40

(spikes/sec) 2

Figure 3: Distributions of response variances for standard and Bayesian methods, derived from 25,000 replications of an experiment with ve stimulus trials per experiment .N D 5/. The probability density functions were estimated from the simulated data by kernel smoothing. Although both Var.¸resp/ distributions have almost identical E.Var.¸resp // values (see Table 2), the Bayesian method produces a much smaller Var.Var.¸resp //. The distribution produced by the standard method is chi square with N ¡ 1 D 4 degrees of freedom (see equation 7.1), while the Bayesian method leads to approximately normally distributed variances.

Moving to E.Var.¸resp //, the variances for both methods are smaller than those for a pure Poisson process (comparing Table 2 with Table 1). This reects the presence of a 3 msec refractory period in the simulated spike trains that served as data. Bayesian variances are slightly larger than the standard ones, with the values converging as the number of trials N increases. Although both methods produce almost identical values of E.Var.¸resp //, we see in Table 2 that Var.Var.¸resp // is substantially smaller for the Bayesian method (as was the case calculated for a pure Poisson process in Table 1). This relationship can also be seen in the plot of the distribution of Var.¸resp/ for N D 5 trials (see Figure 3). As ¸resp is normally distributed, under the standard analysis the distribution of Var.¸resp/ is chi-square distributed with m D N ¡ 1 degrees of freedom:

prob.Var.¸/ j m; ¾ / D

± ² m .Var.¸// 2 ¡1 exp ¡ Var.¸/ 2 2¾ m

2. 2 / 0. m2 /¾ m

E.Var.¸// D m¾ 2

Var.Var.¸// D 2m¾ 4 :

(7.1)

Bayesian Estimation of Neural Responses

1339

Combining expressions for E.Var.¸// in equations 6.14 and 7.1, a calculated value of ¾ can be derived: s ºN spontrspont ºNstim rstim C (7.2) ¾ D : .N ¡ 1/NTstim .N ¡ 1/NTspont For the N D 5 simulations shown in Figure 3, it can be determined that the distribution for the standard method does indeed follow a chi-square distribution with a least-squares t of ¾ D 1:70, compared to a calculated value of ¾ D 1:68. The value of Var.Var.¸resp // observed in these simulations was 67.9, compared to 64.0 calculated from equation 6.19. The bias in variance of variance calculations arises because of the use of an average refractory period correction factor ºN rather than trial-by-trial values of º. In contrast, the Bayesian method, despite also having a normally distributed response estimate, produces a Var.¸resp / that is not chi square but approximately normally distributed. The reason is that under the Bayesian method, Var.¸resp/ is not calculated from the sample of ¸resp estimates, but rather from the width of the likelihood functions. For the N D 5 example, the observed Var.Var.¸resp // in the simulations was 1.10, lower than the value of 1.60 calculated from equation 6.17, again showing a bias in the calculated estimate caused by doing refractory period corrections on pooled data rather than on a trial-by-trial basis. 8 Effects of Noise on Bayesian Estimates The Bayesian analysis assumes that spike trains (either spontaneous or stimulus) form homogeneous Poisson processes, in which ring rates do not vary with time within each epoch. In reality, the data will deviate from this. Here we examine the robustness of the Bayesian statistics by testing how well the method works when inhomogeneous Poisson spike trains serve as inputs. Two cases are considered: fast-uctuation noise, in which spike rates vary rapidly relative to the duration of a single trial, and slow-uctuation noise, in which spike rates are constant within a trial but vary randomly from trial to trial. In both cases, we retain the presence of a 3 msec refractory period. Fast-uctuation noise consisted of Poisson spike trains whose randomly varying instantaneous spike density parameter was described by low-pass ltered gaussian white noise. The white noise perturbed spike density was set with means rNspont D 8 Hz and rNstim D 24 Hz and standard deviations of r.t/ equal to 0.5 rN. The noise was passed through a Butterworth low-pass lter: 1 Aout D 1 Ain [1 C . f=fc /2n ] 2

n D 4;

fc D 20Hz:

(8.1)

Since spike rate uctuations on average integrated to zero within each trial, they did not change the spike count but merely rearranged their timing,

1340

S. Lehky

making the spike train more “bursty.” Fano factors remained at 1.0 over the timescale of a complete trial, unchanged from homogeneous Poisson trains. For slow-uctuation noise, both spontaneous and stimulus spike rates were constant within a trial, but changed randomly from trial to trial according to a log-normal distribution: 8 < prob.r j ¹; ¾ / D

:

1 p

r¾ 2¼ 0

e

¡.ln.r/¡¹/2 2¾ 2

r>0

(8.2)

:

r·0

Parameters for the spontaneous ring rate were ¹ D 2:02, ¾ D 0:35, and for the stimulus rate, ¹ D 3:15, ¾ D 0:22. Those values satised two criteria. First, the mean spike densities were the same as used previously, rNspont D 8 and rNstim D 24, and second, Fano factors were approximately 2.0 in both cases. Using noise-perturbed spike trains, stimulus response statistics were calculated as before. Results for the two noise conditions are given in Tables 3 and 4, to be compared with results for homogeneous Poisson trains in Table 2. Comparison of Tables 2 and 3 shows that fast-uctuation noise has no effect on either Bayesian or standard estimates, not surprising given that on extended timescales, spike count is conserved under this noise. Slow-uctuation noise produces no signicant change in Bayesian estimates of E.Var.¸resp//, as a comparison of Tables 2 and 4 shows. On the other hand, standard estimates of E.Var.¸resp // were increased substantially. Thus, under this noise condition, Bayesian variance estimates become more accurate than standard ones (closer to the noise-free estimates), in addition to being more precise (smaller variance of the variance). In the slow-uctuation noise case, we have spike trains that appear nonPoisson by one criterion, a Fano factor of two, yet the estimates of a PoissonTable 3: Response Statistics Using Inhomogeneous Poisson Spike Trains with a 3 Msec Refractory Period and Whose Firing Rates are Perturbed with FastFluctuation Noise. E.¸resp /

E.Var.¸resp //

Var.Var.¸resp //

Trials

Standard

Bayesian

Standard

Bayesian

Standard

Bayesian

1 2 5 10

16.1 16.0 16.0 16.0

16.1 16.0 16.0 16.0

– 28.8 11.6 5.82

63.1 29.9 11.6 5.72

– 1574 68.2 7.64

133 16.8 1.11 0.138

Notes: Fast-uctuation noise varies rapidly relative to the duration of a stimulus trial, as dened in equation 8.1. Simulation parameters are as given in Table 1.

Bayesian Estimation of Neural Responses

1341

Table 4: Response Statistics Using Inhomogeneous Poisson Spike Trains with a 3 Msec Refractory Period and Whose Firing Rates are Perturbed with SlowFluctuation Noise. E.¸resp /

E.Var.¸resp //

Var.Var.¸resp //

Trials

Standard

Bayesian

Standard

Bayesian

Standard

Bayesian

1 2 5 10

16.2 15.9 16.0 16.1

16.2 15.9 16.0 16.1

– 46.6 18.9 9.36

62.8 29.8 11.5 5.71

– 4520 201 21.4

207 26.7 1.76 0.225

Notes: Slow-uctuation noise is slow relative to the duration of a stimulus trial, as dened in equation 8.2. Simulation parameters are as given in Table 1.

Table 5: Example of Bayesian Estimation Applied to Actual Data. E.¸resp /

E.Var.¸resp //

Var.Var.¸resp //

Trials

Standard

Bayesian

Standard

Bayesian

Standard

Bayesian

5

48.1

47.9

89.4

47.8

2230

19.7

Notes: The data consisted of 10 trials recorded from a unit in macaque striate cortex, presented with grating stimuli. The Fano factor was 1.7. Bayesian estimation was applied to 50 replications of a bootstrap resampling of the data, with 5 of the 10 trials chosen at random with replacement for each resampling.

based model are little affected as spike generation is still Poisson at a local timescale. The insensitivity of Bayesian estimates to either type of noise perturbation away from a homogeneous Poisson process indicates the robustness of the method. As a nal item on noise-perturbed spike trains, we apply the Bayesian method to actual data recorded from monkey striate cortex (see Table 5). In this case, E.¸resp/ is slightly different for the two methods because Tstim 6D Tspont (see equation 6.4). The pattern of E.Var.¸resp // values for the actual data appears to follow those of the simulations with slow-uctuation noise added, in which Bayesian variances are substantially smaller than standard variances. We noted previously that under the slow-uctuation noise condition, the smaller Bayesian variance estimates are more accurate than the standard estimates. 9 Discussion The Bayesian and standard methods provide similar estimates for the mean E.¸resp / and variance E.Var.¸resp // of the response ring rate for parameters typical of experimental conditions. The estimates of both methods converge

1342

S. Lehky

as the data sample size increases, by increasing either the number of trials or the duration of each trial, so that the Bayesian model is asymptotically unbiased. Given a uniform prior, the Bayesian estimate is a maximum likelihood estimate and therefore asymptotically efcient (given certain regularity conditions) (Kay, 1993). That is, the Bayesian E.Var.¸resp// estimate asymptotically approaches the Cram´er-Rao bound. As the Bayesian estimate also asymptotically approaches the standard estimate, the implication is that the standard estimate is, at minimum, also asymptotically efcient. The Bayesian method provides much smaller estimates of the variance of the variance .Var.Var.¸resp/// than the standard method, despite having almost identical values of expected variance E.Var.¸resp //. That is because the two methods compute variances using entirely different procedures. Under the standard method, the variance of the ring rate Var.¸/ is computed directly from the sample of ¸ itself by plugging those values into the standard formula for variance, without making assumptions about the distribution of ¸. Under the Bayesian method, statistics are calculated indirectly from the data in a two step process, by rst tting a model (the likelihood model) and then determining statistics from the model curve t rather than directly from the data. We can identify three benets to using the Bayesian analysis relative to the standard analysis. The rst is smaller values for the variance of the variance. This allows more precise interval estimates of a parameter, which in turn improves the precision of tests of signicance. The second is reduced sensitivity of variance estimates to perturbations in the data caused by noise. This can be seen by comparing Tables 2 and 4. The nal benet is the ability to estimate response variance after only a single trial. Underlying the Bayesian analysis was the assumption that spike trains have Poisson statistics. Sensitivity to this assumption was tested by perturbing spike trains away from Poisson statistics with two kinds of noise, whose uctuations were fast or slow relative to the timescale of a single trial. Despite the noise perturbations, Bayesian estimates of response mean and variance were not signicantly affected. This indicates that the Bayesian method is robust. Given the benets of Bayesian analysis listed above and the robustness of the method, it appears to be a promising candidate for practical data analysis.

Acknowledgments I thank Keiji Tanaka and Leslie Ungerleider for their support. Portions of this work were completed while I was visiting the Sloan-Schwartz Center for Theoretical Neurobiology at the Salk Institute.

Bayesian Estimation of Neural Responses

1343

References Brown, E. N., Frank, L. M., Tang, D., Quirk, M. C., & Wilson, M. A. (1998). A statistical paradigm for neural spike train decoding applied to position prediction from ensemble ring patterns of rat hippocampal place cells. J. Neurosci., 18, 7411–7425. Kay, S. M. (1993). Fundamentals of statistical signal processing: Estimation theory. Upper Saddle River, NJ: Prentice Hall. Martignon, L., Deco, G., Laskey, K., Diamond, M., Freiwald, W., & Vaadia, E. (2000). Neural coding: Higher order temporal patterns in the neurostatistics of cell assemblies. Neural Comput., 12, 2621–2653. Sanger, T. D. (1996). Probability density estimation for the interpretation of neural population codes. J. Neurophysiol., 76, 2790–2793. Zhang, K., Ginzburg, I., McNaughton, B. L., & Sejnowski, T. J. (1998). Interpreting neuronal population activity by reconstruction: Unied framework with application to hippocampal place cells. J. Neurophysiol., 79, 1017–1044. Received April 16, 2003; accepted December 3, 2003.

NOTE

Communicated by John Platt

Comments on “A Parallel Mixture of SVMs for Very Large Scale Problems” Xiaomei Liu [email protected] Department of Computer Science and Engineering, University of Notre Dame, South Bend, IN 46556, U.S.A.

Lawrence O. Hall [email protected] Department of Computer Science and Engineering, University of South Florida, Tampa, FL 33620, U.S.A.

Kevin W. Bowyer [email protected] Department of Computer Science and Engineering, University of Notre Dame, South Bend, IN 46556, U.S.A.

Collobert, Bengio, and Bengio (2002) recently introduced a novel approach to using a neural network to provide a class prediction from an ensemble of support vector machines (SVMs). This approach has the advantage that the required computation scales well to very large data sets. Experiments on the Forest Cover data set show that this parallel mixture is more accurate than a single SVM, with 90.72% accuracy reported on an independent test set. Although this accuracy is impressive, their article does not consider alternative types of classiers. We show that a simple ensemble of decision trees results in a higher accuracy, 94.75%, and is computationally efcient. This result is somewhat surprising and illustrates the general value of experimental comparisons using different types of classiers. 1 Introduction Support vector machines (SVMs) are most directly used for two-class classication problems (Vapnik, 1995). They have been shown to have high accuracy in a number of problem domains, for example, character recognition (Decoste & Scholkopf, 2002). The training time required by SVMs is an issue for large data sets. Collobert, Bengio, and Bengio (2002) introduced a method using a mixture of SVMs created in parallel to address the issue of scaling to large data sets. Results were reported on an example data set, the Forest Cover Type data set from the UC Irvine repository (Blake & Merz, c 2004 Massachusetts Institute of Technology Neural Computation 16, 1345–1351 (2004) °

1346

X. Liu, L. Hall, and K. Bowyer

1998). The train data were converted to a two-class problem. It was shown that the mixture of SVMs is more accurate than a single SVM, it is faster to train, and it has the potential to scale to large data sets. The accuracy from training on 100,000 examples is shown to be 90.72% on an unseen test set of 50,000 examples. The parallel mixture of SVMs was not compared with other classiers. In this comment, we compare an ensemble of randomized C4.5 decision trees (Dietterich, 1998) to the parallel mixture of SVMs and, perhaps contrary to our expectations and others, nd that the ensemble of decision trees results in a more accurate classier. Further, decision trees scale reasonably well with large data sets (Chawla et al., 2003). This result seems to reinforce the idea that it is always useful to compare a classier to other approaches. In the next section, we briey discuss the two ensemble classiers compared. Section 3 provides the details of our experiments and our experimental results. Section 4 presents the discussion and our conclusion. 2 Background 2.1 SVM and a Parallel Mixture of SVMs. The SVM was introduced by Vapnik (1995). SVM classiers are used to solve problems of two-class classication. The learning and training time for an SVM is high. It is at least quadratic in the number of the training patterns. In order to decrease the time cost of SVM, Collobert et al. (2002) proposed a parallel mixture of SVMs. They use only part of the training set for each SVM, so the time cost is decreased signicantly. It is conjectured that the time cost of a parallel mixture of SVMs is subquadratic with the number of training patterns for large-scale problems. They claim that the performance of a parallel mixture of SVMs is at least as good as a single SVM and show it to be so. 2.2 Decision Trees. A decision tree (DT) is a tree-structured classier. Each internal node of the DT contains a test on an attribute of the example to be classied, and the example is sent down a branch according to the attribute value. Each leaf node of the decision tree has the class value of the majority class for the training examples that ended up at the leaf. A DT typically builds axis-parallel class boundaries. Pruning is a useful method to decrease overtting for individual DTs, but is not so useful for ensembles of decision trees. The randomized C4.5 algorithm (Dietterich, 1998) was based on the C4.5 (Quinlan, 1993) algorithm. The main idea of randomized C4.5 is to modify the strategy of choosing a test at a node. When choosing an attribute and test at an internal node, C4.5 selects the best one based on the gain ratio. In the randomized C4.5 algorithm, m best splits are calculated (m is a positive constant with the default value 20), and one of them is chosen randomly with a uniform probability distribution. When calculating the m best tests, it

A Parallel Mixture of SVMs

1347

Table 1: Class Distribution of the Forest Cover Type Data Set. Class Label

Meaning

Number of Records

1 2 3 4 5 6 7

Spruce/Fir Lodgepole Pine Ponderosa Pine Cottonwood/Willow Aspen Douglas-r Krummholz

211,840 283,301 35,754 2747 9493 17,367 20,510

is not required that they be from different attributes. In an extreme situation, the m candidate tests may be from the same attribute. 3 The Forest Cover Type Data Set The original Forest Cover Type data set (Blackard & Dean, 1999) contains 581,012 instances. 1 For each instance, there are 54 features. There are seven class labels numbered from 1 to 7. The distribution of the seven classes is not even. Table 1 shows the class distribution. Since an SVM is most directly used for two-class classication, the original seven-class data set was transformed into a two-class data set in the experiments of Collobert et al. (2002). The problem became to differentiate the majority class (class 2) from the other classes. Since we are going to compare the performance of a DT ensemble with the parallel mixture of SVMs, we use the same two-class version of the Forest Cover Type data set in our experiment. We downloaded the data sets. 2 They normalized the original forest cover data by dividing each of the 10 continuous features or attributes by the maximum value in the training set. There were 50,000 patterns in the testing set, 10,000 patterns in the validation set, and 400,000 patterns in the training set. We used the downloaded testing set and validation set as our testing set and validation set, accordingly. However, we did not actually tune our ensemble based on the validation set, so for us, it serves as a second test set. We used the rst 100,000 patterns in the downloaded training set as our training set. These are the exact data sets used in Collobert et al. (2002).

1 The data set is available on-line at http://ftp.ics.uci.edu/pub/machine-learningdatabases/covtype/covtype.info. 2 Available on-line at http://www.idiap.ch/collober/forest.tgz.

1348

X. Liu, L. Hall, and K. Bowyer

4 Experimental Results We used the USFC4.5 software, which is based on C4.5 release 8 (Quinlan, 1993) and modied by the researchers at University of South Florida (Eschrich, 2003), to do the DT experiments. 4.1 An Ensemble of 50 Randomized C4.5 Trees on the Full Training Set. Typically, a randomized C4.5 ensemble would consist of 200 decision trees. To compare with the 50 SVMs, we restricted our ensemble to 50 decision trees. Each tree was built on the whole training set of size 100,000. Since there were no differences in the data set of each individual tree, we used randomized C4.5 to create each tree to generate a diverse set of trees for the ensemble (Baneld, Hall, Bowyer, & Kegelmeyer, 2003). A random choice from among the top 20 tests was used to build the trees. The trees in the ensemble were unpruned. The ensemble prediction was obtained by unweighted voting, so the class with the most votes from individual classiers was the prediction of the ensemble. As shown in Table 2, the ensemble accuracy on the training data set was 99.81%, on the validation set was 94.85%, and on the testing set it was 94.75%. We also list the minimum, maximum, and average accuracy of the 50 individual DTs included in the ensemble in Table 2. The test set accuracy compares favorably with the 90.72% accuracy of the parallel mixture of SVMs. 4.2 An Ensemble of 100 C4.5 Trees on Half of the Training Set. To get an idea of how much the randomized C4.5 was helping the classication accuracy, we created an ensemble of 100 trees, each built on one-half of the training data. Each tree was trained on a randomly selected 50,000 examples from the 100,000-example training set. It is not guaranteed that each instance appears exactly 50 times in the training sets of the 100 DTs. Since each training data set is clearly unique, we built a standard C4.5 decision tree on them. The trees were not pruned. Each of our trees was built on 25 times more training data than the SVM. However, only 100,000 unique examples are used. Each tree can be built in 1 CPU minute. The ensemble performance on the testing set was 92.76%, the minimum single tree performance is 86.23%, the maximum single tree performance

Table 2: Accuracy of Dietterich’s Randomized C4.5 on the Forest Cover Type Data Set, 50 Trees.

Training set Validation set Testing set

Ensemble

Minimum

Maximum

Average

99.81% 94.85 94.75

97.27% 88.62 88.81

97.73% 89.92 89.66

97.51% 89.42 89.25

A Parallel Mixture of SVMs

1349

Table 3: Comparison of Accuracy Between Randomized C4.5 DTs and a Parallel Mixture of SVMs on the Forest Cover Type Data Set.

Training set Validation set Testing set

Randomized C4.5 DTs

Parallel Mixture of SVMs

99.81% 94.85 94.75

94.09% 91.00 90.72

is 87.60%, and the average single tree performance is 87.10%. So the SVM mixture was outperformed by an ensemble of plain C4.5 DTs with each tree grown on half the training data of one of the SVMs. 5 Discussion and Conclusion According to the results reported in Collobert et al. (2002), the best performance of their parallel mixture of SVMs (using 150 hidden units and 50 SVMs) on the training set was 94.09%, on the validation set it was around 91% (estimated from Figure 4 in Collobert et al., 2002), and on the testing set it was 90.72%. In our experiments, the ensemble of 50 randomized DTs had much better performance. As shown in Table 3, its accuracy on the training set was 99.81%, on the validation set it was 94.85%, and on the testing set it was 94.75%. We did build an ensemble of 200 trees, but the accuracies were only slightly greater. The ensemble becomes good quite quickly. Each SVM in the ensemble of classiers was built from a disjoint data set of 2000 examples. The high accuracy obtained from such small training data sets and the scalability of the algorithm are impressive. Each SVM classier used signicantly fewer than the 100,000 or 50,000 training examples for each DT in the ensemble. The accuracy of the DT ensemble with half the size of the data was 2% less than with all training examples. Clearly, the DT accuracy will decline with fewer examples. However, below, we show some timings indicating that a DT ensemble can likely be built in time comparable to or less than the SVM ensemble. As to the running time, according to Collobert et al. (2002), it needs 237 minutes when using 1 CPU, and 73 minutes when using 50 CPUs. We ran our experiments on a single computer. The CPU time to create the ensemble of 50 random C4.5 DTs is approximately 108 minutes. We used a Pentium III system, and each processor had 1 GHz clock speed. Since it is an ensemble, it could be created in parallel, with each processor starting with a different random seed. The CPU time to build each tree is approximately 2 minutes. The parallel time would then be on the order of 2 minutes plus communication time.

1350

X. Liu, L. Hall, and K. Bowyer

Further, an ensemble of 100 trees, each created on a randomly selected 50,000 examples, was still 2% more accurate than the ensemble of SVMs. Each of these trees could be built in approximately 1 minute of CPU time in parallel. From our experiments, it is shown that for the two-class Forest Cover Type data set, an ensemble of DTs has very good predictive accuracy. This advantage exists in not only the two-class Forest Cover Type data set. We also did some experiments on the original seven-class Cover Type data set using a single DT. The performance of a single DT is promising too. It is much better than that of a single feedforward backpropagation neural network in both accuracy and speed. The comparative results provided here underscore the need to compare classier results with other types of classiers even when it seems that the answer would be known a priori. For a given data set, most people would guess that an SVM would be much better than a DT. So if one designs a classier that is even better than a single SVM, intuitively it seems unnecessary to compare with classical approaches with known limits such as DTs. We are certain that a parallel mixture of SVMs will outperform DTs on some training data sets, but not this one. As noted above, DTs result in good classication accuracy on the Forest Cover data set. They are both faster to construct than SVMs on this data set and more accurate. Acknowledgments This work was supported in part by the U.S. Department of Energy through the Sandia National Laboratories ASCI VIEWS Data Discovery Program, contract number DE-AC04-76DO00789, as well the U.S. Navy, Ofce of Naval Research, under grant number N00014-02-1-0266. References Baneld, R. E., Hall, L. O., Bowyer, K. W., & Kegelmeyer, W. P. (2003). In T. Windeatt & F. Roli (Eds.), Multiple classier systems: Fourth International Workshop, MCS (pp. 306–316). New York: Springer-Verlag. Blackard, J. A., & Dean, D. J. (1999). Comparative accuracies of articial neural networks and discriminant analysis in predicting forest cover types from cartographic variables. Computers and Electronics in Agriculture, 24, 131– 151. Blake, C. L., & Merz, C. J. (1998). UCI repository of machine learning databases. Irvine: University of California, Department of Information and Computer Science. Available on-line at: http://www.ics.uci.edu/»mlearn/ MLRepository.html. Chawla, N. V., Moore, Jr., T. E., Hall, L. O., Bowyer, K. W., Kegelmeyer, W. P., & Springer, C. (2003). Distributed learning with bagging-like performance. Pattern Recognition Letters, 24(1–3), 455–471.

A Parallel Mixture of SVMs

1351

Collobert, R., Bengio, S., & Bengio, Y. (2002). A parallel mixture of SVMs for very large scale problems. Neural Computation, 14, 1105–1114. Decoste, D., & Scholkopf, B. (2002). Training invariant support vector machines. Machine Learning, 46(1–3), 161–190. Dietterich, T. G. (1998). An experimental comparison of three methods for constructing ensembles of decision trees: Bagging, boosting, and randomization. Machine Learning, 40(2), 139–158. Eschrich, S. (2003). Learning from less: A distributed method for machine learning. Unpublished doctoral dissertation, University of South Florida. Quinlan, J. R. (1993). C4.5: Programs for machine learning. San Francisco: Morgan Kaufmann. Vapnik, V. (1995). The nature of statistical learning theory. New York: SpringerVerlag. Received October 2, 2003; accepted January 8, 2004.

LETTER

Communicated by Edward White

Automated Algorithms for Multiscale Morphometry of Neuronal Dendrites Christina M. Weaver [email protected] Department of Biomathematical Sciences and Computational Neurobiology and Imaging Center, Mount Sinai School of Medicine, New York, NY 10029, U.S.A.

Patrick R. Hof [email protected] Fishberg Research Center for Neurobiology,Kastor Neurobiology of Aging Laboratories, Computational Neurobiology and Imaging Center, and Advanced Imaging Program, Mount Sinai School of Medicine, New York, NY 10029, U.S.A.

Susan L. Wearne [email protected] Department of Biomathematical Sciences, Fishberg Research Center for Neurobiology, and Computational Neurobiology and Imaging Center, Mount Sinai School of Medicine, New York, NY 10029, U.S.A.

W. Brent Lindquist [email protected] Department of Applied Mathematics and Statistics, Stony Brook University, Stony Brook, NY 11794, U.S.A.

We describe the synthesis of automated neuron branching morphology and spine detection algorithms to provide multiscale three-dimensional morphological analysis of neurons. The resulting software is applied to the analysis of a high-resolution (0.098 ¹m £ 0.098 ¹m £ 0.081 ¹m) image of an entire pyramidal neuron from layer III of the superior temporal cortex in rhesus macaque monkey. The approach provides a highly automated, complete morphological analysis of the entire neuron; each dendritic branch segment is characterized by several parameters, including branch order, length, and radius as a function of distance along the branch, as well as by the locations, lengths, shape classication (e.g., mushroom, stubby, thin), and density distribution of spines on the branch. Results for this automated analysis are compared to published results obtained by other computer-assisted manual means.

c 2004 Massachusetts Institute of Technology Neural Computation 16, 1353–1383 (2004) °

1354

C. Weaver, P. Hof, S. Wearne, and W. Lindquist

1 Introduction Ever since the pioneering work of Ramon ´ y Cajal (1893) and Tanzi (1893), there have been indications that the processes of learning and memory may require physical changes in neuronal anatomy (see Wallace, Hawrylak, & Greenough, 1991, for review). There is a strong correlation between the functional complexity of a neuron and its morphological complexity (Hubener, Schwarz, & Bolz, 1990; Scheibel et al., 1985; Scheibel, Conrad, Perdue, Tomiyasu, & Wechsler, 1990). As a result, many studies have investigated the morphological variation between neurons located in different regions of the cortex (Elston & Rosa, 1997, 1998; Elston, Tweedale, & Rosa, 1999a, 1999b, 1999c; Elston, 2000; Elston & Rockland, 2002; Jacobs, Driscoll, & Shall, 1997; Jacobs et al., 2001; Nimchinsky, Sabatini, & Svoboda, 2002). Dendritic branching and spine morphology vary throughout the life cycle of an animal (Anderson, Classey, Cond´e, Lund, & Lewis, 1995; Duan et al., 2003; Harris, Jensen, & Tsao, 1992; Page et al., 2002), and can be inuenced by environmental conditions (Bartesaghi, Severi, & Guidi, 2003; Kolb, Gibb, & Gorny, 2003; Seib & Wellman, 2003) and neurological disorders (DeFelipe & Farinas, ˜ 1992; Glantz & Lewis, 2000; Irwin et al., 2001, 2002; Nimchinsky, Oberlander, & Svoboda, 2001; Purpura, 1974). There have been several demonstrations of linkage between the morphology of a neuron and its physiological and electrotonic properties (Baer & Rinzel, 1991; Gilbert & Wiesel, 1992; Holmes, 1989; Krichmar, Nasuto, Scorcioni, Washington, & Ascoli, 2002; Mainen & Sejnowski, 1996; Rumberger, Schmidt, Lohmann, & Hoffmann, 1998; Stratford, Mason, Larkman, Major, & Jack, 1989; Surkis, Peskin, Tranchina, & Leonard, 1998; Wilson, 1984, 1988). These experimental and theoretical results motivate an interest in obtaining detailed information regarding the morphology of a neuron. Current morphological analysis typically involves a signicant component of computer-assisted manual labor in which a human operator performs crucial structure recognition decisions. As a result, such analysis can be very time-consuming and is susceptible to fatigue- and habituation-related bias (Anderson et al., 1995; Elston & Rosa, 1997; Feldman & Peters, 1979). The goal of our work is to contribute to the development of a newer generation of software tools that provide greater structural recognition capability, drastically reducing the human labor component. Koh, Lindquist, Zito, Nimchinsky, and Svoboda (2002) described a software package for detection of dendritic spines in images obtained from laser scanning microscopy and benchmarked it against manual recognition. In this article, we combine the spine detection algorithms with dendritic tracing algorithms (Koh, 2001) to produce a package capable of morphometry on an entire neuron, imaged at high resolution (0.1 ¹m) with vastly reduced user interaction. Automated methods for tracing dendritic arbors have been described in Al-Kofahi et al. (2002), Koh (2001), and Cohen, Roysam, and Turner (1994). Al-Kofahi et al. (2002) generate seed points to nd probable locations of den-

Algorithms for Morphometry of Neuronal Dendrites

1355

dritic segments, and then use predetermined templates of variable size to nd local dendritic boundaries and trace dendritic processes. They take advantage of two-dimensional (2D) projections of three-dimensional (3D) image stacks to locate the seed points, resulting in an efcient trace of dendritic segments, including the basic connectivity of these segments. Koh and Cohen et al. utilize skeletonization (medial axis) construction to automate arbor tracing. Both use the medial axis to generate a graph-theoretic representation of a neuron, from which branch segments are identied. Cohen et al. do not discuss how to handle loops in the medial axis structure that cause ambiguity when determining, for example, branch order and dendritic length from the soma. The arbor tracing method utilized by Koh et al. distinctly resolves such loops using minimum kink angle and branch length criteria. Automated methods for dendritic spine detection have also been demonstrated (Herzog et al., 1997; Koh et al., 2002; Rusakov & Stewart, 1995; Watzel, Braun, Hess, Scheich, & Zuschratter, 1995). The medial axis construction has again been the most popular technique for dendritic spine detection. Rusakov and Stewart construct the medial axis of 2D images of dendritic fragments, using manual supervision to distinguish dendritic segments from spine segments. A stereological procedure is used to estimate the 3D spine length distribution from 2D images of the same fragment at different tilt angles. Watzel et al. demonstrate a 3D medial axis-based algorithm to extract the centerline from images containing a single dendrite and identify “spurs” on the medial axis as potential spines. True spines are distinguished from medial axis artifacts using a minimum-length criterion. Herzog et al. use a parametric model of cylinders with hemispherical ends to t the shape of dendrites and spines appearing in an image. Thin-necked and short spines are hard to detect with this method and have to be added manually. Koh et al. construct a modied medial axis to locate dendritic branch centerlines and detect spines as geometric surface protrusions relative to the centerlines. The algorithms described in this study are implemented as part of the 3DMA-Neuron software package (Weaver & Lindquist, 2003). To date, the dendritic spine analysis in 3DMA-Neuron has been utilized (Koh et al., 2002) only on images detailing subregions of the dendritic arbor; while the branching morphology analysis has been used (Koh, 2001; Maravall, Shepherd, Koh, Lindquist, & Svoboda, in press) on images whose resolution is too low to permit accurate measurement of dendritic spines. We present here an integration of these two algorithms, demonstrating both branching morphology as well as ne-scale structure analysis of a single, full-neuron cell image. The neuron analyzed is a pyramidal neuron from layer III of the rhesus macaque monkey superior temporal cortex. Extracted morphological parameters from the automated analysis are compared with published results obtained by other means. This article is divided into the following sections. The biological preparation, imaging details, and image preprocessing techniques are presented

1356

C. Weaver, P. Hof, S. Wearne, and W. Lindquist

briey in section 2. The algorithms used to analyze the neuron morphology are presented in section 3. Morphometry results, including comparison with measurements available from the literature, are presented in section 4. Final discussion of the image analysis follows in section 5.

2 Materials and Methods 2.1 Tissue Preparation and Imaging. Tissue sections of 200 to 400 ¹m thickness were obtained from the neocortex of a rhesus macaque monkey, aged about 22 years, as described elsewhere (Duan, Wearne, Morrison, & Hof, 2002; Duan et al., 2003). All experimental protocols were conducted according to the National Institutes of Health guidelines for animal research and were approved by the Institutional Animal Care and Use Committee at Mount Sinai School of Medicine. Sections were mounted on nitrocellulose paper and immersed in phosphase buffered saline (PBS). The sections were incubated with 4,6-diamidino-2-phenylindole (DAPI; Sigma, St. Louis, MO) for at least 20 minutes to reveal the cellular architecture under ultraviolet (UV) excitation. Neurons were then impaled with micropipettes and loaded with Lucifer Yellow (LY; Molecular Probes, Eugene, OR) in distilled H2 O under a DC current of 3 to 8 nA until the dye had lled distal processes (10–12 minutes) and no further loading was observed (DeLima, Voigt, & Morrison, 1990; Duan et al., 2002; Nimchinsky, Hof, Young, & Morrison, 1996). Generally, up to 15 cells were injected per slice, with injections sufciently separated to avoid overlapping of the dendritic trees. After neuron labeling, sections were xed again in 4% paraformaldehyde and 0.125% glutaraldehyde in PBS for 4 hours at 4± C, washed and stored in PBS. Sections were mounted on uncoated slides, and loaded neurons were imaged with a TCS-SP (UV) confocal laser scanning microscope (Leica Microsystems, Heidelberg, Germany) equipped with confocal detector channels having 8-bit photomultiplier tubes. The pyramidal neuron imaged for this study is from layer III of the superior temporal cortex (STC) and was shown to project ipsilaterally to the prefrontal cortex (Duan et al., 2003). A 100£ oil immersion objective lens with numerical aperture (NA) of 1.4 was used to collect the image, which was scanned at a resolution of 1024 £ 1024 voxels. Voxel dimensions are 0.098 ¹m £ 0.098 ¹m in plane. Optical sections were collected every 0.081 ¹m via a scan stage with a 40 nm step size. As the working distance of the high NA oil immersion lens is limited to 100 ¹m, only 1000 to 1200 optical sections could be collected per column. After imaging an entire stack through-focus, the stage was refocused and translated in either the z (optical axis) direction, to collect another stack from the remainder of the column, or the xy-plane, to collect from an adjacent column. Sufcient overlap between columns was allowed to ensure for accurate assembly of the full neuron montage. Scanning continued until all areas containing the neuron were imaged.

Algorithms for Morphometry of Neuronal Dendrites

1357

2.2 Deconvolution, Alignment, and Integration of Image Stacks. Each imaged stack was deconvolved separately using the blind deconvolutionbased package AutoDeblur (AutoQuant Imaging Inc). Typically (Turner et al., 1997, 2000), the optical point spread function smears the image along the optical axis (z-direction) about three times that in the x- and y-directions (see Figure 1a). While deconvolution greatly reduces this spread (see Figure 1b), the effect is still noticeable. Inspection shows that compression by voxel averaging of 2 ! 1 in the z-direction nearly eliminates the spread effect (see Figure 1c). We perform this z-compression once the image stacks are integrated (as described next) into a single volume. The imaged stacks were processed into a single volume using the Volume Integration and Alignment System (VIAS) (Rodriguez et al., 2003). VIAS provides the user with an interface by which adjacent stacks can be aligned with one another. 2D projections of a stack, rather than the whole stack, are utilized in the alignment, greatly reducing the computer memory requirements of the program. Working with the xy-projections of two adjacent stacks, an approximate alignment is determined manually via a mouse “click-and-drag” operation. This is followed by an automated alignment that minimizes the L1 norm of the intensity differences in the overlapping region of the projection, using a relatively small neighborhood of offsets around the approximate manual alignment. Completion of this autoalignment step nalizes horizontal registration of the two stacks. Vertical registration is done by performing an automated alignment step using the xz- (or yz-) projections of the stack with the y (or x) coordinate of the alignment xed. By considering adjacent stacks one at a time, all stacks can be registered by a displacement vector relative to the rst. With all stacks registered, a global coordinate system can be assigned that covers all stacks and a minimum size rectangular volume dened that just contains all stacks. Fluorescence intensity values from individual stacks are then assigned to each appropriate voxel position in the enclosing volume. If voxels from two or more stacks overlap position, the average intensity value is assigned. Under the voxel resolution used here, the entire neuron volume (4; 826 £ 5; 609 £ 597 voxels) is too large (25 Gbytes) to store in memory on any single processor machine. Thus, the entire neuron volume is assembled slice-wise; each slice is output into a separate le before the next slice is assembled. A 2D projection of the nal result of the VIAS procedure for this neuron is shown in Figure 2. 3 Image Analysis For brevity, we shall refer to the entire z-compressed VIAS neuron image as If . The morphology of If is analyzed in a three-step approach. The rst step involves local ne-scale analysis of the structure of dendritic branch segments, including spines. The second step determines the global dendritic branching morphology, including assignment of labels for each dendritic

1358

C. Weaver, P. Hof, S. Wearne, and W. Lindquist

Figure 1: xy- (top) and xz-projections (bottom) of a subregion from a single stack after (a) confocal imaging and (b) deconvolution by AutoDeblur. (c) The region in (b) after compression by a factor of 2 in the z (optical) direction.

Algorithms for Morphometry of Neuronal Dendrites

1359

Figure 2: xy-projection of the entire neuron after stack alignment and integration by the VIAS system. Individual tiles manually selected for spine detection analysis are also indicated.

branch. In the third step, the ne-scale morphology measurements from step 1 must be registered with the dendritic branch labeling established in step 2. 3.1 Analysis of Fine-Scale Morphology. For the ne-scale analysis, computer memory limitations require dividing If into nonoverlapping tiles. For this analysis, If was manually divided into 31 tiles (see Figure 2), with

1360

C. Weaver, P. Hof, S. Wearne, and W. Lindquist

no tile larger than 532 3 voxels, a limit imposed by the 1 Gbyte memory limitation of the Linux-based PCs on which the software was implemented. Note that tiles are chosen to avoid regions containing no neuron structure. Fine structure analysis is performed separately for each tile; consequently, ne structure analysis proceeds without global label information for the dendritic segments present in the tile. The analysis sequence for the ne structure in a single tile consists of segmentation; skeletonization of the neuron phase and reduction of the skeleton to one consisting of midlines for all dendrite candidates; dendritic radius determination; spine detection as protrusions relative to dendritic midline; and morphological characterization of each spine. Dendrite midlines are also referred to as backbones, and the reduced skeleton composed of midlines is referred to as the backbone skeleton. These algorithms have been described in detail in Koh et al. (2002); for brevity, we comment only on changes incorporated for, and issues specic to, this entire neuron study. As mentioned, If was acquired as a union of separately imaged, overlapping stacks. During imaging, microscope parameters were adjusted for each separate stack to balance visibility of spines and thin dendritic processes against uorescent brightness of thicker processes. This produces intensity variations throughout If . Hence, a segmentation method that is locally adaptive is preferable to one employing global parameters. We employ a method (Oh & Lindquist, 1999) based on indicator kriging, which (using information in the local spatial neighborhood) performs a maximum likelihood estimation of the probability that any given voxel belongs to the neuron phase. The method requires that a subpopulation of voxels of each phase be positively identied rst. This subpopulation is identied by conservatively establishing a window of image intensity values delimited by two thresholds. Intensity values above the upper threshold are assumed to be positively identied as neuron, and those below the lower threshold are assumed background tissue. The classication of the remaining voxels is then performed by indicator kriging. The intensity variations in If necessitated tile-to-tile manual adjustment of these two threshold values. The determination of dendritic branch radius was not discussed explicitly in Koh et al. (2002); however, it follows directly from the skeletonization procedure discussed there. Skeletonization naturally associates a radius with each voxel comprising the dendrite’s midline; thus, the radius measure is generated as a function of position along any dendritic branch segment. Technically, the radius measured is one-half of the shortest diameter of the cross-sectional area of the branch at the voxel in question. A major thrust of the study by Koh et al. (2002) was to benchmark the spine detection algorithm against manual identication standards. The spine detection algorithms have false-positive and false-negative detection rates that primarily reect the amount of uorescent cellular debris located near neuron tissue and also reect nonspine dendrite surface structures such as incipient lopodia. False detection rates are controlled by a user-set

Algorithms for Morphometry of Neuronal Dendrites

1361

length parameter that determines both the maximum spine length allowed and the maximum distance an (apparently disconnected) spine head can lie from a dendritic backbone. As dendritic segments and spines are detected while examining only single tiles, nal conrmation of all dendrite and spine candidates so detected is reserved until the ne-scale analysis of all tiles has been collated and registered with the global dendrite morphology. 3.2 Analysis of Dendritic Branching Morphology. Memory requirements to store If also preclude an analysis of the dendritic branching morphology at ne-scale resolution. Let Si denote the backbone skeleton obtained from the ne-scale analysis of the neural tissue in the ith tile. Let S be the union of the Si over all tiles. S forms a ne-scale description of the backbone skeleton of the neuron; this skeleton preserves the geometry of the branching of the dendritic arbors. To reduce memory requirements needed to store the volume required to analyze the dendritic branching of S, a coarsened image Sc is obtained by a simple contraction operation. Let v be a voxel in S with global coordinates (xv , yv , zv ). (Because the image is digital, the coordinates are integers.) Dene the coarsened global position of v as ([xv =f ], [yv =f ], [zv =f ]), where f is a positive integer coarsening factor, and the [ ] operation denotes “integer part of.” Qualitatively, the effect of this contraction procedure is to replace each string of f voxels in the skeleton S by a single voxel (of f times the physical dimension in each direction). The set of coarsened voxels produces a coarsened skeleton Sc , which can be stored in a coarsened volume Ic . The branching morphology analysis is performed on Sc in Ic . Choice of the value for f represents a compromise between memory reduction and loss of accuracy of the branching structure due to a change in topology between S and Sc . If two dendritic processes are locally separated by less than f voxels in S, they will be locally merged under coarsening, producing a branch point in Sc that has no correspondence in S. For this neuron, a coarsening factor of f D 5 was used; effectively, the skeleton of the dendritic tree was discretized into coarse segments of length 0.5 ¹m rather than a ne-scale length of 0.1 ¹m. Since the individual pieces Si of the skeleton were determined from individual tiles, the union skeleton S, and hence its coarsened version Sc , contains disconnected pieces corresponding to cellular debris that could not be positively identied as such by examining only individual tiles, as well as spurious segments at tile boundaries that can be resolved only once skeleton pieces from neighboring tiles are connected. The true neuron skeleton is therefore identiable as the largest connected piece of Sc ; any disconnected skeletal pieces are deleted as corresponding to cellular debris. Finally the skeleton Sc is trimmed as described in Weaver (2003) to remove spurious, tile-boundary-related segments and to retain only a 26-connected skeletal structure.

1362

C. Weaver, P. Hof, S. Wearne, and W. Lindquist

In the trimmed skeleton (which we still refer to as Sc ), a (generally complicated) compact region of the backbone structure Sc corresponds to the soma of the neuron. Using the graphical user interface (GUI) provided, the software allows the region of the soma to be delineated manually by simple click-and-drag operations (Koh, 2001). The backbone structure of Sc that lies inside the soma region is then omitted from further analysis. The remainder of the backbone structure of Sc corresponds to the dendritic trees and axon. Axon collaterals in the image are not yet handled by our software, which currently assumes that all structures leaving the soma are dendritic. To handle axon collaterals, provision is provided in the GUI to manually inspect and reject axons using mouse point-and-click operations. During this manual rejection procedure, any click on a dendritic midline will remove it and the entire centrifugal subtree of midlines for which it is the root. After manual removal of the axon, using the GUI, a point-and-click operation is used to identify one (or more) trees that comprise the apical arbor. The software assumes all other trees comprise the basal arbor. We refer to each dendritic backbone segment exiting the soma region as the root of a dendritic tree. Using breadth-rst spanning trees generated in the centrifugal direction starting from each root, the dendritic branch segments in each tree are automatically labeled according to a centrifugal nomenclature (Uylings, Ruiz-Marcos, & van Pelt, 1986). Each branch has a unique label of the form B:t:l:i where B D a; b is the arbor label (apical or basal); t D 1; 2; 3; : : : denotes the tree label; l D 1; 2; 3; : : : is the branch order; and i is an index from 1 to nla (or nlb ), where nla (resp. nlb ) is the total number of apical (resp. basal) branches at branch order l. Any looping structures in the arbor skeleton must be resolved prior to labeling. We utilize minimum kink angle and branch length criteria in doing so (Koh, 2001). 3.3 Global Registration of the Fine-Scale Results. The ne-scale analysis allows only local labeling of all dendritic segments and spines found in each tile. Global labeling of the ne-scale results requires registration of the tile analysis with the dendritic mapping analysis of Sc (and hence of S). This registration is done with the same contraction operation used to produce Sc . Let D denote the (ne-scale) midline of a (presumed) dendritic segment analyzed in a single tile. (D may, in fact, correspond to cellular debris, or to an extraneous skeleton segment at a tile edge.) Let Dc be its coarsened representation, obtained by applying the contraction operation of section 3.2 to D. We attempt to register Dc with one (or more) dendrite midline segments in Sc allowing for the possibility of incomplete voxel alignments. Overlaps (voxels having the same coordinates) between Dc and midline segments of Sc are recorded. If any overlap is found, alignment is checked as follows. Assume Dc overlaps with branch midline segment B. If the length of segment Dc is shorter than B, Dc is further divided into three contiguous segments

Algorithms for Morphometry of Neuronal Dendrites

1363

Dc1 , Dc2 , Dc3 of effectively equal length. If overlaps between Dci and B exist for at least two of the three segments, Dc is considered to align with B. (If B is shorter than Dc , B is divided into three segments, and overlaps of each segment with Dc are checked.) If no overlapping, aligning branch in Sc is found for Dc , it is assumed to correspond to cellular debris or a tile boundary artifact, and it, and any spine candidates that may have been associated with it, are deleted. Since tile boundaries may cut through dendritic branches, a single branch B in Sc may register with a sequence of dendritic segments from different tiles. In this case, the ne-scale data for each dendritic segment are registered appropriately along B. A nal level of manual control is inserted at the end of the global registration step. The GUI provides a facility by which the user can scan 2D projected images of each ne-scale tile showing the nal results of the automated identication and global registration of dendritic segments and spines. Via mouse click, the user can manually remove dendritic segments and spines deemed to be false-positive signals. False-positive signals in a small subregion of the image are illustrated in Figure 3. The two false positives, marked C and D, require manual intervention. 3.4 Morphological Measurements. Recorded for each dendritic branch segment are its global label, total length, running diameter, number of spines, spine density, mean spine length, and numbers of spines of each classication type. Recorded for each spine are the global label of the dendritic branch segment on which it lies; its global position on the dendritic backbone; its length, volume, and head and neck diameters; and its shape classication (thin, mushroom, stubby) (Harris et al., 1992; Nimchinsky et al., 2002). Implementation of shape classication is discussed in Koh et al. (2002). Spine volume is currently an estimation, determined as the sum of the volume of all voxels comprising the spine. This method of estimating spine volume has not as yet been veried for accuracy. The accuracy of such an estimation depends strongly on the relative sizes of the voxel volume versus the microscope focal volume. (The focal volume for this image is estimated as 0.139 ¹m £ 0.139 ¹m £ 0.236 ¹m.) The dendritic branching morphology is output in SWC format (Cannon, Turner, Pyapali, & Wheal, 1998). More detailed output, including summary details about the spine shapes on each branch, is generated in NeuroML format (Goddard et al., 2001, 2002). These formats allow the data to be easily incorporated into compartmental modeling software such as NEURON (Hines, 1994) and GENESIS (Bower & Beeman, 1994). 4 Results For the entire neuron studied here, execution of the three morphology analysis steps required 26 hours computing time on a Pentium III 500 MHz

1364

C. Weaver, P. Hof, S. Wearne, and W. Lindquist

Figure 3: xy-projections of a subregion showing (a) raw data and (b) dendritic backbones and spines (colored black) detected by the automated algorithms. Indicated are instances of cellular debris (C) and of dendritic segment tips (D), which have been falsely identied as spines and must be manually deleted.

PC running RedHat Linux 6.2. (This does not include deconvolution time.) Most of this time was spent in the segmentation (34%) and spine detection algorithms (53%). Employing a coarsening factor of f D 5, produced a few topology differences between S and Sc . In Sc , one axon collateral merged at two points with two separate dendritic branches of the apical arbor. Since axons were deleted, however, this had no consequences. In the basal arbor, there are a couple of places where branches merge under contraction. No ne-scale information is lost during registration, but branch labels are based on the topology of Sc . Manual intervention for the analysis of this neuron consisted of the following. During the dendritic branching determination step, all axon collaterals in the image were deleted. During the nal tile-by-tile visual review, of the 2646 spines detected on the entire neuron, 76 were manually removed as denitely false positive.

Algorithms for Morphometry of Neuronal Dendrites

1365

Our primary interests are to validate the results of the automated analysis of this neuron against manual analysis of the same data and to compare these automated results with published results based on computer-aided manual analysis of similar neurons. We discuss the dendritic and spine morphometry results separately. 4.1 Dendritic Morphometry. The basal arbor of the neuron studied consisted of nine dendritic trees; the apical arbor consisted of a single tree. Following convention (DeFelipe & Farinas, ˜ 1992; Feldman, 1984), the segments of the apical tree were manually reclassied into a primary shaft (assigned branch order 1), oblique subtrees, and terminal tuft branches. There were 10 oblique subtrees and 2 branches in the terminal tuft. To validate our automated dendrite tracing algorithms, a limited manual analysis of the dendritic brushes was performed on the neuron image. The manual analysis was performed by looking at the principal 2D projections of individual tiles in the image. Manual determinations of the number of dendritic segments (NDS), number of dendritic branch points (NBP), and maximum branch order (MBO), for the basal and apical arbors separately, are compared with the automated determinations in Table 1. The automated results identify three fewer dendritic segments in the apical arbor and two more segments in the basal arbor. This is then reected in the manual versus automatic differences observed in NBP and MBO. Figure 4 displays the automated and manual differences in branch count as a function of branch order. Two mechanisms are responsible for the differences seen in the the apical arbor where branch count differences occur in three places. The rst difference occurs along the apical shaft, where the manual interpretation is that the parent shaft splits into three daughter segments (one of which is labeled as the continuation of the shaft). The automated algorithm, however, detects the parent shaft splitting into two daughters, with one of these daughter branches (the continuation of the shaft) proceeding for only a very short length before splitting further into two branches (one of which is again labeled as the continuation of the shaft). The other two places are near branch points where the manual method identies an oblique dendrite parent branch splitting into two daughters. In both cases, a break occurs in the parent branch just before the branch point. The automated method cannot bridge the break and concludes that the dendrite trace terminates at the break. The reason for these breakages is discussed below. Note that these two mechanisms oppose each other; compared to the automated methods, one leads to fewer manual counts, the other to more manual counts. Differences in counts observed on the basal arbor are due to these two mechanisms, and a third: topological differences between S and Sc induced by close dendritic processes. Table 1 presents comparisons on NDS, NBP, MBO as well as total dendritic length (TDL), and average dendritic segment length (DSL) with published results on pyramidal neurons in rhesus macaque (Duan et al., 2003),

1366

C. Weaver, P. Hof, S. Wearne, and W. Lindquist

Table 1: Comparison of Automated and Manual Results from This Study with Published Results on Similar Pyramidal Neurons in Monkeys and Humans. Species

NDS

TDL (mm)

NS

SD (¹m¡1 )

Apical Rhesus macaquea Rhesus macaqueb Rhesus macaquec Patas monkeyd Long-tailed macaquee

41 44 30.0(3.7) 28.3(7.0) 32.8(2.4)

1.86

1219

0.65

1.92(0.15) 1.76(0.21) 1.85(0.13)

791.3(51.7) 738(63)

0.41(0.02) 0.42(0.01)

Basal Rhesus macaquea Rhesus macaqueb Rhesus macaquec Patas monkeyd Long-tailed macaquee Humanf

57 55 51.8(4.1) 48.8(2.7) 42.6(2.0) 49.0(1.0)

2.05

1278

0.62

2.82(0.16) 2.45(0.32) 2.67(0.15) 3.30(0.10)

1078(70.5) 955(80)

0.38(0.03) 0.39(0.03)

903(57)

0.21(0.01)

Species

NBP

MBO

DSL (¹m)

SDS (¹m¡1 )

Apical Rhesus macaquea Rhesus macaqueb Long-tailed macaquee

20 21 15.5(1.2)

12 11 9.10(0.53)

45.4(6.3)

0.729(0.090)

58.5(2.5)

0.586(0.030)

Basal Rhesus macaquea Rhesus macaqueb Long-tailed macaquee

24 23 18.6(0.9)

6 6 5.77(0.25)

36.0 (3.5)

0.545(0.060)

64.8(1.9)

0.633(0.024)

Notes: All values represent mean (S.E.M.) per neuron except for SDS, which is per dendritic segment. NDS: number of dendritic segments; TDL: total dendritic length; NS: number of spines; SD: spine density; NBP: number of branch points; MBO: maximum branch order; DSL: dendrite segment length; SDS: spine density per segment. a This study, automated. b This study, manual. c Duan et al. (2003). d Page et al. (2002). e Soloway et al. (2002). f Jacobs et al. (2001).

patas (Page et al., 2002), and long-tailed macaque (Soloway, Pucak, Melchitzky, & Lewis, 2002) monkeys, and in humans (Jacobs et al., 1997, 2001). Except as noted below, all neurons in the comparison studies are STC layer III pyramidal neurons having ipsilateral projections to the prefrontal cortex area 46. In Duan et al., the results are averaged over 19 neurons from two adult male and three adult female macaque monkeys, ages 24 to 25 years. In Page et al., the results are averaged over three neurons from three adult (21–25 years) patas monkeys. The three neurons are from either layer III or layer V. In Soloway et al., individual results are averages of between 29 and 56 neurons having ipsilateral projections into prefrontal areas 9 or 46. The neurons were taken from young adult long-tailed macaque monkeys. Ap-

Algorithms for Morphometry of Neuronal Dendrites

manual automated

10 5 0

15 branch count

branch count

15

0

2 4 6 branch order Apical oblique branches

(a)

1367

10 5 0

0

2 4 branch order Basal arbor

6

(b)

Figure 4: Number of dendritic segments of each branch order for the neuron shown in Figure 2 for the (a) oblique branches of the apical arbor and (b) basal arbors, as determined by manual and automated methods. Apical oblique branch orders are numbered beginning with order 2; the apical shaft is order 1.

proximately 75% of the neurons were from the supragranular layers (II,III) of the STC; the remainder were from the infragranular layers (V,VI). In Jacobs et al., the results are averages over supragranular pyramidal neurons from Brodmann’s area 22 in neurologically normal human subjects, of mean age 30 (§17) years. One hundred neurons were studied—10 neurons each from 10 subjects. The projections of these neurons are unknown. Only basal arbors were examined in the Jacobs et al. (2001) study. For the apical arbor, both manual and automated counts of NDS are higher than reported in the literature, with the consequence that NBP and MBO are also high. For the basal arbor, our results for NDS and NBP are slightly high (though reasonable agreement exists on NDS with the study of Duan et al.), agreement of MBO is very good, and our results for TDL and DSL are low. Figures 5a and 5b compare TDL as a function of branch order for the apical and basal arbors against data reported in Duan et al. (2003). There is unfortunately a difference in denition of branch order. In Duan et al., if a parent of order n splits into two daughters of roughly equal diameter and equal angle of divergence from the parent, both daughters are assigned order nC1. However if one daughter continues relatively straight with respect to the parent and is judged to be of roughly equal diameter to the parent, that daughter branch remains order n, whereas the other(s) become n C 1. In the algorithm employed in our automated routines, for the trees in the basal arbor, all daughter branches of a branch order n parent are assigned order

C. Weaver, P. Hof, S. Wearne, and W. Lindquist

Apical arbor

total dendritic length ( m m)

total dendritic length ( m m)

1368

Duan et al. (2003) this study

1000 800 600 400 200 0

0

2 4 branch order

Basal arbor 1000

6

800 600 400 200 0

0

(b)

15 10 5 0

0

100 200 300 radius ( m m) (c)

400

number of intersections

number of intersections

(a)

2 4 6 branch order

30 20 10 0

0

100 200 radius ( m m)

300

(d)

Figure 5: Total dendritic length (a,b) and Sholl analysis (c,d) versus branch order for the apical (a,c) and basal (b,d) arbors obtained for the neuron shown in Figure 2, compared to results from Duan et al. (2003) for macaque monkeys of similar age. The Duan et al. plots show mean values; error bars denote § S.E.M. Sholl spheres were analyzed at radius increments of 20 ¹m.

n C 1. For the apical arbor, however, we have manually reclassied branch segments (and hence branch order) as either apical shaft, apical oblique, or terminal tuft. Thus, until the terminal tuft is reached, if an apical shaft parent branch (order 1) splits, one daughter is identied as (the continuation of the) apical shaft (branch order 1), and the other daughter(s) are identied as apical oblique branch (branch order 2). When an apical oblique branch of order n splits, all daughters are assigned order n C 1. Consequently, we observe reasonable agreement for TDL versus branch order with the results of Duan et al. (2003) for the apical arbor (where there is some consistency in

Algorithms for Morphometry of Neuronal Dendrites

1369

how branch order is assigned) but very little agreement on the basal arbor data. This comparison appears to emphasize the importance in choice of branch labeling scheme. To avoid the branch order designation differences, Figures 5c and 5d compare Sholl plots (Sholl, 1953) for the apical and basal arbors against data reported in Duan et al. (2003). The results are now qualitatively similar. The major differences consist of a higher peak and a more rapid “tailing off” in the Sholl plots for our neuron. Comparison of our Figure 2 with Figure 3 of Duan et al. (2003) does show qualitative differences between our neuron and representative examples of those studies in Duan et al. Thus, the differences are partly a consequence of the fact that our sample consists of a single neuron. However, the more rapid drop-off in the dendritic density at higher radius (and the low value measured for the TDL in Table 1) of both arbors results from breakages in dendrites. Breakage is caused by two mechanisms. The rst is caused by very slight rotations of the specimen mount that occurred when different stacks were imaged during the period of time over which the entire neuron image was collected. Such a rotation is demonstrated in Figure 6. Rotations prevent exact alignment of all dendritic segments crossing between two neighboring tiles, with the result that branch breakage occurs, preventing automated branch tracing across the break. Breakage also occurs where intensity variation in thin processes due to nonuniform dye lling results in breaks in the segmentated image. Such errors tend to occur in thin axon collaterals and near the tips of dendritic branches, leading to the observed rapid loss of dendritic length with radius. Figure 7 shows the nal reconstruction of dendrites and spines for this neuron. Arrows indicate where dendrite truncation occurred due to these two mechanisms. Filling and segmentation errors are in the majority. Figure 8 presents our measurements on mean dendrite diameter as a function of branch order. To our knowledge, data on comparable species are not available in the literature. These data should be relatively unaffected by branch truncation errors as such errors will reduce sample size and perhaps result in slight overestimation of average diameter at higher branch order values. Dendrite diameter decreases sharply across low branch orders and continues to taper gradually throughout the dendritic arbor. These observations are consistent with dendritic morphology reported elsewhere (DeLima et al., 1990; Feldman, 1984). 4.2 Dendritic Spine Analysis. The spine detection and characterization algorithms used here have been shown (Koh et al., 2002) to be consistent with manual analysis. The lower signal-to-noise ratio in our data as compared to those of Koh et al., where the imaging involved two-photon laser-scanning microscopy of cells tagged with uorescent proteins via ballistic transfection, motivated a manual count of a sampling of spines contained in our data. We performed a manual spine count on two distinct tiles shown (in

1370

C. Weaver, P. Hof, S. Wearne, and W. Lindquist

Figure 6: xy and xz projective views of a region in Figure 2. Arrows indicate boundaries between imaged regions joined after tile registration. A very small rotation in the right-hand tile prevents complete alignment of dendritic segments. The misalignment is most noticeable in the xz projection.

Algorithms for Morphometry of Neuronal Dendrites

1371

Figure 7: Final reconstruction of the neuron dendrites and spines in Figure 2. Open arrows indicate places where dendrites are truncated due to segmentation breakage; solid arrows indicate where minor rotations in the image tiles prevent accurate tile registration, with the consequence of dendrite breakage.

xy-projection) in Figure 9—one from the proximal section of one of the basal trees, in which all dendrites and spines are clearly visible, and one from a distal section of the apical tree, in which some of the dendrites and spines appear faint. Spines appearing in the xy-projection were counted manually. To distinguish cellular debris from true spines, the xz- and yz-projections were inspected as well. Table 2 summarizes the results. The 90% agreement

C. Weaver, P. Hof, S. Wearne, and W. Lindquist

branch diameter ( m m)

1372

basal apical 1

0.5

0

0

2 4 branch order

6

Figure 8: Mean diameter (in ¹m) of the apical and basal arbors at each branch order, for the neuron of Figure 2.

in spines counted in common for the proximal tile is consistent with results reported in Koh et al. (2002). The lower agreement for the distal tile reects reduced performance, probably in both methods, when spines and dendrites appear faintly. Table 1 compares our automated results on (per neuron) average number of spines and spine density with results from Duan et al. (2003), Page et al. (2002), and Jacobs et al. (2001). Table 1 also summarizes the comparison of spine density per dendrite segment with the results of Soloway et al. (2002). Our total spine numbers and spine densities are higher, except for the per Table 2: Comparison of Automated and Manual Spine Counts for Two Tiles Located Proximally and Distally to the Soma. Tile

Average

Common

Automated Only

Manual Only

Proximal Distal

228 111

205 (90%) 88 (79%)

31 (14%) 24 (22%)

15 (6.6%) 22 (20%)

Notes: Average: average of automated and manual counts. Common: number of spines identied by both methods. Automated only: additional spines identied by automated method only. Manual only: additional spines identied by manual method only. Percentages are relative to the average number of counts.

Algorithms for Morphometry of Neuronal Dendrites

1373

Figure 9: xy-projections of tiles (a) proximal to and (b) distal from the soma used for comparison of manual and automated spine counts.

segment density in the basal arbors of the neurons studied in Soloway et al. Figures 10a and 10b compare total spine count as a function of branch order for the apical and basal arbors, respectively, with the results from Duan et al. Direct comparison is difcult due to the different methods in assignment of branch orders, as discussed for the TDL results of Figures 5a and 5b. Based on that discussion, we would expect better agreement for the apical arbor (except for possibly a more rapid fall-off at higher branch order due to the truncation effects discussed above) than for the basal. Indeed we see greater qualitative agreement for the apical arbor, though with higher spine counts recorded for the automated method.

1374

C. Weaver, P. Hof, S. Wearne, and W. Lindquist

Apical arbor Duan et al. (2003) this study spine count

600

400 200 0

0

2 4 branch order

0

2 4 6 branch order

(b)

Apical arbor

Basal arbor

0.5

0

200

(a)

1

0

400

0

6

spine density (1/ m m)

spine count

600

spine density (1/ m m)

Basal arbor

2 4 branch order

(c)

6

1

0.5

0

0

2 4 6 branch order

(d)

Figure 10: Morphological measurements on the spines obtained for the neuron shown in Figure 2 compared to results from Duan et al. (2003) for macaque monkeys of similar age. (a, b) Spine count for the apical and basal arbors, respectively. (c, d) Spine density for the apical and basal arbors, respectively. Plots show mean § S.E.M. values.

Comparison of spine density with branch order is shown in Figures 10c and 10d. Note that spine density results will not be as affected by truncated dendrite branches, though differences in branch order numbering schemes will play some role. For the apical arbor, our data suggest a decrease in spine density with branch order, whereas the data of Duan et al. (2003) suggest a relatively constant density with branch order. For the basal arbor, both data

Algorithms for Morphometry of Neuronal Dendrites

1375

sets suggest an increase in spine density with branch order, though the rise is faster in our neuron. This difference can be explained by the difference in branch order labeling methods. If spine densities increase on daughter branches relative to their common parent, then labeling the parent and one daughter with the same branch order number as in Duan et al. (2003) will average densities, lowering the rate of increase seen with branch order. We note the results of DeLima et al. (1990) and Feldman (1984), which report increasing spine density with branch order until distal branches are reached, where the density then declines. The automatic method detects many more spines (and therefore nds higher spine densities) than published manual methods. It has been documented (Anderson et al., 1995; Elston & Rosa, 1997; Feldman & Peters, 1979) that many current methods tend to underestimate spine counts. We note in particular that spine counts in the study of Duan et al. (2003) are obtained by examining 2D slices separated by 1 ¹m increments, whereas the automated spine analysis examines the entire data set in 3D at 0.1 ¹m resolution. Manual analysis of a subset of the data presented here reveals that automated and manual spine identications can agree on 90% of the spines. Disagreement on remaining identications can be attributed to small spines, which are difcult to identify manually, small cellular debris near dendrites, which cannot be clearly distinguished from spines, and segmentation artifacts in the automated method. Spine length and shape analyses have been reported in rodents (Harris & Stevens, 1988; Harris et al., 1992; Irwin et al., 2002; Nimchinsky et al., 2001; Wilson, Groves, Kitai, & Linder, 1983) but not in primates. We therefore present our results for spines on the apical shaft, apical oblique branches, and basal dendritic branches with no comparison against previous literature. Figure 11 presents histograms of the spine length, shape classication, and volume results. Spines are categorized according to dendrite type and distance from the soma. For each category, the Shapiro-Wilk statistic (Shapiro & Wilk, 1965) was used to test the hypothesis that the spine length distributions are normal. The Shapiro-Wilk statistic rejects this hypothesis for every category, except for apical shaft spines less than 150 ¹m from the soma and for apical oblique spines less than 75 ¹m from the soma. Shape classication frequencies are uniformly distributed within each distancedendrite type category. Mushroom shapes dominate, and thin spines are the least frequent. As noted in section 3.4, our volume estimation procedure, based on straight voxel counts, has not yet been validated. Based on exponential ts to the histograms, the results are consistent with an exponentially decreasing volume distribution for apical shaft spines 150 to 225 ¹m from the soma, for apical oblique spines more than 75 ¹m from the soma, and for basal spines. We have also analyzed the dependence on branch order of branch length, branch diameter, spine density, mean spine length, and mushroom spine

1376

C. Weaver, P. Hof, S. Wearne, and W. Lindquist

percent of spines percent of spines

apical shaft 50

80

40 30

apical oblique

basal

0 - 75 mm 75 - 150 m m 150 - 225 m m > 225 mm

20 10 0

0.5 1.0 1.5 2.0 >2.5

0.5 1.0 1.5 2.0 >2.5 spine length ( m m)

0.5 1.0 1.5 2.0 >2.5

thin

thin

60 40 20 thin mushroom stubby

mushroom stubby

mushroom stubby

percent of spines

spine shape

60 40 20 0

0.2 0.4 0.6 0.8 >1.0

0.2 0.4 0.6 0.8 >1.0

0.2 0.4 0.6 0.8 >1.0

spine volume ( m m 3 )

Figure 11: Measured distributions for spine length, shape, and volume obtained from the neuron shown in Figure 2. Spines are categorized according to dendrite branch type and distance from the soma.

shape fraction from our data using a one-way analysis of variance to test the hypothesis of no change of the dependent variable with branch order. This was done separately for data on the apical oblique and the basal dendrites. A post hoc Duncan’s multiple-range test was applied to detect branch order groups with signicantly different measures for each variable tested. Because this design analyzed one dependent measure at a time, a Bonferroni correction with ® D 0:0125 was applied to ensure an experiment-wise ® of 0.05. For the apical oblique dendrites, the analysis of variance revealed no signicant variation in any of these measures with branch order. For the basal dendrites, the analysis of variance revealed signicant variation in dendrite diameter [F.5;51/ D 9:33, p < 0:0001] and spine density [F.5;51/ D 10:02, p < 0:0001] with branch order. Duncan’s test revealed no signicant variation in branch diameter with branch order for dendrites in the basal arbor with branch order two and higher. The mean values for the basal dendrites

Algorithms for Morphometry of Neuronal Dendrites

1377

(see Figure 8) do, however, suggest a trend to smaller diameter. Duncan’s test revealed a signicant increase in spine density between proximal (one and two) and distal (four and higher) branch orders for dendrites in the basal arbor. 5 Discussion Our results suggest that a next generation of automated tracing software containing feature recognition is a feasible, near-term goal. With respect to dendritic arbor tracing, the major hurdle is one of truncated branch traces due to breaks. The ability of the human eye to utilize nonlocal information to bridge a gap is a skill that is a challenge to duplicate using fast computer algorithms where speed is usually attained by choosing to depend on only local information. The problem of tracing across segmentationinduced breaks has been satisfactorily handled in 2D tracing of neurite outgrowth from explants (Ford-Holevinski, Dahlberg, & Agranoff, 1986; Weaver, Pinezich, Lindquist, & Vazquez, 2003) and should be adaptable to segmentation-induced breaks in tracing dendrites in 3D. It may also be adaptable to handling breaks induced by minor rotation effects, though a resolution of this problem may more appropriately lie in data acquisition techniques. There remain manual steps still to be automated. The retiling needed in section 3.1 is one such step. We believe we can construct such an algorithm that will produce nonoverlapping rectangular tiles that avoid coverage of regions devoid of neural processes, are relatively balanced as to the volume of neural processes each contains (load balancing), and minimize the number of dendritic segments cut by tile boundaries. It will be unlikely, at least in the near future, that manual intervention can be completely eliminated. (The reasonable goal is to reduce manual intervention to a minimum time involvement.) As mentioned, the GUI provides the ability to manually eliminate spine and dendrite identications judged to be false-positive by a human observer. It would be useful to include the possibility for manual false-negative corrections. Intensity variations between imaged tiles distal from and proximal to the soma result from adjustment of microscope settings to optimize the contrast of dendrite and spine intensities with that of the background. Such intensity variation throughout the image required manual adjustment of the two threshold parameters in the segmentation algorithm applied to each tile in the ne-scale analysis. We expect that this parameter adjustment can be automated by directly exploiting recorded microscope settings. The major effort in this article is to demonstrate the applicability of automated dendrite and spine detection software and to validate results against current state-of-the-art, manual-aided analysis. We have made a determined effort to control comparison between similar cell types from the same species and age range. However, there remain a number of factors po-

1378

C. Weaver, P. Hof, S. Wearne, and W. Lindquist

tentially contributing to disagreement between our automated results and results from the literature reported here. These include (1) a single-neuron sample size; (2) primate species variation; (3) age variation (the data in Soloway et al. (2002) are from young adult monkeys); (4) cells from different cortical layers (Page et al., 2002; Soloway et al., 2002); (5) staining methodintracellular injection (this study, Duan et al., 2002; Duan et al., 2003; Elston et al., 1999b; Page et al., 2002; Soloway et al., 2002) versus Golgi stain (Jacobs et al., 2001); (6) the method of labeling branch order; (7) lower manual spine counts, which can occur when spines hidden behind dendrites lie undetected, with many more spines detected by the 3D automated method; and (8) habituation-related biases in manual counting. This work also demonstrates the potential for access to morphological measures not commonly accessed. Due to the involved manual reconstruction required, detailed morphology (length, volume, shape) of dendritic spines or detailed determinations of dendritic radius are not readily available in other primate studies. Automated analysis removes this restriction. We note nally that the morphological output from automated algorithms can be put into a form compatible with input required by compartmental modeling software, allowing data to be incorporated into a very detailed compartmental model with no additional manipulation. Acknowledgments We thank H. Duan for access to her macaque data (Duan et al., 2003). This work is supported by the NSF Division of Mathematical Sciences, grant DMS-0107893 (C.M.W., W.B.L.), NSF Division of Biological Infrastructure, grant DBI-0305799 (C.M.W.), and NIH grants MH58911 (P.R.F.), DC04632, MH06073 and RR16754 (S.L.W.). We thank S. Henderson for help with the confocal microscopy. References Al-Kofahi, K. A., Lasek, S., Szarowski, D. H., Pace, C. J., Nagy, G., Turner, J. N., & Roysam, B. (2002). Rapid automated three-dimensional tracing of neurons from confocal image stacks. IEEE Trans. Information Tech. Biomed., 6, 171–187. Anderson, S. A., Classey, J. D., Cond´e, F., Lund, J. S., & Lewis, D. A. (1995). Synchronous development of pyramidal neuron dendritic spines and parvalbumin-immunoreactive chandelier neuron axon terminals in layer III of monkey prefrontal cortex. Neurosci., 67, 7–22. Baer, S. M., & Rinzel, J. (1991). Propagation of dendritic spikes mediated by excitable spines: A continuum theory. J. Neurophysiol, 65, 874–890. Bartesaghi, R., Severi, S., & Guidi, S. (2003). Effects of early environment on pyramidal neuron morphology in eld CA1 of the guinea-pig. Neuroscience, 116, 715–732.

Algorithms for Morphometry of Neuronal Dendrites

1379

Bower, J. M., & Beeman, D. (1994). The book of GENESIS: Exploring realistic neural models with the GEneral NEural SImulation system. Berlin: TELOS/SpringerVerlag. Cannon, R. C., Turner, D. A., Pyapali, G. K., & Wheal, H. V. (1998). An online archive of reconstructed hippocampal neurons. J. Neurosci. Methods, 84, 49–54. Cohen, A. R., Roysam, B., & Turner, J. N. (1994). Automated tracing and volume measurements of neurons from 3-D confocal uorescence microscopy data. J. Microsc., 171, 103–114. DeFelipe, J., & Farinas, ˜ I. (1992). The pyramidal neuron of the cerebral cortex: Morphological and chemical characteristic of the synaptic inputs. Prog. Neurobiol., 39, 563–607. DeLima, A. D., Voigt, T., & Morrison, J. H. (1990). Morphology of the cells within the inferior temporal gyrus that project to the prefrontal cortex in the macaque monkey. J. Comp. Neurol., 296, 159–172. Duan, H., Wearne, S. L., Morrison, J. H., & Hof, P. R. (2002). Quantitative analysis of the dendritic morphology of corticocortical projection neurons in the macaque monkey association cortex. Neuroscience, 114, 349–359. Duan, H., Wearne, S. L., Rocher, A. B., Macedo, A., Morrison, J. H., & Hof, P. R. (2003). Age-related dendritic and spine changes in corticocorticallyprojecting neurons in macaque monkeys. Cereb. Cortex, 13, 950–961. Elston, G. N. (2000). Pyramidal cells of the frontal lobe: All the more spinous to think with. J. Neurosci., 20, RC95. Elston, G. N., & Rockland, K. S. (2002). The pyramidal cell of the sensorimotor cortex of the macaque monkey: Phenotypic variation. Cereb. Cortex, 12, 1071– 1078. Elston, G. N., & Rosa, M. G. (1997). The occipitoparietal pathway of the macaque monkey: Comparison of pyramidal cell morphology in layer III of functionally related cortical visual areas. Cereb. Cortex, 7, 432–452. Elston, G. N., & Rosa, M. G. (1998). Morphological variation of layer III pyramidal neurones in the occipitotemporal pathway of the macaque monkey visual cortex. Cereb. Cortex, 8, 278–294. Elston, G. N., Tweedale, R., & Rosa, M. G. (1999a). Cellular heterogeneity in cerebral cortex: A study of the morphology of pyramidal neurones in visual areas of the marmoset monkey. J. Comp. Neurol., 415, 33–51. Elston, G. N., Tweedale, R., & Rosa, M. G. (1999b). Cortical integration in the visual system of the macaque monkey: Large-scale morphological differences in the pyramidal neurons in the occipital, parietal and temporal lobes. Proc. R. Soc. Lond. Ser. B, 266, 1367–1374. Elston, G. N., Tweedale, R., & Rosa, M. G. (1999c). Supragranular pyramidal neurones in the medial posterior parietal cortex of the macaque monkey: Morphological heterogeneity in subdivisions of area 7. NeuroReport, 10, 1925– 1929. Feldman, M. L. (1984). Morphology of the neocortical pyramidal neuron. In E. G. Jones & A. Peters (Eds.), Cerebral cortex (Vol. 1, pp. 123–200). New York: Plenum Press.

1380

C. Weaver, P. Hof, S. Wearne, and W. Lindquist

Feldman, M. L., & Peters, A. (1979). Technique for estimating total spine numbers on Golgi-impregnated dendrites. J. Comp. Neurol., 118, 527–542. Ford-Holevinski, T., Dahlberg, T. A., & Agranoff, B. W. (1986). A microcomputerbased image analyzer for quantifying neurite outgrowth. Brain Res., 368, 339–346. Gilbert, C. D., & Wiesel, T. N. (1992). Receptive-eld dynamics in adult primary visual-cortex. Nature, 356, 150–152. Glantz, L. A., & Lewis, D. A. (2000). Decreased dendritic spine density on prefrontal cortical pyramidal neurons in schizophrenia. Arch. Gen. Psychiatry, 57, 65–73. Goddard, N. H., Beeman, D., Cannon, R., Cornelis, H., Gewaltig, M. O., Hood, G., Howell, F., Rogister, P., De Schutter, E., Shankar, K., & Hucka, M. (2002). NeuroML for plug and play neuronal modeling. Neurocomputation, 44, 1077– 1081. Goddard, N. H., Hucka, M., Howell, F., Cornelis, H., Shankar, K., & Beeman, D. (2001). Towards NeuroML: Model description methods for collaborative modelling in neuroscience. Philos. Trans. R. Soc. B, 356, 1209–1228. Harris, K. M., Jensen, F. E., & Tsao, B. (1992). Three-dimensional structure of dendritic spines and synapses in rat hippocampus (CA1) at postnatal day 15 and adult ages: Implications for the maturation of synaptic physiology and long-term potentiation. J. Neurosci., 12, 2685–2705. Harris, K. M., & Stevens, J. K. (1988). Dendritic spines of rat cerebellar Purkinje cells: Serial electron microscopy with reference to their biophysical characteristics. J. Neurosci., 8, 4455–4469. Herzog, A., Krell, G., Michaelis, B., Wang, J., Zuschratter, W., & Braun, K. (1997). Restoration of three-dimensional quasi-binary images from confocal microscopy and its application to dendritic trees. In C. J. Cogswell, J. Conchello, & T. Wilson (Eds.), Three-dimensional microscopy: Image acquisition and processing IV, Proceedings of SPIE. Bellingham, WA: SPIE. Hines, M. L. (1994). The neuron simulation program. In J. Skrzypek (Ed.), Neural network simulation environments (pp. 147–163). Norwell, MA: Kluwer. Holmes, W. R. (1989). The role of dendritic diameter in maximizing the effectiveness of synaptic inputs. Brain Res., 478, 127–137. Hubener, M., Schwarz, C., & Bolz, J. (1990). Morphological types of projection neurons in layer 5 of cat visual cortex. J. Comp. Neurol., 301, 655–674. Irwin, S. A., Idupulapati, M., Gilbert, M. E., Harris, J. B., Chakravarti, A. B., Rogers, E. J., Crisostomo, R. A., Larsen, B. P., Mehta, A., Alcantara, C. J., Patel, B., Swain, R. A., Weiler, I. J., Oostra, B. A., & Greenough, W. T. (2002). Dendritic spine and dendritic eld characteristics of layer V pyramidal neurons in the visual cortex of fragile-X knockout mice. Am. J. Med. Genet., 111, 140– 146. Irwin, S. A., Idupulapati, M., Harris, J. B., Crisostomo, R. A., Larsen, B. P., Kooy, F., Willems, P. J., Cras, P., Kozlowski, P. B., Swain, R. A., Weiler, I. J., & Greenough, W. T. (2001). Abnormal dendritic spine characteristics in the temporal and visual cortices of patients with fragile-X syndrome: A quantitative examination. Am. J. Med. Genet., 98, 161–167.

Algorithms for Morphometry of Neuronal Dendrites

1381

Jacobs, B., Driscoll, L., & Schall, M. (1997). Life-span dendritic and spine changes in areas 10 and 18 of human cortex: A quantitative Golgi study. J. Comp. Neurol., 386, 661–680. Jacobs, B., Schall, M., Prather, M., Kapler, E., Driscoll, L., Baca, S., Jacobs, J., Ford, K., Wainwright, M., & Treml, M. (2001). Regional dendritic and spine variation in human cerebral cortex: A quantitative Golgi study. Cereb. Cortex, 11, 558–571. Koh, Y. Y. (2001). Automated recognition algorithmsfor neural studies. Unpublished doctoral dissertation, Stony Brook University. Koh, I. Y. Y., Lindquist, W. B., Zito, K., Nimchinsky, E. A., & Svoboda, K. (2002). An image analysis algorithm for dendritic spines. Neural Comput., 14, 1283– 1310. Kolb, B., Gibb, R., & Gorny, G. (2003). Experience-dependent changes in dendritic arbor and spine density in neocortex vary qualitatively with age and sex. Neurobiol. Learn. Mem., 79, 1–10. Krichmar, J. L., Nasuto, S. J., Scorcioni, R., Washington, S. D., & Ascoli, G. A. (2002). Effects of dendritic morphology on CA3 pyramidal cell electrophysiology: A simulation study. Brain Res., 941, 11–28. Mainen, Z. F., & Sejnowski, T. J. (1996). Inuence of dendritic structure on ring pattern in model neocortical neurons. Nature, 382, 363–366. Maravall, M., Shepherd, G. M. G., Koh, I. Y. Y., Lindquist, W. B., & Svoboda, K. (in press). Experience-dependent changes in dendritic morphology of layer 2/3 pyramidal neurons during a critical period for developmental plasticity in rat barrel cortex. Cerebral Cortex. Nimchinsky, E. A., Hof, P. R., Young, W. G., & Morrison, J. H. (1996). Neurochemical, morphologic, and laminar characterization of cortical projection neurons in the cingulated motor areas of the macaque monkey. J. Comp. Neurol., 374, 136–160. Nimchinsky, E. A., Oberlander, A. M., & Svoboda, K. (2001). Abnormal development of dendritic spines in FMR1 knock-out mice. J. Neurosci., 21, 5139– 5146. Nimchinsky, E. A., Sabatini, B. L., & Svoboda, K. (2002). Structure and function of dendritic spines. Annu. Rev. Physiol., 64, 313–352. Oh, W. & Lindquist, W. B. (1999). Image thresholding by indicator kriging. IEEE Trans. Pattern Anal. Mach. Intell., 21, 590–602. Page, T. L., Einstein, M., Duan, H., He, Y., Flores, T., Rolshud, D., Erwin, J. M., Wearne, S. L., Morrison, J. H., & Hof, P. R. (2002). Morphological alterations in neurons forming corticocortical projections in the neocortex of aged Patas monkeys. Neurosci. Lett., 317, 37–41. Purpura, D. P. (1974). Dendritic spine dysgenesis and mental-retardation. Science, 186, 1126–1128. Ramon ´ y Cajal, S. (1893). Neue Darstellung vom histologischen Bau des Centralnervensystems. Arch. Anat. Entwickl. [Anatomisches Abteilung des Archiv fur ¨ Anatomie und Physiologie], 319–428. Rodriguez, A., Ehlenberger, D., Kelliher, K., Einstein, M., Henderson, S. C., Morrison, J. H., Hof, P. R., & Wearne, S. L. (2003). Automated reconstruction

1382

C. Weaver, P. Hof, S. Wearne, and W. Lindquist

of three-dimensional neuronal morphology from laser scanning microscopy images. Methods, 30, 94–105. Rumberger, A., Schmidt, M., Lohmann, H., & Hoffmann, K. P. (1998). Correlation of electrophysiology, morphology, and functions in corticotectal and corticopretectal projection neurons in rat visual cortex. Exp. Brain Res., 119, 375–390. Rusakov, D. A., & Stewart, M. G. (1995). Quantication of dendritic spine populations using image analysis and a tilting disector. J. Neurosci. Methods, 60, 11–21. Scheibel, A. B., Conrad, T., Perdue, S., Tomiyasu, U., & Wechsler, A. (1990). A quantitative study of dendrite complexity in selected areas of the human cerebral cortex. Brain Cogn., 12, 85–101. Scheibel, A. B., Paul, L. A., Fried, I., Forsythe, A. B., Tomiyasu, U., Wechsler, A., Kao, A., & Slotnick, J. (1985). Dendritic organization of the anterior speech area. Exp. Neurol., 87, 109–117. Seib, L. M., & Wellman, C. L. (2003). Daily injections alter spine density in rat medial prefrontal cortex. Neurosci. Lett., 337, 29–32. Shapiro, S. S., & Wilk, M. B. (1965). An analysis of variance test for normality (complete samples). Biometrika, 52(3–4), 591–611. Sholl, D. A. (1953). Dendritic organization in the neurons of the visual and motor cortices of the cat. J. Anat., 87, 387–406. Soloway, A. S., Pucak, M. L., Melchitzky, D. S., & Lewis, D. A. (2002). Dendritic morphology of callosal and ipsilateral projection neurons in monkey prefrontal cortex. Neuroscience, 109, 461–471. Stratford, K., Mason, A., Larkman, A., Major, G., & Jack, J. (1989). The modeling of pyramidal neurons in the visual cortex. In R. Durbin, C. Miall, & G. Mitchison (Eds.), The computing neuron (pp. 296–321). Reading, MA: AddisonWesley. Surkis, A., Peskin, C. S., Tranchina, D., & Leonard, C. S. (1998). Recovery of cable properties through active and passive modeling of subthreshold membrane responses from laterodorsal tegmental neurons. J. Neurophysiol., 80, 2593– 2607. Tanzi, E. (1893). I fatti i le induzioni nell’odierna istologia del sistema nervoso. Riv. Sper. Freniatr. Med. Leg. Alien. Ment., 19, 419–422. Turner, J. N., Ancin, H., Becker, D. E., Szarowski, D. H., Holmes, M., O’Connor, N., Wang, M., Holmes, T., & Roysam, B. (1997). Automated image analysis technologies for biological 3D light microscopy. Int. J. Imag. Syst. Tech., 8, 240–254. Turner, J. N., Shain, W., Szarowski, D. H., Lasek, S., Dowell, N., Sipple, B., Can, A., Al-Kofahi, K., & Roysam, B. (2000). Three-dimensional light microscopy: Observation of thick objects. J. Histotechnol., 23, 205–217. Uylings, H. B. M., Ruiz-Marcos, A., & van Pelt, J. (1986). The metric analysis of three-dimensional dendritic tree patterns: A methodological review. J. Neurosci. Methods, 18, 127–151. Wallace, C., Hawrylak, N., & Greenough, W. T. (1991). Studies of synaptic structural modications after long-term potentiation and kindling: Context

Algorithms for Morphometry of Neuronal Dendrites

1383

for a molecular morphology. In M. Baudry & J. L. Davis (Eds.), Long-term potentiation: A debate of current issues (pp. 189–332). Cambridge, MA: MIT Press. Watzel, R., Braun, K., Hess, A., Scheich, H., & Zuschratter, W. (1995). Detection of dendritic spines in 3-dimensional images. Bielefeld: Springer-Verlag. Weaver, C. M. (2003). Automated morphology of neural cells. Unpublished doctoral dissertation, Stony Brook University. Weaver, C. M., & Lindquist, W. B. (2003). 3DMA-neuron users manual. Available on-line at: http://www.ams.sunysb.edu/»lindquis/3dma neuron/ 3dma neuron.html. Weaver, C. M., Pinezich, J. D., Lindquist, W. B., & Vazquez, M. E. (2003). An algorithm for neurite outgrowth reconstruction. J. Neurosci. Methods, 124, 197–205. Wilson, C. J. (1984). Passive cable properties of dendritic spines and spiny neurons. J. Neurosci., 4, 281–297. Wilson, C. J. (1988). Cellular mechanisms controlling the strength of synapses. J. Electron Microsc. Tech., 10, 293–313. Wilson, C. J., Groves, P. M., Kitai, S. T., & Linder, J. C. (1983). Three-dimensional structure of dendritic spines in the rat neostriatum. J. Neurosci., 3, 383–398. Received November 24, 2003; accepted January 7, 2004.

LETTER

Communicated by Misha Tsodyks

Computing and Stability in Cortical Networks Peter E. Latham [email protected]

Sheila Nirenberg [email protected] Department of Neurobiology, University of California at Los Angeles, Los Angeles, CA 90095-1763, U.S.A.

Cortical neurons are predominantly excitatory and highly interconnected. In spite of this, the cortex is remarkably stable: normal brains do not exhibit the kind of runaway excitation one might expect of such a system. How does the cortex maintain stability in the face of this massive excitatory feedback? More importantly, how does it do so during computations, which necessarily involve elevated ring rates? Here we address these questions in the context of attractor networks—networks that exhibit multiple stable states, or memories. We nd that such networks can be stabilized at the relatively low ring rates observed in vivo if two conditions are met: (1) the background state, where all neurons are ring at low rates, is inhibition dominated, and (2) the fraction of neurons involved in a memory is above some threshold, so that there is sufcient coupling between the memory neurons and the background. This allows “dynamical stabilization” of the attractors, meaning feedback from the pool of background neurons stabilizes what would otherwise be an unstable state. We suggest that dynamical stabilization may be a strategy used for a broad range of computations, not just those involving attractors. 1 Introduction Attractor networks—networks that exhibit multiple stable states—have served as a key theoretical model for several important computations, including associative memory (Hopeld, 1982, 1984), working memory (Amit & Brunel, 1997a; Brunel & Wang, 2001), and the vestibular-ocular reex (Seung, 1996), and determining whether these models apply to real biological networks is an active area of experimental research (Miyashita & Hayashi, 2000; Aksay, Gamkrelidzek, Seungk, Baker, & Tank, 2001; Ojemann, Schoeneld-McNeill, & Corina, 2002; Naya, Yoshida, & Miyashita, 2003). A denitive determination, however, has been difcult, mainly because attractors cannot be observed directly; instead, inferences must be made about their existence by comparing experimental data with model prediction. c 2004 Massachusetts Institute of Technology Neural Computation 16, 1385–1412 (2004) °

1386

P. Latham and S. Nirenberg

To make these comparisons, it is necessary to have realistic models. Construction of such models has proved difcult because of what we refer to as the stability problem. The stability problem arises primarily because cortical networks are highly recurrent—a typical neuron in the cortex receives input from 5,000 to 10,000 others, most of which are nearby (Braitenberg & Schuz, ¨ 1991). While this high connectivity undoubtedly provides immense computational power, it also has a downside: it can lead to instabilities in the form of runaway excitation. For example, even mild electrical stimulation applied periodically can eventually lead to seizures (McIntyre, Poulter, & Gilby, 2002), and epilepsy, a sign of intrinsic instability, occurs in 0.5 to 1% of the human population (Bell & Sander, 2001; Hauser, 1997). The severity of the stability problem lies in the fact that recurrent connections in attractor networks have to be strong enough to allow activity in the absence of input, but not so strong that the activity can occur spontaneously. Moreover, in areas where attractor networks are thought to exist, such as prefrontal cortex (Fuster & Alexander, 1971; Wilson, Scalaidhe, & GoldmanRakic, 1993; Freedman, Riesenhuber, Poggio, & Miller, 2001) and inferior temporal cortex (Fuster & Jervey, 1982; Miyashita & Chang, 1988), the ring rates associated with attractor states are not much higher than those associated with background activity. The two differ by only a few Hz, with typical background rates ranging from 1 to 10 Hz and attractor rates ranging from 10 to 20 Hz (Fuster & Alexander, 1971; Miyashita & Chang, 1988; Nakamura & Kubota, 1995). This small difference makes a hard stability problem even harder, as there is almost no barrier preventing spontaneous jumps to attractors. Indeed, in previous attempts to build realistic models of attractor networks (Amit & Brunel, 1997a; Wang, 1999; Brunel, 2000; Brunel & Wang, 2001), ne-tuning of parameters was required to support attractors at the low rates seen in vivo. This was mainly because ring rates in those models were effectively set by single neuron saturation—by the fact that neurons simply cannot re for extended periods above a maximum rate. Here we propose a solution to the ne-tuning problem, one that allows attractors to exist at low rates over a broad range of parameters. The basic idea is to use natural interactions between the attractors and the background to limit ring rates, so that rates are set by network rather than single neuron properties. Limiting ring rates in this way may be a general computational strategy, not one used just by attractor networks. Thus, quantifying this mechanism in the context of attractor networks may serve as a general model for how cortical circuits carry out a broad range of computations while avoiding instabilities. 2 Reduced Model System To understand qualitatively the properties of attractor networks, we analyze a model system in which the neurons are described by their ring rates. For simplicity, we start with a network that exhibits two stable equilibria: a

Computing and Stability in Cortical Networks

1387

background state, where all neurons re at low rates, and a memory state, where a subpopulation of neurons res at an elevated rate. Our results, however, apply to multiattractor networks—networks that exhibit multiple memories—and we will consider such networks below. The goal of the analysis in this section is to obtain a qualitative understanding of attractor networks, in particular, how they can re at low rates without destabilizing the background. We will then use this qualitative understanding to guide network simulations of multiattractor networks with spiking neurons (containing up to 50 attractors), and use those simulations to verify the results obtained from the reduced model. We construct our reduced model attractor network in two steps. First, we build a randomly connected network of excitatory and inhibitory neurons that re at a few Hz on average. Second, we pick out a subpopulation of the excitatory neurons (which we refer to as memory neurons) and strengthen the connections among them (see Figure 1). If the parameters are chosen properly, this strengthening will produce a network in which the memory neurons can re at either an elevated rate or the background rate, resulting in a bistable network.

J EE

J II

(1-f) memory

J IE

- f

non memory

excitatory cells

J

EI

inhibitory cells

Figure 1: Schematic of network architecture. Excitatory and inhibitory neurons are synaptically coupled via the connectivity matrix J. The inhibitory neurons are homogeneous; the excitatory neurons break into two populations, memory and nonmemory. The coupling among memory neurons is higher than among nonmemory neurons by a factor of ¯.1 ¡ f /; the coupling from nonmemory to memory neurons is lower by a factor of ¯ f . The memory neurons make up a fraction f of the excitatory population, so the decrease in coupling from nonmemory to memory neurons ensures that the total synaptic drive to both memory and nonmemory neurons is the same.

1388

P. Latham and S. Nirenberg

To analyze the stability and computational properties of this network, we assume that, in equilibrium, the ring rate of a neuron is a function of only the ring rates of the neurons presynaptic to it. Then, near equilibrium, we expect Wilson and Cowan–like equations (Wilson & Cowan, 1972) in which the ring rates obey rst-order dynamics. Augmenting the standard Wilson and Cowan equations to allow increased connectivity among a subpopulation of neurons (see Figure 1 and appendix A), we have 0 1 X dºEi ¯ C ºEi D ÁE @ JEE ºE ¡ JEI ºI C ¿E »i .»j ¡ f /ºEj A (2.1a) dt NE f .1 ¡ f / j ¿I

dºI C ºI D ÁI .JIE ºE ¡ JII ºI /: dt

(2.1b)

Here, ¿E and ¿I are the excitatory and inhibitory time constants, ºEi is the ring rate of the ith excitatory neuron (i D 1; : : : ; NE ), ºE and ºI are the average ring rates of the excitatory and inhibitory neurons, respectively, the Js are the average coupling coefcients among the excitatory and inhibitory populations, ÁE and ÁI are the average excitatory and inhibitory gain functions, respectively, ¯ is the effective strength of the coupling among the memory neurons, NE is the number of excitatory neurons, and » is a random binary vector: »i D 1 with probability f and 0 with probability 1¡ f . The factor »j ¡ f ensures postsynaptic normalization: on average, the total synaptic strength to both memory and nonmemory neurons is the same. The » -dependent term in equation 2.1a is a standard one for constructing attractor networks (Hopeld, 1982; Tsodyks & Feigel’man, 1988; Buhmann, 1989). The gain functions, ÁE and ÁI , play a key role in this analysis. These functions, which hide all the single neuron properties, have a natural interpretation: each one corresponds to the average ring rate of a population of neurons as a function of the average ring rates of neurons presynaptic to it. They thus have standard shapes when plotted versus ºE : they look like f -I curves—ring rate versus injected current (McCormick, Connors, Lighthall, & Prince, 1985)—that have been smoothed at low ring rates (Brunel & Sergi, 1998; Tiesinga, Jos´e, & Sejnowski, 2000; Fourcaud & Brunel, 2002; Brunel & Latham, 2003). A generic gain function is shown in Figure 2. This curve does not correspond to any particular neuron or class of neurons—it is just illustrative. There are two important aspects to its shape: (1) it has a convex region, and (2) the transition from convex to concave occurs at ring rates well above 10 to 20 Hz on the output side (that is, the transition occurs when Á À 10–20 Hz). These properties are typical of both model neurons (Brunel & Sergi, 1998; Tiesinga et al., 2000; Fourcaud & Brunel, 2002; Brunel & Latham, 2003) and real neurons (McCormick et al., 1985; Chance, Abbott, & Reyes, 2002), and in the next two sections we will see how they connect to the problem of robust, low ring-rate attractors.

Computing and Stability in Cortical Networks

1389

( E, I )

~100 Hz

E

Figure 2: Generic gain function versus average excitatory ring rate, ºE . There are three distinct regimes: First, at small ºE , the mean current induced by presynaptic ring is too low to cause a postsynaptic neuron to re, so postsynaptic activity is due to uctuations in the current. This is the noise-dominated regime, and here Á is convex. Second, for slightly larger ºE , Á becomes approximately linear or slightly concave. Finally, for large enough ºE , the ring rate saturates— at around 100 Hz for typical cortical pyramidal cells (McCormick et al., 1985). This gain function is a schematic and does not correspond to any particular neuron model.

Although equations 2.1a and 2.1b have a fairly broad range of applicability, they are not all encompassing. In particular, they leave out the possibility of instability via synchronous oscillations (Abbott & van Vreeswijk, 1993; Gerstner & van Hemmen, 1993; Hansel & Mato, 2001), and they ignore the effects of higher-order moments of the ring rate (Amit & Brunel, 1997a; Latham, 2002). We thus verify all predictions using large-scale simulations of synaptically coupled neurons. As indicated in Figure 1, the network described by equations 2.1a and 2.1b contains a preferentially connected subpopulation. We are interested in determining under what conditions the network supports two states: a “memory” state in which this subpopulation res at elevated rate compared to the background, and a background state in which the subpopulation res at the same rate as the background. Since it is the difference between the attractor and background ring rates that is important, we dene a variable, m, that is proportional to this difference; for convenience, we let

1390

P. Latham and S. Nirenberg

the proportionality constant be 1=.1 ¡ f /, " # 1 1 X 1 X m´ »i ºEi ¡ ºEi : 1 ¡ f NE f i NE i

(2.2)

The rst expression inside the angle brackets is the average ring rate of the subpopulation; the second is the average ring rate of the whole population, ºE . Thus, the average ring rate of the subpopulation is ºE C .1 ¡ f /m, and the background state corresponds to m D 0. Intuitively, we expect that network dynamics should be governed by three variables: the ring rate associated with the memory state, m, the average excitatory ring rate, ºE , and the average inhibitory rate, ºI . In fact, if we average equations 2.1a and 2.1b over index, i, we can derive differential equations for these variables. As shown in appendix A, the averaged equations are dm C m D ÁE .JEE ºE ¡ JEI ºI C ¯m/ ¡ ÁE .JEE ºE ¡ JEI ºI / dt dºE C ºE D .1 ¡ f /ÁE .JEE ºE ¡ J EI ºI / ¿E dt C f ÁE .J EE ºE ¡ JEI ºI C ¯m/ dºI C ºI D ÁI .JIE ºE ¡ J II ºI /: ¿I dt ¿E

(2.3a)

(2.3b) (2.3c)

To simplify our analysis, we adopt the the effective response function approach of Mascaro and Amit (1999), which can be implemented by taking the limit of fast inhibition, ¿I ! 0. The main caveat to this limitis that we may overestimate the stability of equilibria: equilibria can be stable when ¿I D 0 but unstable when ¿I is above some threshold (Wilson & Cowan, 1972; van Vreeswijk & Sompolinsky, 1996; Latham, Richmond, Nelson, & Nirenberg, 2000a; Hansel & Mato, 2001). With ¿I D 0, we can solve equation 2.3c for ºI in terms of ºE . This allows us to write ºI D ºI .ºE /, where the function ºI .ºE / is determined implicitly from equation 2.3c with ¿I D 0. Replacing ºI with ºI .ºE / in equations 2.3a and 2.3b, we nd that all the ºE -dependence on the right-hand side of these equations can be lumped into the single expression, JEE ºE ¡ JEI ºI .ºE /. We denote this ¡° .ºE /, so that ° .ºE / ´ ¡JEE ºE C JEI ºI .ºE /:

(2.4)

With ºI eliminated, we are left with just two equations: dm C m D ÁE .¡° .ºE / C ¯m/ ¡ ÁE .¡° .ºE // ´ 1ÁE .ºE ; m/ dt dºE C ºE D ÁE .¡° .ºE // C f 1ÁE .ºE ; m/: ¿E dt ¿E

(2.5a) (2.5b)

Computing and Stability in Cortical Networks

1391

These equations are similar to ones derived previously by Brunel and colleagues (Amit & Brunel, 1997a; Brunel, 2000; Brunel & Wang, 2001). 2.1 The Sparse Coding Limit. The question we are interested in is: under what conditions do equations 2.5a and 2.5b admit two stable solutions— one with m D 0 (the background state) and one with m 6D 0 (the memory state)? Before we answer this question in general, let us consider the sparse coding limit, f ! 0, as this limit is relatively simple and it allows us to make contact with previous work. When f D 0, equations 2.5a and 2.5b decouple. In this regime, we can solve equation 2.5b for the excitatory ring rate, then solve equation 2.5a for m. Let us assume that equation 2.5b has a stable solution, meaning the network admits a stable background state, and focus on equation 2.5a. This equation is best solved graphically, by plotting 1ÁE .ºE ; m/ versus m and looking for intersections of this plot with the 45 degree line; those intersections correspond to equilibria (see Figure 3). Stability can be read off the

E

( E,m) (Hz)

A

B

~100

~100

~15

~15

m

C ~15

m

m

Figure 3: Gain functions and m-equilibria. (A). 1ÁE .ºE ; m/ versus m for different values of ¯. Points where the curves cross the 45 degree line are equilibria. In the sparse coding limit, f ! 0, an equilibrium is stable if the slope of the curve is less than one and unstable otherwise. Dotted line: weak coupling (small ¯). The only equilibrium is at the background, and it is stable. Solid line: intermediate coupling. Two more equilibria have appeared. The one with intermediate m is unstable; the one with large m is stable. Dashed line: large coupling. The background has become destabilized, and the memory is easily activated without input. Since 1ÁE .ºE ; m/ is essentially an average single-neuron f -I curve, it saturates at the maximum ring rate of single neurons, »100 Hz. Thus, the upper equilibrium occurs at high rate. (B). To reduce the ring rate of the upper equilibrium, one must operate in a very narrow parameter regime: small changes in network parameters would either rotate the curve counterclockwise, which would destabilize the background and/or move the equilibrium to high rate, or rotate it clockwise, which would eliminate the memory state. (C). Blowup of the region below 15 Hz in B.

1392

P. Latham and S. Nirenberg

plots by looking at the slope: an equilibrium with slope less than 1 is stable and one with slope greater than 1 is unstable. The number and stability of the equilibria are determined by ¯. If ¯ is small—weak coupling among the neurons in the subpopulation—there is only one equilibrium, at m D 0, and it is stable (see Figure 3A, dotted line). This makes sense, as weak coupling should not have much effect on network behavior. As ¯ increases, so does the slope of 1ÁE .ºE ; m/, and eventually a new pair of equilibria appear (see Figure 3A, solid line). Of these, the one at higher ring rate (larger m) is stable, since its slope is less than 1, and the one at lower, but nonzero, rate is unstable, since its slope is greater than 1. Finally, for large enough ¯, the unstable equilibrium slides past zero, at which point the equilibrium at m D 0, and thus the background, becomes unstable (see Figure 3A, dashed line). The intermediate ¯ regime, where the network can support both a memory and a stable background, is of the most interest to us. It is in this regime that the network can actually compute, in the sense that input to the network controls which state it is in (memory or background). The low and high ¯ regimes are not so interesting, however, at least if the goal is to construct a network that can store memories. If ¯ is too low, the network is stuck in the background state; if it is too high, the network can easily jump spontaneously from the background to the memory state. As Figure 3A shows, it is not hard to build a network that can exhibit two states—so long as one is willing to allow ring rates near saturation, meaning at a healthy fraction of 100 Hz. It is much more difcult to build a network in which the memory state exhibits low ring rates—in the 10 to 20 Hz range, as is observed experimentally (Fuster & Alexander, 1971; Miyashita & Chang, 1988; Nakamura & Kubota, 1995). This is because low ring rates require too many bends in 1ÁE .ºE ; m/ in too small a ring-rate range, where “too small” is relative to the saturation ring rate of »100 Hz (see Figures 3B and 3C). These qualitative arguments were quantied in networks of leaky integrate-and-re neurons using both analytic techniques and numerical simulations (Brunel, 2000; Brunel & Wang, 2001). For attractors that red at about 15 Hz, the coupling, ¯, among the subpopulations had to be tuned to within 1% to ensure that both the background and memories were stable. At slightly higher ring rates of 25 to 40 Hz, the tuning was somewhat more forgiving, 3% to 5%. It should be pointed out, however, that these networks were more robust to other parameters: multiple attractors were supported at reasonably low rates (20–40 Hz) when external input varied over a 40% range and the strengths of different receptor types (AMPA, NMDA, and GABA) were varied over range of 5% to 15%. 2.2 Beyond the Sparse Coding Limit. What these results indicate is that parameters need to be nely tuned for attractor networks to exist at low rates, at least in the f ! 0 limit. When f is nite, however, the picture

Computing and Stability in Cortical Networks

1393

changes, since the memories (m) and background (ºE ) couple (see equation 2.5). This means that the slope of 1ÁE .ºE ; m/ with respect to m is no longer the sole factor determining stability; instead, stability depends on the interaction between m and ºE . Consequently, an equilibrium in which the slope of 1ÁE .ºE ; m/ is greater than 1 can be stable. For example, the equilibrium on the solid curve in Figure 3A that occurs at about 15 Hz, which is unstable in the f ! 0 limit, could become stable when f > 0. If this were to happen, attractors could exist robustly at low rate. We refer to the regime in which the attractors are stable even though the slope of 1ÁE .ºE ; m/ is greater than 1 as the dynamically stabilized regime. This is because an equilibrium that would be unstable when one considers only the time evolution of the memory is dynamically stabilized by feedback from the background. Networks that operate in this regime have been investigated by Sompolinsky and colleagues (Rubin & Sompolinsky, 1989; Golomb, Rubin, & Sompolinsky, 1990). Although the ring rates of the attractors in those networks were low, the networks were not realistic in an important respect: they did not exhibit a background state in which all neurons red at about the same rate; instead, the only stable states were ones in which a subpopulation of the neurons was active. Our goal here is to overcome this problem and build a network that both operates in the dynamically stabilized regime, and thus at low rates, and supports a low ring-rate background. To do this, we must evaluate the stability of the equilibria in m-ºE space. The network equilibria are found by setting dm=dt and dºE =dt in equations 2.5a and 2.5b and solving the resulting algebraic equations. With the time derivatives set to zero, the solutions to these equations are curves in m-ºE space. These curves are referred to as nullclines, and their intersections correspond to equilibria. Constructing nullclines from the properties of ÁE is straightforward (Latham et al., 2000a) but tedious. We thus skip the details and simply plot them; once we have made the plots, it is relatively easy to see that they have the correct qualitative shape. One possible set of nullclines is shown in Figure 4A, with the black curve corresponding to the ºE -nullcline (the solution to equation 2.5b with dºE =dt D 0) and the gray curves corresponding to the m-nullcline (the solution to equation 2.5a with dm=dt D 0; note that there are two pieces—a smooth curve and a vertical line). The shape of the ºE -nullcline is relatively easy to understand: it is an increasing function of m, reecting the fact that as m gets larger, there is more excitatory drive to the network. This can also be seen from equation 2.5b, where ºE is coupled to m through the term f 1ÁE .ºE ; m/, and 1ÁE .ºE ; m/ is an increasing function of m. The m-nullcline is a little more complicated, primarily because it consists of two pieces rather than one. To understand its shape, we must reexamine Figure 3A. This gure shows three plots of 1ÁE .ºE ; m/ versus m. These plots correspond to three values of ¯; however, because ºE as well as ¯ affects 1ÁE .ºE ; m/, they could just as easily have corresponded to three values of

1394

P. Latham and S. Nirenberg

ºE . In fact, had we xed ¯ and varied ºE , we would have obtained a set of curves qualitatively similar to the ones shown in Figure 3A. In other words, depending on the value of ºE , we would have seen three distinct regimes: (1) one equilibrium at m D 0 (dotted line in Figure 3A), (2) one equilibrium at m D 0 and two at m > 0 (solid line), and (3) one equilibrium at m D 0, one at m < 0, and one at m > 0 (dashed line). These three regimes are reected in the m-nullcline in Figure 4A: when ºE is large, there is a single equilibrium at m D 0 (regime 1); when ºE is intermediate, there is an additional pair of equilibria at m > 0 (regime 2); and when ºE is small, one of the equilibria in that pair becomes negative (regime 3). The order in which the regimes appear in Figure 4A—one equilibrium at zero when ºE is large, two positive ones when ºE is slightly smaller,

A

B E

E

0

m

C

0

m

0

m

D E

E

0

m

Figure 4: Nullclines in various parameter regimes. The gray curve, including the vertical line, is the nullcline for equation 2.5a; the black curve is the nullcline for equation 2.5b. The solid branches of the gray curves are stable at xed ºE ; the dashed branches are unstable at xed ºE . The black arrows indicate typical trajectories; they are derived from equation 2.5. (A) 1ÁE .ºE ; m/ is a decreasing function of ºE ; f is reasonably large. (B) 1ÁE .ºE ; m/ is an increasing function of ºE ; f is reasonably large. (C) 1ÁE .ºE ; m/ is a decreasing function of ºE ; f ¿ 1. (D) 1ÁE .ºE ; m/ is an increasing function of ºE ; f ¿ 1.

Computing and Stability in Cortical Networks

1395

and a negative one when ºE is smaller still—corresponds to a particular dependence of 1ÁE .ºE ; m/ on ºE . Examining Figure 3A, we see that this progression corresponds to a dependence in which 1ÁE .ºE ; m/ decreases as ºE increases. Thus, for Figure 4A to correctly describe the m-nullcline, the network parameters must be such that 1ÁE .ºE ; m/ is a decreasing function of ºE . Below, we derive explicit conditions under which this happens. First, however, we examine its consequences. The black and gray nullclines in Figure 4 intersect in three places, and thus exhibit three equilibria. One is at m D 0, and two are at m > 0. The equilibrium at m D 0 corresponds to the background state; the other two are candidates for memory states. To determine which are stable, we have plotted a few typical trajectories (black arrows), which are derived from equation 2.5. These trajectories tell us that the stable equilibria occur at m D 0 and at the right-most intersection. Importantly, the stable equilibrium at m > 0 occurs on the unstable branch of the m-nullcline, in the dynamically stabilized regime. (We can tell that this branch is unstable because at xed ºE , the trajectories point away from it; ow is to the right below the m-nullcline and to the left above it. The ºE -nullcline, on the other hand, consists of one stable branch, since ow is toward it at xed m. This reects the fact that the background is always stable at xed m, independent of the ring rate of the memory state.) The unstable branches, which are drawn with dashed lines, correspond to points where the slope of 1ÁE .ºE ; m/ is greater than one. That the memory state lives on the unstable branch of the m-nullcline is important, because the unstable branch corresponds to the intermediate equilibrium shown in Figure 3A and can thus occur at low ring rate. All previous work in realistic networks that we know of (Amit & Brunel, 1997a; Brunel, 2000; Brunel & Wang, 2001) puts the memory state on the stable branch of the m-nullcline— the part with slope less than one. The stable branch corresponds to the upper equilibrium in Figure 3A, and so tends to occur at high ring rate—at some reasonable fraction of the saturation ring rate for single neurons. It may seem counterintuitive to have a stable equilibrium on the unstable branch, but it is well known that this can happen (Rinzel & Ermentrout, 1989). The only caveat is that there is no guarantee of stability: the equilibrium may become unstable via a Hopf bifurcation (Marsden & McCracken, 1976), leading to oscillations. In fact, oscillations were occasionally observed—mainly near or above the stability boundary in Figure 7. The fact that the network did not oscillate in most of the parameter regime explored was, we believe, because the background is strongly stable (there is strong attraction to the ºE -nullcline in Figure 4). What this analysis shows is that two conditions are necessary for building an attractor network in which the memory states occur on the unstable branch of the m-nullcline, and thus re at low rates: 1ÁE .ºE ; m/ must be a decreasing function of ºE , and the fraction, f , of neurons involved in a memory must be large enough to give the ºE -nullcline signicant curvature.

1396

P. Latham and S. Nirenberg

What happens if either of these conditions is violated? Figure 4B shows the nullclines when 1ÁE .ºE ; m/ is an increasing function of ºE ; in this regime, the m-nullcline turns upside down. A low-ring-rate equilibrium still exists, but it is unstable, as indicated by the black arrows. The stable memory state in this regime is at high ring rate, near single-neuron saturation—too high to be consistent with experiment. Figures 4C and 4D show nullclines when f ¿ 1. Again, the only stable memory states are near saturation, and thus at high rate. Is the condition that 1ÁE .ºE ; m/ be a decreasing function of ºE satised for realistic networks? In other words, is it reasonable to expect @1ÁE .ºE ; m/=@ºE < 0? To answer this, we use equation 2.5a to write @ 1ÁE .ºE ; m/ D ¡° 0 .ºE /[ÁE0 .¡° .ºE / C ¯m/ ¡ ÁE0 .¡° .ºE //]; @ ºE

(2.6)

where a prime after a function denotes a derivative with respect to its argument. The term in brackets on the right-hand side of equation 2.6 is typically positive for a memory lying on the unstable branch of the m-nullcline; this is because the unstable branch corresponds to the intermediate equilibrium in Figure 3A, where ÁE is generally convex. Thus, the sign of @1ÁE .ºE ; m/=@ºE is determined solely by the sign of ° 0 .ºE /. For typical cortical networks, it is likely that ° 0 .ºE / > 0. That is because cortical networks operate in the highgain regime, in which one excitatory action potential is capable of causing more than one excitatory action potential somewhere else in the network (Abeles, 1991; Matsumura, Chen, Sawaguchi, Kubota, & Fetz, 1996). The only way to stabilize such a system is to provide strong feedback from excitatory to inhibitory neurons, so that the inhibitory response to small increases in excitation dominates over the excitation (van Vreeswijk & Sompolinsky, 1996; Amit & Brunel, 1997b; Brunel & Wang, 2001). In terms of our network variables, this means that d.JEI ºI .ºE //=dºE must be greater than d.JEE ºE /=dºE , which implies, via equation 2.4, that ° 0 .ºE / > 0. Thus, the condition @1ÁE .ºE ; m/=@ºE < 0 is naturally satised in cortical networks. Finally, we point out that a necessary condition for a low-ring-rate memory and a stable background state is that the ºE -nullcline intersect the m D 0 line above the exchange-of-stability point (the point where the two branches of the m-nullcline intersect), and then intersect twice more on the unstable branch of the m-nullcline. A sufcient condition for this to happen is that the slope of the ºE -nullcline is less than the slope of the m-nullcline at m D 0. If that condition is satised, then ¯ can be increased until the ºE -nullcline is sufciently far above the exchange-of-stability point to guarantee two intersections on the unstable branch. The slopes of the two nullclines, which can be determined by implicitly differentiating equations 2.5a and 2.5b, are given by 1 ¯Á 0 .¡° C ¯m/ ¡ 1 dºE D 0 0 E (2.7a) dm m¡nullcline ° ÁE .¡° C ¯m/ ¡ ÁE0 .¡° /

Computing and Stability in Cortical Networks

1397

¯ f ÁE0 .¡° C ¯m/ dºE D : 1 C ° 0 [.1 ¡ f /ÁE0 .¡° / C f ÁE0 .¡° C ¯m/] dm ºE ¡nullcline

(2.7b)

E

( E,m)

To calculate the slope of the m-nullcline at m D 0, it is necessary to take the m ! 0 limit of equation 2.7a; that will ensure that we do not pick up the vertical piece of the m-nullcline, which has innite slope. Using the fact that ¯ÁE0 .¡° C ¯m/ D 1 at the exchange-of-stability point and Taylor expanding the numerator and denominator in equation 2.7a around m D 0, we see that the slope of the m-nullcline is equal to ¯=° 0 at m D 0. A simple calculation then tells us that the slope of the ºE -nullcline at m D 0 is a factor of f=.1C1=° 0 ÁE0 / smaller. This ensures that for some range of ¯, the nullclines will exhibit the set of equilibria shown in Figure 4C. We can now use Figure 4 to understand why the two features listed in the beginning of section 2 (convexity over some range and transition to concavity at ring rates well above 10–20 Hz) are necessary if attractor networks are to exhibit robust, low-ring-rate equilibria in the dynamically stabilized regime. First, if gain functions are not convex over some nite range of m, then they will intersect the 45 degree line with a slope that is less than 1 (disregarding the trivial intersection at m D 0), which eliminates the possibility of operating in the dynamically stabilized regime (see Figure 5). Second, if the transition from a convex to a concave gain function occurs at an output rate that is less than, say, 20 Hz, then it would be possible for the equilibria in Figures 4B to 4D to occur at ring rates less than 20 Hz, which would imply that a stable, low-ring-rate equilibrium can exist on the stable branch of the m-nullcline, and thus not in the dynamically stabilized regime.

m

Figure 5: Concave gain functions. When the gain functions have no convex region, they intersect the 45 degree line (marked with arrows) with slope less than one, indicating that the network operates on the stable branch of the mnullcline. Thus, the only way for the network to operate in the dynamically stabilized regime is for the gain functions to be convex for some range of m (see Figure 3).

1398

P. Latham and S. Nirenberg

The above analysis tells us what is possible in principle; to see what is possible in practice, and to determine whether attractor networks can operate at low rates in a reasonably large parameter range, we now turn to simulations.

3 Simulations To verify the reduced model described above and to determine the range of parameters that supports multiple attractors, we performed simulations with a large network of synaptically coupled, spiking neurons. Network connectivity was based on the ring-rate model given in equation 2.1, with two enhancements to make it more realistic. First, we used random, sparse background connectivity rather than uniform, all-all connectivity. Second, we allowed multiple attractors—multiple memories—rather than just a single memory. To implement multiple memories, we included in the connecPp tivity matrix a term proportional to ¯ ¹D1 »i¹ .»j¹ ¡ f /, where p is the number of memories and the » ¹ are uncorrelated, random binary vectors with a fraction f of their components equal to one (Hopeld, 1982; Tsodyks & Feigel’man, 1988; Buhmann, 1989). This term is a natural extension of the one given in equation 2.1a. A detailed description of the network is provided in appendix B. We are interested not only in whether the network can exhibit memories at low rates, but also if it can do so without ne-tuning parameters. Ideally, we would like to explore the whole parameter space. However, there are 17 parameters (see Table 1 in appendix B), making this prohibitively timeconsuming. Instead, we chose a network with reasonable parameters (e.g., synaptic and membrane time constants and PSP size; Table 1), and then varied two parameters over a broad range. The ones we varied were VPSPEE , the excitatory-to-excitatory EPSP (excitatory postsynaptic potential) size, and ¯, the increase in EPSP size among the neurons contained in each of the p memories. At a range of points in this two-dimensional space, we checked how many memories could be embedded out of the p memories we attempted to embed and whether any memories were spontaneously activated. A run—a simulation at a particular set of parameters—lasted 12 seconds. The rst 5 seconds consisted of background activity, in which the neurons red at low average rate. At 5 seconds, all neurons in one of the memories received a 100 ms barrage of EPSPs. After 2 seconds, the same neurons received a second 100 ms barrage, this time of inhibitory postsynaptic potentials. The network then ran for another 5 seconds at background. A successful run—one in which the desired memory was activated and deactivated at the right times, and no other memories were spontaneously activated—is shown in Figure 6A. Note that the ring rate during a memory is relatively low, about 14 Hz. Thus, for at least one set of parameters, an attractor can exist at low rate.

Computing and Stability in Cortical Networks

1399

A

18

9

Firing rate (Hz)

0

0

6

12

6

12

6 Time (s)

12

B

18

9 0

0 C

18

9 0

0

Figure 6: Simulations in which a set of memory neurons (black line) was activated at t D 5 seconds (rst vertical dashed line) and deactivated at t D 7 seconds (second vertical dashed line). The gray line in all plots is the background ring rate. (A) The activated memory is successfully embedded for the full 2 seconds. (B) The activated memory lasts for only 1.5 seconds, indicating that the attractor is not stable. (C) A spurious memory (dashed line) is spontaneously activated, indicating that the background is not stable. Parameters are given in Table 1, with VPSP EE D 0:48 mV, ¯ D 0:21 mV, and f D 0:1. Onset times for the memories were 50 to 100 ms, consistent with what is observed in vivo (Wilson et al., 1993; Tomita, Ohbayashi, Nakahara, Hasegawa, & Miyashita, 1999; Naya et al., 2003).

1400

P. Latham and S. Nirenberg

Figure 7: Summary of simulations. (A) Number of memories successfully embedded out of 50, the number attempted. (B) Average ring rate of memory neurons. The black line in both plots is the stability boundary: below it, the background is stable, meaning no spurious memories were activated; above it, the background is unstable. For EPSPs ranging from 0.2 to 0.5 mV, the network supports multiple attractors and a stable background for values of ¯ varying over a 15% to 25% range. Scale bar is shown to the right. Parameters are given in Table 1, with f D 0:1.

Not all runs were successful, of course. Two things can go wrong: the desired memory might not stay active for the whole 2 seconds (see Figure 6B), or a spurious memory might become spontaneously active (see Figure 6C). To determine whether a particular set of parameters can support multiple memories without any of them becoming unstable, we activated and deactivated, one at a time, each of the p memories. A particular memory was considered to be successfully embedded if it stayed active for the requisite 2 seconds, and a particular set of parameters was considered to be stable if no spurious memories were activated in any of the p runs. The results of simulations with p D 50 are summarized in Figure 7. Figure 7A shows the number of memories that were successfully embedded versus VPSPEE and ¯. The black line in this gure is the stability boundary: the background state is stable in the region below it, meaning no spurious memories were activated (see Figure 6A); it is unstable in the region above it, meaning at least one spurious memory was activated (see Figure 6C). Figure 7B shows the average ring rate of the memory neurons for those memories that were successfully embedded. As predicted by the reduced model, the ring rate is uniformly low, never exceeding 15 Hz in the stable regime. The background rate (not shown) was »0.1 to 0.2 Hz, lower than what is seen in vivo. For these network parameters, 50 memories was close to capacity: increasing p to 75 drastically reduced the size of the stable regime.

Computing and Stability in Cortical Networks

1401

Consistent with the analysis of the reduced model, the parameter regime that supports multiple memories and a stable background is large: for EPSPs ranging from 0.2 to 0.5 mV, the region with multiple (> 45/ memories extended about 0.04 mV below the stability boundary. If we consider the dynamic range of ¯ to lie between zero and the stability boundary, then this corresponds to a parameter range for ¯ of 15% to 25%. Although the low ring rates of the memory neurons and the large parameter regime that supports multiple memories are consistent with the reduced model introduced above, they are not a direct conrmation of it. What we would like to know is whether the equilibria we observed in the simulations really do lie on the unstable branch of the m-nullcline, as in Figure 4A, or whether they in fact lie on the stable one, as in Figures 4B to 4D. One way to nd out is to manipulate the nullclines, and thus the equilibrium ring rates, and then check to see if the ring rates change as predicted by the reduced model. A convenient manipulation is one in which the ºE -nullcline is raised or lowered while the m-nullcline remains xed. This is because raising the ºE nullcline lowers the average ring rate during the activation of a memory only when the equilibrium is on the unstable branch of the m-nullcline (see Figure 4A and the inset in Figure 8); if the equilibrium is on the stable branch, raising the ºE -nullcline increases the average ring rate (see Figures 4B–4D). Examining equation 2.5, we see that the ºE -nullcline can be raised without affecting the m-nullcline by increasing f , the fraction of neurons involved in a memory. The prediction from the reduced model, then, is that the background ring rate during the activation of a memory should decrease as f increases. This prediction is veried in Figure 8, thus providing strong evidence that the reduced model really does explain the simulations and that the simulations operate in a regime corresponding to the nullclines shown in Figure 4A. Note also that the range of f that supports stable memories is large, » 30%, providing further evidence for the robustness of the network. 4 Discussion The question we address in this article is: how can cortical networks, which are highly interconnected and dominated by excitatory neurons, carry out computations and yet maintain stability? We answered this question in the context of attractor networks—networks that support multiple stable states, or memories. We chose attractor networks for two reasons. First, they are thought to underlie several computations in the brain, including associative memory (Hopeld, 1982, 1984), working memory (Amit & Brunel, 1997a; Brunel & Wang, 2001), and the vestibular-ocular reex (Seung, 1996). Consequently, they are an active area of experimental research (Miyashita & Hayashi, 2000; Aksay et al., 2001; Ojemann et al., 2002; Naya et al., 2003). Second, the stability problem in attractor networks is especially severe, so

1402

P. Latham and S. Nirenberg

0.8

E

E

(Hz)

1.6

m

0.0 0.08 0.10 0.12 f (fraction of neurons per memory) Figure 8: Average ring rate during the activation of a memory, ºE , as a function of the fraction of neurons involved in a memory, f , for two parameter sets. Black curve: VPSP EE D 0:40 mV, ¯ D 0:18 mV; gray curve: VPSPEE D 0:20 mV, ¯ D 0:21 mV; other parameters are given in Table 1. Averages are over all 50 memories; error bars are standard deviations. As indicated in the inset, the prediction of our model is that the ring rate, ºE , should drop as f increases and drives the ºE -nullcline up (the upper ºE -nullcline corresponds to larger f ; see the text). This prediction is veried by the decreasing ring rate versus f .

if we can solve this problem for attractor networks, we should be able to solve it for networks implementing other kinds of computations. In previous models of realistic attractors networks, the strategy was to operate on the stable branch of the m-nullcline (Amit & Brunel, 1997a; Wang, 1999; Brunel, 2000; Brunel & Wang, 2001), where ring rates are set essentially by the saturation rates of single neurons (see Figure 4C). Those rates are »100 Hz, making the experimentally observed 10 to 20 Hz (Fuster & Alexander, 1971; Miyashita & Chang, 1988; Nakamura & Kubota, 1995) hard to reach without ne-tuning parameters. Here we showed, using a reduced, two-variable model and simulations with large networks of spiking neurons, that attractor networks can operate on the unstable branch of the m-nullcline, and thus at low rates. There are two key components to this model. The rst is strong coupling between excitatory and inhibitory cells. This coupling, which is essential in networks with strong recurrent excitatory connections like those found in the cortex, means that any upward uctuation in excitatory ring rate is matched by a larger upward uctuation in inhibitory rate. (Here “larger” has a technical denition: it means that ° 0 .ºE / > 0; see equations 2.4 and 2.6.) This makes the background state effectively inhibitory, a fact that is somewhat counterintuitive but extremely valuable, as it allows the background to act as a stabilizing pool. The second component is strong coupling between

Computing and Stability in Cortical Networks

1403

the attractors and the background state. Since the background is effectively inhibitory, this coupling can dynamically stabilize what would otherwise be an unstable state, and thus allow operation on the unstable branch of the m-nullcline. The approach we used, in which the network operates on the unstable branch of the m-nullcline, is fundamentally different from approaches in which networks operate on the stable branch. In particular, the latter require ne-tuning of network parameters to keep neurons from ring near saturation (see Figure 3C); the former does not. Importantly, the ne-tuning problem associated with operation on the stable branch cannot be eliminated simply by including effects like synaptic and spike-frequency adaption, or by adding inhibition. This is because the ne-tuning problem is associated with the structure of the nullclines in Figure 4 and the fact that neurons saturate at about 100 Hz (even with spike frequency adaptation taken into account), neither of which is changed by these manipulations. To verify the predictions of the reduced model, we performed large-scale simulations with networks of spiking neurons. We found that »50 overlapping memories could be embedded in a network of 10,000 neurons. As predicted, the ring rates of the neurons in the attractor were low, between 10 and 15 Hz on average—approximately what is seen in vivo. More important, the network was stable over a broad range of parameters: background EPSP amplitude could vary by »100%, the connection strength among the neurons involved in a memory could vary by »25%, and the fraction of neurons involved in a memory could vary by »30%, all without degrading network operation (see Figures 7 and 8). 4.1 Implications for Memory Storage. An important outcome of our model is that it predicts that the fraction of neurons involved in a memory, f , must be above some threshold. Otherwise, the feedback between the memory and the background would be too weak to stabilize activity at low rates, as can be seen by comparing Figures 4A and 4C. As shown by Gardner (1988), Tsodyks and Feigel’man (1988), Buhmann (1989), and Treves, Skaggs, and Barnes (1996), the maximum number of memories that can be embedded in a network is » 0:2K=j f log f j where K is the number of connections per neuron. This relation, combined with our result that f must be above some minimum, means that there is a limit to the number of memories that can be stored in any one network. This precludes the possibility of increasing the number of memories by adding neurons to a network while keeping the number of neurons in a memory xed: this manipulation would decrease f and thus increase the maximum number of allowed memories, but as f becomes too small, it would ultimately make the network unable to operate at low rates. If we take f D 0:1 (Fuster & Jervey, 1982) and K D 5000 (Braitenberg & Schuz, ¨ 1991), then the maximum number of memories that can be stored in any strongly interconnected network in the brain is »4000. This has ma-

1404

P. Latham and S. Nirenberg

jor implications for how we store memories. Humans, for example, can remember signicantly more than 4000 items, even within a particular category (words, for example). Thus, our model implies that memory must be highly distributed: local networks store at most thousands of items each, and during recall, those items must be dynamically linked to form a complete memory. 4.2 Biological Plausibility of Our Model. Our simulations contained garden-variety neurons and synapses—simple point neurons and fast synapses. This allowed us to cleanly separate behavior due to collective interactions from behavior due to single-neuron properties. However, it leaves open the possibility that more realistic neurons and synapses might change our ndings, especially the size of the parameter range in which the network is stable. Probably the most important parameters for stability are the excitatory and inhibitory synaptic time constants, with long excitatory time constants being stabilizing and long inhibitory time constants destabilizing (Tegner, Compte, & Wang, 2002). In our networks, we used the same time constant for both (3 ms). More realistic would be a longer inhibitory time constant (Salin & Prince, 1996; Xiang, Huguenard, & Prince, 1998) and a mix of long (NMDA) and short (AMPA) excitatory time constants (Andrasfalvy & Magee, 2001). Fortunately, such a mix has an overall stabilizing effect (Wang, 1999). Thus, we expect that with more realistic synapses, the parameter regime that supports a large number of memories should, in fact, be larger than the one we found. Because slow excitatory synapses reduce uctuations, they may increase the number of memories that can be embedded. This is important, since the number of memories in our network, 50, was smaller than what we expect of realistic networks, and much smaller than the theoretical maximum of 2000 for our network (derived using the above formula, 0:2K=j f log f j, with K D 2500 and f D 0:1). The background ring rate in our simulations was »0.1 to 0.2 Hz, lower than the 1 to 10 Hz observed in vivo (Fuster & Alexander, 1971; Miyashita & Chang, 1988; Nakamura & Kubota, 1995). These low rates can be traced to the fact that our network consisted of two pools of neurons, one near threshold and one well below threshold, with the latter pool ring at a very low rate. The below-threshold pool was needed to ensure that the average gain function, ÁE , was convex at the background ring rate, which is necessary for memories to exist (see Figure 3). With other methods for achieving convex gain functions, such as coherent external input or nonlinearities in the current-to-ring-rate transformation (especially the nonlinearity provided by persistent neurons; Egorov, Hamam, Fransen, Hasselmo, & Alonso, 2002), it is likely that a larger background rate could be supported. A nonlinear current-to-ring-rate transformation is especially helpful: when we included even a slight nonlinearity, the background rate increased to about 1 Hz and the capacity of the network increased to 100 memories (data

Computing and Stability in Cortical Networks

1405

not shown). Alternatively, it is possible that our network was consistent with real cortex, and the ring rates in vivo are overestimated. This could happen because experimental recordings are, by necessity, biased toward neurons that re at detectable rates, whereas we averaged ring rates over all neurons in the network, some of which rarely red. This possibility is strengthened by the recent nding that, based on energy considerations, average ring rates in the cortex are likely to be less than 1 Hz (Lennie, 2003). 4.3 Summary. These results demonstrate that dynamical stabilization can be used to embed multiple, overlapping, low-ring-rate attractors over a broad range of network parameters. This opens the door to detailed comparison between theory and experiment, and should help resolve the question of whether attractor networks really do exist in the brain. In addition, dynamical stabilization can be used for all kinds of networks, not just those involving attractors, and may turn out to be a general computational principle. Appendix A: Derivation of Reduced Model Equations In large neuronal networks in which neurons are ring asynchronously, it is reasonable to model neurons by their ring rates (Amit & Brunel, 1997a). While this approach is unlikely to capture the full temporal network dynamics (Treves, 1993), it is useful for studying equilibria. Moreover, near an equilibrium, we expect the ring rates to obey rst-order dynamics. Thus, a ring-rate model that uses linear summation followed by a nonlinearity would be described by the equations 0 1 X X dºEi EE EI C ºEi D ÁEi @ ¿E Jij ºEj ¡ Jij ºIj A dt j j 0 1 X X dºIi C ºIi D ÁIi @ ¿I JijIE ºEj ¡ JijII ºIj A ; dt j j

(A.1a)

(A.1b)

where ºEi and ºIi are the ring rates of individual excitatory and inhibitory neurons, respectively, ÁEi and ÁIi are their gain functions, and Jij determines the connection strength from neuron j to neuron i. The behavior of the network described by equation A.1 depends critically on connectivity. If the connectivity is purely random, then the network can be described qualitatively by the Wilson and Cowan model (Wilson & Cowan, 1972). However, if the connectivity among a subpopulation of excitatory neurons is strengthened, as it is in this study, we need to augment the Wilson and Cowan model. Ignoring ring-rate uctuations, an approx-

1406

P. Latham and S. Nirenberg

imation that is valid qualitatively (Amit & Brunel, 1997a; Latham, 2002), we can construct heuristically such an augmented model by letting JijEE ! NE¡1 JEE C ¯[NE f .1 ¡ f /]¡1 »i .»j ¡ f /

JijLM

!

LM N ¡1 ; M J

(A.2a)

LM D EI; IE; II:

(A.2b)

In these expression, NE and NI are the number of excitatory and inhibitory neurons, respectively, and » is a random binary vector: »i is 1 with probability f and 0 with probability 1 ¡ f . If we apply the replacements given in equation A.2 to equation A.1, average equation A.1b over index, i, dene P ÁI .x/ ´ NI¡1 i ÁIi .x/, and let ÁEi ! ÁE , then equation A.1 turns into equation 2.1. Replacing ÁEi with ÁE was done for convenience only: it simplies the resulting equations without detracting from the main result. To derive equations 2.3a and 2.3b from equation 2.1, we must average ÁE and »i ÁE over index. To perform these averages, we rst use the denition of m, equation 2.2, to simplify the argument of ÁE ; this denition implies that ÁE D ÁE .JIE ºE ¡ JII ºI C ¯m»i /. We then note that for any function F.»i /, NE¡1

X i

F.»i / D NE¡1 D NE¡1

X i

.1 ¡ »i /F.»i / C NE¡1

i

.1 ¡ »i /F.0/ C NE¡1

X

X

»i F.»i / i

X

»i F.1/

(A.3)

i

D .1 ¡ f /F.0/ C f F.1/: The second line follows because »i is either 0 or 1; the third (which is strictly valid only in the NE ! 1 limit) because »i averages to f . Similarly, . f NE /¡1

X i

»i F.»i / D . f NE /¡1

X i

»i F.1/ D F.1/:

(A.4)

Equations A.3 and A.4, along with the denition of m given in equation 2.2, can be used to derive equations 2.3a and 2.3b. Appendix B: Simulation Details The network we simulate consists of NE excitatory and NI inhibitory quadratic integrate-and-re neurons (Ermentrout & Kopell, 1986; Ermentrout, 1996; Gutkin & Ermentrout, 1998; Brunel & Latham, 2003). The membrane potential of the ith neuron, Vi , and the conductance change at neuron i in response to a spike at neuron j, sij , evolve according to ¿

X dVi .Vi ¡ Vr /.Vi ¡ Vt / D C V0i ¡ .Vi ¡ EE / Jij sij .t/ dt Vt ¡ Vr j2E

Computing and Stability in Cortical Networks

¡ .Vi ¡ EI / dsij dt

D¡

sij ¿s

C

X l

X Jij sij .t/

1407

(B.1a)

j2I

±.t ¡ tjl /:

(B.1b)

Here ¿ is the cell time constant, Vr and Vt are the nominal resting and threshold voltages, V0i is the product of the applied current and the membrane resistance (in units of voltage), Jij is the (dimensionless) connection strength from cell j to cell i, EE and EI are the excitatory and inhibitory reversal potentials, respectively, the notation j 2 M means sum over only those cells of type M, ±.¢/ is the Dirac ±-function, and tjl is the lth spike emitted by neuron j. To mimic the heterogeneity seen in cortical neurons, we let Voi , which determines the distance from resting membrane potential to threshold, have a range of values. This range is captured by the distributions PE .V0 / D 0:75

exp[¡.V0 ¡ 1:5/2 =2.0:5/2 ] [2¼.0:5/2 ]1=2

exp[¡.V0 ¡ 3:75/2 =2.1:0/2 ] C 0:25 [2¼.1:0/2 ]1=2 » 1 1 0:5 < V0 < 5:0 £ PI .V0 / D ; 0 otherwise 4:5 where PE .V0 / and PI .V0 / are the distributions for excitatory and inhibitory neurons, respectively. In the absence of synaptic drive, the distance between resting membrane potential and threshold is .Vt ¡ Vr /[1 ¡ 4V0 =.Vt ¡ Vr /]1=2 (see equation B.1a). Since we use Vt ¡ Vr D 15 mV (see Table 1), V0 D 3:75 corresponds to a resting membrane potential that is equal to threshold while V0 D 1:5 corresponds to a resting membrane potential about 12 mV below threshold. Neurons for which V0 > 3:75 are endogenously active—they re repeatedly without input. The distribution PE .V0 / tells us that the excitatory neurons consist of two pools: one with resting membrane potential well below threshold, and one with resting membrane potential near threshold. About half the neurons in the latter pool are above threshold, and thus endogenously active. In realistic networks, this endogenous activity could be due to external input, intrinsic single-neurons properties (Latham, Richmond, Nirenberg, & Nelson, 2000b), or a combination of the two. While endogenously active cells greatly facilitate the existence of a low-ring-rate background state (Latham et al., 2000a), they are not absolutely necessary for it (Hansel & Mato, 2001). A spike is emitted from neuron j whenever Vj reaches C1, at which point it is reset to ¡1. To attain the values §1 in our numerical integration, we make the change of variables V D .Vr C Vt /=2 C .Vt ¡ Vr / tan µ and integrate µ instead of V.

1408

P. Latham and S. Nirenberg

The connectivity matrix, Jij , consists of two parts. One is random, corresponding to background connectivity before learning; the other is structured, corresponding to memories. In addition, we impose sparse connectivity by making the connection probability less than 1. We thus write Jij D g.cij .Wij C Aij //; where cij is 1 with probability cconnect and zero otherwise, Wij corresponds to the background connectivity, Aij corresponds to the structured connectivity, and g is a clipping function chosen so that 0 · g.x/ · gmax ; for convenience, we choose g.x/ to be threshold linear and truncated at gmax : g.x/ D max.x; 0/ ¡ max.x ¡ gmax ; 0/. The clipping function ensures that a particular connection strength neither violates Dale’s law by falling below zero nor exceeds physiological levels by becoming too large. The random part of the connectivity matrix, W, is chosen to produce realistic postsynaptic potentials. Specically, we let W ij D wij

VPSPLM ; VM

where neuron i is of type L, neuron j is of type M (L; M D E; I), wij is a p p random variable uniformly distributed between 1 ¡ 31 and 1 C 31 (so that its variance is 12 ), and VM ´

E M ¡ Vr : .¿ =¿s / exp[ln.¿ =¿s /=.¿=¿s ¡ 1/]

With this choice for the connection matrix, a neuron of type L will exhibit postsynaptic potentials on the order of VPSPLM when a neuron of type M res, assuming that V0 D 0 for the neuron of type L (Latham et al., 2000a). The structured part of the connectivity matrix is the natural multimemory extension of the matrix given in equation 2.1a, except with a slightly different normalization, Aij D

p X ¯=VE ¹ ¹ »i .»j ¡ f /; NE f .1 ¡ f / ¹D1

where p is the number of memories and the » ¹ are uncorrelated, random binary vectors: »i¹ is 1 with probability f and 0 with probability 1 ¡ f . Only excitatory neurons participate in the memories. The factor 1=VE is included so that ¯ has units of voltage. The parameters in the model are listed in Table 1. All were xed throughout the simulation except for ¯ and VPSPEE , which were varied to explore robustness (see Figure 7), and f , which was 0.1 except in Figure 8, where it ranged from 0.085 to 0.115.

Computing and Stability in Cortical Networks

1409

Table 1: Parameters Used in the Simulations. Excitatory neurons Inhibitory neurons cconnect 1 ¿ ¿s Vr Vt E

E

E

I

VEPSP (E ! E) VEPSP (E ! I) VIPSP (I ! E, I ! I) gmax ¯ p f Time step

8000 2000 0.25 0.25 10 ms 3 ms ¡65 mV ¡50 mV 0 mV ¡80 mV 0.2–0.8 mV 1.0 mV ¡1:5 mV 2.5 mV 0.08–0.28 mV 50 0.085–0.115 0.5 ms

Acknowledgments We thank Nicolas Brunel and Alex Pouget for insightful comments on the manuscript. This work was supported by NIMH Grant R01 MH62447. References Abbott, L., & van Vreeswijk, C. (1993). Asynchronous states in networks of pulse-coupled oscillators. Phys. Rev. E, 48, 1483–1490. Abeles, M. (1991). Corticonics: Neural circuits of the cerebral cortex. Cambridge: Cambridge University Press. Aksay, E., Gamkrelidzek, G., Seungk, H., Baker, R., & Tank, D. (2001). In vivo intracellular recording and perturbation of persistent activity in a neural integrator. Nature Neurosci., 4, 184–193. Amit, D., & Brunel, N. (1997a). Dynamics of a recurrent network of spiking neurons before and following learning. Network, 8, 373–404. Amit, D., & Brunel, N. (1997b). Model of global spontaneous activity and local structured activity during delay periods in the cerebral cortex. Cereb. Cortex, 7, 237–252. Andrasfalvy, B., & Magee, J. (2001). Distance-dependent increase in AMPA receptor number in the dendrites of adult hippocampal ca1 pyramidal neurons. J. Neurosci., 21, 9151–9159. Bell, G., & Sander, J. (2001). The epidemiology of epilepsy: The size of the problem. Seizure, 10, 306–314. Braitenberg, V., & Schuz, ¨ A. (1991). Anatomy of the cortex. Berlin: Springer-Verlag.

1410

P. Latham and S. Nirenberg

Brunel, N. (2000). Persistent activity and the single-cell frequency-current curve in a cortical network model. Network: Computation in Neural Systems, 11, 261–280. Brunel, N., & P. Latham (2003). Firing rate of the noisy quadratic integrate-andre neuron. Neural Comput., 15, 2281–2306. Brunel, N., & Sergi, S. (1998). Firing frequency of leaky intergrate-and-re neurons with synaptic current dynamics. J. Theor. Biol., 195, 87–95. Brunel, N., & Wang, X. (2001). Effects of neuromodulation in a cortical network model of object working memory dominated by recurrent inhibition. J. Comput. Neurosci., 11, 63–85. Buhmann, J. (1989). Oscillations and low ring rates in associative memory neural networks. Phys. Rev. A, 40, 4145–4148. Chance, F., Abbott, L., & Reyes, A. (2002). Gain modulation from background synaptic input. Neuron, 35, 773–782. Egorov, A., Hamam, B., Fransen, E., Hasselmo, M., & Alonso, A. (2002). Graded persistent activity in entorhinal cortex neurons. Nature, 420, 173–178. Ermentrout, B. (1996). Type i membranes, phase resetting curves, and synchrony. Neural Comput., 8, 979–1001. Ermentrout, B., & Kopell, N. (1986). Parabolic bursting in an excitable system coupled with a slow oscillation. SIAM J. Appl. Math., 46, 233–253. Fourcaud, N., & Brunel, N. (2002). Dynamics of the ring probability of noisy integrate-and-re neurons. Neural Comput., 14, 2057–2110. Freedman, D., Riesenhuber, M., Poggio, T., & Miller, E. (2001). Categorical representation of visual stimuli in the primate prefrontal cortex. Science, 291, 312–316. Fuster, J., & Alexander, G. (1971). Neuron activity related to short-term memory. Science, 173, 652–654. Fuster, J., & Jervey, J. (1982). Neuronal ring in the inferotemporal cortex of the monkey in a visual memory task. J. Neurosci., 2, 361–375. Gardner, E. (1988). The space of interactions in neural network models. J. Phys. A: Math. Gen., 21, 257–270. Gerstner, W., & van Hemmen, L. (1993). Coherence and incoherence in a globally coupled ensemble of pulse-emitting units. Phys. Rev. Lett., 71, 312–315. Golomb, D., Rubin, N., & Sompolinsky, H. (1990). Willshaw model: Associative memory with sparse coding and low ring rates. Phys. Rev. A, 41, 1843– 1854. Gutkin, B., & Ermentrout, B. (1998). Dynamics of membrane excitability determine interspike interval variability: A link between spike generation mechanisms and cortical spike train statistics. Neural Comput., 10, 1047–1065. Hansel, D., & Mato, G. (2001). Existence and stability of persistent states in large neuronal networks. Phys. Rev. Lett., 86, 4175–4178. Hauser, W. (1997). Incidence and prevelance. In J. Engel & T. A. Pedley (Eds.), Epilepsy: A comprehensivetextbook (Vol. 1, pp. 47–58). Philadelphia: LippincottRaven. Hopeld, J. J. (1982). Neural networks and physical systems with emergent collective computational abilities. Proc. Natl. Acad. Sci., 79, 2554–2558.

Computing and Stability in Cortical Networks

1411

Hopeld, J. J. (1984). Neurons with graded responses have collective computational properties like those of two-state neurons. Proc. Natl. Acad. Sci., 81, 3088–3092. Latham, P. (2002). Associative memory in realistic neuronal networks. In T. G. Dietterich, S. Becker, & Z. Ghahramani (Eds.), Advances in Neural Information Processing Systems, 14. Cambridge MA: MIT Press. Latham, P., Richmond, B., Nelson, P., & Nirenberg, S. (2000a). Intrinsic dynamics in neuronal networks. I. Theory. J. Neurophysiol., 83, 808–827. Latham, P., Richmond, B., Nirenberg, S., & Nelson, P. (2000b). Intrinsic dynamics in neuronal networks. II. Experiment. J. Neurophysiol., 83, 828–835. Lennie, P. (2003). The cost of cortical computation. Curr. Biol., 13, 493–497. Marsden, J., & McCracken, M. (1976). The Hopf bifurcation and its applications. Berlin: Springer-Verlag. Mascaro, M., & Amit, D. (1999). Effective neural response function for collective population states. Network, 10, 351–373. Matsumura, M., Chen, D., Sawaguchi, T., Kubota, K., & Fetz, E. (1996). Synaptic interactions between primate precentral cortex neurons revealed by spiketriggered averaging of intracellular membrane potentials in vivo. J. Neurosci., 16, 7757–7767. McCormick, D., Connors, B., Lighthall, J., & Prince, D. (1985). Comparative electrophysiology of pyramidal and sparsely spiny stellate neurons of the neocortex. J. Neurophysiol., 54, 782–806. McIntyre, D., Poulter, M., & Gilby, K. (2002). Kindling: Some old and some new. Epilepsy Res., 50, 79–92. Miyashita, Y., & Chang, H. (1988). Neuronal correlate of pictorial short-term memory in the primate temporal cortex. Nature, 331, 68–70. Miyashita, Y., & Hayashi, T. (2000). Neural representation of visual objects: Encoding and top-down activation. Curr. Opin. Neurobiol., 10, 187– 194. Nakamura, K., & Kubota, K. (1995). Mnemonic ring of neurons in the monkey temporal pole during a visual recognition memory task. J. Neurophysiol., 74, 162–178. Naya, Y., Yoshida, M., & Miyashita, Y. (2003). Forward processing of long-term associative memory in monkey inferotemporal cortex. J. Neurosci., 23, 2861– 2871. Ojemann, G., Schoeneld-McNeill, J., & Corina, D. (2002). Anatomic subdivisions in human temporal cortical neuronal activity related to recent verbal memory. Nature Neurosci., 5, 64–71. Rinzel, J., & Ermentrout, G. (1989). Models for excitable cells and networks. In C. Koch & I. Segev (Eds.), Methods in neuronal modeling: from synapses to networks (pp. 137–173). Cambridge, MA: MIT Press. Rubin, N., & Sompolinsky, H. (1989). Neural networks with low local ring rates. Europhys. Lett., 10, 465–470. Salin, P. A., & Prince, D. A. (1996). Spontaneous GABAA receptor-mediated inhibitory currents in adult rat somatosensory cortex. J. Neurophysiol., 75, 1573–1588.

1412

P. Latham and S. Nirenberg

Seung, H. (1996). How the brain keeps the eyes still. Proc. Natl. Acad. Sci., 93, 13339–13344. Tegner, J., Compte, A., & Wang, X. (2002). The dynamical stability of reverberatory neural circuits. Biol. Cybern., 87, 471–481. Tiesinga, P., Jos´e, J., & Sejnowski, T. (2000). Comparison of current-driven and conductance-driven neocortical model neurons with hodgkin-huxley voltage-gated channels. Phys. Rev. E, 62, 8413–8419. Tomita, H., Ohbayashi, M., Nakahara, K., Hasegawa, I., & Miyashita, Y. (1999). Top-down signal from prefrontal cortex in executive control of memory retrieval. Nature, 401, 699–703. Treves, A. (1993). Mean-eld analysis of neuronal spike dynamics. Network, 4, 259–284. Treves, A., Skaggs, W., & Barnes, C. (1996). How much of the hippocampus can be explained by functional constraints? Hippocampus, 6, 666–674. Tsodyks, M., & Feigel’man, M. (1988). The enhanced storage capacity in neural networks with low activity level. Europhys. Lett., 6, 101–105. van Vreeswijk, C., & Sompolinsky, H. (1996). Chaos in neuronal networks with balanced excitatory and inhibitory activity. Science, 274, 1724–1726. Wang, X. (1999). Synaptic basis of cortical persistent activity: The importance of NMDA receptors to working memory. J. Neurosci., 19, 9587–9603. Wilson, F., Scalaidhe, S., & Goldman-Rakic, P. (1993). Dissociation of object and spatial processing domains in primate prefrontal cortex. Science, 260, 1955–1958. Wilson, H., & Cowan, J. (1972). Excitatory and inhibitory interactions in localized populations of model neurons. Biophys. J., 12, 1–24. Xiang, Z., Huguenard, J., & Prince, D. (1998). GABAA receptor-mediated currents in interneurons and pyramidal cells of rat visual cortex. J. Physiol., 506, 715–730. Received July 17, 2003; accepted November 21, 2003.

LETTER

Communicated by Peter Latham

Real-Time Computation at the Edge of Chaos in Recurrent Neural Networks Nils Bertschinger [email protected] Institute for Theoretical Computer Science, Technische Universitaet Graz, A-8010 Graz, Austria

Thomas Natschl¨ager [email protected] Software Compentence Center Hagenberg, A-4232 Hagenberg, Austria

Depending on the connectivity, recurrent networks of simple computational units can show very different types of dynamics, ranging from totally ordered to chaotic. We analyze how the type of dynamics (ordered or chaotic) exhibited by randomly connected networks of threshold gates driven by a time-varying input signal depends on the parameters describing the distribution of the connectivity matrix. In particular, we calculate the critical boundary in parameter space where the transition from ordered to chaotic dynamics takes place. Employing a recently developed framework for analyzing real-time computations, we show that only near the critical boundary can such networks perform complex computations on time series. Hence, this result strongly supports conjectures that dynamical systems that are capable of doing complex computational tasks should operate near the edge of chaos, that is, the transition from ordered to chaotic dynamics.

1 Introduction A central issue in the eld of computer science is to analyze the computational power of a given system. Unfortunately, there exists no full agreement regarding the meaning of abstract concepts such as computation and computational power. In the following, we mean by a computation simply an algorithm, system, or circuit that assigns outputs to inputs. The computational power of a given system can be assessed by evaluating the complexity and diversity of associations of inputs to outputs that can be implemented on it. Most work regarding the computational power of neural networks is devoted to the analysis of articially constructed networks (see, e.g., Siu, Roychowdhury, & Kailath, 1995). Much less is known about the computational c 2004 Massachusetts Institute of Technology Neural Computation 16, 1413–1436 (2004) °

1414

N. Bertschinger and T. Natschl¨ager

capabilities1 of networks with a biologically more adequate connectivity structure. Often, randomly connected (with regard to a particular distribution), networks are considered as a suitable model for biological neural systems (see, e.g., van Vreeswijk & Sompolinsky, 1996). Obviously not only the dynamics of such random networks depends on the connectivity structure but also its computational capabilities. Hence, it is of particular interest to analyze the relationship between the dynamical behavior of a system and its computational capabilities. It has been proposed that extensive computational capabilities are achieved by systems whose dynamics is neither chaotic nor ordered but somewhere in between order and chaos. This has led to the idea of “computation at the edge of chaos.” Early evidence for this hypothesis was reported by Langton (1990) and Packard (1988). Langton based his conclusion on data gathered from a parameterized survey of cellular automaton behavior. Packard used a genetic algorithm to evolve a particular class of complex rules. He found that as evolution proceeded, the population of rules tended to cluster near the critical region—the region between order and chaos—identied by Langton. Similar experiments involving boolean networks have been conducted by Kauffman (1993). The main focus of his experiments was to determine the transition from ordered to chaotic behavior in the space of a few parameters controlling the network structure. In fact, the results of numerous numerical simulations suggested a sharp transition between the ordered and chaotic regime. Later, these results were conrmed theoretically by Derrida and others (Derrida & Pomeau, 1986; Derrida & Weisbuch, 1986). They used ideas from statistical physics to develop an accurate mean-eld theory for random boolean networks (Derrida & Pomeau, 1986; Derrida & Weisbuch, 1986) and for Ising-spin models (networks of threshold elements) (Derrida, 1987), which allows determining the critical parameters analytically. Because of the physical background, this theory focused on the autonomous dynamics of the system, that is, its relaxation from an initial state (the input) to some terminal state (the output) without any external inuences. However such “off-line” computations (whose main inspiration comes from their similarity to batch processing by Turing machines) do not adequately describe the input-to-output relation of systems like animals or autonomous robots, which must react in real time to a continuously changing stream of sensory input. Such computations are more adequately described as mappings, or lters, from a time-varying input signal to a time-varying output signal. In this article, we will focus on time-series computations that shift the emphasis toward on-line computations and anytime algorithms (i.e.,

1 Usually one talks only about the computational power of a dedicated computational model, which we do not yet have dened. We prefer the term computational capabilities in combination with a randomly constructed system.

Real-Time Computation at the Edge of Chaos

1415

algorithms that output at any time the currently best guess of an appropriate response), and away from off-line computations. (See also Maass, Natschl¨ager, & Markram, 2002.) Our purpose is to investigate how the computational capabilities of randomly connected recurrent neural networks in the domain of real-time processing and the type of dynamics of the network are related to each other. In particular, we will develop answers to the following questions: ² Are there general dynamical properties of input-driven recurrent networks that support a large diversity of complex real-time computations, and if yes, ² Can such recurrent networks be utilized to build arbitrary complex lters, that is, mappings from input time series to output time series? A similar study using leaky integrate-and-re neurons, which was purely based on computer simulation, was conducted in Maass, Natschl¨ager, and Markram (2002). The rest of the letter is organized as follows. After dening the network model in section 2, we extend the mean-eld theory developed in Derrida (1987) to the case of input-driven networks. In particular, we will determine in section 3 where the transition from ordered to chaotic dynamics occurs in the space of a few connectivity parameters. In section 4, we propose a new complexity measure, which can be calculated using the mean-eld theory developed in section 3 and which serves as a predictor for the computational capability. Finally we investigate in section 5 the relationship between network dynamics and the computational capabilities in the timeseries domain. 2 Network Dynamics To analyze real-time computations on time series in recurrent neural networks, we consider input-driven recurrent networks consisting of N threshold gates with states xi 2 f¡1; C1g, i D 1; : : : ; N. Each node i receives nonzero incoming weights wij from exactly K randomly chosen units j 2 f1; : : : ; Ng. Such a nonzero connection weight wij is randomly drawn from a gaussian distribution with zero mean and variance ¾ 2 . Furthermore, the network is driven by an external input signal u.¢/, which is applied to each threshold gate. Hence, in summary, the update of the network state x.t/ D .x1 .t/; : : : ; xN .t// is given by 0 1 N X (2.1) xi .t/ D 2 @ wij ¢ xj .t ¡ 1/ C u.t/A ; jD1

which is applied for all neurons i 2 f1; : : : ; Ng in parallel and where 2.h/ D C1 if h ¸ 0 and 2.h/ D ¡1 otherwise.

1416

N. Bertschinger and T. Natschl¨ager

In the following, we consider a randomly drawn binary input signal u.¢/: at each time step, u.t/ assumes the value uN C 1 with probability r and the value uN ¡ 1 with probability 1 ¡ r. Note that the input bias uN can also be considered the mean threshold of the nodes. Similar to autonomous systems, these networks can exhibit very different types of dynamics, ranging from completely ordered all the way to chaotic, depending on the connectivity structure. Informally, we will call a network chaotic if arbitrary small differences in an (initial) network state x.0/ are highly amplied and do not vanish. In contrast, a totally ordered network forgets immediately about an (initial) network state x.0/, and the current network state x.t/ is determined to a large extent by the current input u.t/ (see section 3 for a precise denition of order and chaos). The top row of Figure 1 shows typical examples of ordered, critical, and chaotic dynamics while it is indicated in the lower panel (phase plot) which values of the model or system parameters correspond to a given type of dynamics.2 We will concentrate on the three system parameters K, ¾ 2 , and N where K (number of incoming connections per neuron) and ¾ 2 (variance of u, nonzero weights) determine the connectivity structure, whereas uN describes a statistical property (the mean or bias) of the input signal u.¢/. A phase transition occurs if varying such a parameter leads to a qualitative change in its behavior, like switching from ordered to chaotic dynamics. An example can be seen in Figure 1, where increasing the variance ¾ 2 of the weights leads to chaotic behavior. We refer to the transition from the ordered to the chaotic regime as the critical line. 3 Finding the Critical Line To dene the chaotic and the ordered phase of a discrete dynamical system, Derrida and Pomeau (1986) proposed the following approach: consider two (initial) network states with a certain (normalized) Hamming distance. These states are mapped to their corresponding following states dened by the network dynamics, and the change in the Hamming distance is observed. If the distance tend to grow, this is a sign of chaos, whereas if the distance decreases, this is a signature of order (Derrida & Pomeau, 1986; Derrida & Weisbuch, 1986; Derrida, 1987). This is closely related to the computation of the Lyapunov exponent in a continuous system (see, e.g., Luque & Sole, 2000) since this approach also emphasizes the idea of considering a system as chaotic if it is very sensitive to differences in its initial conditions. We will apply the same basic approach to dene the chaotic and ordered regime for an input-driven network: Two (initial) states are mapped to their corresponding following states with the same input in each case while the change in the Hamming distance is observed. Again, if the distances tend to increase (decrease) this is a sign for chaos (order). 2 A Mathematica (www.mathematica.com) notebook that contains the code to reproduce most of the gures in this letter is available on-line at: http://www.igi.tugraz.at/ tnatschl/edge-of-chaos.

Real-Time Computation at the Edge of Chaos

1417

Figure 1: Networks of randomly connected threshold gates can exhibit ordered, critical, and chaotic dynamics (see the text for a denition of order and chaos). In the upper row, examples of the time evolution of the network state x.t/ are shown (black: xi .t/ D C1, white: xi .t/ D ¡1) for three different networks with parameters taken from the ordered, critical, and chaotic regime, respectively. N as indicated in the phase plot at the Parameters: K D 4, N D 250, and ¾ 2 and u, bottom of the gure.

Following closely the arguments by Derrida (1987), we develop a meaneld theory (i.e., we consider the limit N ! 1), which allows us to analyze the time evolution of the (normalized) Hamming distance d.t/ D jfi : x1;i .t/ 6D x2;i .t/gj=N between two network states x1 .t/ and x2 .t/, where u.¢/ is the input signal, (xl;i .t/ denotes the ith component of the network state vector xl .t/, l D 1; 2). As it is shown in appendix A the Hamming distance d.t/ between two trajectories at time t is related to the Hamming distance d.tC1I u/ at time t C 1 (given that u is the input in that time step) through the equation d.t C 1I u/ D

K ³ ´ X K cD0

c

¢ d.t/c ¢ .1 ¡ d.t//K¡c ¢ P BF .c; u/;

(3.1)

1418

N. Bertschinger and T. Natschl¨ager

where PBF .c; u/ is the probability that exactly c bit-ips out of the K inputs to a unit will cause a different output of that unit in the two trajectories (see appendix A for details).3 In contrast to previous studies (Derrida, 1987; Derrida & Weisbuch, 1986), the Hamming distance d.t C 1I u/ in equation 3.1 depends on the input u, which leads to the new states at time t C 1. If one assumes a certain distribution of the input signal, one can calculate the expected value of d.t C 1I u/. Let us consider the case where u D uN C 1 and u D uN ¡ 1 with probability r and 1 ¡ r, respectively. This input distribution leads to the equation, dfade .t C 1/ :D FADE.d.t// :D r ¢ d.t C 1I uN C 1/ C .1 ¡ r/ ¢ d.t C 1I uN ¡ 1/;

(3.2)

for the time evolution of the Hamming distance. The map dened by equation 3.2 determines whether a network is in the ordered phase. If for all initial values d.0/ 2 [0; 1], the Hamming distance dfade .t/ converges to zero for t ! 1, we know that any state differences will die out, and the network is in the ordered phase. If, on the other hand, a small difference d.t/ > 0 is amplied and never dies out, the network is in the chaotic phase. Whether the Hamming distance will converge to zero can easily be seen in a so-called Derrida plot, where dfade .t C 1/ D FADE.d.t// is plotted versus d.t/. A network is in the ordered phase iff d¤fade D 0 is the only stable xed point d¤fade D FADE.d¤fade / of equation 3.2. See Figure 2A for examples of

Derrida plots.4 We also compared the theoretical predictions of equation 3.2 to simulations of networks. The results are shown in Figure 2B. Even so, the theory assumes an innite network size and that the weights are randomly drawn at each time step (the annealing approximation; Derrida & Pomeau, 1986). A good correspondence with simulation results is already obtained for networks containing a few hundred nodes and with the weights randomly drawn prior to the simulation (see appendix A for a more detailed discussion about the annealing approximation). The stability can be checked by analyzing the slope ® of the Derrida plot at d.t/ D 0. The xed point d¤fade D 0 is stable if j®j < 1. As shown in appendix B, the slope ® is given as @dfade .t C 1/ ®D @d.t/

d.t/D0

D K ¢ .r ¢ PBF .1; uN C 1/ C .1 ¡ r/ ¢ PBF .1; uN ¡ 1//:

3 Note that in the appendix, d.t C 1I u/ and PBF .c; u/ are denoted as d.t C 1I u; u/ and PBF .c; u; u/, respectively. 4 Equation 3.2 was solved numerically to yield the Derrida plots.

Real-Time Computation at the Edge of Chaos

A

B

Derrida plot

1

0.6

d(t)

0.9 0.8

K=2

theory simulation

K=4

theory simulation

K=8

0.4

0.5

0

K=4 K=2

0.196

0.6

d(t)

dfade(t+1)

0.391

0.6

0.4 0.2

0.4

0 0

0.6

d(t)

0.2 0.1 0 0

theory simulation

0.2

K=8

0.7

0.3

1419

0.4 0.2

0.5

d(t)

1

0 0

20

t

40

Figure 2: (A) Derrida plots. The Derrida plots show the dependence of dfade .tC1/ on d.t/. Shown are Derrida plots for networks with parameters uN D 0, ¾ 2 D 1, r D 0:5 and three different values of K, as denoted in the gure. For each plot, the numerical value of the stable xed point d¤fade D FADE.d¤fade /, that is, the intersection with the identity function (dashed line), is given. (B) Comparison of the mean-eld theory and numerical simulations. For each of the considered values of K, we compare the theoretical evolution of the Hamming distance by iterating equation 3.2 (solid line) to results from numerical simulations (crosses). Simulation results are averages over 50 runs of the same network of N D 250 threshold gates where in each run, the input u.¢/ was randomly generated with “rate” r D Pr fu.t/ D uN C 1g.

Hence, the stability of the xed point d¤fade D 0 depends on only the probability PBF .1; ¢/ that a single changed state component leads to a change of the output. Therefore, the so-called critical line j®j D 1, where the phase transition from ordered to chaotic behavior occurs, is given by r ¢ PBF .1; uN C 1/ C .1 ¡ r/ ¢ PBF .1; uN ¡ 1/ D

1 : K

(3.3)

This criterion is related to criticality criteria that are derived using a single bit-ip approximation directly (Rohlf ¨ & Bornholdt, 2002). The corresponding phase diagram is shown in Figure 3. For xed values of uN and ¾ 2 , more incoming connections K tend to make the network more

1420

N. Bertschinger and T. Natschl¨ager

critical line 2 1.5 1

K=3

K=4

K=5

K=6

0

K=8

u ¹

0.5

0.5 1 1.5 2

ordered 10

chaotic

1

¼

10

0

10

1

2

Figure 3: The critical line. Shown is the critical line for parameters ¾ 2 and uN in dependence of K (denoted in the gure). As before, we set r D 0:5. Note that the ordered (chaotic) regime is to the left (right) of the critical line.

chaotic. Note that for K D 1 and K D 2, the network cannot be made chaotic (see appendix B for details). Ordered dynamics can be achieved by means of smaller internal connection weights (smaller ¾ 2 ), which will increase the impact of the input signal or by biasing the network toward ¡1 or +1 by changing the mean uN of the input signal. This second way to achieve ordered dynamics is analogous to using biased boolean functions in random boolean networks (Kauffman, 1993). Since the state encoding of the networks is symmetric, using a positive or negative mean uN is equivalent. Usually a higher bias is expected to give more ordered dynamics. However, for high values of K, this is not true for uN near §1, which is an effect due to the binary input signals that are used here. Note that the ordered phase can be described by using the notion of fading memory (Boyd & Chua, 1985). Informally, a network has fading memory if the current state x.t/ depends just on the values of its input u.t0 / from some nite time window t0 2 [t¡T; t] into the past (Maass, Natschl a¨ ger,

Real-Time Computation at the Edge of Chaos

1421

& Markram, 2002). 5 A slight reformulation of this property (Jaeger, 2002) shows that it is equivalent to the requirement that all state differences vanish (i.e., being in the ordered phase). This property is called echo state property in Jaeger (2002). Fading memory plays an important role in the “liquid state machine” framework (Maass, Natschl¨ager, & Markram, 2002), called echo state networks in Jaeger (2002) which we will employ (and discuss in more detail) in section 5 to show that networks operating near the critical line support complex computations. Intuitively speaking, in a network with fading memory, a state x.t/ is fully determined by a nite history u.t ¡ T/; u.t ¡ T C 1/; : : : ; u.t ¡ 1/; u.t/ of the input u.¢/.6 This would in principle allow an appropriate readout function to deduce the recent input, or any function of it, from the network state. If, on the other hand, the network does not have fading memory (i.e., is in the chaotic regime), a given network state x.t/ also contains “spurious” information about the initial conditions, and hence it is hard, or even impossible, to deduce any features of the recent input. 4 Separation as a Predictor for Computational Power A further dynamical property that is especially important if a network is to be useful for computations on input time series is the separation property (see Maass, Natschlager, ¨ & Markram, 2002): any two different input time series that should produce different outputs (with regard to the readout function) drive the recurrent network into two (signicantly) different states. From the point of view of the readout function, it is clear that different outputs can be produced only for different network states. Hence, only if different input signals separate the network state (i.e., different inputs result in different states) is it possible for a readout function to respond differently. Furthermore, it is desirable that the separation (distance between network states) increase with the difference of the input signals. However, a simple separation measurement cannot be directly related to the computational power since chaotic networks separate even minor differences in the input to a very high degree. This is because any single different bit of the network state (eventually caused by the differing inputs) results in drastically different trajectories but is not a direct consequence of the different inputs. Therefore, we will propose in this section the so-called

5 More formally, a network is said to have the fading memory property if the following holds. For all ² > 0, initial conditions x1 .0/; x2 .0/ and input signals u1 .¢/; u2 .¢/ exist ± > 0 and T 2 N such that ku1 ¡ u2 kinput < ± ) kx1 .T/ ¡ x2 .T/kstate < ². Here, x1 .T/; x2 .T/ denote the states that are obtained by running the network T time steps on initial conditions x1 .0/; x2 .0/ and input signals u1 .¢/; u2 .¢/, respectively. k ¢ kinput and k ¢ kstate are norms that turn the space of input signals and network states into compact vector spaces. 6 More precisely there exists a function E such that x.t/ D E.u.t ¡ T/; u.t ¡ T C 1/; : : : ; u.t ¡ 1/; u.t// for some nite T 2 N.

1422

N. Bertschinger and T. Natschl¨ager

network-mediated separation (NM-separation, for short), which aims to capture those portions of the separation that are due to the differences in the input signals u1 .¢/ and u2 .¢/. As before, we measure differences of network states by their Hamming distance. The distance of input signals u1 .¢/ and u2 .¢/ is dened by b D P limT!1 T1 TtD0 .1 ¡ ±.u1 .t/; u2 .t///, which measures the average number of differences in the input signals, where ±.x; y/ D 1 if x D y and ±.x; y/ D 0 otherwise. b can also be interpreted as the probability that u1 .t/ and u2 .t/ are different, that is, b D Pr fu1 .t/ 6D u2 .t/g. The mean-eld theory we have developed (see appendix A) can easily be extended to calculate the state distance (separation) that results from applying inputs u1 .¢/ and u2 .¢/ with a mean distance of b (r denotes the “rate” r D Pr fu1 .t/ D uN C 1g of u1 .¢/): dsep.t C 1/ D SEP.d.t//

(4.1)

D b ¢ .r ¢ d.t C 1I uN C 1; uN ¡ 1/ C .1 ¡ r/ ¢ d.t C 1I uN ¡ 1; uN C 1// C .1 ¡ b/ ¢ .r ¢ d.t C 1I uN C 1; uN C 1/ C .1 ¡ r/ ¢ d.t C 1I uN ¡ 1; uN ¡ 1//

(4.2)

As in section 3, we numerically solved equation 4.2 for the stable xed point d¤sep D SEP.d¤sep/, that is, the separation that results from long runs with inputs u1 .¢/ and u2 .¢/ with a distance of b. The part of this separation that is caused by the input distance b and not by the distance of some initial state is then given by d¤sep ¡ d¤fade , since d¤fade measures the state distance that is caused by differences in the initial states and remains even after long runs with the same inputs (see section 3). Note that d¤fade is always zero in the ordered phase and nonzero in the chaotic phase. Furthermore we also have to consider the immediate separation dinp that is expected from the given input distribution (determined by uN and b). This term can be estimated as N r/ ¡ 1/2 ; dinp :D b ¢ .2 ¢ q.u;

N r/ is the fraction of nodes that “copy the input” (see appendix C where q.u; for details). 7 The immediate separation dinp is especially high for inputN r/ close to 1.0). Obvidriven networks (low values of ¾ 2 and values of q.u; ously this immediate separation cannot be utilized by any readout function that needs access to information about the input a few time steps ago because the network state x.t/ is dominated to a very large extent by the current input at time t. More informally, input-driven networks do not have any memory about recent inputs. 7 A unit copies the input if it outputs C1 if the input has the value uN C 1 and outputs ¡1 if the input has the value uN ¡ 1.

Real-Time Computation at the Edge of Chaos

1423

Since we want the NM separation to be a predictor for computational power, we have to correct d¤sep ¡ d¤fade by the immediate separation dinp in order to take the loss of memory of input-driven networks into account. Hence, a suitable measure of the network-mediated separation due to input differences is given by N r/ ¡ 1/2 : NMsep D d¤sep ¡ d¤fade ¡ b ¢ .2 ¢ q.u;

(4.3)

In Figures 4A and 4B, the uncorrected separation d¤sep ¡d¤fade and the NM separation are plotted in dependence of the input distance b for three representative parameter settings (same as in Figure 1). As expected, both measures are increasing with b. Note that for the ordered dynamics, the uncorrected separation assumes similar (or even higher) values as the uncorrected separation for critical parameters (this is because the network is very much input driven; see above). In contrast, the NM separation, which is corrected by the immediate separation dinp , is highest for critical parameters. In Figures 4C and 4D, the NM separation resulting from an input difference of b D 0:1 is shown in dependence of the network parameters uN and ¾ 2 . It can be clearly seen that the separation peaks at the critical line. Because of the computational importance of the separation property, this also suggests that the computational capabilities of the networks will peak at the onset of chaos, which is conrmed in the next section. 5 Real-Time Computations at the Edge of Chaos To access the computational power of a network, we make use of the socalled liquid state machine framework proposed by Maass et al. (2002) and independently by Jaeger (2002), who called it “echo state networks.” They put forward the idea that any complex time series computation can be implemented by a system that consists of two conceptually different parts: a dynamical system with rich dynamics followed by a memoryless readout function (the basic architecture is depicted in Figure 5). This idea is based on the following observation: If one excites a sufciently complex recurrent circuit with an input signal and looks at a later time at the current internal state of the circuit, then this state is likely to hold a substantial amount of information about recent inputs. In order to implement a specic task, called the target lter in the following, it is enough that the readout function is able to extract the relevant information from the network state. This amounts to a classical pattern recognition problem, since the temporal dynamics of the input stream has been transformed by the recurrent circuit into a high-dimensional spatial pattern. Following this line of arguments, the construction of a liquid state machine that implements an arbitrary lter can be decomposed into two steps: (1) choose a proper general purpose recurrent network, and (2) train a readout function to map the network state to the desired output.

1424

N. Bertschinger and T. Natschl¨ager

Figure 4: NM separation assumes high values on the critical line. (A) The uncorrected separation d¤sep ¡ d¤fade is shown for ordered (¾ 2 D 0:1), critical (¾ 2 D 0:5), and chaotic (¾ 2 D 5) dynamics in dependence of the input distance b (K D 4, uN D 0:4, r D 0:5). (B) Same as A but for the NM separation. (C) The gray coded image shows the NM separation in dependence on ¾ 2 and uN for K D 4, r D 0:5, and b D 0:1 (white: low values; dark gray: high values). The solid line marks the critical line. (D) Same as C but for K D 8.

This approach is potentially successful if the general purpose network encodes the relevant features of the input signal in the network state in such a way that the readout function can easily extract it. In the context of the network model considered in this article, we have already investigated in depth two properties that support such a requirement: fading memory (i.e., ordered dynamics) and (network-mediated) separation. However, it is not yet clear whether these properties are enough to ensure that the information can be extracted easily. In fact, up to now, it can be proven only

Real-Time Computation at the Edge of Chaos

1425

target y(t) = (F u)(t) target lter F

w; w0

training

input u(·)

output v(t) recurrent network

linear classi er

Figure 5: Setup used to train a linear classier for a given computational task. The input u.¢/ drives the dynamic reservoir (the randomly connected recurrent network in our case), while a memoryless readout function C, which has access to the full network state x.t/, computes the actual output v.t/ D C.x.t//. Only the weights w 2 RN , w0 2 R of the linear classier are adjusted such that the actual network output v.t/ D 2.w ¢ x.t/ C w0 / is as close as possible to the target values y.t/ D .F u/.t/. The signals shown are an example of the performance of a trained network on a one-step-delayed XOR task, that is, y.t/ D XOR.u.t ¡ 1/; u.t ¡ 2//.

(Maass et al., 2002) that any time-invariant lter that has fading memory can be approximated with any desired degree of precision if (1) the recurrent network has fading memory, (2) the recurrent network has the separation property, and (3) the readout has the capability to approximate any given continuous function that maps current states on current outputs. Hence, in a worst-case scenario, it may happen that the readout function must be arbitrarily complex to implement the desired target lter. However, we will show that near the critical line, the network encodes the information in such a way that a simple linear classier C.x.t// D 2.w¢x.t/C w0 / sufces to implement a broad range of complex nonlinear functions.8 To access the computational power in a principled way, networks with different parameters were tested on a delayed three-bit parity task. Here, the desired output signal y¿ .t/ is given by PARITY.u.t ¡¿ /; u.t ¡¿ ¡1/; u.t¡ ¿ ¡ 2// for increasing delays ¿ ¸ 0. The task is quite complex for the networks considered here since the parity function is not linear separable and 8 In order to train the network for a given target lter F, only the parameters w 2 RN , w0 2 R of the linear classier are adjusted such that the actual network output v.t/ is as close as possible to the target values y.t/ D .Fu/.t/. The weights w 2 RN , w0 2 R are determined using standard linear regression in an off-line fashion.

1426

N. Bertschinger and T. Natschl¨ager

it requires memory. Hence, to achieve good performance, it is necessary that a state x.t/ contains information about several input bits u.t0 /, t0 < t in a nonlinear transformed form such that a linear classier C is sufcient to perform the nonlinear computation of the parity task. In Figure 6A the mutual information MI.v¿ ; y¿ /9 measured on a test set between the network output v¿ .¢/ trained on a delay of ¿ and the target signal y¿ .¢/ is shown for increasing delays ¿ (cf. Natschl a¨ ger & Maass, 10 Following Jaeger (2002), we dene a memory capacity forthcoming). MC D P1 ¿ D0 MI.v¿ ; y¿ /. In Figure 6B, the dependence of MC on the parameters of the randomly drawn networks is shown. Each data point shows the mean over 10 different networks with the given parameters11 . The highest performance is clearly achieved for parameter values close to the critical line, where the phase transition occurs. This has been noted before (Langton, 1990; Packard, 1988). In Packard (1988), for example, cellular automata were evolved to solve one particular task. Even though there is evidence that evolution will nd solutions near the critical line, the specic dynamics selected are highly task specic. For this and other reasons, edgeof-chaos interpretations of these results were criticized (Mitchell, Hraber, & Crutcheld, 1993). In contrast, the networks used here are not optimized for any specic tasks, but only a linear readout function is trained to extract the task-specic information that is already present in the state of system. Furthermore, each network is evaluated for many different tasks, such as the 3-bit parity of increasingly delayed-input values. Therefore, a network that is specically designed for a single task will not show good performance in this setup. 9

The mutual information MI.v; y/ (in bits) between two signals v.¢/ and y.¢/ is de-

(f¡1; C1g in

P P

p.v0 ;y 0 / , where the sums are over all possible values p.v 0 /p.y0 / our case) of v.¢/ and y.¢/, p.v0 ; y0 / D Pr v.t/ D v0 ^ y.t/ D y0 is the joint prob-

ned as MI D

v0

©

y0

p.v0 ; y0 / log2

ª 0

©

©

ª

ª 0

ability, p.v0 / D Pr v.t/ D v and p.y0 / D Pr y.t/ D y are the marginal distributions. Note that all these probabilities can reliably be estimated simply by counting since v.¢/ and y.¢/ are binary signals in our case. 10 Note that for each delay ¿ , a separate classier is trained. For training a single linear classier, a training set fhxl ; yl ig, l D 1 : : : 9000where xl are network states and yl 2 fC1; ¡1g are the target values given by the considered target lter, was constructed as follows. The considered network was run from 10 randomly drawn initial states on randomly drawn input signals of length 5000 and rate r D 0:5. The rst 500 states were discarded, and then only the network states of every fth time step were added with the according target value yl to the training set. The weights w0 2 R, w D hw1 ; : : : ; wN i 2 RN of the linear classier were P then determined by standard linear regression: that is, by minimizing the squared error l .w0 Cw ¢ xl ¡ yl /2 via the pseudo-inverse solution. The performance of the trained classier was then measured on a test set. Therefore, the network was run again from 10 initial states for 2000 time steps. The initial states and input signals were again randomly drawn and different from the ones used during training. As before, the rst 500 states were discarded. 11 Standard deviations are not shown because they are quite small (< 0:5 bit of MC). The performance differences across the critical line are therefore highly signicant.

Real-Time Computation at the Edge of Chaos

1427

Figure 6: (A) Memory curve for the three-bit parity task. Shown is the mutual information MI.v; y/ between the classier output v.¢/ and the target function y.t/ D PARITY.u.t ¡ ¿ /; u.t ¡ ¿ ¡ 1/; u.t ¡ ¿ ¡ 2// on a test set (see the text for details) for various delays ¿ . Parameters: N D 250, K D 4, ¾ 2 D 0:2, uN D 0, and r D 0:5. (B) The gray coded image (an interpolation between the data points marked with open diamonds) shows the performance of trained networks in dependence of the parameters ¾ 2 and uN for the P same task as in A. Performance is measured as the memory capacity MC D ¿ MI.¿ /, that is, the “area” under the memory curve. Remaining parameters as in A.

These considerations suggest the following hypotheses regarding the computational function of generic recurrent neural circuits: to serve as a general-purpose temporal integrator and simultaneously as kernel (i.e., nonlinear projection into a higher dimensional space) to facilitate the subsequent linear readout of information whenever it is needed. The network size N determines the dimension of the space into which the input signal is nonlinearly transformed by the network and in which the linear classier is operating. It is expected that MC increases with N due to the enhanced discrimination power of the readout function. Hence, it is worthwhile to investigate how the computational power (in terms of MC) scales up with the network size N (see Figure 7). Interestingly, the steepest increase of MC with increasing N is observed for critical parameters (almost perfect logarithmic scaling). In contrast, at noncritical parameters, the performance on the delayed 3-bit parity task grows only slightly with increasing network size. To further explore the idea of computation at the edge of chaos, the above simulations were repeated for different values of K and different tasks. The networks were trained on a delayed 1-bit (actually just a delay line), a delayed 3-bit, and a delayed 5-bit parity task as well as on 50 randomly drawn boolean functions of the last ve inputs, that is, y.t/ D f .u.t/; u.t ¡ 1/; u.t ¡ 2/; u.t ¡ 3/; u.t ¡ 4// for a randomly drawn boolean function f :

1428

N. Bertschinger and T. Natschl¨ager

scaling behaviour (3 bit parity) 5

ordered critical chaotic

4.5 4

MC [bit]

3.5 3 2.5 2 1.5 1 0.5 0

125

250

network size N

500

1000

Figure 7: Computational capabilities scale with network size. Shown is the performance on the 3-bit parity task in dependence on the network size N (mean MC § standard deviation over 10 randomly drawn networks for each parameter setting). The remaining parameters were chosen as K D 4; r D 0:5; uN D 0:4, and ¾ 2 D 0:1 (ordered), ¾ 2 D 0:5 (critical), and ¾ 2 D 5 (chaotic).

f¡1; C1g5 ! f¡1; C1g. Average results over 10 randomly drawn networks for each parameter setting are shown in Figure 8. In almost all cases, the best performance is achieved close to the critical line. Just for the simplest task considered, a delay line, the best performance is achieved in the chaotic regime. In the case of K D 2, the network can never be made chaotic by increasing ¾ 2 , but still the performance peaks at some intermediate value of ¾ 2 . The reasons remain to be uncovered. It is also notable that even though the best performances are usually achieved close to the critical line, the performance varies considerably along it. Especially at high values of K, performance drops signicantly if the network N Hence, it is advantageous to use unis stabilized by using large biases u. biased networks (i.e., uN D 0). This is probably related to the fact that for uN D 0, the entropy of the network states is highest. Normalizing MC on an entropy bound obtained from the mean activity12 suggests that it indeed 12 For a given mean activity of a D Pr fxi .t/ D C1g, the maximum entropy of a single unit is H.a/ D ¡a log a ¡ .1 ¡ a/ log.1 ¡ a/. Note that 0 · H.a/ · 1 and H.0:5/ D 1. Hence, an upper bound for the entropy of the network states is N ¢ H.a/, which reaches

Real-Time Computation at the Edge of Chaos

1429

Figure 8: Performance of trained networks for various parameters and different tasks with increasing complexity. The performance (as measured in Figure 6B) is shown in dependence of the parameters ¾ 2 and uN for K D 2; 4; 8 (left to right) and the 1-, 3-, and 5-bit parity task as well as for an average over 50 randomly drawn boolean functions (top to bottom).

explains a large amount of the performance drop (not shown). Just for the randomly drawn boolean functions, slightly biased networks show a significantly higher performance, because unbiased networks cannot implement symmetric functions (about half of the randomly drawn functions will be symmetric) since they are antisymmetric devices (compare Jaeger, 2002). 6 Discussion We developed a mean-eld theory for input-driven networks, which allows an accurate determination of the position of the transition line between its maximum for a D 0:5. To compensate for the entropy drop, we considered the scaling MC ¢ N ¢ H.0:5/=.N ¢ H.a// D MC=H.a/.

1430

N. Bertschinger and T. Natschl¨ager

ordered and chaotic dynamics with respect to the parameters controlling the network connectivity and input statistics. To our knowledge, this is the rst time that networks that receive a time-varying input signal have been considered in this context. Additionally, a clear correlation between the network dynamics and computational power for real-time time-series processing was established. We found that near the critical line, the type of networks considered here supports a broad class of complex computations. These results provide further evidence for the idea of computation at the edge of chaos (Langton, 1990). In the context of the questions raised in the Introduction, our ndings point to the following answers: ² Dynamics near the critical line are a general property of input-driven dynamical systems that support complex real-time computations. ² Such systems can be used by training a simple readout function to approximate any arbitrary complex lter. Furthermore, this suggests that to exhibit sophisticated information processing capabilities, an adaptive system should stay close to critical dynamics. Since the dynamics is also inuenced by the statistics of the input signal (e.g., the mean as shown here), these results indicate a new role for plasticity rules: stabilize the dynamics at the critical line. Such rules would then implement what could be called dynamical homeostasis, which could be achieved in an unsupervised manner. Examples of such rules that adjust the connectivity K toward the critical line in threshold networks can be found in Bornholdt and Rohlf ¨ (2000) and Bornholdt and Rohlf ¨ (2003). Such a system then actively adjust its dynamics toward criticality, so that it shows self-organized criticality (Bak, Tang, &, Wiesenfeld, 1987, 1988). The rule described in Bornholdt and Rohlf ¨ (2003) is also interesting from a biological point of view since it is based on local Hebb-like correlations. In the context of articial neural networks, it is shown in Dauce, Quoy, Cessac, Doyon, and Samuelides (1998) that Hebbian learning drives the dynamics from the chaotic into the ordered regime. Hence combining task-specic optimization provided by (supervised) learning rules (Hertz, Krogh, & Palmer, 1991) with a more general adjustment of the dynamics would allow a system to gather specic information while self-organizing its dynamics toward criticality in order to be able to react exibly to other incoming signals. We believe that this viewpoint will be a fruitful avenue to further explore the mysteries of brain functioning. By combining self-organized criticality with learning, it might be possible to uncover some of the principles underlying the most impressive adaptability exhibited by brains.

Real-Time Computation at the Edge of Chaos

1431

Appendix A: Time Evolution of Hamming Distance d.t/ To determine the time evolution of the Hamming distance, we consider a randomly drawn set of weights wij and the following scenario: 1. The state x1 .t/ is mapped via equation 2.1 to the state x1 .t C 1/ where u1 is given as input. 2. The state x2 .t/ is mapped by the same network to the state x2 .t C 1/ where u2 is given as input. We will derive an equation that relates the Hamming distance d.t/ at time t between the two states x1 .t/ and x2 .t/ to the Hamming distance d.tC1I u1 ; u2 / at time t between the two states x1 .t C 1/ and x2 .t C 1/ given the two inputs u1 and u2 . Assume that out of the K inputs xj .t/ to a given unit i, there are exactly c (0 · c · K), which are different between the states x1 .t/ and x2 .t/, that is, c D jD j with D D fj : x1;j .t/ 6D x2;j .t/ ^ wij 6D 0g. The resulting internal activations at time t C 1 of a unit i in the two cases 1 and 2 can be written as (index i and t omitted for brevity) h1 D h2 D

X j

X j

wij x1;j C u1 D a C b C u1 wij x2;j C u2 D a ¡ b C u2 ;

P where a D j2E wij x1;j is a sum over the K ¡ c state components that are P equal (E D fj : x1;j .t/ D x2;j .t/ ^ wij 6D 0gj) and b D j2D wij x1;j is the sum over the c differing state components (D D fj : x1;j .t/ 6D x2;j .t/ ^ wij 6D 0g) (cf. Derrida, 1987). In the following, we apply the annealing approximation introduced by Derrida and others (Derrida & Pomeau, 1986). The basic assumption of the annealing approximation in our context is that at each time step, the whole weight matrix is randomly generated. This has the effect that correlations between units that would otherwise be built up during previous time steps must not be considered. Although this is a rather radical assumption, the predictions obtained for the quenched case (weight matrix generated once) are quite accurate (see Figure 2). This is due to the fact that if the number of inputs to a unit is rather small compared to the network size (K << lnN), then correlations between units build up very slowly and can be neglected in the limit N ! 1 (Derrida, 1987; Derrida & Weisbuch, 1986). We calculate the probability PBF .c; u1 ; u2 / that the states x1;i .t C 1/ and x2;i .t C 1/ of a node i at time t C 1 differ given that exactly c bits of the K inputs to unit i differ and u1 and u2 are the external inputs to the nodes independent of i. Note that PBF .c; u1 ; u2 / is equal to the probability that h1

1432

N. Bertschinger and T. Natschl¨ager

and h2 have different signs: © ª P BF .c; u1 ; u2 / D Pr x1;i .t C 1/ 6D x2;i .t C 1/ © ª D Pr 2.h1 / 6D 2.h2 / © ª © ª D Pr h1 ¸ 0 ^ h2 < 0 C Pr h1 < 0 ^ h2 ¸ 0 © ª D Pr a ¸ ¡b ¡ u1 ^ a < b ¡ u2 © ª C Pr a < ¡b ¡ u1 ^ a ¸ b ¡ u2 :

(A.1)

Under the approximations made, a Hamming distance of d.t/ can be interpreted as the probability that a certain component of two states x1 .t/ and x2 .t/ differs. Then the probability that out of the K inputs to a given node i ¡ ¢ exactly c differ is given by the binomial distribution Kc ¢ d.t/c ¢ .1 ¡ d.t//K¡c . Averaging over all possible values of c, we arrive at the following formula relating the Hamming distance in one time step to the next: d.t C 1I u1 ; u2 / D

K ³ ´ X K

c

cD0

¢ d.t/c ¢ .1 ¡ d.t//K¡c ¢ PBF .c; u1 ; u2 /:

(A.2)

For weights drawn from a gaussian distribution with zero mean, a and b are normally distributed with zero mean and variances ¾a2 D .K ¡ c/¾ 2 and ¾b2 D c¾ 2 , respectively, and P BF can be calculated as the integral over the area in the .a; b/ plane corresponding to the constraints dened in equation A.1. PBF .c; u1 ; u2 / Z 1Á Z D Á.b; 0; c¾ 2 / ¡1

maxfb¡u2 ;¡b¡u1 g minfb¡u2 ;¡b¡u1 g

! 2

Á.a; 0; .K ¡ c/¾ / da

db (A.3)

2

¡ 1 ¢ .x¡¹/

for 0 < c < K where Á.x; ¹; ¾ 2 / D p 1 2 e 2 ¾ 2 denotes the gaussian 2¼¾ density function. Note that in the cases of c D 0 and c D K, the equation for PBF .c; u1 ; u2 / collapses to a single integral: Z PBF .0; u1 ; u2 / D

maxf¡u2 ;¡u1 g

Á.a; 0; K¾ 2 / da

minf¡u2 ;¡u1 g Z maxf¡u1 ;u2 g

P BF .K; u1 ; u2 / D 1 ¡

minf¡u1 ;u2 g

Á .b; 0; K¾ 2 / db:

This can easily be concluded by considering equation A.1 and noting that c D 0 ) b D 0 and c D K ) a D 0. Note that we use for short d.t; u/ :D d.t; u; u/ and PBF .c; u/ :D PBF .c; u; u/ in the main text.

Real-Time Computation at the Edge of Chaos

1433

Appendix B: Calculation of the Criticality Criterion @dfade .tC1;u1 ;u2 /

The derivative of at d D 0 can be calculated as follows (we @d write P.c/ as short for PBF .c; u1 ; u2 /): Á ! K ³ ´ @ X K c K¡c d .1 ¡ d/ P.c/ @d cD0 c ³ ³ ´ ´ dD0³ ³ ´ ´ @ K @ K K¡1 D C ¡ .1 ¡ d/K P.0/ d.1 d/ P.1/ @d 0 @d 1 dD0 dD0 Á ! K ³ ´ X K @ c C d .1 ¡ d/K¡c P.c/ c @d cD2 dD0

D .K.¡1/.1 ¡ d/K¡1 P.0//jdD0

C .K..1 ¡ d/K¡1 ¡ d.K ¡ 1/.1 ¡ d/K¡2 / ¢ P.1//jdD0 C 0

D K.P.1/ ¡ P.0//:

Since the dynamic of the network is deterministic, the same input never leads to an output bit ip, that is, PBF .0; u; u/ D 0, and dfade .t C 1/ D r ¢ d.t C 1I uN C 1; uN C 1/ C .1 ¡ r/ ¢ d.t C 1I uN ¡ 1; uN ¡ 1/ we obtain: @dfade .t C 1/ D K ¢ .r ¢ PBF .1; uN C 1; uN C 1/ C .1 ¡ r/ @d.t/ d.t/D0 ¢ PBF .1; uN ¡ 1; uN ¡ 1//: Investigation of the special cases c D 1; K D 1; u1 D u2 D u and c D 1; K D 2; u1 D u2 D u will show that for K D 1 and K D 2, the network is always in N the ordered phase regardless of the parameters ¾ 2 and u. For c D 1; K D 1, and u1 D u2 D u, we get Z maxf¡u;ug PBF .1; u; u/ D 1 ¡ Á .b; 0; ¾ 2 / db < 1: minf¡u;ug

Hence, for the criticality criterion, equation 3.3, we obtain 1 D1 r ¢ PBF .1; uN C 1; uN C 1/ C.1 ¡ r/ ¢ P BF .1; uN ¡ 1; uN ¡ 1/ < | {z } | {z } K <1

<1

since 0 · r · 1. For c D 1; K D 2, and we get Z 1 Z 2 PBF .1; u; u/ D db Á .b; 0; ¾ / ¡1

maxfb¡u;¡b¡ug

da Á.a; 0; ¾ 2 /:

minfb¡u;¡b¡ug

Note that this is an integral over the product density Á.b; 0; ¾ 2 / ¢ Á .a; 0; ¾ 2 / with equal variance in each dimension. Taking into account the area of

1434

N. Bertschinger and T. Natschl¨ager

integration, which is spanned by the two lines b ¡ u and ¡b ¡ u and covers half of the .a; b/ plane, it is clear that at most half of the probability mass (in the case of u D 0) is covered by that integral, that is, PBF .1; u; u/ < 0:5 for u 6D 0 and PBF .1; 0; 0/ D 0:5. Hence, the criticality criterion reads as follows, 1 D 0:5; r ¢ PBF .1; uN C 1; uN C 1/ C.1 ¡ r/ ¢ PBF .1; uN ¡ 1; uN ¡ 1/ < | {z } | {z } K <0:5

<0:5

since 0 · r · 1. Appendix C: Correction Term for the NM Separation N b/ of neurons that copy the input. A unit We rst calculate the fraction q.u; i copies the input u.t/ D uN C ¯.t/ with the bit ¯ 2 f¡1; C1g if the output has N r/ is given as follows, the same value as ¯.t/, that is, xi .t/ D ¯.t/. q.u; N r/ :D r ¢ Pr q.u;

8 <X :

j

9 =

wij xj .t ¡ 1/ C uN C 1 ¸ 0

C .1 ¡ r/ ¢ Pr

8 <X :

j

;

9 = wij xj .t ¡ 1/ C uN ¡ 1 < 0 ; ;

P where r D Pr fu.t/ D uN C 1g. The sum j wij xj .t ¡ 1/ is normally distributed with zero mean and variance K¾ 2 , which yields N r/ D r ¢ 8.uN C 1; 0; K¾ 2 / C .1 ¡ r/ ¢ 8.¡uN C 1; 0; K¾ 2 /; q.u; Rx where 8.x; ¹; ¾ 2 / denotes the gaussian cumulative density ¡1 2 ¡ 12 ¢ .z¡¹/ ¾2

p 1 2¼¾ 2

¢

e

dz. As the next step, we calculate the probability that for P two given inputs u1 and u2 with b :D Pr fu2 6D u1 g, the outputs x1 D 2. j wij xj .t ¡ 1/ C u1 / P and x2 D 2. j wij xj .t ¡ 1/ C u2 / of a unit i are different. Pr fx1 6D x2 g can be interpreted as the Hamming distance, which is generated immediately by the input. © ª Pr fx1 6D x2 g D Pr fu1 D u2 g ¢ .Pr u1 is copied and u2 is not copied © ª C Pr u1 is not copied and u2 is copied / © ª C Pr fu1 6D u2 g ¢ .Pr u1 is copied and u2 is copied © ª C Pr u1 is not copied and u2 is not copied /: N r/ dropped for sake of brevity) This can be simplied to (arguments .u; Pr fx1 6D x2 g D 2.1 ¡ b/q.1 ¡ q/ C b.q2 C .1 ¡ q/2 /:

Real-Time Computation at the Edge of Chaos

1435

Since we want to get an estimate of the Hamming distance due to the difference b of the input, we correct the above term by Pr fx1 6D x2 g jbD0 and use the result as the correction term introduced in section 4: Pr fx1 6D x2 g ¡ Pr fx1 6D x2 g jbD0 D 2.1 ¡ b/q.1 ¡ q/

C b.q2 C .1 ¡ q/2 / ¡ 2q.1 ¡ q/

D b.2q ¡ 1/2 : Acknowledgments

We thank Alexander Kaske for stimulating discussions and helpful comments about the manuscript for this article. References Bak, P., Tang, C., & Wiesenfeld, K. (1987). Self-organized criticality: An explanation of 1/f noise. Physical Review Letters, 59(4), 381–384. Bak, P., Tang, C., & Wiesenfeld, K. (1988). Self-organized criticality. Physical Review A, 38(1), 364–374. Bornholdt, S., & R¨ohlf, T. (2003). Self-organized critical neural networks. Physical Review E, 67, 066118. Bornholdt, S., & Rohlf, ¨ T. (2000). Topological evolution of dynamical networks: Global criticality from local dynamics. Physical Review Letters, 84(26), 6114– 6117. Boyd, S., & Chua, L. O. (1985). Fading memory & the problem of approximating nonlinear operators with Voltera series. IEEE Trans. on Circuits and Systems, 32, 1150. Dauce, E., Quoy, M., Cessac, B., Doyon, B., & Samuelides, M. (1998). Selforganization & dynamics reduction in recurrent networks: Stimulus presentation and learning. Neural Networks, 11, 521–533. Derrida, B. (1987). Dynamical phase transition in non-symmetric spin glasses. J. Phys. A: Math. Gen., 20, 721–725. Derrida, B., & Pomeau, Y. (1986). R&om networks of automata: A simple annealed approximation. Europhys. Lett., 1, 45–52. Derrida, B., & Weisbuch, G. (1986). Evolution of overlaps between congurations in random Boolean networks. J. Physique, 47, 1297. Hertz, J., Krogh, A., & Palmer, R. G. (1991). Introduction to the theory of neural computation. Cambridge, MA: Perseus Books. Jaeger, H. (2002). Tutorial on training recurrent neural networks, covering BPPT, RTRL, EKF and the “echo state network” approach (GMD Tech. Rep. 159). Berlin: German National Research Center for Information Technology. Kauffman, S. A. (1993). The origins of order: Self-organization and selection in evolution. New York: Oxford University Press. Langton, C. G. (1990). Computation at the edge of chaos. Physica D, 42. Luque, B., & Sole, R. (2000). Lyapunov exponents in random boolean networks. Physica A, 284, 33–45.

1436

N. Bertschinger and T. Natschl¨ager

Maass, W., Natschl¨ager, T., & Markram, H. (2002). Real-time computing without stable states: A new framework for neural computation based on perturbations. Neural Computation, 14(11), 2531–2560. Mitchell, M., Hraber, P. T., & Crutcheld, J. P. (1993). Revisiting the edge of chaos: Evolving cellular automata to perform computations. Complex Systems, 7, 89– 130. Natschl¨ager, T., & Maass, W. (2004). Information dynamics and emergent computation in recurrent circuits of spiking neurons. In S. Thrun, L. Saul, B. Sch¨olkopf (Eds.), Advances in neural information processing systems, 16. Cambridge, MA: MIT Press. Packard, N. (1988). Adaptation towards the edge of chaos. In J. A. S. Kelso, A. J. Mandell, & M. F. Shlesinger (Eds.), Dynamic patterns in complex systems (pp. 293–301). Singapore: World Scientic. Rohlf, ¨ T., & Bornholdt, S. (2002). Criticality in random threshold networks: annealed approximation and beyond. Physica A, 310, 245–259. Siu, K.-Y., Roychowdhury, V., & Kailath, T. (1995). Discrete neural computation: A theoretical foundation. Upper Saddle River, NJ: Prentice Hall. van Vreeswijk, C. A., & Sompolinsky, H. (1996). Chaos in neuronal networks with balanced excitatory and inhibitory activity. Science, 274, 1724–1726. Received October 30, 2003; accepted January 7, 2004.

LETTER

Communicated by Shun-ichi Amari

Information Geometry of U-Boost and Bregman Divergence Noboru Murata [email protected] School of Science and Engineering, Waseda University, Shinjuku, Tokyo 169-8555, Japan

Takashi Takenouchi [email protected] Department of Statistical Science, Graduate University of Advanced Studies, Minato, Tokyo 106-8569, Japan

Takafumi Kanamori [email protected] Department of Mathematical and Computing Sciences, Tokyo Institute of Technology, Meguro, Tokyo 152-8552, Japan

Shinto Eguchi [email protected] Institute of Statistical Mathematics, Japan, and Department of Statistical Science, Graduate University of Advanced Studies, Minato, Tokyo 106-8569, Japan

We aim at an extension of AdaBoost to U-Boost, in the paradigm to build a stronger classication machine from a set of weak learning machines. A geometric understanding of the Bregman divergence dened by a generic convex function U leads to the U-Boost method in the framework of information geometry extended to the space of the nite measures over a label set. We propose two versions of U-Boost learning algorithms by taking account of whether the domain is restricted to the space of probability functions. In the sequential step, we observe that the two adjacent and the initial classiers are associated with a right triangle in the scale via the Bregman divergence, called the Pythagorean relation. This leads to a mild convergence property of the U-Boost algorithm as seen in the expectation-maximization algorithm. Statistical discussions for consistency and robustness elucidate the properties of the U-Boost methods based on a stochastic assumption for training data. 1 Introduction In the past decade, several novel developments for classication and pattern recognition have been done, mainly along statistical learning theory c 2004 Massachusetts Institute of Technology Neural Computation 16, 1437–1481 (2004) °

1438

N. Murata, T. Takenouchi, T. Kanamori, and S. Eguchi

(see, e.g., McLachlan, 1992; Bishop, 1995; Vapnik, 1995; Hastie, Tibshirani, & Friedman, 2001). Several important approaches have been proposed and implemented in feasible computational algorithms. One promising direction is boosting, which is a method of combining many learning machines trained by simple learning algorithms. Theoretical research on boosting began with a question by Kearns and Valiant (1988): “Can a weak learner which is a bit better than random guessing be boosted into an arbitrarily accurate strong learner?” The rst interesting answer was given by Schapire (1990), who proved that it is possible to construct an accurate machine by combining three machines trained by different examples, which are sequentially sampled and ltered by previous trained machines. Intuitively speaking, the key idea of the boosting algorithm is to sort important and unimportant examples according to whether machines are good at or weak in predicting those examples. The procedures are summarized in the following three types: ² Filtering: New examples are sampled and ltered by the previous trained machines so that as many difcult examples as easy examples are collected (Schapire, 1990; Domingo & Watanabe, 2000). ² Resampling: Examples are resampled from given examples repeatedly so that difcult examples are chosen with a high probability (Freund, 1995). ² Reweighting: Given examples are weighted so that difcult examples severely affect the error (Freund & Schapire, 1997; Friedman, Hastie, & Tibshirani, 2000). In this letter, we focus on the reweighting method including AdaBoost (Freund & Schapire, 1997). Lebanon and Lafferty (2001) give a geometric consideration of the extended Kullback-Leibler divergence, which leads to a close relation between AdaBoost and logistic discrimination. The main objective in this letter is to propose a class of boosting algorithms, U-Boost, which is naturally derived from the Bregman divergence. This proposal provides an extension of the geometry discussed by Lebanon and Lafferty and elucidates the fact that the Bregman divergence associates the normalized with the unnormalized U-Boost from the viewpoint of information geometry. There are several different purposes for classication problems. The extension from AdaBoost to U-Boost can correspond to these different requirements, including robustness for mislabeling and outliers in a feature vectors. In this context MadaBoost and Eta-Boost in the class of U-Boost are introduced and discussed. This letter is organized as follows. In section 2, we give the structure of the U-Boost algorithm proposed here. In section 3, we introduce the Bregman divergence in order to give a statistical framework of boosting algorithms, and discuss some properties in the sense of information geometry, and we give a geometrical understanding of the U-Boost algorithm. In section 4, we

Information Geometry of U-Boost and Bregman Divergence

1439

Table 1: Relationship Between Classiers and Decision Functions. Classier

voting classier

Decision Function

combined decision function

discuss the consistency, efciency, and robustness of the algorithm based on the statistical consideration given in the previous section and provide some illustrative examples with numerical simulations. The last section sets out our concluding remarks and future work. 2 U-Boost Algorithm Let us consider a classication problem where, for a given feature vector x, the corresponding label y is predicted. Hereafter we assume that the feature vector x belongs to some space X and the corresponding label y to a nite set Y . We note that the problem is mainly written in the case of multiclass discrimination; however we consider the case where y is the binary label with values ¡1 and 1 for the intuitive examples in this article. Let h be a classier, and h is a set-valued function, that is, for a given feature vector x, h returns a subset C of class labels, h:x2X

7! C ½ Y :

(2.1)

Possible class labels are a, b, and c in Y for a feature vector x; then h returns a set of labels, h.x/ D fa; b; cg:

Next, we dene a decision function on X » 1; if y 2 h.x/; f .x; y/ D 0; otherwise;

£ Y from the classier h as (2.2)

namely, f .x; y/ D I.y 2 h.x//; where I is the indicator function dened by » 1; if A is true; I.A/ D 0; otherwise:

(2.3)

(2.4)

1440

N. Murata, T. Takenouchi, T. Kanamori, and S. Eguchi

In the following discussion, we identify a classier h with a decision function f with the above correspondence. For a set of decision functions ft .x; y/ .t D 1; : : : ; T/, let us dene a combined decision function with a condence rate ®t 2 R .t D 1; : : : ; T/ as F.x; y/ D

T X

(2.5)

®t ft .x; y/:

tD1

Note that F is a real-valued function; it is no longer an indicator function. By using the combined decision function, a voting classier of a set of classiers ht .x/ .t D 1; : : : ; T/ is dened as H.x/ D argmax F.x; y/:

(2.6)

y2Y

Hence, we follow the diagram in Table 1 for the general discussion on classication problems. Let U be a convex function on R with the positive derivative U 0, and let us x a set F of decision functions. Here, F is usually taken by a set of boosting stamps, decision trees, or neural networks. In this formulation of classiers and decision functions, the U-Boost algorithm over F is described as follows. U-Boost Algorithm. Input: Given n examples f.xi ; yi /I xi 2 X ; yi 2 Y ; i D 1; : : : ; ng.

Initialize: D1 .i; y/ D 1=n.jY j ¡ 1/; i D 1; : : : ; n, and F0 .x; y/ D 0, where jY j is the cardinality of Y . Do for t D 1; : : : ; T Step 1: Dene an error of decision function f (classier h) under the distribution Dt as ²t . f / D

n X X f .xi ; y/ ¡ f .xi ; yi / C 1 Dt .i; y/: 2 iD1 y6Dy i

Then select a decision function such that ft .x; y/ D argmin ²t . f /: f 2F

Step 2: Calculate the condence rate ®t as ®t D argmin ®

n X X iD1 y2Y

U .Ft¡1 .xi ; y/ ¡ Ft¡1 .xi ; yi / C ®. ft .xi ; y/ ¡ ft .xi ; yi ///:

Information Geometry of U-Boost and Bregman Divergence

1441

Step 3: Update the combined decision function by Ft .x; y/ D Ft¡1 .x; y/ C ®t ft .x; y/ and the distribution Dt by ¡ ¢ DtC1 .i; y/ / U0 Ft .xi ; y/ ¡ Ft .xi ; yi / ; where DtC1 .i; y/ is normalized as n X X iD1 y6Dyi

DtC1 .i; y/ D 1:

Output: The nal decision by the majority vote as follows: H.x/ D argmax FT .x; y/ D argmax y2Y

y2Y

T X

®t ft .x; y/:

tD1

In the U-Boost algorithm, the dependency on U appears only in steps 2 and 3. When the U-function is equal to the exponential function, the above algorithm is reduced to the AdaBoost algorithm. Thus, steps 2 and 3 are simplied: step 20 : ®t D

1 1 ¡ ²t . ft / log : 2 ²t . ft /

step 30 : DtC1 .i; y/ / expfFt .xi ; y/ ¡ Ft .xi ; yi /g: In the above description, the best choice of classier ft at time t is given. In practice, we do not have to use the optimally chosen classier, and in general we can adopt a classier chosen by some weak learning algorithm like the common setup of the boosting method. Also, we can assume ²t . f / · 12 without loss of generality, because when ²t . f / > 12 , we can use the reversed output 1 ¡ f instead; then the error becomes ²t .1 ¡ f / < 12 . We have a fundamental property of the U-Boost algorithm, ²tC1 . ft / D

1 2

.8t D 1; 2; : : : ; T ¡ 1/;

for any convex function U. It will be discussed in detail and proved in a subsequent section. This means that the reweighting of U-Boost is commonly organized to assign the best selected decision function ft in the tth stage to the least favorable weight distribution to the example set in the t C 1th stage. It is remarkable that this error property is common for all U-Boost algorithms. We will present a geometric perspective for the U-Boost algorithm by introducing the Bregman U-divergence. In the discussion, the U-Boost al-

1442

N. Murata, T. Takenouchi, T. Kanamori, and S. Eguchi

gorithm is seen as giving the sequential minimization algorithm of an empirical loss dened by LU .F/ D

n X 1X U.F.xi ; y/ ¡ F.xi ; yi //; n iD1 y2Y

as observed in step 2. 3 Geometrical Structure of U-Boost The AdaBoost algorithm can be regarded as a procedure of optimizing an exponential loss with an additive model (Friedman et al., 2000): L.F/ D

n X 1X exp.F.xi ; y/ ¡ F.xi ; yi // n iD1 y2Y

where F.x; y/ D

T X

®t ft .x; y/:

tD1

By adopting different loss functions, several variations of AdaBoost are proposed, such as LogitBoost (Friedman et al., 2000) and MadaBoost (Domingo & Watanabe, 2000), where the loss functions, L.F/ D

n X 1X Á .F.xi ; y/ ¡ F.xi ; yi // n iD1 y2Y

LogitBoost: MadaBoost:

Á.z/ D log.1 C exp.z// » zC 1 z ¸ 0; Á.z/ D 1 2 exp.2z/ otherwise; 2

are used instead of the exponential loss. In a subsequent discussion, we will focus on the robustness to outliers of the feature vector and class label. It will be shown that MadaBoost is the most robust in the class of logistic consistent methods. The U-Boost algorithm is a natural extension of those based on loss functions. For constructing algorithms, the notion of the loss function is useful, because the various algorithms are derived based on the gradient descent and line search methods. Also, the loss function controls the condence of the decision, which is characterized by the margin (Schapire, Freund, Bartlett, & Lee, 1998). However, the statistical properties such as consistency and efciency are not apparent, because the relationship between loss functions and the distributions realized by voting classiers has been unclear so far. Lebanon and Lafferty (2001) rst pointed out the duality of the AdaBoost algorithm in the space of distributions and discussed its geometrical structure from the viewpoint of linear programming.

Information Geometry of U-Boost and Bregman Divergence

1443

In this section, we extend their geometrical consideration by introducing the Bregman divergence. The relationship between the Bregman divergence and boosting has been discussed by Kivinen and Warmuth (1999) and Collins, Schapire, and Singer (2000). Here we dene a form of the Bregman divergence that is suited for statistical inferences. Then we discuss special subspaces associated with Bregman divergences. Based on the geometrical framework of the Bregman divergence introduced in this section, we will consider the roles of loss functions in boosting algorithms and discuss some statistical properties of algorithms. 3.1 Space of Finite Measures and Bregman Divergence. Let us consider the space of all the positive nite measures over Y conditioned by x 2 X , » ¼ X M D m.yjx/ m.yjx/ < 1 .a.e. x/ ;

(3.1)

y2Y

and the conditional probability density as its subspace, » ¼ X P D m.yjx/ m.yjx/ D 1 .a.e. x/ :

(3.2)

y2Y

For given examples f.xi ; yi /I i D 1; : : : ; ng, let ( I.y D yi /; if x D xi ; Q D p.yjx/ 1 ; otherwise jY j

(3.3)

be the empirical conditional probability density of y for given x, where jY j is the cardinality of Y . Here we assume the consistent data assumption (Lebanon & Lafferty, 2001), where a unique label yi is given to each input xi . If multiple labels are given to an input xi , we can use 8 Pn D xi ; y D yi / > < iD1 PI.x ; if x D xi ; n Q D p.yjx/ iD1 I.x D xi / > : 1 ; otherwise; jY j and the discussion in this article can be extended in a straightforward manner. We also dene the empirical marginal density of x with Q D ¹.x/

n 1X ±.x ¡ xi /; n iD1

(3.4)

where ±.x/ is the delta function that represents a point mass on the origin. The Bregman divergence is a pseudo-distance for measuring the discrepancy between two functions. We dene the Bregman divergence between two conditional measures as follows.

1444

N. Murata, T. Takenouchi, T. Kanamori, and S. Eguchi

Denition 1 (Bregman divergence). Let U be a strictly convex function on R, and then its derivative u D U 0 is a monotone function, which has the inverse function » D .u/¡1 . For p.yjx/ and q.yjx/ in M, let us dene the Bregman crossentropy under the marginal density ¹.x/ with respect to U by Z X fU.».q.yjx/// ¡ p.yjx/».q.yjx//g¹.x/dx; (3.5) HU .p; qI ¹/ D X

y2Y

and the Bregman divergence from p to q is dened by DU .p; qI ¹/ D HU .p; qI ¹/ ¡ HU .p; pI ¹/:

(3.6)

In the following, the Bregman divergence DU .p; qI ¹/ associated with the convex function U is called U-divergence for short, and the entropy functions HU .p; qI ¹/ and HU .p; pI ¹/ are referred to as U-cross-entropy and U-entropy, respectively. For notational simplicity, if the context is clear, we omit x and y from functions and abbreviate ¹.x/dx to d¹ in the following, for example, Z X [fU.».q// ¡ U.».p//g ¡ pf».q/ ¡ ».p/g]d¹: DU .p; qI ¹/ D Also we drop ¹ from HU and DU if there is no ambiguity, such as HU .p; q/ and DU .p; q/. A popular form of the Bregman divergence (see, e.g., Kivinen & Warmuth, 1999; Collins et al., 2000) is Z DU . f; g/ D d. f .z/; g.z//dº.z/; where f; g are one-dimensional real-valued functions of z, º.z/ is a certain measure on z, and d is the difference at g between the function U and the tangent line at . f; U. f //: d. f; g/ D U.g/ ¡ fu. f /.g ¡ f / C U. f /g:

(3.7)

In the denition (see equation 3.6), densities are rst mapped by » ; then the form 3.7 is applied, and the meaning of d is easily understood from Figure 1. Note that from the convexity of U, d is obviously nonnegative; therefore, the Bregman divergence is nonnegative, and DU .p; q/ D 0 holds if and only if p.yjx/ D q.yjx/ .a.e. x/. Also note that the Bregman divergence is in general not symmetric with respect to p and q, as is easily seen; therefore, it is not a distance. It is also closely related with the potential duality. Let us dene the dual function of U by the Legendre transformation, U ¤ .³ / D sup f³ µ ¡ U.µ /g : µ

Information Geometry of U-Boost and Bregman Divergence

1445

U(¹ (q))

d(¹ (p); ¹ (q))

p(¹ (q) ¡

¹ (p)) + U(¹ (p))

tangent at ¹ (p)

U(¹ (p))

¹ (q)

¹ (p)

U : R ! R, convex u = U 0; ¹ = (U 0)¡1 Note: u(¹ (p)) = p Figure 1: Bregman divergence.

Then d is written with U and U¤ simply as d. f; g/ D U ¤ .u. f // C U.g/ ¡ u. f /g: The advantage of the form 3.6 is that it allows us to plug in the empirical distribution directly. Since the Bregman divergence is nonnegative, that is, DU .p; q/ ¸ 0, the U-cross-entropy is bounded below by U-entropy as HU .p; q/ ¸ HU .p; p/: Now let us consider a problem in which q is optimized with respect to DU .p; q/ for xed p. Neglecting the terms that do not depend on q, the problem is simplied as argmin DU .p; q/ D argmin HU .p; q/: q

(3.8)

q

In HU .p; q/, the distribution p appears only for taking the expectation of Q are used without any ».q/, and therefore the empirical distributions pQ and ¹ difculty as 2 3 n X 1X 4 Q qI ¹/ Q D HU .p; U.».q.yjxi /// ¡ ».q.yi jxi //5 : n iD1 y2Y

(3.9)

1446

N. Murata, T. Takenouchi, T. Kanamori, and S. Eguchi

Hence, the optimal distribution for this example with respect to U-divergence is written as Q qI ¹/: Q qQ D argmin HU .p; q

This is equivalent to the well-known relationship between the maximum likelihood estimation and the minimization of the Kullback-Leibler (KL) divergence. Related discussions can be found in Eguchi and Kano (2001), in which the divergences are derived based on the pseudo-likelihood. Example 1 (U-functions). The following are important examples of the convex function U (see Figure 2): Exponential (Kullback-Leibler divergence): U.z/ D exp.z/;

u.z/ D exp.z/;

».z/ D log.z/:

¯-type (¯-divergence): U.z/ D

¯ C1 1 .¯z C 1/ ¯ ; C 1 ¯

1

u.z/ D .¯z C 1/ ¯ ;

».z/ D

z¯ ¡ 1 : ¯

Square-type (¯ D 1): U.z/ D

1 .z C 1/2 ; 2

u.z/ D z C 1;

».z/ D z ¡ 1:

´-type (´-divergence): U.z/ D .1 ¡ ´/ exp.z/ C ´z; z¡´ ».z/ D log : 1¡´

u.z/ D .1 ¡ ´/ exp.z/ C ´;

MadaBoost type: » z C 12 z ¸ 0; 1 exp.2z/ z < 0; 2 1 ».z/ D log.z/.z · 1/: 2

U.z/ D

» u.z/ D

1 exp.2z/

z ¸ 0; z < 0;

Note that the MadaBoost U function is not strictly convex, and hence ».z/ is not well dened for z > 1. Although it is peculiar as a U-function, it performs an important role in discussing robustness. The divergence of ¯-type has been employed for independent component analysis from the viewpoint of robustness, while the divergence of

Information Geometry of U-Boost and Bregman Divergence

3.0 2.0

2.5

KL(AdaBoost) b(0.5 ) h(0.2 ) MadaBoost

0.5 0.0

0.0

0.5

1.0

1.5

u(z)

2.0

2.5

KL(AdaBoost) b( 0.5 ) h (0.2 ) MadaBoost

1.0

U(z)

loss functions

1.5

3.0

loss functions

1447

2

1

0

1

2

z

2

1

0

1

2

z

Figure 2: Examples of U-functions. (a) Shapes of U-functions. (b) Derivatives of U-functions, u.

´-type is shown to improve AdaBoost in the case of mislabeling (see Minami & Eguchi, 2002; Takenouchi & Eguchi, 2004). Note that ¯-divergence and ´-divergence coincide with KL divergence when ¯ ! 0 and ´ ! 0, respectively. There are several proposals for statistical divergence, such as f -divergence and ®-divergence (Amari, 1985). The class of f -divergence is totally different from that of U-divergence except for the KL divergence. See Figure 3 for a schematic map. Also note that U-entropy HU is a generalization of the Shannon entropy. For example, the Tsallis entropy is derived from the ¯-type U-function (Amari & Nagaoka, 2000). 3.2 Pythagorean Relation and Orthogonal Foliation. Let us dene the inner product of functions of x 2 X and y 2 Y under ¹.x/ by Z X h f; gi¹ D f .x; y/g.x; y/¹.x/dx; X

y2Y

and dene that f and g are orthogonal if h f; gi¹ D 0. Then the Pythagorean relation for the Bregman divergence is stated as follows (see Figure 4): Theorem 1 (Pythagorean relation). Let p, q, and r be in M. If p ¡ q and ».r/¡ ».q/ are orthogonal under ¹, the relation DU .p; rI ¹/ D DU .p; qI ¹/ C DU .q; rI ¹/ holds.

(3.10)

1448

N. Murata, T. Takenouchi, T. Kanamori, and S. Eguchi

-divergence -divergence

-type

KL-divergence ,

-type Bregman divergence

Figure 3: Schematic picture of statistical divergences.

Figure 4: Pythagorean relation for Bregman divergence.

Proof.

For any conditional measures p, q, and r, DU .p; rI ¹/ ¡ DU .p; qI ¹/ ¡ DU .q; rI ¹/ Z X D .q.yjx/ ¡ p.yjx//.» .r.yjx// ¡ ».q.yjx///¹.x/dx X

y2Y

D hp ¡ q; ».q/ ¡ ».r/i¹

(3.11)

Information Geometry of U-Boost and Bregman Divergence

1449

holds by denition. From the orthogonality of p ¡ q and ».r/ ¡ ».q/, the right-hand side of equation 3.11 vanishes, and it proves the relation. A special case of theorem 1 associated with the KL divergence is given in Amari (1985). Note that in theorem 1, the orthogonality is dened between p ¡ q and ».r/ ¡ ».q/. The form ».q/ is rewritten as q.yjx/ D u.».q.yjx///; and ».q/ is called the U-representation of q. In the following discussion, U-representation plays a key part. Now we consider subspaces feasible for the nature of the Bregman divergence. Denition 2 (U-at subspace). With a xed q0 2 M and a set of decision functions F D f f¸ .x; y/I ¸ 2 3g, where 3 is a certain nite set of indices, a set of conditional measures written in the form of

QU .q0 ; F / (

) ³ ´ X D q2M q.yjx/ D u ».q0 .yjx// C ®¸ f¸ .x; y/ ; ®¸ 2 R ; (3.12) ¸23

is called a U-at subspace. In other words, QU consists of functions such that X ».q/ ¡ ».q0 / D ®¸ f¸ .x; y/; ¸23

which means QU is a subspace in M including q0 and spanned by F . Therefore, the dimension number of the subspace QU is j3j, the cardinality of 3, in the space M. We remark that F can be not only decision functions but also arbitrary functions on X £ Y in general. The at structure and geometrical properties discussed later hold for those general functions. Because of the linear structure of U-representation, an important property of QU is introduced. Let q1 and q2 be in QU ; then for any numbers ¯1 and ¯2 , any linear combination of U-representation of q1 and q2 belongs to QU again; that is, ¡ ¢ u ».q0 / C ¯1 .».q1 / ¡ ».q0 // C ¯2 .».q2 / ¡ ».q0 // 2 QU ; holds; therefore, we call QU .q0 ; F / a U-at subspace. In this relation, » plays the same role with the logarithm for the exponential family in statistics, which is called an e-at subspace in information geometry. Next let us consider another at subspace in M.

1450

N. Murata, T. Takenouchi, T. Kanamori, and S. Eguchi

Denition 3 (m-at subspace). and orthogonal to QU dened by

A subspace in M that passes a point p0 2 M

8 <

9 Z = X T .p0 ; ¹; F / D p 2 M .p.yjx/ ¡ p0 .yjx// f¸ .x; y/¹.x/dx D 0; 8¸ : ; X y2Y

D fp 2 M j hp ¡ p0 ; f¸ i¹ D 0; 8¸g

(3.13)

is called an m-at subspace. In the following, we omit p0 , ¹, and F from T depending on the context. An important property of T is also its at structure. For any p1 and p2 in T .p0 /, and for any positive numbers ¯1 and ¯2 , we observe p0 C ¯1 .p1 ¡ p0 / C ¯2 .p2 ¡ p0 / 2 T .p0 /; because of the linearity of the inner product, ¯1 hp1 ¡ p0 ; f¸ i¹ C ¯2 hp2 ¡ p0 ; f¸ i¹ D 0; 8¸: Hence, its structure is called mixture at (m-at). Note that the codimension number of the subspace T is j3j in the space M by its denition. As a special case, let us take q¤ 2 QU and consider an m-at subspace T .q¤ / at q¤ 2 QU . By the above denitions, QU and T .q¤ / are orthogonal at q¤ , and for any p 2 T .q¤ / and q 2 QU , hp ¡ q¤ ; ».q/ ¡ ».q¤ /i¹ D 0; 8p 2 T .q¤ /; 8q 2 QU holds (see Figure 5). A set of m-at subspaces at all the points in QU , fT .q/I q 2 QU g, covers the whole space M as [ q2Q

U

T .q/ D M;

T .q/ \ T .q0 / D Á; if q 6D q0: The set fT .q/I q 2 QU g is called an orthogonal foliation of QU in M, and each subspace T .q/ is called a leaf—that is, M is decomposed into leaves orthogonal to QU (see Figure 6). Based on the notion of the orthogonal foliation, we can prove the following theorem (see Figure 7). Theorem 2.

Two optimization problems,

minimize DU .p; q0 I ¹/ with respect to p 2 T .p0 / for xed q0 ;

(3.14)

Information Geometry of U-Boost and Bregman Divergence

1451

Figure 5: Geometrical structure of Q and T .

- at

- at

- at

Figure 6: Orthogonal foliation derived from the Bregman divergence.

1452

N. Murata, T. Takenouchi, T. Kanamori, and S. Eguchi

- at -projection -projection

- at

Figure 7: A geometrical interpretation of the dual optimization problems in the U-Boost algorithm.

and minimize DU .p0 ; qI ¹/ with respect to q 2 QU .q0 / for xed p0 ;

(3.15)

give the same solution: q¤ D argmin DU .p; q0 I ¹/ D argmin DU .p0 ; qI ¹/; p2T

q2Q

(3.16)

U

which is the intersection of QU and T . Proof. First, we should note that two subspaces QU .q0 / and T .p0 / intersect each other at one point q¤ , fq¤ g D T .p0 / \ QU .q0 /; because of the relationship between dim QU and codim T . From the Pythagorean relation of U-divergence, for any p 2 T .p0 / and q 2 QU .q0 /, DU .p; q/ D DU .p; q¤ / C DU .q¤ ; q/ holds. As a consequence, DU .p; q0 / D DU .p; q¤ / C DU .q¤ ; q0 /; for any p 2 T

(3.17)

Information Geometry of U-Boost and Bregman Divergence

1453

holds; therefore, we observe DU .p; q0 / ¸ DU .q¤ ; q0 /; for any p 2 T ; because of the positivity of U-divergence, DU .p; q¤ / ¸ 0. This means the point q¤ 2 T is the closest to q0 in T . And in the same way, we observe DU .p0 ; q/ ¸ DU .p0 ; q¤ /; for any q 2 QU ; and it shows q¤ is the closest to p0 in QU . These prove the statement of the theorem. As we discuss later, the one-dimensional U-at subspace from q0 to q¤ is orthogonal to the m-at subspace T .p0 /, and hence q¤ is said to be the U-projection of q0 to T .p0 /. Simultaneously q¤ is the m-projection of p0 to Q (cf. Amari & Nagaoka, 2000). 3.3 U-models for Classication. Our purpose of the classication problem is to estimate an element in a certain set of conditional measures, which minimizes U-divergence from the empirical distribution constructed from given examples, Q qI ¹/: Q q¤ D argmin HU .p; q

In this section, we consider a special subspace, U-model, produced from a set of decision functions F D f ft .x; y/I t D 1; : : : ; Tg in the form ( QU .q0 ; F I b/ D q 2 M q.yjx/ Á D u ».q0 .yjx// C

T X tD1

!) ®t ft .x; y/ ¡ b.x; ®/

; (3.18)

where ® D .®t 2 RI t D 1; : : : ; T/ and b is an auxiliary function of x and ®, which imposes an additional condition on QU . Note that in the following, ® is used as a parameter for specifying an element in QU . Intuitively speaking, QU is spanned by a basis F in terms of U-representation, and b is introduced to overcome the inconvenience caused by the fact that the basis is restricted within decision functions. For a conditional measure q in this subspace QU , the empirical U-crossentropy is simply written as emp

Q qI ¹/ Q LU .q/ D HU .p; 2 3 n X 1X 4 D U.».q.yjxi // ¡ ».q.yi jxi //5 ; n iD1 y2Y

1454

N. Murata, T. Takenouchi, T. Kanamori, and S. Eguchi

where ».q.yjx// D ».q0 .yjx// C

T X tD1

®t ft .x; y/ ¡ b.x; ®/;

which we call the empirical U-loss, or simply U-loss. Taking account of the classication rule invariance discussed in the next section, function b for constraint on the U-model must be determined based on computational convenience or statistical validity. From a statistical point of view, we consider following two specic cases, an empirical model and a normalized model, in the following sections. 3.3.1 Invariance of Classication Rule. denoting a conditional measure as

Let us use U-representation for

».q.yjx// equivalently q.yjx/ D u.».q.yjx///; and ».q/ is referred to later as a discriminate function. For the classication task, we use Bayes rule, where the estimate of the corresponding label for a given input x is given by the maximum value of q.yjx/, which is realized by using the discriminate function as yO D argmax ».q.yjx// D argmax q.yjx/; y2Y

y2Y

because » is monotonic. There are two equivalent conditions for discriminate functions for Bayes rule. Shift invariance. Let b.x/ be an arbitrary function of x. The classication rule associated with u.».q/ ¡ b/ is equivalent to that with q, argmax ».q.yjx// D argmaxf».q.yjx// ¡ b.x/g: y2Y

y2Y

Scale invariance. Let c.x/ be an arbitrary positive function of x. The classication rule associated with c.x/q.yjx/ is equivalent to that with q.yjx/, argmax ».c.x/q.yjx// D argmax ».q.yjx//; y2Y

y2Y

because of the monotonicity of » . These invariant properties are closely related to obtaining a probability density p 2 P from a measure m 2 M by U-projection and m-projection, respectively. In the next two sections, we focus on the quantity dened by 1.q; q¤ / D DU .p; q/ ¡ DU .p; q¤ / ¡ DU .q¤ ; q/ Z X D .» .q.yjx// ¡ ».q¤ .yjx///.q¤ .yjx/ ¡ p.yjx//¹.x/dx X

y2Y

D h».q/ ¡ ».q¤ /; q¤ ¡ pi¹ ;

(3.19)

Information Geometry of U-Boost and Bregman Divergence

1455

for a xed conditional density p.yjx/ 2 P and a density ¹.x/. By using the above invariance, we consider the conditions where 1.q; q¤ / vanishes in order to utilize the Pythagorean relation. 3.3.2 Empirical U-Model. Suppose q¤ .yjx/ D c.x/p.yjx/ or, equivalently, ».q¤ .yjx// D ».c.x/p.yjx//, which implies that the classication rule associated with ».q¤ / is equivalent to the Bayes rule for p by scale invariance. Then for any measure q, we observe Z X 1.q; q¤ / D .c.x/ ¡ 1/ .».q.yjx// ¡ ».q¤ .yjx///p.yjx/¹.x/dx: X

y2Y

Let Dm be a set of discriminate functions with the same conditional expectation under p, » ¼ X X Dm D q 2 M ».q.yjx//p.yjx/ D ».q¤ .yjx//p.yjx/ .a.e. x/ : y2Y

y2Y

¤

Then 1.q; q / D 0 for any q 2 Dm and hence the Pythagorean relation DU .p; q/ D DU .p; q¤ / C DU .q¤ ; q/

holds. This means that q¤ gives the minimum of DU .p; q/ among q 2 Dm , q¤ D argmin DU .p; q/: q2D

m

In other words, DU chooses the Bayes optimal rule for p in Dm , which consists of discriminate functions of the same conditional expectation. From the shift invariant property, ».q/ ¡ b gives the same classication rule with ».q/; therefore, by adopting an appropriate b, we can introduce a simple constraint on Dm as » ¼ X X Dm D q 2 M ».q.yjx//p.yjx/ D ».q0 .yjx//p.yjx/ ; y2Y

y2Y

where ».q.yjx// D ».q0 .yjx// C

T X tD1

®t ft .x; y/ ¡ b.x; ®/:

In this case, b is simply given by b.x; ®/ D

T XX y2Y

®t ft .x; y/p.yjx/:

tD1

Q the above condition for Dm is Under empirical distributions ¹Q and p, satised by taking b.x; ®/ D

T X tD1

®t fQt .x/;

1456

N. Murata, T. Takenouchi, T. Kanamori, and S. Eguchi

Q where fQ is the conditional expectation under p: X Q fQt .x/ D p.yjx/ ft .x; y/ y2Y

8 < ft .xi ; yi /; P D : 1 f .x; y/; jY j y2Y t

if x D xi ; otherwise:

Then we dene the empirical U-model depending on the empirical conditional density pQ by emp

QU .q0 ; F / (

) T X Q D q2M ».q.yjx// D ».q0 .yjx// C ®t . ft .x; y/ ¡ ft .x// : (3.20) tD1

Note that the empirical U-model depends on pQ and has a U-at structure by its denition, emp

QU

D QU .q0 ; f ft ¡ fQt ; t D 1; : : : ; Tg/:

Hence, the geometrical properties discussed in the previous section hold emp between QU and its orthogonal m-at subspaces, » ¼ emp T .q/ D p 2 M hp ¡ q; ft ¡ fQt i¹Q D 0; 8t ; q 2 QU ; (3.21) under the empirical distribution ¹.x/. Q emp For the empirical U-model, the U-loss for q 2 QU under pQ and ¹Q is written as 2 Á ! n T X X 1X 4 LU .q/ D U ».q0 .yjxi // C ®t . ft .xi ; y/ ¡ ft .xi ; yi // n iD1 y2Y tD1 3 ¡ » .q0 .yi jxi //5 :

(3.22)

As a special case that ».q0 / D 0, LU .q/ depends on only the combined decision function F, F.x; y/ D

T X

®t ft .x; y/; tD1

and in this case, as stated in section 2, the U-loss is denoted by emp

LU .F/ D

n X 1X U.F.xi ; y/ ¡ F.xi ; yi //: n iD1 y2Y

(3.23)

Information Geometry of U-Boost and Bregman Divergence

1457

3.3.3 Normalized U-Model. Next let us consider a set of discriminate functions, which are shifted from a xed function ».q0 .yjx// by arbitrary functions b.x/,

Ds D fq 2 M j ».q.yjx// D ».q0 .yjx// ¡ b.x/g: By shift invariance, the classication rule associated with any q 2 Ds is equivalent to that with q0 . Also, for any ».q/ D ».q0 /¡b and ».q¤ / D ».q0 /¡b¤ , Z

1.q; q¤ / D

X

.b.x/ ¡ b¤ .x//

X y2Y

holds. Therefore, if q¤ 2 P , that is, and the Pythagorean relation,

.q¤ .yjx/ ¡ p.yjx//¹.x/dx

P y2Y

q¤ .yjx/ D 1 .a.e. x/, then 1 D 0

DU .p; q/ D DU .p; q¤ / C DU .q¤ ; q/; holds. This means that q¤ is the closest point from p in Ds , all the elements in which give the same classication rule: q¤ D argmin DU .p; q/: q2D

s

As a consequence, the minimization DU .p; q/ in Ds is equivalent to introducing the normalizing factor, so that X y2Y

q¤ .yjx/ D 1:

Namely, q¤ is restricted within a set of conditional probability densities P . We dene the normalized U-model by

Qnorm .q0 ; F / U ( ) T X D q2P ».q.yjx// D ».q0 .yjx// C ®t ft .x; y/ ¡ Á.x; ®/ ;

(3.24)

tD1

P where Á is the normalization term chosen so that y2Y q.yjx/ D 1 is satised, which might depend on q0 and F implicitly. Note that because of the normalization term Á, Qnorm is not a U-at U subspace in M. However, the Pythagorean relation for the normalized Umodel holds by restricting within P as follows. First, let us dene the orthogonal m-at subspace in M at q® 2 Qnorm by U

T .q® / D fp 2 M j hp ¡ q® ; ft ¡ Át0 .®/i D 0; 8tg;

1458

N. Murata, T. Takenouchi, T. Kanamori, and S. Eguchi

ft ¡ b0

Q not U- at T

q

m- at

Figure 8: Orthogonal condition of the normalized U-model.

where the subscript ® of q® is a parameter for specifying an element in QU and Át0 .x; ®/ D

@Á.x; ®/ : @®t

In this case, the orthogonality of Qnorm and T .q® / is locally dened with U the tangent of Qnorm at q® (see Figure 8): U ½ ¾ @ p ¡ q® ; ».q® / D hp ¡ q® ; ft ¡ Át0 .®/i¹ D 0: @ ®t ¹ Knowing that X .p.yjx/ ¡ q.yjx// D 1 ¡ 1 D 0; 8p; q 2 P ; y2Y

and for any function g that does not depend on y, Z X hp ¡ q; gi¹ D .p.yjx/ ¡ q.yjx//g.x/¹.x/dx D 0; 8p; q 2 P ; X

y2Y

we observe that for p 2 P ,

hp ¡ q® ; ft ¡ Át0 .®/i¹ D hp ¡ q® ; ft i¹ 8t;

and hence the orthogonal m-at subspace in P is written as

T .q/ \ P D fp 2 P j hp ¡ q; ft i¹ D 0; 8tg:

Information Geometry of U-Boost and Bregman Divergence

1459

Let q; r 2 Qnorm be U ».q/ D ».q0 / C ».r/ D ».q0 / C

T X tD1 T X tD1

®t ft .x; y/ ¡ Á.x; ®/; ®t0 ft .x; y/ ¡ Á.x; ® 0 /:

Then for any p 2 T .q/ \ P , we observe hp ¡ q; ».q/ ¡ ».r/i¹ D hp ¡ q;

T X tD1

.®t0 ¡ ®t / ft ¡ .Á .®/ ¡ Á.® 0 //i¹ D 0:

Therefore, the Pythagorean relation DU .p; r/ D DU .p; q/ C DU .q; r/ holds for the normalized U-model in P . In this case, the empirical U-loss is led to " Á ! n T X X 1X LU .q/ D U ».q0 .yjxi // C ®t ft .xi ; y/ ¡ Á.xi ; ®/ n iD1 y2Y tD1 ( )# T X ¡ ».q0 .yi jxi // C ®t ft .xi ; yi / ¡ Á.xi ; ®/ :

(3.25)

tD1

In the case of ».q0 / D 0, LU .q/ is written in terms of F and Á, and Á is determined from F. Therefore, we denote it by Lnorm .F/ U

2 3 n X X 1 4 D U.F.xi ; y/ ¡ Á.xi ; ®// ¡ fF.xi ; yi / ¡ Á .xi ; ®/g5 : n iD1 y2Y

(3.26)

3.4 Geometrical Understanding of U-Boost. In the following, we treat the algorithm for constructing the combined decision function sequentially. Here we give a unied description of the U-Boost algorithm from a geometrical point of view. In the following, we consider a one-dimensional expansion of U-model given by QU .qt¡1 ; ft I bt /

D fq j ».q.yjx// D ».qt¡1 .yjx// C ®ft .x; y/ ¡ bt .x; ®/; ® 2 Rg; (3.27)

1460

N. Murata, T. Takenouchi, T. Kanamori, and S. Eguchi

where bt is the auxiliary function for the U-model, bt .x; ®/ D

» ® fQt .x/; Át .x; ®/;

(empirical model); (normalized model);

(3.28)

and its derivative at ® D 0, »

b0t .x/ D

fQt .x/; Át0 .x; ®/j®D0 ;

(empirical model); (normalized model):

(3.29)

Geometrical Description of U-Boost Algorithm. Input: Given n examples f.xi ; yi /I xi 2 X ; y 2 Y ; i D 1; : : : ; ng.

Initialize: q0 .yjx/. (In the usual case, set ».q0 / D 0 for simplicity.) Do for t D 1; : : : ; T

Step 1: Select a classier ht based on the corresponding decision function ft so that ft ¡ b0t points in the same direction of qt¡1 ¡ pQ as much as possible, that is, Q ft ¡ b0t i¹Q : maximize hqt¡1 ¡ p; Step 2: Construct a one-dimensional U-model Qt D QU .qt¡1 ; ft I bt / by Q qI ¹/, equation 3.27. Then nd ®t , which minimize DU .p; Q namely, ®t D argmin LU .q/ ® 2 3 n X X 4 D argmin U.».q.yjxi /// ¡ ».q.yi jxi //5 : q2Q

t

iD1

y2Y

Step 3: Update qt .yjx/ D u.».qt¡1 .yjx// C ®t ft .x; y/ ¡ bt .x; ®t // equivalently, ».qt .yjx// D ».q0 .yjx// C

t X f®k fk .x; y/ ¡ bk .x; ®k /g: kD1

Output: The nal decision by the majority vote as H.x/ D argmax y2Y

T X

®t ft .x; y/: tD1

A geometrical interpretation of the algorithm is shown in Figure 9. In step 1, a strategy for selecting a decision function is described. Since ft is a base of the next one-dimensional U-model that originates from the

Information Geometry of U-Boost and Bregman Divergence

1461

Figure 9: Geometrical interpretation of the optimization procedure of the UBoost algorithm.

current estimate qt¡1 , the tangent ft ¡b0t should parallel pQ as much as possible Q For the empirical U-model, hpQ ¡ for the next estimate qt to come close to p. qt¡1 ; ft ¡ b0t i¹Q is rewritten as hpQ ¡ qt¡1 ; ft ¡ fQt i¹Q D ¡hqt¡1 ; ft ¡ fQt i¹Q ;

and for the normalized U-model, hpQ ¡ qt¡1 ; ft ¡ Át i¹Q D hpQ ¡ qt¡1 ; ft i¹Q D fQt ¡ hqt¡1 ; ft i¹Q

.because fQt D hqt¡1 ; fQt i¹Q /

D ¡hqt¡1 ; ft ¡ fQt i¹Q :

Therefore, for both models, step 1 is equivalent to minimize hqt¡1 ; ft ¡ fQt i¹Q :

(3.30)

This translation is related to the error rate property discussed later. The optimization procedure for obtaining qt in step 2 is geometrically interpreted as m-projection from pQ onto Qt as shown in Figure 10. For a U-model Qt , we can consider an orthogonal foliation fTt .q/g as

Tt .q/ D fp 2 M or P j hp ¡ q; ft ¡ b0t i¹Q D 0g; q 2 Qt ;

(3.31)

1462

N. Murata, T. Takenouchi, T. Kanamori, and S. Eguchi

Figure 10: Relationship between the m-projection and the foliation of U-model in the U-Boost algorithm.

where the foliation is considered in M for the empirical model and in P for the normalized model. Then we can nd a leaf Tt .q¤ / that passes the Q and the optimal model is determined by qt D q¤ . empirical distribution p, The concrete expression of step 2 is written as follows for each model: Empirical U-model ®t D argmin ®

n X X iD1 y6Dyi

U.Ft¡1 .xi ; y/ ¡ Ft¡1 .xi ; yi / C ®f ft .xi ; y/ ¡ ft .xi ; yi /g/;

(3.32)

Normalized U-model ®t D argmin ®

" n X X iD1

y2Y

U.Ft¡1 .xi ; y/ C ®ft .xi ; y/ ¡ Át .xi ; ®// #

¡ fFt¡1 .xi ; yi / C ®ft .xi ; yi / ¡ Át .xi ; ®/g :

(3.33)

The above sequential procedure from f1 through fT without recursion is not in general equivalent to the parallel procedure, which directly gives the optimal point q¤ for given F D f ft I t D 1; : : : ; Tg, that is, the intersection

Information Geometry of U-Boost and Bregman Divergence

1463

Figure 11: the sequential and parallel procedures coincide when U-models fQt g are orthogonal to each other.

of Q.F / and T .F /. As intuitively shown in Figure 11, the sequential procedure without recursion coincides with the parallel procedure when the sequences of Qt ’s are orthogonal to each other as a special case. However, if we use f ft I t D 1; : : : ; Tg repeatedly, the following property suggests that the solution of the sequential update with recursion approaches the optimal solution of the parallel procedure. By simple calculation, we observe that for the sequence of densities fqt I t D 0; 1; : : :g dened by the normalized or unnormalized U-Boost algorithm, the relation LU .qtC1 / ¡ LU .qt / D ¡DU .qtC1 ; qt /

(3.34)

holds. This shows that as far as qt and qtC1 are different, namely, DU .qt ; qtC1 / > 0 , U-loss decreases monotonically. This property is closely related with that in the expectation-maximization (EM) algorithm for obtaining the maximum likelihood estimator. (See Amari, 1995, for the geometric considerations of the EM algorithm.) 3.4.1 Note on Binary Classication. In this section, we summarize some characteristics of binary classication problems, and in the following, we suppose the initial measure is simplied as ».q0 / D 0. A classier h.x/ outputs either fC1g or f¡1g in the binary case; therefore, h can be regarded as a real-valued function, and the corresponding decision

1464

N. Murata, T. Takenouchi, T. Kanamori, and S. Eguchi

function f is written as f .x; y/ D

yh.x/ C 1 : 2

(3.35)

In this case, the combined decision function satises Ft .xi ; y/ ¡ Ft .xi ; yi / D

» 0; P ¡yi tkD1 ®k hk .xi /;

if y D yi ; if y 6D yi :

(3.36)

The U-loss for the empirical model is written as n X 1X U.Ft¡1 .xi ; y/ ¡ Ft¡1 .xi ; yi / C ®. ft .xi ; y/ ¡ ft .xi ; yi /// n iD1 y2Y " Á Á !!# n t¡1 X 1X D U.0/ C U ¡yi ®k hk .xi / C ®ht .xi / : n iD1 kD1

LU .q/ D

Therefore, ®t in step 2 is given by ®t D argmin LU .q/ ®

D argmin ®

n X iD1

Á

U ¡yi

Á t¡1 X kD1

!! ®k hk .xi / C ®ht .xi /

:

(3.37)

In the case of the KL divergence, where U.z/ D exp.z/, this procedure is equivalent to AdaBoost: ®t D argmin ®

n X iD1

Á exp ¡yi

Á

t¡1 X kD1

!! ®k hk .xi / C ®ht .xi /

(cf. Lebanon & Lafferty, 2001). Also, for U.z/ D exp.z/, the constraint for the normalized model is described as X X exp.log.qt¡1 .yjx// C ®ft .x; y/ ¡ Át .x; ®// D 1: q.yjx/ D y2Y

y2Y

The normalization term Át is solved as 0 1 X Át .x; ®/ D log @ qt¡1 .yjx/ exp.®ft .x; y//A ; y2Y

Information Geometry of U-Boost and Bregman Divergence

1465

and ®t is given by ®t D argmin ®

D argmin ®

D argmin ®

n X

P

iD1 n X iD1 n X

y2Y

log P

y2Y

log log

iD1

qt¡1 .yjxi / exp.® ft .xi ; y//

qt¡1 .yi jxi / exp.®ft .xi ; yi //

exp.Ft¡1 .xi ; y/ C ®ft .xi ; y//

exp.Ft¡1 .xi ; yi / C ®ft .xi ; yi //

X y2Y

exp.Ft¡1 .xi ; y/ ¡ Ft¡1 .xi ; yi /

C ®. ft .xi ; y/ ¡ ft .xi ; yi /// Á ! ³ n t¡1 ±X ²´ X D argmin log 1 C exp ¡yi ®k hk .xi / C ®ht .xi / : ®

iD1

kD1

This representation is equivalent to the U-Boost for the empirical model with U.z/ D log.1 C exp.z//, and this shows that the normalized U-Boost associated with U.z/ D exp.z/ conducts the same procedure of LogitBoost (Friedman et al., 2000). Also, we note that the above discussion is an example that the U-Boost algorithm with the empirical and normalized models with the same U-function results in different solutions. 4 Statistical Properties of U-Boost 4.1 Error Rate Property. One of the important characteristics of the AdaBoost algorithm is the evolution of its weighted error rates, that is, the classier ht at step t shows the worst performance, which is equivalent to random guess, under the distribution at the next step t C 1. Let us dene the weight at step t C 1 by DtC1 .i; y/ D

qt .yjxi / ; ZtC1

(4.1)

where ZtC1 is a normalization constant dened by ZtC1 D

n X X

qt .yjxi /:

(4.2)

iD1 y6Dyi

Then dene the weighted error of the classier h with its decision function f by ²tC1 . f / D

n X X f .xi ; y/ ¡ f .xi ; yi / C 1 DtC1 .i; y/; 2 iD1 y6Dy i

(4.3)

1466

N. Murata, T. Takenouchi, T. Kanamori, and S. Eguchi

that is, 2²tC1 . f / ¡ 1 D

n X¡ X iD1 y6Dyi

¢ f .xi ; y/ ¡ f .xi ; yi / DtC1 .i; y/;

/ hqt ; f ¡ fQi¹Q ;

(4.4)

where the meaning of the factor of the weight is interpreted by considering the following four cases: 8 > >0; f .xi ; y/ ¡ f .xi ; yi / C 1 <1=2; D 1=2; > 2 > : 1;

if yi if yi if yi if yi

2 h.xi / and 2 h.xi / and 62 h.xi / and 62 h.xi / and

y 62 h.xi /; y 2 h.xi /; y 62 h.xi /; y 2 h.xi /:

(4.5)

Intuitively speaking, the rst case is that ht is correct for y and yi , the second and third are the cases where ht is partially correct or partially wrong, and the last is the case where ht is wrong, because the correct classication rule for xi is to output fyi g: ²tC1 .h/ D

X

DtC1 .i; y/ C

1·i·n yi 62 h.xi / y 2 h.xi /

C

X 1·i·n yi 2 h.xi / y 2 h.xi / y 6D yi

X 1·i·n yi 62 h.xi / y 62 h.xi / y 6D yi

1 DtC1 .i; y/ 2

1 DtC1 .i; y/: 2

(4.6)

Note that f .xi ; y/ ¡ f .xi ; yi / vanishes when y D yi despite whether h.xi / is correct; hence, we omit qt .yi jxi / from the weighted error, that is, the weights of labels that are different from given examples are dened only as in AdaBoost.M2 (Freund & Schapire, 1996). Also note that the correct rate is written as 1 ¡ ²tC1 . f / D

X 1·i·n yi 2 h.xi / y 62 h.xi /

C

DtC1 .i; y/ C

X 1·i·n yi 2 h.xi / y 2 h.xi / y 6D yi

X 1·i·n yi 62 h.xi / y 62 h.xi / y 6D yi

1 DtC1 .i; y/; 2

1 DtC1 .i; y/ 2

(4.7)

Information Geometry of U-Boost and Bregman Divergence

1467

and that the second and third terms on the right-hand side are the same in the error rate. Then we can prove the following interesting property of the error rate of normalized and empirical U-Boost algorithms: Theorem 3. The U-Boost algorithm updates the distribution into the least favorable at each step, which means ²tC1 . ft / D

1 : 2

(4.8)

Proof. By differentiating the U-loss for the empirical model with respect to ®, we know that ®t satises n X X iD1 y2Y

. ft .xi ; y/ ¡ ft .xi ; yi //u.».qt¡1 .yjxi // C ®t . ft .xi ; y/ ¡ ft .xi ; yi /// D 0I(4.9)

by using the denition qt .yjxi / D u.».qt¡1 .yjxi // C ®t . ft .xi ; y/ ¡ ft .xi ; yi ///; the above equation is rewritten as h ft ¡ fQt ; qt i¹Q D 0: Similarly, by differentiating the U-loss for the normalized model, ®t satises ¡

n X iD1

C

. ft .xi ; yi / ¡ Át0 .xi ; ®t //

n X X iD1 y2Y

. ft .xi ; y/ ¡ Át0 .xi ; ®t //u.».qt¡1 .yjxi //

C ®t ft .xi ; y/ ¡ Át .xi ; ®t // D 0:

(4.10)

Using the denition of qt .yjxi / and the constraint above equation is rewritten as ¡

n X iD1

. ft .xi ; yi / ¡ Át0 .xi ; ®t // C

n X X iD1 y2Y

Q ¹Q C h ft ¡ Át0 ; qt i¹Q D ¡h ft ¡ Át0 ; pi Q ¹Q D h ft ¡ Át0 ; qt ¡ pi

P y2Y

qt .yjxi / D 1, the

. ft .xi ; y/ ¡ Át0 .xi ; ®t //qt .yjxi /

1468

N. Murata, T. Takenouchi, T. Kanamori, and S. Eguchi

Q ¹Q D h ft ; qt ¡ pi D h ft ; qt i¹Q ¡ fQt

(Át0 does not depend on y)

D h ft ¡ fQt ; qt i¹Q

.h1; qt i¹Q D 1/

D 0:

Therefore, for both models, the optimal condition for qt is written as the decision function and its conditional expectation as h ft ¡ fQt ; qt i¹Q D 0: This condition is interpreted as n X X iD1 y2Y

D

. ft .xi ; y/ ¡ ft .xi ; yi //qt .yjxi / X

1·i·n yi 62 ht .xi / y 2 ht .xi /

qt .yjxi / ¡

D 0;

X

qt .yjxi /

1·i·n yi 2 ht .xi / y 62 ht .xi /

that is, X 1·i·n yi 62 ht .xi / y 2 ht .xi /

qt .yjxi / D

X

qt .yjxi /:

(4.11)

1·i·n yi 2 ht .xi / y 62 ht .xi /

This means that the accumulated probability of correctly classied examples and wrongly classied examples is balanced under qt . By imposing the above relation into error rate 4.6 and correct rate 4.7, we observe ²tC1 .ht / D 1 ¡ ²tC1 .ht /; which concludes ²tC1 .ht / D

1 : 2

(4.12)

In this way, the U-Boost algorithm is constructed as updating the distribution into the least favorable for the present step. A set of decision functions dened by 8 9 < n X = X Q qt / D f R .p; (4.13) . f .xi ; y/ ¡ f .xi ; yi //qt .yjxi / D 0 : ; y2Y iD1

Information Geometry of U-Boost and Bregman Divergence

1469

tangent space of at random guess

Figure 12: Relationship between “random guess” and the tangent space of Q.

can be regarded as “random guess” associated with the weighted error. As seen in the above proof, the condition on R is equivalent to hpQ ¡ qt ; f ¡ b0 j®D0 i¹Q D 0;

(4.14)

and hence the set of “random guess,” which is a set of the “worst” classiers, is included in the tangent space of Qt at qt . Therefore, step 1 of the U-Boost claims that the next classier must be chosen from outside of random guess (see Figure 12). 4.2 Consistency and Bayes Rule Equivalence. Using the basic property of the Bregman divergence, we can show the consistency of the U-loss as follows: Theorem 4. Let p.yjx/ be the true conditional distribution and F.x; y/ be the minimizer of the U-loss HU .q/ with q D u.F/. The classication rule given by F becomes Bayes optimal: O D argmin F.x; y/ D argmin p.yjx/: y.x/ y2Y

Proof.

y2Y

From the property of the Bregman divergence,

DU .p; q/ D 0 , p.yjx/ D q.yjx/ .a.e. x/;

(4.15)

1470

N. Murata, T. Takenouchi, T. Kanamori, and S. Eguchi

and equivalence relation 3.8, the minimizer of the U-cross-entropy HU .p; qI ¹/ with respect to q, is given by F.x; y/ D argmin HU .p; uF / D ».p.yjx//: F

The statement comes from the monotonicity of » : argmin p.yjx/ D argmin ».p.yjx//: y2Y

y2Y

In the U-Boost algorithm, F.x; y/ is chosen from a class of functions that are a linear combination of ft .x; y/; .t D 1; : : : ; T/ with some bias function q0 and b. In the case that the true distribution is not always in the considered U-model, which happens in practical cases, U-Boost cannot achieve Bayes optimality, and the closest point in the model is chosen in the sense of U-loss. If the number of functions T is sufciently large and the functions ft I t D 1; : : : ; T are chosen from sufciently various decision functions, the U-model can well approximate the true distribution. It depends on the richness of the decision functions, which are basis of discriminate function F. For a discussion about the richness of the linear combination of simple functions, see, for example, Barron (1993) and Murata (1996). For the binary case in particular, we can show the following interesting relationship between the U-function and the log-likelihood ratio. In a binary classication problem, as shown in equation 3.37, the objective function of the U-loss is simplied as Z

X X

U.¡yF.x//p.yjx/¹.x/dx;

(4.16)

y2f§1g

where qt and F are linked as q D u.yF.x//; and the discriminate function F gives the classication rule as » C1; F.x/ > 0; yD ¡1; F.x/ < 0: Theorem 5. that is,

The minimizer of the U-cross-entropy gives the Bayes optimal rule,

» ¼ p.C1jx/ fxjF.x/ > 0g D x log >0 : p.¡1jx/

Information Geometry of U-Boost and Bregman Divergence

1471

Moreover, if log

u.z/ D 2z u.¡z/

(4.17)

holds, F coincides with the log-likelihood ratio F.x/ D

1 p.C1jx/ log : 2 p.¡1jx/

Proof. By usual variational arguments, the minimizer of equation 4.16 satises Z X

.p.C1jx/u.¡F.x// ¡ p.¡1jx/u.F.x///1.x/¹.x/dx D 0

for any function 1.x/. Hence, log

p.C1jx/ u.F.x// D log .a.e. x/; p.¡1jx/ u.¡F.x//

knowing that for any convex function U, ½.z/ D log

u.z/ ; u.¡z/

is monotonically increasing and satises ½.0/ D 0. This directly shows the rst part of the theorem and by imposing ½.z/ D 2z, the second part is proved. The last part of the theorem agrees with the result in Eguchi and Copas (2001, 2002). U-functions for AdaBoost, LogitBoost, and MadaBoost satisfy the condition 4.17. 4.3 Asymptotic Covariance. To see the efciency of the U-model, we investigate the asymptotic covariance of ® in this section. Let us consider the generic U-model in the form of Á ! T X q.yjx/ D u ».q0 .yjx// C ®t ft .x; y/ ¡ b.x; ®/ ; iD1

parameterized by ® D .®t I t D 1; : : : ; T/, and consider the case that the auxiliary function b does not depend on the data. Let p.yjx/ be the true

1472

N. Murata, T. Takenouchi, T. Kanamori, and S. Eguchi

conditional distribution. The optimal point q¤ in the U-model is given by ¤

® D argmin HU .p; qI ¹/ D argmin ®

®

Z X © ª U.».q// ¡ p».q/ d¹; (4.18) X

y2Y

and for given n examples, the estimate of ® is given by ®ˆ D argmin LU .q/ ®

Z

Q qI ¹/ D argmin HU .p; Q D argmin ®

®

X© x2X

y2Y

ª Q Q (4.19) U.».q// ¡ p».q/ d¹;

in abstract form. When n is sufciently large, the covariance of ®ˆ with respect to all the possible sample sets is given as follows: Theorem 6.

The asymptotic covariance of ®ˆ is given by

Cov.®/ ˆ D

³ ´ 1 ¡1 1 H GH¡1 C o ; n n

(4.20)

where H and G are T £ T matrices dened by @2 HU .p; q¤ I ¹/ @®@® ¿ Z @2 D r.x; ® ¤ /¹.x/dx; ¿ X @®@® Z X @ @ GD .U.».q¤ // ¡ ».q¤ // ¿ .U.».q¤ // ¡ ».q¤ //pd¹ @® @® X y2Y ´ Z X³ @ D r.x; ® ¤ / ¡ f .x; y/ @® X y2Y ³ ´¿ @ £ r.x; ® ¤ / ¡ f .x; y/ p.yjx/¹.x/dx; @®

HD

where r is the function of x dened by r.x; ®/ D

X y2Y

Á U ».q0 .yjx// C

T X tD1

! ®t ft .x; y/ ¡ b.x; ®/ C b.x; ®/;

and f D . f1 ; f2 ; : : : ; fT /T . The proof is given by usual asymptotic arguments (see Murata, Yoshizawa, and Amari, 1994, for example).

Information Geometry of U-Boost and Bregman Divergence

1473

When the true distribution is included in the U-model, that is, p D q¤ , the asymptotic covariance of LogitBoost becomes ³ ´ 1 1 Cov.®/ ˆ D I¡1 C o ; n n where I is the Fisher information matrix of the logistic model, which means LogitBoost attains the Crame´ r-Rao bound asymptotically, that is, LogitBoost is asymptotic efcient. In general, the asymptotic covariance of U-Boost algorithms is inferior to the Cram´er-Rao bound; hence, from this point of view, U-Boost is not efcient. However, instead of efciency, some U-Boost algorithms show robustness, as discussed in the next section. The expected U-loss of qt estimated with given n examples is asymptotically bounded by Q qt I ¹// Q D HU .p; q¤ I ¹/ C E.LU .qt // D E.HU .p;

³ ´ 1 1 ¡1 tr H G C o ; (4.21) 2n n

where E is the expectation over all the possible sample sets (cf. Murata et al., 1994). 4.4 Robustness of U-Boost. In this section, we examine the robustness of the U-Boost for the binary classication problem. First, we consider the robust condition for U-functions; then we discuss the robustness of the UBoost algorithm. 4.4.1 Most B-Robust U-Function. with one parameter ®,

Let us consider the statistical model

Qnorm .q0 ; fyh.x/g/ D fq 2 P j log q.yjx; ®/ D log q0 .yjx/ C ®yh.x/ ¡ b.x; ®/; ® 2 Rg; (4.22) where q0 2 P and h.x/ takes C1 or ¡1. Let us dene the likelihood ratio of q0 by F.x/ D

1 q0 .C1jx/ log : 2 q0 .¡1jx/

We dene an estimator of ® with the U-function as Z X ®U .q¹/ D argmin q.yjx/U.¡y.F.x/ C ®h.x///¹.x/dx; ®

X

y2Y

where q¹ is the joint distribution of x and y. As considered in the previous section when the U-function satises condition 4.17, the estimator by the

1474

N. Murata, T. Takenouchi, T. Kanamori, and S. Eguchi

U-function satises ®U .q® ¹/ D ® for any q® 2 Qnorm .q0 ; fyh.x/g/. We measure the robustness of the estimator by the gross error sensitivity (Hampel, Rousseeuw, Roncheti, & Stahel, 1986) ° .U; q0 / D sup lim Q ²!C0 ˜ ;y/ .x

.®U .pQ² / ¡ ®U .q0 ¹//2 ; ²2

(4.23)

where Q pQ² D .1 ¡ ²/ q0 ¹ C ² ±.˜x; y/ Q is the probability distribution with a point mass at .˜x; y/. Q The and ±.˜x; y/ gross error sensitivity measures the worst inuence that a small amount of contamination can have on the estimate. The estimator, which minimizes the gross error sensitivity, is called the most B-robust estimator. For a choice of a robust U-function, we show the following theorem. Theorem 7. The U-function that derives the MadaBoost algorithm minimizes the gross error sensitivity among the U-function with the property of 4.17. Proof. By brief calculation, the gross error sensitivity of the estimator is written as 0 1¡2 Z X Q x//2 @ ° .U; q0 / D sup u.yF.˜ u0.¡F.x/y/q0 .yjx/¹.x/dxA : (4.24) X Q ˜ ;y/ .x y2Y Knowing that the conditional probability is written with F as 1 1 C exp.¡2yF.x// u.yF.x// D u.F.x// C u.¡F.x//

q0 .yjx/ D

and u satises u0 .z/ u0 .¡z/ C D2 u.z/ u.¡z/ from the consistent condition on u log

u.z/ D 2z; u.¡z/

Information Geometry of U-Boost and Bregman Divergence

1475

we get u0.¡F.x//q0 .1jx/ C u0 .F.x//q0 .¡1jx/ u0 .¡F.x//u.F.x// u0 .F.x//u.¡F.x// D C u.F.x// C u.¡F.x// u.F.x// C u.¡F.x// 2u.F.x//u.¡F.x// D u.F.x// C u.¡F.x// D 2u.F.x//q0 .¡1jx/: By imposing the above relation, we obtain the expression of the gross error sensitivity only with function u as ³ Z ´¡2 2 Q ° .U; q0 / D sup u.yF.˜x// 2 u.F.x//q0 .¡1jx/¹.x/dx : X Q ˜ ;y/ .x

(4.25)

From the above formula, we nd that the gross error sensitivity diverges if u.yF.x// is not bounded and the integration of u.F.x// by q0 .¡1jx/¹.x/ is bounded. Therefore, we focus on the case that u is bounded. Without loss of generality, we can suppose Q x//2 D 1; sup u.yF.˜ Q ˜ ;y/ .x because the multiplication of the positive value to the U-function does not change the estimator. To minimize the gross error sensitivity, we need to nd a U-function that maximizes Z u.F.x//q0 .¡1jx/¹.x/dx: X

The point-wise maximization of u.z/ under the conditions u.¡z/ D u.z/e¡2z

and

Q x// D 1 sup u.yF.˜ Q ˜ ;y/ .x

leads to » u.z/ D

1; exp.2z/;

z ¸ 0; z < 0;

and this coincides with the MadaBoost U-function.

1476

N. Murata, T. Takenouchi, T. Kanamori, and S. Eguchi

4.4.2 Robustness of Boosting Algorithm. Next, we study the robustness of the U-Boost. Let us dene the normalized U-model Qnorm .q0 ; F / with the U set of binary decision functions,

F D fyh1 .x/; : : : ; yhT .x/g; where ht .x/ takes C1 or ¡1. When the probability q0 .yjx/ changes to pQ² , the U-Boost estimator is altered from F.x/ to Q F.x/ C ®U .pQ² /h.x/; Q where h.x/ is an element of fh1 .x/; : : : ; hT .x/g, which depends on pQ² . The probability q² .yjx/ 2 Qnorm .q0 ; F /, which is specied by the above decision function, is written as Q log q² .yjx/ D log q0 .yjx/ C ®U .pQ² /yh.x/ ¡ b.x; ®U .pQ² // norm Q 2 Q .q0 ; fyh.x/g/: Let us measure the robustness of U-Boost by the gross error sensitivity of the distribution estimated by the algorithm, which is dened with the KL divergence as °boost .U; q0 / 2 D sup lim 2 ²!C0 ² Q ˜ ;y/ .x

Z X X

y2Y

q0 .yjx/.log q0 .yjx/ ¡ log q² .yjx//¹.x/dx: (4.26)

Intuitively, this measures the sensitivity of the KL divergence between the true and the estimated distributions under the small contamination. Q the chosen classier hQ does not depend on the value of For a xed .Qx; y/, ² because Q ¡ q0 ; yht .x/ ¡ b0t i hpQ² ¡ q0 ; yht .x/ ¡ b0t i D ² ¢ h±.Qx; y/ holds. Therefore, we nd that Z X 2 lim 2 q0 .yjx/.log q0 .yjx/ ¡ log q² .yjx//¹.x/dx ²!C0 ² X y2Y Q ¢ D lim I.h/ ²!C0

.®U .pQ² / ¡ ®U .q0 ¹//2 ²2

Q is the Fisher information of Qnorm by the asymptotic expansion, where I.h/ Q .q0 ; yh.x// at ® D 0. From the property of .yh.x//2 D 1, we nd that I.h/ does not depend on h, and let us dene I0 as the common value of I.ht / for t D 1; : : : ; T.

Information Geometry of U-Boost and Bregman Divergence

1477

From the above argument, we nd that 2 Q ¢ .®U .pQ² / ¡ ®U .q0 ¹// °boost .U; q0 / D sup lim I.h/ ²!C0 ²2 Q ;y/ Q .x

D I0 sup lim ²!C0 Q ;y/ Q .x D I0 ° .U; q0 /:

.®U .pQ² / ¡ ®U .q0 ¹//2 ²2

Hence, the U-function of MadaBoost also minimizes °boost .U; q0 /. As a consequence, MadaBoost minimizes the inuence of outliers around the true distribution. 4.5 Illustrative Examples. In the following numerical experiments, we study the two-dimensional binary classication problem with “stumps” (Friedman et al., 2000). We generate labeled examples subject to a xed probability and a few examples are ipped by the contamination, as shown in Figure 13. The detailed setup is x D .x1 ; x2 / 2 X

D [¡¼; ¼] £ [¡¼; ¼]

y 2 Y D fC1; ¡1g

¹.x/ : uniform on X 1Ctanh.yF.x// p.yjx/ D 2

where F.x/ D x2 ¡ 3 sin.x1 /; and a% contaminated data are generated according to the following procedure. First, examples are sorted by descending order of jF.xi /j and from the top 10a% examples, a% are randomly chosen and ipped without replacement. That means that contamination is avoided around the boundary of classication. The plots are made by averaging 50 different runs. In each run, 300 training data are produced, and the classication error rate is calculated with 4000 test data. The training results by three boosting methods—AdaBoost, LogitBoost, and MadaBoost—are compared from the viewpoint of the robustness. In Figure 14, we show the test error evolution in regard to the number of boosting. All the boosting methods show overt phenomena as the number of boosting increases. We can see that AdaBoost is quite sensitive to the contaminated data. To show the robustness to the contamination, we plot the test error differences against the number of boosting. In Figure 15, the difference between 1% contamination and noncontamination and between 2% contamination and noncontamination are plotted, respectively. In fact, Figures 14b and 14c for the examples with outliers show that LogitBoost and MadaBoost stably provide an optimal performance around

1478

N. Murata, T. Takenouchi, T. Kanamori, and S. Eguchi 4

3

2

1

0

-1

-2

-3

-4

-4

-3

-2

-1

0

1

2

3

4

Figure 13: Typical examples with contamination.

the boosting number 30 to 50, while AdaBoost fails to attain such an optimal performance observed for the noncontaminated examples case, as observed in Figure 14a. Thus, we conclude that AdaBoost is more sensitive to outliers than MadaBoost in respect to learning curves.

5 Conclusion In this article, we formulated boosting algorithms as sequential updates of conditional measures, and we introduced a class of boosting algorithms by considering the relation with the Bregman divergence. Using a statistical framework, we discuss properties of consistency, efciency, and robustness. Still, detailed studies on some properties, such as the rate of convergence and stopping criteria of boosting, are needed to avoid the overtting problem. Here we treated only the classication problem, but the formulation can be extended to the case where y is in some continuous space, such as regression and density estimation. This remains a subject for future work.

Information Geometry of U-Boost and Bregman Divergence contamination: 0%

contamination: 1%

0.16 0.15 0.14

AdaBoost MadaBoost LogitBoost

0.12

0.12

0.13

0.14

test err

0.15

0.16

AdaBoost MadaBoost LogitBoost

0.13

test err

1479

0

100

200

300

400

boost

0

100

200

300

400

boost

0.14 0.13

test err

0.15

0.16

contamination: 2%

0.12

AdaBoost MadaBoost LogitBoost

0

100

200

300

400

boost

Figure 14: Test error of boosting algorithms. (a) Training data are not contaminated. (b) 1% contamination. (c) 2% contamination.

References Amari, S. (1985). Differential-geometrical methods in statistics. Berlin: SpringerVerlag. Amari, S. (1995). Information geometry of the EM and EM algorithms for neural networks. Neural Networks, 8(9), 1379–1408. Amari, S., & Nagaoka, H. (2000). Methods of information geometry. New York: Oxford University Press. Barron, A. R. (1993). Universal approximation bounds for superpositions of a sigmoidal function. IEEE Trans. Information Theory, 39(3), 930–945.

1480

N. Murata, T. Takenouchi, T. Kanamori, and S. Eguchi

0.01

diff. of test err

0.02

AdaBoost MadaBoost LogitBoost

0.00

0.01

0.02

0.03

diff. of test err (contamination: 2% vs. 0%)

AdaBoost MadaBoost LogitBoost

0.00

diff. of test err

0.03

diff. of test err (contamination: 1% vs. 0%)

0

100

200 boost

(a)

300

400

0

100

200

300

400

boost

(b)

Figure 15: Difference of test errors of the original examples and that of the contaminated examples. (a) Difference between 1% contamination and noncontamination. (b) Difference between 2% contamination and noncontamination.

Bishop, C. (1995). Neural networks for pattern recognition. Oxford: Clarendon Press. Collins, M., Schapire, R. E., & Singer, Y. (2000). Logistic regression, Adaboost and Bregman distances. In Proceedings of the Thirteenth Annual Conference on Computational Learning Theory (pp. 158–169). San Francisco: Morgan Kaufmann. Domingo, C., & Watanabe, O. (2000). MadaBoost: A modication of AdaBoost. In Proceedings of the Thirteenth Conference on Computational Learning Theory (pp. 180–189). San Francisco: Morgan Kaufmann. Eguchi, S., & Copas, J. B. (2001). Recent developments in discriminant analysis from an information geometric point of view. Journal of the Korean Statistical Society, 30, 247–264. Eguchi, S., & Copas, J. B. (2002). A class of logistic type discriminant functions. Biometrika, 89, 1–22. Eguchi, S., & Kano, Y. (2001). Robusting maximum likelihood estimation by psidivergence (ISM Research Memorandum 802). Tokyo: Institute of Statistical Mathematics. Freund, Y. (1995). Boosting a weak learning algorithm by majority. Information and Computation, 121(2), 256–285. Freund, Y., & Schapire, R. E. (1996). Experiments with a new boosting algorithm. In Machine Learning: Proceedings of the Thirteenth International Conference (pp. 148–156). San Francisco: Morgan Kaufmann. Freund, Y., & Schapire, R. E. (1997). A decision-theoretic generalization of online learning and an application to boosting. Journal of Computer and System Sciences, 55(1), 119–139.

Information Geometry of U-Boost and Bregman Divergence

1481

Friedman, J. H., Hastie, T., & Tibshirani, R. (2000). Additive logistic regression: A statistical view of boosting. Annals of Statistics, 28, 337–407. Hampel, F. R., Rousseeuw, P. J., Ronchetti, E. M., & Stahel, W. A. (1986). Robust statistics. New York: Wiley. Hastie, T., Tibishirani, R., & Friedman, J. (2001). The elements of statisticallearning. New York: Springer-Verlag. Kearns, M., & Valiant, L. G. (1988). Learning boolean formulae or nite automata is as hard as factoring (Tech. Rep. TR-14-88). Cambridge, MA: Harvard University Aiken Computation Laboratry. Kivinen, J., & Warmuth, M. K. (1999). Boosting as entropy projection. In Proceedings of the Twelfth Annual Conference on Computational Learning Theory (pp. 134–144). New York: ACM Press. Lebanon, G., & Lafferty, J. (2001). Boosting and maximum likelihood for exponential models (Tech. Rep. CMU-CS-01-144). Pittsburgh, PA: School of Computer Science, Carnegie Mellon University. McLachlan, G. J. (1992). Discriminant analysis and statistical pattern recognition. New York: Wiley. Minami, M., & Eguchi, S. (2002). Robust blind source separation by betadivergence. Neural Computation, 14, 1859–1886. Murata, N. (1996). An integral representation with ridge functions and approximation bounds of three-layered network. Neural Networks, 9(6), 947–956. Murata, N., Yoshizawa, S., & Amari, S. (1994). Network information criterion— determining the number of hidden units for an articial neural network model. IEEE Trans. Neural Networks, 5(6), 865–872. Schapire, R. E. (1990). The strength of weak learnability. Machine Learning, 5, 197–227. Schapire, R. E., Freund, Y., Bartlett, P., & Lee, W. S. (1998). Boosting the margin: A new explanation for the effectiveness of voting methods. Annals of Statistics, 26(5), 1651–1686. Takenouchi, T., & Eguchi, S. (2004). Robustifying AdaBoost by adding the naive error rate. Neural Computation, 16(4), 767–787. Vapnik, V. (1995). The nature of statistical learning theory. Berlin: Springer-Verlag. Received December 3, 2003; accepted January 8, 2004.

LETTER

Communicated by Joachim Diederich

A New Approach to the Extraction of ANN Rules and to Their Generalization Capacity Through GP Juan R. Rabunal ˜ [email protected]

Juli´an Dorado [email protected]

Alejandro Pazos [email protected]

Javier Pereira [email protected]

Daniel Rivero [email protected] Department of Informationand Communications Technologies,University of A Coruna, ˜ Faculta de Inform´atica, Campus Elvina ˜ s/n, 15192 A Coruna, ˜ Spain

Various techniques for the extraction of ANN rules have been used, but most of them have focused on certain types of networks and their training. There are very few methods that deal with ANN rule extraction as systems that are independent of their architecture, training, and internal distribution of weights, connections, and activation functions. This article proposes a methodology for the extraction of ANN rules, regardless of their architecture, and based on genetic programming. The strategy is based on the previous algorithm and aims at achieving the generalization capacity that is characteristic of ANNs by means of symbolic rules that are understandable to human beings. 1 Introduction Articial neural networks (ANN) are easily implemented and used, and they present a number of features that make them ideal for problem solving in various elds. Nevertheless, many developers prefer not to use them because they perceive them as black boxes: systems that produce certain outputs from certain inputs without explaining why or how. Researchers from a wide range of elds feel that ANNs are not trustworthy. In medicine, for instance, a computational system that is designed to provide medical diagnoses should not only be correct but also be able to justify how and why the diagnoses were reached. Expert systems (ES) are typically successful because they provide a reason for their solutions or answers. The aim of this letter is to nd a way of explaining how ANNs work and to translate their information into a symbolic format that maintains their major c 2004 Massachusetts Institute of Technology Neural Computation 16, 1483–1523 (2004) °

1484

J. Rabunal, ˜ J. Dorado, A. Pazos, J. Pereira, and D. Rivero

characteristic: the fact that they learn from examples. We shall propose a system that is conditioned by certain requirements (Tickle, Andrews, Golea, & Diederich, 1998): ² It must function with any type of ANN architecture. It must be compatible with all kinds of neural networks, including those that are highly interconnected and recurrent. ² It must function with any type of ANN training. Most algorithms focus on a given ANN training process, based on rule extraction, and are not valid for other types of training. ² It must be correct. Many rule extraction mechanisms generate only approaches to the functioning of the ANN, whereas the rules should be as precise as possible. ² It must have a high expression level. The rule language represents the extracted symbolic knowledge. 2 State of the Art 2.1 Genetic Programming. Genetic programming (GP) evolved from the so-called genetic algorithms (GA). During the rst conferences on GA at the International Conference on Genetic Algorithms (ICGA), two papers appeared that are generally considered the predecessors of the GP (Cramer, 1985; Fujiki, 1987); however, back in 1958 and 1959, Friedberg had already initiated his work on the automatic creation of programs in machine language (Friedberg, 1958; Friedberg, Dunham, & North, 1959). But it was John R. Koza who came up with the term that was also the title of his book: Genetic Programming (Koza, 1992). This book formally established the basis of todays work on GP. GP is based on the evolution of a population in which each individual represents a solution to a problem that needs to be solved. In this particular problem, these solutions will be rules that explain the behavior of ANNs, and, as we will see, they will take the shape of arithmetic or logical rules. The evolutionary process takes place after an initial population of random solutions is created. This initial population is called generation 0. The evolution takes place by selecting the best individuals (although the worst ones also have a small probability of being selected) and combining them in order to create new solutions. With these new solutions, a new population is created, deleting the old one. It is a process based on selection, mutation, and crossover algorithms. After a few generations, the population is expected to have reached a solution that is able to solve the problem. In GP, the codication of solutions is shaped like a tree: the user species which terminals (leaves) and functions can be used by the algorithm. Complex expressions can be produced either mathematically (i.e., involving

A New Approach to ANN Rules Extraction

1485

arithmetical operators) or logically (involving boolean or relational operators). Several directions or branches have been developed within GP. With regard to knowledge discovery (KD), fuzzy rules (Fayyad, Piatetsky-Shapiro, Smyth, & Uthurusamy, 1996; Bonarini, 1996), which derive from the combination of fuzzy logic and rule-based systems, are among the most promising evolutions. 2.2 Extraction of ANN Rules. Andrews (Andrews, Diederich, & Tickle, 1995; Andrews et al., 1996) identies three techniques for the extraction of rules: 1. The decompositional technique, which refers to extraction at the level of each ANN neuron, especially the neurons in the hidden layers and the output layer. Rules are extracted from each neuron and from its relation with the other neurons; as a result, the methods based on this technique depend completely on the architecture of the ANN and have a limited applicability. 2. The pedagogical technique, which treats the ANN like a black box, where the analysis takes place through the neurons of the hidden layers and by means of network inputs, from which the relevant rules are extracted. This means that the system only observes the relations between the inputs and the outputs of the ANN. The main purpose of this technique is to obtain the function that is computed by the ANN. 3. The eclectic technique, which uses the ANNs architecture and the inputoutput pairs as a complement of a symbolic training algorithm. This technique combines elements from the two previous ones. Towell and Shavlik (1994) apply the rst technique by using the connections between neurons as rules, based on simple transformations. This limits the extraction to networks with a certain multilayer architecture and with few processing elements (Andrews et al., 1995). Another example of this “decompositional” technique is given by Andrews and Geva (1994), who present the CEBP (constrained error backpropagation) algorithm. This algorithm constitutes a special class of ANNs, built with local functions that behave similar to basic radial functions. Whereas the basic underlying motive for rule extraction in CEBP is decompositional, in this case it is possible to exploit the characteristic properties of local functions to clearly identify dominant inputs (and, hence, rules). Thrun (1995) has developed the most important approach of the second technique, the so-called validity interval analysis (VI-A). His algorithm uses linear programming (SIMPLEX) by applying value intervals to the activations of each neuron in each layer. The system extracts “possibly correct” rules by expanding those intervals forward and backward through the ANN. Since it uses linear programming, this algorithm may present an

1486

J. Rabunal, ˜ J. Dorado, A. Pazos, J. Pereira, and D. Rivero

exponential complexity; it may also need an unacceptable lapse of time to reach a solution when confronted with a large number of process elements. Other approaches that use the second technique are the RULENEG (Pop, Hayward, & Diederich, 1994) and TREPAN (Craven, 1996) algorithms. The rst one is based on a simple sequential change of the truth value of the inputs, and if this affects the output, rule modication occurs. But those rule discovery techniques, which focus only on training data, lose the generalization capacity that is typical of ANNs. The TREPAN algorithm, proposed by Craven (Craven, 1996; Craven & Shavlik, 1996), introduces the concepts of language representation and extraction strategy. Language representation is the language used by the extraction method. The languages that were used by various extraction methods are the inference rules (if-then), the M rules, the fuzzy rules, and the decision trees. The extraction strategy is the method used to represent an ANN that is trained in the representation language: it explains how the method explores a space of possible characteristics. When the rules are extracted, they must describe the ANN’s behavior. The Trepan algorithm is similar to conventional decision tree algorithms such as CART and C4.5, which learn directly from a training set. These algorithms build their decision trees by recursively partitioning the input space. Each internal node in such a tree represents a splitting criterion that partitions a part of the input space, and each leaf represents a predicted class. The Trepan algorithm converts a neural network into a decision tree of the MofN type. Decompositional and pedagogical rule extraction techniques can be combined. DEDEC (Tickle, Orlowski, & Diederich, 1996), for instance, uses the trained ANN to create examples from which the underlying rules can be extracted. However, an important difference is that it extracts additional information from the trained ANN by subjecting the resultant weight vectors to further analysis (that is, a partial decompositional approach). This information is then used to direct the strategy for the generation of a (minimal) set of examples for the learning phase. It also uses an efcient algorithm for the rule extraction phase. Other rule discovery techniques are (Chalup, Hayward, & Diedrich, 1998; Visser, Tickle, Hayward, & Andrews, 1996; Tickle et al., 1998), based on several of the already mentioned approaches. Several authors have done research on the equivalence between ANNs and fuzzy rules (Jang & Sun, 1992; Buckley, Hayashi, & Czogala, 1993; Ben´õ tez, Castro, & Requena, 1997). Most results establish that equivalence through a process of successive approximations. Apart from being purely theoretical solutions, they require a large number of rules in order to approach the ANN’s functioning (Ben´õ tez et al., 1997). The work of Jang and Sun (1992), on the other hand, provides an equivalence between radial ANNs and fuzzy systems that require a nite number of rules or neurons. They remain, however, limited to ANNs of the RBF type. The methods for logical rule extraction developed by DUCH (Duch, Adamczak, & Grabczewski, 2001) are based on constrained multilayer per-

A New Approach to ANN Rules Extraction

1487

ceptron backpropagation networks (MLP2LN method) and their constructive version C-MLP2LN. MLP2LN is based on multilayered perceptrons that are rst trained on the data and subsequently simplied to obtain a skeleton network with 0 or C= ¡ 1 weights. It uses the same modication of the error function as the C-MLP2LN method, but the algorithm is not constructive, and more than one hidden layer may be used. The C-MLP2LN is based on a constructive multilayered perceptron algorithm, adding one new neuron per class and enforcing logical interpretation by modication of the error function, which ensures smooth transition to the 0 and +1 or ¡1 weights. After this transition, only a few relevant inputs are usually left. Dominant rules are recovered from the analysis of the rst input weights, interpreted as 0 D irrelevant feature, C1 D positive evidence, and ¡1 D negative evidence. GA has recently been used to search and discover ANN rules, because of the advantages of evolutionary methods in the analysis of complex systems. Keedwell (Keedwell, Narayanan, & Savic, 2000), as a follower of the pedagogical technique, uses a GA in which chromosomes are multicondition rules based on value intervals or ranges applied to ANN inputs. These values are obtained from training patterns. The work of Wong and Leung (2000) applies GP to the extraction of database knowledge and presents LOGENPRO (logic grammar based genetic programming). This rst approach shows the advantages of GP as an extraction technique in knowledge discovery in databases (KDDB). More recently, Engelbrecht, Rouwhorst, and Schoeman (2001) applied the GP technique and decision trees in order to extract database knowledge by designing an algorithm called BGP (building-block approach to genetic programming). One of the most remarkable aspects in each of these discovery process is the optimization of the rules obtained from the analysis of the ANN. It should be noted that the discovered rules may have redundant conditions, many specic rules may be contained in general rules, and many of the discovered logic expressions could be simplied by writing them differently. Therefore, rule optimization consists of the simplication and elaboration of symbolic operations with the rules. Depending on the discovery method and the type of rules that were obtained, various optimization techniques can be applied. As such, they can be classied into two main groups: a posteriori methods and implied optimization methods. The rst group usually consists of a syntactic analysis algorithm that is applied to the discovered rules in order to simplify them. For instance, Duch et al. (2001) used Prolog as a programming language to carry out the postprocessing of the obtained rules. The results are optimal linguistic variables that serve to nd simplied rules that use those variables. Implied optimization methods are techniques used in rule discovery algorithms, which intrinsically cause the algorithm to produce increasingly improved rules.

1488

J. Rabunal, ˜ J. Dorado, A. Pazos, J. Pereira, and D. Rivero

3 Development Premises The most important aspect in the design of an extraction algorithm is its modus operandi. As mentioned before, the extraction techniques can be classied into three main groups: decompositional, pedagogical, and eclectic techniques. The algorithm’s design must aim to adjust to the four features of Tickle et al. (1998): independence from the ANN’s architecture, independence from the ANN’s training algorithm, correctness, and a high level of expression. According to the rst two characteristics, the system must be able to deal with the ANN as a black box structure, which means that it must be an algorithm of general application. According to the last two features, the algorithm must be able to work with symbols that represent the knowledge expressed in the shape of rules. The algorithm needs to use symbolic models. This letter applies GP as the algorithm that builds a syntactic tree that reects, while imitating the functioning of an ANN, a set of rules. We used symbolic regression applied to input-output patterns, which are sequences of input applied to an ANN and of output produced by it. The result is a set of rules and formulas that imitate the ANN’s behavior. This technique is pedagogical because the ANN is treated like a black box: it is not necessary to know the ANN’s internal functioning. However, a rule extraction algorithm that is able to work with black box structures should also be able to implement some kind of mechanism that allows the a priori incorporation of the knowledge about that black box, which would considerably reduce the search space of the system’s rules. These structures are known as the “gray box.” The use of the GP makes this possible, because it allows us to control and determine the number and type of terminal and nonterminal elements that take part in the search process. If we know, for example, that the ANN carries out classication tasks, we can limit the type of terminal nodes to boolean nodes, thus avoiding oating-point values. The advantage is enormous, because we eliminate a priori all the possible combinations of mathematical operations. We shall therefore apply the eclectic technique rather than the pedagogical one. 3.1 ANN Generalization Capacity. Another fundamental aspect of the design of an extraction mechanism is that it should be expandable in order to determine the most characteristic and successful feature of the ANN: its generalization capacity. We must extract the knowledge that is extracted by the ANN from the training patterns and from test cases that were not presented during the learning process. This means that we need to design an algorithm for the creation of new test cases that will be applied to the ANN, allowing the analysis of its behavior in the face of new situations and of its functional limitations. This aspect is of particular relevance in the applicability of ANNs to certain elds with a high practical risk, such as the monitoring of a nuclear power plant or the diagnosis of certain medical

A New Approach to ANN Rules Extraction

1489

conditions. It has not been clearly used in the scientic literature about ANNs and rule extraction until now. 4 Description of the System We will carry out a process of ANN rule extraction based on input-output patterns. The inputs are applied to the ANN, and the resulting output is the basis of the “inputs training set—outputs ANN” patterns. The inductive algorithm (GP) will use these patterns in order to quantify the accuracy achieved by the rules that were obtained through the pair “outputs (ANN)— obtained outputs (rules).” Once the ANNs have been designed and trained, the same training and test values can be used in order to generate a second data pattern that will search for the rules acquired by the ANN during the training process. Figure 1 shows a general chart of the process. A very interesting aspect of this process is the fact that we can incorporate the knowledge extracted from the “expert” into the extraction algorithm as a priori or innate knowledge. The insertion of knowledge into the neural network has been known since the KBANN algorithm (Towell & Shavlik, 1994), which works by translating a domain theory, consisting of a set of propositional rules, directly into the neural network. In this article, however, knowledge insertion is different. The knowledge about the problem and the ANN can be reected if we have some data about the ANN’s structure, the

OUTPUT Training Set ANN INPUT Training Set

RULES

Training

Output Set

RULE Extraction

Figure 1: Rule extraction process of the ANN.

Adjust

1490

J. Rabunal, ˜ J. Dorado, A. Pazos, J. Pereira, and D. Rivero

Algorithm for the Creation of New Patterns Input

Output

ANN as Black Box

Rule Extraction Algorithm

Prior Knowledge

Knowledge ( Rules )

Figure 2: Rule extraction process.

types of data it will treat, or the kind of problems it will solve, by means of choosing the operators and the types of data the algorithm is going to work with. The inductive algorithm’s search space is therefore more or less reduced, depending on the amount of knowledge we possess about the ANN and the kind of problem it is dealing with. Figure 2 shows the extraction process diagram that was followed in the development of the rule extraction system. 4.1 Extraction of the Generalization Capacity. With regard to the extraction of the generalization capacity, we have designed an algorithm (algorithm for the creation of new patterns) that allows us to create new test cases. Through these cases, we can extract rules and expressions from those parts of the ANN’s internal functioning that have not been explored by the learning patterns. We are not certain whether the behavior of the ANN matches the behavior of the reality outside; we can only hope so. The only purpose of the new test cases is to better mimic the ANN model. Once good accuracy is obtained, the algorithm for creating new patterns is applied, thus generating a series of new input patterns that are applied to the ANN, giving us the relevant outputs. These new “new inputs—outputs ANN”

A New Approach to ANN Rules Extraction

1491

patterns are added to the training patterns and renew the rule extraction process. It should be noted that in order to continue with the extraction process, the rst step is to reassess each individual of the population in order to calculate its new tness. Once this has been achieved, the discovery process can continue. In order to avoid an unlimited number of test cases, the number of new test cases is set by establishing an elimination rule for test cases: if there is a new test case where the ANN’s outputs coincides with those produced by the best combination of obtained rules, that test case is eliminated (the obtained rules are supposed to have acquired the knowledge represented by that test case). The result is an iterative process: a number of generations or simulations of the rule extraction algorithm are simulated, new test cases are generated, and those rules whose result coincides with the ANN’s result are eliminated. Once all the relevant test cases are eliminated, new cases are created in order to ll in the gaps left by the deleted ones. Each new case is compared with the existing ones in order to determine if it is already present in the test cases. If so, it is eliminated, a new case is created, and the process starts all over, running the extraction algorithm during N generations, and repeating once again the elimination and creation of test cases. This process will end when the user determines that the obtained rules are the appropriate ones. Figure 3 shows a general chart of the process. The following parameters are involved in the algorithm for the creation of new patterns: ² Percentage of new patterns: The user must indicate the number of new patterns as a percentage. If the number of training cases is 100, for instance, and the user indicates 25% of new patterns, this means that 125 cases were used for the rules extraction (100 training cases and 25 new cases). These new cases are based on the learning patterns that were used for the ANN training, and they are checked to be different from the existing ones. ² Probability of change: The process of creating new test cases focuses on generating new sequences from the training data. A sequence of inputs is chosen at random from the training le. For each input, the same value is used or a new one is generated at random, following a certain probability. This technique is similar to the EC mutation process. The main advantage of this procedure is that by modifying only punctual values of the entry sequences of the training le, the newly created example cases are less likely to be very different (in the search space) from the training cases, and there will be an increased production of new representative sequences and a decreased generation of cases that are inconsistent with the problem. A more formal specication of the algorithm for the creation of new patterns is the following, where S is the pattern set used in the knowledge

1492

J. Rabunal, ˜ J. Dorado, A. Pazos, J. Pereira, and D. Rivero

ANN Training Inputs

Delete and generate new input

ANN Outputs

N+1 N+2

Rule Extraction Algorithm

Error

N+1

Yes

N+2

No

.

.

.

.

.

.

.

.

.

N+R

N+R

Yes

New Inputs

Best Rules

Figure 3: Generalization rule extraction process.

process, and Pc the probability of change: insert the set "inputs_training_set---outputs_ANN" in S repeat while (not full S) do choose a pattern from the training set change its values with a probability of Pc if (new pattern is not present in S) then evaluate the new pattern with the best rule obtained if (output of ANN is different from output of the rule) then insert the new pattern in S end if end if end while run the discovery algorithm N generations

A New Approach to ANN Rules Extraction

1493

for (each pattern in S) do if (output of ANN == output of the rule) then eliminate this pattern from S end if end for until (rules are good enough)

4.2 Rule Optimization. The technique used in our GP implementation for rule optimization is called depth penalty (parsimony parameter). Conceptually, when the adaptation level of an individual of the GP (tree) is evaluated, its adaptation level decreases to a certain extent if the tree possesses a large number of terminal and nonterminal nodes. An individual with a bad tness but few nodes is still better than an individual with a good tness but plenty of nodes. This is a way of dynamically favoring the existence of simple individuals. Therefore, if we are looking for rules (syntactic trees), we are also intrinsically stimulating the appearance of simple (optimized) rules. 5 Parameter Adjustment The next step in rule extraction is the numeric calculation of the parameters involved in the algorithm. Experimentation is used to adjust the GP-related parameters. The implemented parameters are shown in Table 1. Table 1: GP Parameters. Parameter Population creation algorithm

Selection algorithm

Mutation algorithm Elitist strategy Use of nondestructive crossover Crossover rate Mutation rate Nonterminal selection probability Population size Parsimony level

Options Full Partial Intermediate Tournament Roulette Stochastic remainder Stochastic universal Deterministic sample Subtree Punctual Yes-No Yes-No 0–100% 0–100% 0–100% Number of individuals Penalty value

1494

J. Rabunal, ˜ J. Dorado, A. Pazos, J. Pereira, and D. Rivero

When specifying terminal and nonterminal operators, we must establish a typology: each node has a type, and the nonterminal nodes demand a particular type from their offspring (Montana, 1995). This tactic guarantees that the generated trees will satisfy the grammar specied by the user. Also, both sets of specied operators must fulll two requirements: closure and sufciency, that is, it must be possible to build correct trees with the specied operators and to express the solution to the problem (the expression we are searching for) by means of those operators. The possibility of assigning a type to each node takes advantage of the domain knowledge about the type of the output function and the type of each input. By using node typing, we allow GP to create and combine only expressions that are grammatically correct. This means that GP will not allow, for example, arithmetical or relational expressions with boolean variables or logical operations with continuous variables. The structure of the expressions developed by GP will follow the syntactic rules provided by the types of the nodes and therefore will contain some knowledge of the domain about the types of inputs and outputs and the operations allowed. An empirical analysis must be carried out in order to adjust the ideal parameters, trying combinations so as to adjust the various values progressively until the best results are achieved. The following parameter adjustment is related to the terminal and nonterminal nodes that will take part in the algorithm’s execution. Our previous knowledge about the ANN, together with the types of problems for which it was designed, will be of consequence here. The parameters shown in Table 2 must be selected. The table shows a wide variety of possible congurations. Logical and relational operators are included in order to make boolean expressions

Table 2: Parameters of the Terminal and Nonterminal Nodes in the GP Algorithm. Parameter Logical operators

Relational operators Arithmetic functions Decision functions Constants Input variables Nature of the outputs

Options AND OR NOT >; <; D; <>; >D;
A New Approach to ANN Rules Extraction

1495

for either decision rules or classication tasks. Arithmetic functions are needed for evolving complex mathematical expressions. If-then-else operators are included in order to develop decision expressions. Constants are also needed in all of these expressions. More information about the ANN means an earlier establishment of the values of the parameters. Otherwise, experimentation is necessary in order to determine the best values for each ANN. 6 Methodology A methodology can be formalized by using the aforementioned steps and by establishing differences between each individual stage. We need to start the process with the trained ANN and the training and test les. If we know the type of the input-output values and the relation between them, we can limit the search space of the rule extraction process by introducing it as “prior knowledge.” The stages of the methodology are the following: 1. Design of the learning patterns. Starting from training and test les, we build the ANN’s inputs and outputs and obtain input-output patterns. 2. GP parameter adjustment. Adjustment of the values of the parameters related to evolutionary simulation and to terminal and nonterminal nodes. 3. Execution and comparison of the accuracy level of the discovered rules. This stage consists of running the algorithm with the obtained rules, comparing their operability by means of other extraction methods, or examining certain learning cases and their correspondence to the obtained rules. If the obtained rules do not satisfy the expected results or do not t with the learning patterns, we will have to modify either the learning patterns or the algorithm’s parameters until the expected results are achieved. The stages are graphically shown in Figure 4. 7 Results ANNs have proven very useful at tasks where, starting from an input sequence, it must be decided whether these values correspond to a given classication. The system is tested with a series of examples of classication problems. The rst example will serve to test the system’s functioning in multiple classication tasks and has been observed and carefully analyzed in elds such as ANNs and rule discovery since Fisher rst documented it

1496

J. Rabunal, ˜ J. Dorado, A. Pazos, J. Pereira, and D. Rivero

Training and Test files

ANN

Prior Knowledge

Design of the learning patterns

Parameters adjustment of the GP algorithm

Valid Program

Not valid and redesign

Use it Figure 4: Methodology of rule extraction process by GP.

in 1936 (Fisher, 1936). The other examples tests the system with medical diagnosis problems and problems like poisonous mushrooms and the monk’s problems. 7.1 The Iris Flower. This problem consists of identifying a plant’s species: Iris setosa, Iris versicolor, or Iris virginica. This case has 150 examples with four continuous variables that represent four features concerning the ower ’s shape. The four input values represent measurements in millimeters of the following characteristics: (X1 ) Petal width (X2 ) Petal length (X3 ) Sepal width (X4 ) Sepal length The learning patterns have four inputs and three outputs. The three outputs are boolean outputs that represent the Iris species. By using three boolean outputs instead of a multiple output, we can provide additional

A New Approach to ANN Rules Extraction

25

1497

X1 mm Setosa Versicolor Virginica

20

15

10

5

X2

0 0

10

20

30

40

50

60

mm 70

Figure 5: Original distribution of all three classes (real values).

information about the reliability of the system’s outputs. Due to the outputs’ intrinsic features, only one output should have the true value that represents the type of ower it has classied, while the others have a false value. Therefore, if two or more outputs have true values, or if all of them are false, we may conclude that the value classied by the system is an error and the system cannot classify that case. Figure 5 shows the space distribution for variables X1 and X2 . According to authors such as Duch et al. (2001), these variables achieve a good classication for all three classes. A success rate of 98% is achieved by using only these two variables, but the others are still needed for 100% classication. Therefore, these two variables are an important reference point in the comparison of the graphically obtained results. The values that correspond to the four input variables have been normalized in the [0-1] interval and are treated by the ANN. The rule extraction algorithm has been applied to these normalized values in order to compare the results to other knowledge extraction methods. The following results were achieved through normalized values; this means that the rules will be applied only to normalized input values. 7.1.1 GP Rules Extraction. Various parameter congurations were tried to solve this problem. We tried several variations of these parameters, and those that were most subject to changes were the population size, the cross-

1498

J. Rabunal, ˜ J. Dorado, A. Pazos, J. Pereira, and D. Rivero

Table 3: Parameters of the Algorithm for the Iris Flower Problem in GP Rules Extraction. Parameter Constants Variables Relational operators Logical operators Selection algorithm Crossover rate Mutations rate Population size Parsimony level

Value 20 random values in the range [0-1] X1 , X2 , X3 , X4 <; > AND, OR, NOT Tournament 95% 4% 500 individuals 0.0001

over, rate and the mutation rate selection algorithm. The best results were obtained with the parameters shown in Table 3. The following rule is obtained from the original (normalized) patterns that correspond to the classication of Iris setosa, and produces a success rate of 100%: .X1 < 7:852/: This rule corresponds to the following confusion matrix: ³ ´ 50 0 PD : 0 100 In this matrix, the markers of the columns correspond to the real classes, and the le markers correspond to the assigned classes. Fifty cases were correctly classied as Iris setosa and 100 cases were correctly classied as not Iris setosa. The following rules are obtained from the original (normalized) patterns that correspond to the classication of Iris versicolor and produce a success rate of 100%: ...29:801 > X3 / OR .36:335 < X2 < 50:811// AND ...15:26 < X1 < 18:042/ OR ..8:4 < X1 < 13:165/ OR .36:335 < X2 < 49:797/// AND ..X3 > X1 / OR .16:932 > X1 ////: These rules correspond to the following confusion matrix: ³ ´ 50 0 D P : 0 100 The following rules are obtained from the original (normalized) patterns that correspond to the classication of Iris virginica and produce a success

A New Approach to ANN Rules Extraction

1499

Table 4: Summary of Rule Extraction Results for the Iris Data Set. Method Our method ReFuNN C-MLP2LN SSV ANN Grobian GA+NN NEFCLASS FuNe-I

Accuracy 99.33% 95.7 98.0 98.0 98.67 100.0 100.0 96.7 96.0

Reference Kasabov (1996) Duch et al. (2001) Duch et al. (2001) Mart´õ nez and Goddard (2001) Browne, Duntsch, ¨ and Gediga (1998) Jagielska, Matthews, and Whitfort (1996) Nauck, Nauck, and Kruse (1996) Halgamuge and Glesner (1994)

rate of 99.33%: ...X1 > X2 / OR .X2 > 0:7182// AND ..X2 > X4 / OR ...50:991 < X2 < 52:785/ OR .X4 > 71:321// OR .X1 > X3 ////: These rules correspond to the following confusion matrix: ³ ´ 50 1 PD : 0 99 The error of the latter expression produces, as a system output, the classication of Iris versicolor and Iris virginica (“false-true-true” output), which is a bad output and can be eliminated. This system allows use of a technique that detects whether it makes mistakes when no classication is produced or when it produces more than one. In this case, it will be counted as an error (because it cannot classify it), and the nal success rate is 99.33%. Table 4 draws an accuracy comparison with other extraction techniques. In order to carry out a detailed analysis of the behavior of the obtained rules, we have built an analysis le that contains all the possible input values for the four variables, with regular intervals. The intervals are 7 mm for X1 , 2.5 mm for X2 , 4.4 mm for X3 , and 7.9 mm for X4 ; the analysis le contains 14.641 examples, and we can analyze the whole possible range of classications carried out by the rules (see Figure 6). If we group the three classes into the same chart (see Figure 7), and compare them to those from the training le, we can see that the rule extraction system gathers the input values that depend on each classication and isolates them from those values that depend on the other classications. The intersection areas are those where the system produces incorrect output values. 7.1.2 ANN Rules Extraction. Mart´õ nez and Goddard (2001) have proved that a maximum success of 98.67% (two errors) is achieved with six neurons

1500

J. Rabunal, ˜ J. Dorado, A. Pazos, J. Pereira, and D. Rivero

A

Setosa

25 20 15 10 5 0 0

10

20

30

40

50

B

60

70

Versicolor

25 20 15 10 5 0 0

10

20

30

40

50

C

60

70

Virginica

25 20 15 10 5 0 0

10

20

30

40

50

60

70

Figure 6: Distribution of the three classes. (A) Iris setosa. (B) Iris versicolor. (C) Iris virginica.

A New Approach to ANN Rules Extraction

1501

Original Distribution

Setosa Versicolor Virginica

Incorrect Output

Setosa Versicolor Virginica

25

20

15

10

5

0 0

10

20

30

40

50

60

70

Figure 7: Distribution of the three classes and those that proceed from the training.

in the hidden layer. The system put forward by Rabu nal ˜ (1999) decreases the number of hidden neurons (ve), and by using tangent hyperbolic activation functions and a threshold function of 0.5 in the output neurons, it also reaches a success rate of 98.67%. In the Iris setosa cases, no error is obtained, whereas in the Iris versicolor and Iris virginica cases, two errors are made, but they are not detected because the ANN produces a valid classication (only one true output) that is nevertheless erroneous. Figure 8 shows the obtained architecture. By applying the rule discovery system with the previously mentioned parameter, we obtain the following rules: ² The rule obtained from the inputs and outputs of the ANN that corresponds to the classication of Iris setosa produces a delity of 100% correct answers: .X1 < 7:79/: ² The rule obtained from the inputs and outputs of the ANN that correspond to the classication of Iris versicolor produces a delity of 100% correct answers: ..7:23 < X1 < 13:29/ OR ...X3 > X2 / OR ..X4 > 60:379/ AND .X2 > X1 /// AND .36:6804 < X2 < 50:149///: ² The rule obtained from the inputs and outputs of the ANN that correspond to the classication of Iris virginica produces a delity of 100%

1502

X 1X1

X 2X2

X 3X3

X 4X4

J. Rabunal, ˜ J. Dorado, A. Pazos, J. Pereira, and D. Rivero

1

5

6

10

Setosa

7

11

Versicolor

8

12

Virginica

2

3

4

9

Figure 8: Obtained ANN.

correct answers: ...X1 > X3 / AND .X1 > X2 // OR ...13:742 < X1 / AND ..24:1868 > X3 / OR .50:225 < X2 /// AND .46:8303 < X2 /// The analysis of the results shows the distributions carried out by the ANN, and the rules obtained through the use of the analysis le (see Figure 9). Since we obtained a 100% success rate on the three distributions from the ANN, the resulting expressions show a 100% delity of the ANN behavior. The gures of these distributions are very similar to those of Figure 6: the rst two distributions are almost identical to those of Figure 6, and the third distribution is similar in the cases of the training set. These results obtained with the analysis le are the delity values of the rules, because the outputs of the analysis le are the outputs produced by the ANN with the inputs of the analysis le. 7.1.3 ANN Generalization. Figures 10A through 10C show the distribution of the three species that result from the application of the analysis le to the trained ANN. If we compare the distributions produced by the ANN in Figures 10A through 10C to those obtained by the rule extraction system with the initial data (see Figure 6) and the one obtained from the rules from the ANN (see Figure 9), there may seem to be little similarity. This is due to the fact that the rules provided by the extraction system have very geometrical shapes (since they use rules of the If-then-else type on ranges). As a consequence, the obtained rules provide a perfect tness in the areas that

A New Approach to ANN Rules Extraction

1503

A

Setosa

25 20 15 10 5 0

0

10

20

30

40

50

B

60

70

Versicolor

25 20 15 10 5 0 0

10

20

30

40

50

C

60

70

Virginica

25 20 15 10 5 0

0

10

20

30

40

50

60

70

Figure 9: Distribution of the three classes produced by the rules. (A) Iris setosa. (B) Iris versicolor. (C) Iris virginica.

1504

J. Rabunal, ˜ J. Dorado, A. Pazos, J. Pereira, and D. Rivero

correspond to training examples but a very bad accuracy with regard to the ANN’s global behavior. If we combine the rule extraction mechanism with the algorithm for new pattern creation, the values with better behavior in the algorithm are a value between 20% and 35% of new patterns and a value between 20% and 35% of new patterns and between 20% and 25% of change probability. With regard to the parameter of the new test case percentage, it should be noted that if this percentage is too high (over 60%), the extraction algorithm does not improve the obtained rules due to the fact that when the new values are generated, the problem is diverted to an extent that makes the algorithm look for contradictory rules, passing continuously from one kind of rules to another. The underlying idea to the whole process is to take the training les as the core, since they concentrate an accumulation of representative values for each classied value. For example, when we used a value of 75% of new cases for this parameter, the learning patterns were divided into 80% of cases representing the Iris versicolor, 12% Iris setosa, and 8% Iris virginica. If the rules always classify the Iris versicolor with a success level of 80%, improving this local minimum is a complicated task. With regard to the change probability parameter, something similar occurs: if the probability is high, the new values of the parameters depend too much on randomness. Therefore, the classications that represent the bigger areas in the search space are strongly favored, and the rules obtained from the extraction process focus on those classications. The dynamic creation method (i.e., using the algorithm for creating new patterns) obtained a dynamic success of 96.27% for the Iris setosa, 89.17% for the Iris versicolor, and 93.13% for the Iris virginica. The success is dynamic because it depends on the new test cases that are generated dynamically, and the success took place at the moment when the process was stopped. These values are the delity values obtained in the dynamic training. Obviously, these tness values do not correspond to the real ones, but they offer a representative idea about how the process develops with the training set and the extended test cases. The analysis le is transferred to the obtained rules in order to achieve the real tness values. With regard to the rules obtained from only the training patterns, a real delity is achieved for the Iris setosa (63.34%), while 56.97% corresponds to Iris versicolor and 57.23% to Iris virginica. The analysis le has been built using as outputs the outputs of the ANN, so the accuracy obtained with the rules on this le is the delity of these rules. The normalized rules obtained for the classication of Iris setosa with the algorithm for the creation of new patterns produce a real success (delity) of 95.45% on the analysis le: ...¡0:1941/ > .X2 ¡ X3 // AND ..X3 ¡ X2 / > .0:6381 ¤ X1 ///:

A New Approach to ANN Rules Extraction

1505

The normalized rules obtained for the classication of Iris versicolor produce a delity of 88.81% on the analysis le: ...X4 % 0:8032/ > X3 / AND ...X1 C .X2 % 0:7875// > X3 / AND .X2 < ...X4 %0:7875/ ¡ .0:4966 ¤ X1 //% 0:8896////:

The normalized rules obtained for the classication of Iris virginica produce a delity of 77.88% on the analysis le: ..X4 < ..X1 ¡ .¡.0:4925 ¤ .X2 ¤ 0:5280//// ¡ .X4 ¡ X2 /// AND ..X3 < X2 / OR .X1 > .X3 ¡ X2 ////:

These three rules are normalized, and their variables take only values between 0 and 1. The distribution of the rules obtained (not normalized) can be seen in Figures 10D through 10F. The distributions obtained from the ANN with GP and dynamic creation of patterns (see Figures 10D to 10F) are very similar to the actual gures of the ANN (see Figures 10A to 10C). These gures are very similar due to the high success rate that was obtained in the extraction of knowledge from the ANN. This leads us to believe that the nal behavior of the rules is very similar to the one produced by the ANN and that the algorithm has performed well. Table 5 draws a comparative chart of the delities achieved through the exclusive use of the learning patterns, and of those achieved through the use of the algorithm for the creation of new patterns. It shows the accuracy values that are produced by the rules obtained with both methods during the analysis of the le patterns (14,641 examples). The delity obtained with the algorithm for the creation of new patterns is better than the delity obtained without this dynamic creation. This is due to the fact that with this algorithm, new patterns not present in the training set are used for training the rules, and without dynamic creation no training will be made on those inputs outside the training set. Therefore, without dynamic creation, the test performs poorly on those patterns different from the ones present in the training set (but present in the analysis le). 7.2 Wisconsin Breast Cancer Data. The Wisconsin cancer data set is a small medical database from UCI (Merz & Murphy, 2002). It contains 699 entries, with 458 benign cases (65.5%) and 241 (34.5%) malignant cases. Each instance has 9 attributes with integer values in the range 1–10, and they are normalized in the 0–1 range (we may assume that they are continuous values): .X1 / Clump thickness .X2 / Uniformity of cell size .X3 / Uniformity of cell shape .X4 / Marginal adhesion

1–10 1–10 1–10 1–10

1506

J. Rabunal, ˜ J. Dorado, A. Pazos, J. Pereira, and D. Rivero

A

D

Setosa

25

25

20

20

15

15

10

10

5

5

0

Setosa

0 0

10

20

30

40

B

50

60

70

0

10

20

30

40

50

E

Versicolor

25

25

20

20

15

15

10

10

5

5

0

60

70

Versicolor

0 0

10

20

30

40

50

C

60

70

0

10

20

30

40

50

F

Virginica

25

25

20

20

15

15

10

10

5

5

0

60

70

Virginica

0 0

10

20

30

40

50

60

70

0

10

20

30

40

50

60

70

Figure 10: Distribution of the three classes produced by the ANN (A) Iris setosa, (B) Iris versicolor, (C) Iris virginica. Distribution of the three classes produced by the obtained rules with the algorithm for the creation of new patterns: (D) Iris setosa, (E) Iris versicolor, (F) Iris virginica. Table 5: Fidelities Achieved Through Exclusive Use of the Learning Patterns and Use of the Algorithm to Create New Patterns. Fidelity Method Without dynamic creation With dynamic creation

Iris setosa

Iris versicolor

Iris virginica

63.34% 95.45

56.97% 88.81

57.23% 77.88

A New Approach to ANN Rules Extraction

1507

Table 6: Parameters of the Algorithm for the Wisconsin Breast Cancer Data Problem. Parameter

Value

Constants Variables Relational operators Logical operators Decision function Selection algorithm Crossover rate Mutation rate Population size Parsimony level

50 random values in the range [0–1] X1 , X2 , X3 , X4 , X5 , X6 , X7 , X8 , X9 <; >; <>; D AND, OR, NOT If then else over boolean values Tournament 92% 8% 1000 individuals 0.001

.X5 / Single epithelial cell size .X6 / Bare nuclei .X7 / Bland chromatin .X8 / Normal nucleoli .X9 / Mitoses

1–10 1–10 1–10 1–10 1–10

The GP parameters that produce the best results are shown in Table 6. The following rule, obtained from the original (normalized) patterns, produces a tness of 99.56%: ...IF .IF .IF .X1 > X7 / THEN .X2 > 0:4788/ ELSE .0:0000 < X7 < 0:3719// THEN .X3 > 0:3780/ ELSE ..0:2920 < X3 < 0:37192/ AND .0:0000 < X6 < 0:6564/// THEN .NOT ..0:0669 < X4 < 0:3719/ AND .NOT .0:2920 < X6 < 0:3719//// ELSE .X1 > 0:8431// OR .IF ..X3 > 0:4788/ AND .X7 > 0:3780// THEN .NOT .0:0000 < X4 < 0:3719// ELSE ..X3 <> X7 / AND .NOT .0:0669 < X1 < 0:6564///// OR ..IF .IF .IF .0:5996 < X1 < 0:7400/ THEN .X1 > X7 /

1508

J. Rabunal, ˜ J. Dorado, A. Pazos, J. Pereira, and D. Rivero

ELSE .NOT .0:0000 < X6 < 0:6564/// THEN .IF .IF .X6 > 0:8431/ THEN .X3 <> X4 / ELSE ..0:0000 < X3 < 0:3719/ AND .0:0000 < X6 < 0:6564/// THEN .X7 > 0:3780/ ELSE .0:0000 < X1 // ELSE FALSE/ THEN .IF .X7 > 0:3781/ THEN .X7 > 0:3781/ ELSE .X1 > X3 // ELSE .X2 > 0:4788//// This rule corresponds to the following confusion matrix, calculated on the test le and including missing values: ³ ´ 239 4 PD : 2 454 Table 7 draws an accurancy comparison with other extraction techniques. We use the system proposed by Rabu nal ˜ (1999); although several ANNs have been trained, we have not been able to achieve 100% correct classications. With seven neurons in one hidden layer and a linear activation function, we can obtain only 98.28% of correct classications. If we use the same architecture but with hyperbolic tangent functions, the success rate reaches 98.68% of correct classications. In order to carry out a detailed analysis of the behavior of the obtained rules, we have built an analysis le with all the possible input values for the nine variables, taking regular intervals of 0.25 between 0 and 1 (normalized inputs). With these intervals, the analysis le contains 1,953,125 examples, and the whole possible range of classications carried out by the rules can be analyzed. With this set, the outputs were calculated as the outputs of the ANN with these inputs, so the accuracies obtained from this analysis le with the rules are the delity values of these rules to the ANN. If we apply the analysis le over the rules set, we obtain a delity of 64.56% of correct classications. If we combine the rule extraction mechanism with the algorithm for new pattern creation on the ANN, the result is that with a value between 15% and 35% of new patterns and between 35% and 40% of change probability, the values are the ones with a better behavior in the algorithm.

A New Approach to ANN Rules Extraction

1509

Table 7: Summary of Rule Extraction Results for the Wisconsin Breast Cancer Data Set. Method

Accuracy

Our method C-MLP2LN IncNet k-NN Fisher LDA MLP with BP LVQ Bayes Naive Bayes DB-CART LDA LFC, ASI, ASR CART

99.56% 99.0 97.1 97.0 96.8 96.7 96.6 96.6 96.4 96.2 96.0 95.6 93.5

Reference Duch et al. (2001) Jankowski and Kadirkamanathan (1997) Duch et al. (2001) Ster and Dobnikar (1996) Ster and Dobnikar (1996) Ster and Dobnikar (1996) Ster and Dobnikar (1996) Ster and Dobnikar (1996) Shang and Breiman (1996) Ster and Dobnikar (1996) Ster and Dobnikar (1996) Shang and Breiman (1996)

The best rule obtained (normalized) through the rule extraction mechanism combined with the algorithm for new pattern creation is the following: ....X2 > 0:3009/ AND .X1 > X6 // OR ....X1 > 0:5098/ OR .X8 > X5 // OR .X6 > 0:5098// AND ..X3 > 0:3009/ OR .X6 > X5 //// OR .X3 > 0:5098//: This rule produces a success of 98.83% on the ANN outputs with the training inputs and 71.91% (delity) on the analysis le. We obtained another normalized rule that provides less accuracy but is much simpler: X6 > ..0:8928 ¡ X3 / ¡ X2 /: It produces an accuracy of 98.44% on the ANN outputs with the training inputs and 70.47% on the analysis le. The interpretation of this rule is the following: “Bare Nuclei” > .9:035 ¡ “Uniformity of cell shape” ¡ “Uniformity of cell size”/:

Table 8 draws a comparative chart of the delities achieved by using only the learning patterns and those achieved by using the algorithm for creating new patterns. It shows the accuracy values produced by the rules that were obtained with both methods during the evaluation of the analysis le patterns (1,953,125 examples). 7.3 Hepatitis. This is another small medical database from UCI (Merz & Murphy, 2002). It contains 155 samples that belong to two different classes:

1510

J. Rabunal, ˜ J. Dorado, A. Pazos, J. Pereira, and D. Rivero

Table 8: Comparative Fidelities Achieved Through Exclusive Use of the Learning Patterns and Use of the Algorithm to Create New Patterns. Method Without dynamic creation With dynamic creation

Fidelity 64.56% 71.91

32 “die” cases and 123 “live” cases. There are 19 attributes—13 binary and 6 attributes with 6 to 9 discrete values: .X1 / AGE: .X2 / SEX: .X3 / STEROID: .X4 / ANTIVIRALS: .X5 / FATIGUE: .X6 / MALAISE: .X7 / ANOREXIA: .X8 / LIVER BIG: .X9 / LIVER FIRM: .X10/ SPLEEN PALPABLE: .X11 / SPIDERS: .X12/ ASCITES: .X13/ VARICES: .X14/ BILIRUBIN: .X15/ ALK PHOSPHATE: .X16/ SGOT: .X17/ ALBUMIN: .X18/ PROTIME: .X19/ HISTOLOGY:

10; 20; 30; 40; 50; 60; 70; 80 male; f emale no; yes no; yes no; yes no; yes no; yes no; yes no; yes no; yes no; yes no; yes no; yes 0:39; 0:80; 1:20; 2:00; 3:00; 4:00 33; 80; 120; 160; 200; 250 13; 100; 200; 300; 400; 500 2:1; 3:0; 3:8; 4:5; 5:0; 6:0 10; 20; 30; 40; 50; 60; 70; 80; 90 no; yes

The parameters in Table 9 produce the best results. The rule obtained from the original (normalized) patterns produces an accuracy of 98.06% correct answers: IF .IF .X10 OR X7 / THEN .IF .X3 OR X4 / THEN .IF X7 THEN .IF .IF X19

A New Approach to ANN Rules Extraction

1511

Table 9: Parameters of the Algorithm for the Hepatitis Data Set. Parameter

Value

Constants Variables Relational operators Arithmetical operators Logical operators Decision function Selection algorithm Crossover rate Mutation rate Population size Parsimony level

50 random values in the range [0–1] X1 , X2 , X3 , X4 , X5 , X6 , X7 , X8 , X9 , X10 , X11 , X12 , X13 , X14 , X15 , X16 , X17 , X18 , X19 <; >; <>; D C; ¡; ¤; % (protected division) AND, OR, NOT If-then-else over boolean values If-then-else over real values Tournament 90% 10% 800 individuals 0.001

THEN X10 ELSE X11/ THEN X12 ELSE X2 / ELSE X9 / ELSE X9 / ELSE X3 / THEN .IF .IF X19 THEN X3 ELSE X6 / THEN X12 ELSE .0:0816 < X14 < 0:4534// ELSE .IF .X9 OR X7 / THEN .X5 OR .0:5350 < X18 < 0:8474// ELSE .IF .IF X10 THEN X12 ELSE X4 /

1512

J. Rabunal, ˜ J. Dorado, A. Pazos, J. Pereira, and D. Rivero

Table 10: Summary of Rule Extraction Results for the Hepatitis Data Set. Method

Accuracy

Our method C-MLP2LN k-NN, k D 18, Manhattan FSM + rotations LDA Naive Bayes IncNet + rotations QDA 1-NN ASR FDA LVQ CART MLP with BP ASI LFC

98.06% 96.1 90.2 89.7 86.4 86.3 86.0 85.8 85.3 85.0 84.5 83.2 82.7 82.1 82.0 81.9

Reference Duch et al. (2001) Duch et al. (2001) Duch et al. (2001) Ster and Dobnikar (1996) Ster and Dobnikar (1996) Jankowski and Kadirkamanathan (1997) Ster and Dobnikar (1996) Ster and Dobnikar (1996) Ster and Dobnikar (1996) Ster and Dobnikar (1996) Ster and Dobnikar (1996) Ster and Dobnikar (1996) Ster and Dobnikar (1996) Ster and Dobnikar (1996) Ster and Dobnikar (1996)

THEN .X1 < 0:6620/ ELSE X13 // This rule corresponds to the following confusion matrix, calculated on patterns without any missing values: ³ ´ 64 1 PD : 3 12 Table 10 draws an accurancy comparison with other discovery techniques. We use the system proposed by Rabunal ˜ (1999) to train an ANN. With ve neurons in one hidden layer and hyperbolic tangent activation functions, we achieve a 98.06% tness. In order to carry out a detailed analysis of the behavior of the obtained rules, we built an analysis le with all the possible input values for the 13 boolean variables and with regular intervals of 0.25 between 0 and 1 (normalized inputs) for the discrete values. With these intervals, the analysis le has 5,971,968 examples, and we can analyze the possible range of classications carried out by the rules. With this set, the outputs were calculated as the outputs of the ANN with these inputs, so the accuracies obtained from this analysis le with the rules are the delity values of these rules to the ANN. If we apply the analysis le on the rules set, we obtain a delity of 59.95% of correct classications. If we combine the rule extraction mechanism with the algorithm for new pattern creation on the ANN, the result is that with a value between 25% and 45% of new patterns and between 30% and 50%

A New Approach to ANN Rules Extraction

1513

of change probability, the values are the ones with a better behavior in the algorithm. The best rule, obtained with the rule extraction mechanism combined with the algorithm for creating new patterns, is the following: .IF .IF .X16
Fidelity 59.95% 73.34

1514

J. Rabunal, ˜ J. Dorado, A. Pazos, J. Pereira, and D. Rivero

if if if

<= X1 6

X7

AND

0.5 X6

>=

die

X12

X18

NOT

NOT

<

X7

0.75

0.5

X1

Figure 11: The obtained rules in tree shape.

The description of the 22 characteristics is as follows: .X1 / Cap shape: .X2 / Cap surface: .X3 / Cap color: .X4 / Bruises: .X5 / Odor: .X6 / Gill attachment: .X7 / Gill spacing: .X8 / Gill size: .X9 / Gill color: .X10 / Stalk shape: .X11 / Stalk root: .X12 / Stalk surface above ring: .X13 / Stalk surface below ring: .X14 / Stalk color above ring: .X15 / Stalk color below ring: .X16 / Veil type:

bell; conical; convex; f lat; knobbed; sunken f ibrous; grooves; scaly; smooth brown; buf f; cinnamon; grey; green; pink; purple; red; white; yellow true; f alse almond; anise; creosote; f ishy; f oul; musty; none; pungent; spicy attached; descending; f ree; notched close; crowded; distant broad; narrow black; brown; buf f; chocolate; grey; green; orange; pink; purple; red; white; yellow enlarging; tapering bulbous; club; cup; equal; rhizomorphs; rooted; missing ibrous; scaly; silky; smooth ibrous; scaly; silky; smooth brown; buf f; cinnamon; grey; orange; pink; red; white; yellow brown; buf f; cinnamon; grey; orange; pink; red; white; yellow partial; universal

A New Approach to ANN Rules Extraction

1515

Table 12: Parameters of the Algorithm for the Poisonous Mushrooms Data Problem. Parameter

Value

Constants Variables

1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12 X1 , X2 , X3 , X4 , X5 , X6 , X7 , X8 , X9 , X10 , X11 , X12 , X13 , X14 , X15 , X16 , X17 , X18 , X19 , X20 , X21 , X22 C; ¡; ¤; % <; ; >D; D AND, OR, NOT If-then-else Tournament 95% 4% 1000 individuals 0.0001

Arithmetic operators Relational operators Logical operators Decision function Selection algorithm Crossover rate Mutation rate Population size Parsimony level

.X17 / Veil color: .X18 / Ring number: .X19 / Ring type: .X20 / Spore print color: .X21 / Population: .X22 / Habitat:

brown; orange; white; yellow none; one; two cobwebby; evanescent; f laring; large; none; pendant; sheathing; zone black; brown; buf f; chocolate; green; orange; purple; white; yellow abundant; clustered; numerous; scattered; several; solitary grasses; leaves; meadows; paths; urban; waste; woods

The parameters in Table 12 produce the best results: The normalized rule obtained from the original data set makes a correct classication of all the samples. However, due to the number of samples needed for the evaluation of the rules, 14 hours were necessary to obtain a rule with a success rate of 100%, and 3 more hours were needed to obtain optimization. The rule is as follows: ..0:3193 < X5 / AND ..X5 < 0:7641/ OR ..X8 > X4 / OR ..0:2529 < X20 / AND .X20 < X14/////: This rule corresponds to the following confusion matrix: ³ ´ 2156 0 PD : 0 3488 The comparison of the results with other methods can be seen in Table 13.

1516

J. Rabunal, ˜ J. Dorado, A. Pazos, J. Pereira, and D. Rivero

Table 13: Summary of Rule Extraction Results for the Poisonous Mushrooms Data Set. Method

Fitness Level

Our method C-MLP2LN SSV RULENEG REAL DEDEC TREX C4.5 RULEX

100% 100.0 100.0 91.0 98.0 99.8 100.0 99.8 98.5

Reference Duch et al. (2001) Duch et al. (2001) Duch et al. (2001) Duch et al. (2001) Duch et al. (2001) Duch et al. (2001) Duch et al. (2001) Duch et al. (2001)

We used the system proposed by Rabunal ˜ (1999), and several ANNs have been trained. The best results were obtained with three neurons in one hidden layer and a hyperbolic tangent transfer function. The success rate was 96.60%. We applied the rule extraction algorithm on this ANN. Without creating new patterns, we obtained a rule with a success rate of 91.10% on the ANN outputs: X20 >D 0:2978 However, we combined the rule extraction algorithm with the algorithm for new pattern creation and obtained a dynamic tness of 92.02% on the ANN outputs. The best rule was the following: ..X20 >D 0:3813/ OR ...¡0:9846/ C ...¡0:8851/%.¡X15 //

C.X5 ¡ 0:5614/// >D .¡...X9 C X10 / C .¡X11 //%.¡0:3598/////: The analysis le was built with all the possible input values for the 4 boolean variables and taking regular intervals of 0.5 between 0 and 1 (normalized inputs) for the discrete values. With these intervals, the analysis le contains 6,198,727,824 examples. With this set, the outputs were calculated as the outputs of the ANN with these inputs, so the accuracies obtained from this analysis le with the rules are the delity values of these rules to the ANN. We can evaluate both expressions (with and without the algorithm for new pattern creation) with the analysis le. In this case, the results are as shown in Table 14. 7.5 Appendicitis. This is another small medical database. It was obtained from Shalom Weiss (Weiss & Kulikowski, 1991). (Our special thanks to him for his support in the development of this experiment.)

A New Approach to ANN Rules Extraction

1517

Table 14: Comparative Fidelities Achieved Through Exclusive Use of Learning Patterns and the Use of the Algorithms for Creating New Patterns. Method Without dynamic creation With dynamic creation

Fidelity 77.44% 82.51

It contains 106 samples that belong to two different classes: 88 cases with acute appendicitis and 18 cases with other problems. There are eight attributes, all of them with continuous values. .X1 / .X2 / .X3 / .X4 / .X5 / .X6 / .X7 / .X8 /

WBC1 MNEP MNEA MBAP MBAA HNEP HNEA CRP

The GP parameters that produce the best results are shown in Table 15. The rule obtained from the original (normalized) patterns produces a success rate of 98.89%: IF .IF ..X8 C X4 /
THEN ..X8 C ..X8 C X7 /0:0897// X8 / ELSE .X3 D X7 / Table 15: Parameters of the Algorithm for the Appendicitis Data Set. Parameter

Value

Constants Variables Relational operators Arithmetical operators Logical operators Decision function Selection algorithm Crossover rate Mutation rate Population size Parsimony level

10 random values in the range [0–1] X1 , X2 , X3 , X4 , X5 , X6 , X7 , X8 <; >; D C; ¡; ¤ AND, OR, NOT If-then-else over boolean values Tournament 90% 5% 500 individuals 0.001

1518

J. Rabunal, ˜ J. Dorado, A. Pazos, J. Pereira, and D. Rivero

Table 16: Summary of Rule Extraction Results for the Appendicitis Data Set. Method Our method SSV, 2 crisp rules C-MLP2LN, 2 neurons C-MLP2LN, 1 neuron PVM MLP + backprop CART FSM, 12 fuzzy rules Bayes

Accuracy 98.89% 94.3 94.3 91.5 91.5 90.2 90.6 92.5 88.7

Reference Duch et al. (2001) Duch et al. (2001) Duch et al. (2001) Weiss and Kulikowski (1990) Weiss and Kulikowski (1990) Weiss and Kulikowski (1990) Duch et al. (2001) Duch et al. (2001)

This rule corresponds to the following confusion matrix, which shows the results obtained for those cases without any missing values. ³ ´ 14 0 PD : 5 79 Table 16 draws an accuracy comparison with other discovery techniques. We use the system proposed by Rabunal ˜ (1999) to train an ANN. With ve neurons in one hidden layer and hyperbolic tangent activation functions, the tness reaches a success rate of 94.89%. We applied the rule-extraction algorithm on this ANN. Without creating new patterns, we obtained a rule with a tness of 95.91% on the ANN outputs: IF .X2 < X1 / THEN .X1 D X7 /

ELSE ...¡X8 / ¡ .X1 C .¡0:24832912 /// > .X4 ¤ X6 //: After we combined the rule extraction algorithm with the algorithm for new pattern creation, we obtained a dynamic tness of 92.77% on the ANN outputs. The best rule is the following: ..X1 D X3 / OR ...X6 > X7 / AND .X1 > X8 // AND .0:5225 >D ..X6 ¤ X2 / C X1 ////:

The analysis le has been built taking regular intervals of 0.125 between 0 and 1 (normalized inputs). With these intervals, the analysis le contains 134,217,728 examples. With this set, the outputs were calculated as the outputs of the ANN with these inputs, so the accuracies obtained from this analysis le with the rules are the delity values of these rules to the ANN. We can evaluate both expressions (with and without the algorithm for new pattern creation) with the analysis le. In this case, the results are as shown in Table 17.

A New Approach to ANN Rules Extraction

1519

Table 17: Comparative Fidelities Achieved Through Exclusive Use of the Learning Patterns and Use of the Algorithm for Creating New Patterns. Method Without dynamic creation With dynamic creation

Fidelity 78.93% 85.97

8 Discussion The results presented here have shown us that the rule discovery algorithm based on GP can be compared to other existing methods, but that it offers the great advantage of being applicable to any ANN architecture with any type of activation function. Depending on the task for which the ANN was designed, we only need to carry out the appropriate experimentation and build the subsequent learning patterns by selecting the operator types that will be used by the GP. The four requirements for an ideal rule discovery system have been met: ² The ANN is treated as a black box: only the inputs and outputs produced by the ANN are taken into account, and there is no dependance on architecture or training algorithm (requirements 1 and 2). ² The rules discovered in the ANN are as close to the ANN’s functioning as the tness value produced by the algorithm (GP) and as expressive as the semantic level used by the expression trees that are codied in the GP (requirements 3 and 4). We can therefore conclude that the algorithms based on GP adjust to t the ideal needs of rule extraction systems. Their results are very positive if we compare them to other more specic methods of the ANN architecture or of its training. Moreover, the results obtained from the extraction of the generalization capacity provide a rather accurate emulation of the ANN’s behavior with respect to the possible value combinations that may occur in the inputs. The high success rates of the rule extraction from the ANNs indicate a reliable behavior of the ANNs. We may also state that the knowledge of the ANN has been obtained in an explicit and comprehensible way from a human’s point of view, which allows us to express that ANN’s generalization capacity by means of a symbolic rule. Another advantage of the rule extraction system is that it drastically reduces the complexity of the obtained rules in all the analyzed cases. As the previous results show, the number of operators and constants that intervene in the rules is considerably smaller than that obtained by the rules extracted from the training patterns, and they always maintain the obtained tness levels.

1520

J. Rabunal, ˜ J. Dorado, A. Pazos, J. Pereira, and D. Rivero

9 Future Works The algorithm presented here works with data sequences that are generated by the ANN. This means that it could also be applied to ANNs with recurrent architecture (RANN) and dynamic problems, where the RANN outputs depend on the present inputs and on the inputs in N previous cycles. It is therefore necessary to redene the types of rules and operators in order to be able to represent these data structures, including the terminal nodes that store the values of the previous cycles. This is an attractive development that can be applied to dynamic models such as the prediction of time series, where recurrent ANNs have been applied to many problems of a similar nature. Another future development will be the analysis of the parameters that take part in the algorithm’s correct functioning, depending on the type of problem that is solved by the ANN. We shall consider the ANN not as a black box but as a gray box, where the ANN’s activation function is known; through its incorporation as a mathematical operator for GP, we shall analyze which rules, and expressions can be discovered by this operator. In order to accelerate the rule extraction process, we can use a network of computers that searches in a distributed and recurring manner, exchanging rules (subtrees) internally. Acknowledgments This work has been supported by project INBIOMED (Storage, Integration and Analysis of Clinical, Genetic and Epidemiological Data and Images Oriented to Research on Pathologies Platform) G03/160 nanced by the General Subbureau of Sanitary Research and the Institute of Health Carlos III. References Andrews, R., Cable, R., Diederich, J., Geva, S., Golea, M., Hayward, R., HoStuart, C., & Tickle, A. B. (1996). An evaluation and comparison of techniques for extracting and rening rules from articial neural networks (QUT NRC Tech. Rep.). Queensland: Queensland University of Technology, Neurocomputing Research Centre. Andrews, R., Diederich, J., & Tickle, A. (1995). A survey and critique of techniques for extracting rules from trained articial neural networks. Knowledge Based Systems, 8, 373–389. Andrews, R., & Geva, S. (1994). Rule extraction from a constrained error backpropagation MLP. In Proceedings of the Australian Conference on Neural Networks (pp. 9–12). Bendigo, Australia: Articial Intelligence Research Group.

A New Approach to ANN Rules Extraction

1521

Ben´õ tez, J. M., Castro, J. L., & Requena, I. (1997). Are articial neural networks black boxes? IEEE Transactions on Neural Networks, 8(5), 1156–1164. Bonarini, A. (1996). Evolutionary learning of fuzzy rules: Competition and cooperation. In W. Pedrycz (Ed.), Fuzzy modelling: Paradigms and practice. Norwell, MA: Kluwer. Browne, C., Duntsch, ¨ I., & Gediga, G. (1998). IRIS revisited: A comparison of discriminant and enhanced rough set data analysis. In L. Polkowski & A. Skowron (Eds.), Rough sets in knowledge discovery (Vol. 2, pp. 345–368). Heidelberg: Physica Verlag. Buckley, J. J., Hayashi, Y., & Czogala, E. (1993). On the equivalence of neural nets and fuzzy expert systems. Fuzzy Sets Systems, 53, 129–134. Chalup, S., Hayward, R., & Diedrich, J. (1998). Rule extraction from articial neural networks trained on elementary number classication task (QUT NRC Tech. Rep.). Queensland: Queensland University of Technology, Neurocomputing Research Centre. Cramer, N. L. (1985). A representation for the adaptive Generation of simple sequential programs. In Proceedings of First International Conference on Genetic Algorithms. Mahwah, NJ: Erlbaum. Craven, M. W. (1996). Extracting comprehensible models from trained neural networks. Unpublished doctoral dissertation, University of Wisconsin. Craven, M. W., & Shavlik, J. W. (1996). Extracting tree-structured representations of trained networks. In D. Touretzky, M. Mozer, & M. Hasselmo (Eds.), Advances in neural information processing systems, 8. Cambridge, MA: MIT Press. Duch, W., Adamczak, R., & Gr¸abczewski, K. (2001). A new methodology of extraction, optimisation and application of crisp and fuzzy logical rules. IEEE Transactions on Neural Networks, 12, 277–306. Engelbrecht, A. P., Rouwhorst, S. E., & Schoeman, L. (2001). A building block approach to genetic programming for rule discovery. In A. Abbass, R. Sarkar, & C. Newton (Eds.), Data mining: A heuristic approach. Idea Group Publishing. Fayyad, U., Piatetsky-Shapiro, G., Smyth, P., & Uthurusamy, R. (1996). Advances in knowledge discovery and data mining. Cambridge, MA: MIT Press. Fisher, R. A. (1936). The use of multiple measurements in taxonomic problems. Annals of Eugenics, 7, 179–188. Friedberg, R. M. (1958). A learning machine: Part I. IBM Journal of Research and Development, 2, 2–13. Friedberg, R. M., Dunham, B., & North, J. H. (1959). A learning machine: Part II. IBM Journal of Research and Development, 3, 282–287. Fujiki, C. (1987). Using the genetic algorithm to generate lisp source code to solve the Prisoner ’s Dilemma. In Proceedings of the International Conference on GAs (pp. 236–240). Mahwah, NJ: Erlbaum. Halgamuge, S. K., & Glesner, M. (1994). Neural networks in designing fuzzy systems for real world applications. Fuzzy Sets and Systems, 65, 1–12. Jagielska, I., Matthews, C., & Whitfort, T. (1996). The application of neural networks, fuzzy logic, genetic algorithms and rough sets to automated knowledge acquisition. In Proceedings of the Fourth International Conference on Soft Computing, IIZUKA’96 (Vol. 2, pp. 565–569). Singapore: World Scientic.

1522

J. Rabunal, ˜ J. Dorado, A. Pazos, J. Pereira, and D. Rivero

Jang, J., & Sun, C. (1992). Functional equivalence between radial basis function networks and fuzzy inference systems. IEEE Transactions on Neural Networks, 4, 156–158. Jankowski, N., & Kadirkamanathan, V. (1997). Statistical control of RBF-like networks for classication. In Proceedingsof the Seventh InternationalConference on Articial Neural Networks (pp. 385–390). Berlin: Springer-Verlag. Kasabov, N. (1996). Foundations of neural networks, fuzzy systems and knowledge engineering. Cambridge, MA: MIT Press. Keedwell, E., Narayanan, A., & Savic, D. (2000). Creating rules from trained neural networks using genetic algorithms. International Journal of Computers, Systems and Signals, 1, 30–42. Koza, J. (1992). Genetic programming. on the programming of computers by means of natural selection. Cambridge, MA: MIT Press. Mart´õ nez, A., & Goddard, J. (2001). Denicion ´ de una red neuronal para clasicaci´on por medio de un programa evolutivo. Journal of Ingenier´õ a Biom´edica, 22, 4–11. Merz, C. J., & Murphy, P. M. (2002). UCI repository of machine learning databases. Available on-line at: http://www-old.ics.uci.edu/pub/machine-learningdatabases. Montana, D. J. (1995). Strongly typed genetic programming. In M. Schoenauer et al. (Eds.), Evolutionary computation (pp. 199–200). Cambridge, MA: MIT Press. Nauck, D., Nauck, U., & Kruse, R. (1996). Generating classication rules with the neuro-fuzzy system NEFCLASS. In Proceedings of Biennal Conference of the North American Fuzzy Information Processing Society (NAFIPS’96) (pp. 466– 470). Los Alamos, CA: IEEE. Pop, E., Hayward, R., & Diederich, J. (1994). RULENEG: Extracting rules from a trained ANN by stepwise negation (Neurocomputing Research Centre Tech. Rep.). Queensland: Queensland University of Technology. Rabunal, ˜ J. R. (1999).Entrenamiento de Redes de Neuronas Articiales mediante Algoritmos Gen´eticos. Graduate thesis, University of A Coruna, ˜ Spain. Shang, N., & Breiman, L. (1996). Distribution based trees are more accurate. International Conference on Neural Information Processing, 1, 133–138. Ster, B., & Dobnikar, A. (1996). Neural networks in medical diagnosis: Comparison with other methods. In Proceedings of the International Conference on Engineering Applications of Neural Networks (pp. 427–430). Thrun, S. (1995). Extracting rules from networks with distributed representations. In G. Tesauro, D. Touretzky, & T. Leen (Eds.), Advances in neural information processing systems, 7. Cambridge, MA: MIT Press. Tickle, A. B., Andrews, R., Golea, M., & Diederich, J. (1998). The truth will come to light: Directions and challenges in extracting the knowledge embedded within trained articial neural networks. IEEE Transaction on Neural Networks, 9, 1057–1068. Tickle, A. B., Orlowski, M., & Diederich, J. (1996). DEDEC: A methodology for extracting rules from trained articial neural networks (Tech. Rep.). Queensland: Queensland University of Technology, Neurocomputing Research Centre.

A New Approach to ANN Rules Extraction

1523

Towell, G., & Shavlik, J. W. (1994). Knowledge-based articial neural networks. Articial Intelligence, 70, 119–165. Visser, U., Tickle, A., Hayward, R., & Andrews, R. (1996). Rule-extraction from trained neural networks: Different techniques for the determination of herbicides for the plant protection advisory system PRO PLANT. In Proceedings of the Rule Extraction from Trained Articial Neural Networks Workshop (pp. 133– 139). Queensland: Queensland University of Technology. Weiss, S. M., & Kulikowski, C. A. (1990). An empirical comparison of pattern recognition, neural nets and machine learning classication methods. In J. W. Shavlik & T. G. Dietterich (Eds.), Machine learning. San Francisco: Morgan Kaufmann. Weiss, S. M., & Kulikowski, C. A. (1991). Computer systems that learn: Classication and prediction methods from statistics, neural nets, machine learning, and expert systems. San Francisco: Morgan Kaufmann. Wong, M. L., & Leung, K. S. (2000). Data mining using grammar-based genetic programming and applications. Norwell, MA: Kluwer. Received April 9, 2003; accepted January 8, 2004.

LETTER

Communicated by Steven Nowlan

A Classication Paradigm for Distributed Vertically Partitioned Data Jayanta Basak [email protected]

Ravi Kothari [email protected] IBM India Research Laboratory, Indian Institute of Technology, New Delhi 110016, India

In general, pattern classication algorithms assume that all the features are available during the construction of a classier and its subsequent use. In many practical situations, data are recorded in different servers that are geographically apart, and each server observes features of local interest. The underlying infrastructure and other logistics (such as access control) in many cases do not permit continual synchronization. Each server thus has a partial view of the data in the sense that feature subsets (not necessarily disjoint) are available at each server. In this article, we present a classication algorithm for this distributed vertically partitioned data. We assume that local classiers can be constructed based on the local partial views of the data available at each server. These local classiers can be any one of the many standard classiers (e.g., neural networks, decision tree, k nearest neighbor). Often these local classiers are constructed to support decision making at each location, and our focus is not on these individual local classiers. Rather, our focus is constructing a classier that can use these local classiers to achieve an error rate that is as close as possible to that of a classier having access to the entire feature set. We empirically demonstrate the efcacy of the proposed algorithm and also provide theoretical results quantifying the loss that results as compared to the situation where the entire feature set is available to any single classier. 1 Introduction Existing approaches to classier design assume that all the features are available for constructing the classier and during its subsequent use (Duda & Hart, 1973; Jain, Duin, & Mao, 2000). In some situations, all the features may not observable at a single location. Even if they are observable from one location, all the features may not be relevant in a particular context, and an observer may choose selected features relevant within its context. For example, in the domain of e-commerce, Web-based merchant-customer c 2004 Massachusetts Institute of Technology Neural Computation 16, 1525–1544 (2004) °

1526

J. Basak and R. Kothari

interaction may be recorded in one server. Such interaction may comprise past transaction data, clickstream (the sequence of Web pages visited by an individual) and other related information (Cooley, 2000). However, a different server may record phone-based merchant-customer interaction. Yet another server may record brick-and-mortar-based interactions. In some cases, the underlying infrastructure does not permit continual synchronization among the different servers. In other cases, a merchant may not have felt the need to utilize all the data in the past but may now feel the need to do so. A solution in which classiers, local to a server and built using the partial view available at the server, are utilized to arrive at an overall classication may be far more attractive than an infrastructure upgrade for co-locating data. There are other examples also. In the domain of computer vision, for example, multiple cameras may observe an object from different viewpoints. To support certain types of queries, there might be a need to utilize information gathered from the multiple viewpoints. Co-locating the data might be an acceptable solution in some situations. However, from a different perspective, one can think of integrating the output of (local) classiers designed on the basis of (local) partial views. In summary, there exist several reasons for considering the problem of decision making in this vertically partitioned environment (Caragea, Silvescu, & Honavar, 2001). Depending on the domain of application, the following, either individually or in combination, motivate the problem setting: ² The data sources may not be available at a single location, and it may be expensive (or difcult due to existing infrastructure constraints) to co-locate the partial views. ² Even if it is possible to co-locate the partial views, it may be difcult to solve the correspondence problem, that is, knowing which (partial) pattern in one data source corresponds to which (partial) pattern in some other data source. ² Even if it is possible to solve the correspondence problem, it may be difcult to build a single classier in a large-dimensional feature space (Ho & Basu, 2002). Vertically partitioning may allow scaling in complex situations. ² Even if all the above are resolved, issues related to privacy and security may preclude one data source’s fully releasing and sharing its data. In such situations, co-locating the partial views may not be possible. While one would expect a lower error rate when all the features are simultaneously available, the proposed classication paradigm is more feasible and can be attractive if a comparable error rate is achievable. One intuitive approach is to construct individual classiers from the partial views of the data and then use a master classier to combine the decisions of the individ-

A Classication Paradigm for Distributed Vertically Partitioned Data

1527

ual slave classiers. Some expert systems are in fact constructed using such an approach (Behrouz & Koono, 1996). However, this requires the master classier to have access to some data specied in terms of the entire feature set. At rst glance, it appears that the signicant activity in the area of the socalled mixture-of-experts model (Jacobs, Jordan, & Nowlan, 1991; Breiman, 1996; Jordan & Jacobs, 1994; Nowlan & Hinton, 1990, 1991; Nowlan & Sejnowski, 1995) is usable when data are vertically partitioned. In the typical mixture-of-experts model, there are a number of classiers (or modules) that simultaneously receive the input x. A gating module produces the overall output based on selection or negotiation of the output of the individual modules. The fundamental motivation of such networks is divide-and-conquer. That is, each expert in the mixture-of-experts framework can solve a simpler problem, and the combination of the outputs of the individual experts provides a solution to the more complex problem. Though typically each expert in such a mixture-of-experts framework sees the entire input x, it is conceivable that each expert could observe certain features and the entire framework be usable when data are vertically partitioned. However, each expert in a mixture-of-experts framework partitions the input space and establishes local regression surfaces in each partition. When used with vertically partitioned data, such regression surfaces would be dened over regions of a subspace, and there is no guarantee that the approximation will be anything close to the desired approximation (unless it happens that the function to be approximated is separable). From another perspective, a mixture-of-experts network typically reduces the bias (over a single classier) and experiences an increase in the variance. “Soft splits” (Jordan & Jacobs, 1994) alleviate this to some extent by allowing smooth transitions between the regions of inuence of two experts. However, when the experts see different feature subsets, it becomes difcult to produce smooth transitions and reduce the variance. Closely related to the mixture-of-experts model (within this context) is the so-called bagging technique (Jacobs et al., 1991; Breiman, 1996) in which weights of the individual classier are adjusted to minimize the generalization error. In the bagging technique, however, there is a “supervisor” that indicates the overall error. Such a supervisor may not be present in many situations. Inducing a classier with vertically partitioned data may also be viewed from the perspective of missing data. A classier induced from the features in a data partition may view the unobserved features as features whose value is always missing. Ahmad and Tresp (1993), for example, compute the posterior probabilities by marginalizing out the missing features in a manner similar to the one proposed in this article. However, they use a single classier that is different from the multiple classiers induced, using individual feature subsets in this work. Correlations between the multiple classiers that are exploited in this work, and not addressed in the previous works (Ahmad & Tresp, 1993), are an advantage of the proposed paradigm.

1528

J. Basak and R. Kothari

In this article, we assume that each individual classier is constructed based on the partial view of the data that are available. Correspondence of the patterns in the different data sources is not necessary—a classier is constructed based on the available data. Of course, in order for the decision made by the classiers to be consistent, the data sets available to the classiers should be sampled from the same (xed though unknown) distribution. However, we assume that a test pattern is observable across the classiers. Our approach to arriving at the combined classication is based on utilizing the posterior probabilities computed by the individual classiers. The rest of the letter is organized as follows. In section 2, we provide an outline of the problem formulation in a Bayesian framework. In section 3, we quantify the error introduced into the classication procedure due to the combination of the different classiers. In section 4, we provide certain case studies for different types of classiers and the simulation results obtained for different data sets vertically partitioned across the classiers. Finally, we present our conclusions in section 5. 2 Problem Formulation in a Bayesian Framework We consider the situation in which there are q independent observers of a phenomenon. The ith observer records events pertaining to the phenomenon S in terms S of S a set of features Fi . The entire feature set is given as F D F1 F2 ¢ ¢ ¢ Fq . The partial view recorded by each observer may be interpreted within the traditional nondistributed approach by visualizing that a data set in which each row comprises a pattern is vertically (columnwise) partitioned into q (possibly overlapping partitions). Let the (partial) view of a pattern x as viewed by the ith observer be denoted by xFi . More specically, let xFi denote the vector representation of a pattern x comprising the features of x present in Fi . Associated with each observer i is a classier Ci constructed on the basis of xFi . In this distributed setting, we approach the problem of nding the class label of a test pattern x. We assume that the output of each classier is the a posteriori probability computed on the basis of the partial view available to the classier. 2.1 Problem Formulation. Here and in the rest of the article, for notational simplicity, we use P.¢/ to represent the empirically estimated probaibO to represent the posterior probability as approximated by the lities and P.¢/ proposed approach. Let P.!j jxFi / denote the posterior probability for class !j as determined by classier i based on a partial view of x (i.e., xFi ), that is, P.!j jxFi / D

P.!j /P.xFi j!j / P.xFi /

;

(2.1)

A Classication Paradigm for Distributed Vertically Partitioned Data

1529

where the prior probabilities are assumed to be the same across the different partitions. As a consequence, there is no i associated with P.!j /. If the feature sets observed by the different classiers are nonoverlapping, then the combined decision can be obtained based on the assumption that the subset of features (observed by the classiers) is independent of each other. However, if these subsets are overlapping, then the independence assumption may not hold. In fact in the nonoverlapping case, the independence assumption may not hold true, but in the absence of dependency information, the best that can be obtained is based on the independence assumption of the feature subsets. The problem is to make a decision based on the output of the individual classiers. We model the problem as one of estimating the posterior probability based on the posterior probabilities estimated by the individual classiers. The following theorem describes one approximation technique by which the overall posterior probability can be estimated: Theorem 1. If we marginalize those features that are not visible by more than one classier, then the overall posterior probability is approximated as, Q Q [ k P.!j jxFk /][ k;l;m P.!j jx.Fk T Fl T Fm / /] ¢ ¢ ¢ O j jx/ D Q Q (2.2) P.! [ P.! jx T /][ P.! jx T T T /] ¢ ¢ ¢ k;l

j

.Fk

Fl /

k;l;m;n

j

.Fk

Fl

Fm

Fn /

where P.!j jx.Fk T Fl T Fm / / denotes the posterior probability for class !j based on T T the feature subset .Fk Fl Fm / and x.Fk T Fl T Fm / is the corresponding view of x. PO is the approximated posterior probability. Proof. Let us rst consider that thereS are only two classiers such that the feature set can be expressed as F D F1 F2 . Therefore, P.x/

D D

P.xF1 ; xF2 / P.xOF1 ; xOF2 ; x.F1 T F2 / /;

(2.3)

where xOF1 and xOF2 are the partial views of x corresponding to the feature subsets F1 ¡F2 and F2 ¡F1 , respectively. Since the feature subsets F1 ¡F2 and F2 ¡ F1 are not observable across the classiers, the probability distribution can be marginalized over these subsets (according to the assumption), and therefore O D P.xOF1 jx.F1 T F2 / /P.xOF2 jx.F1 T F2 / /P.x.F1 T F2 / / P.x/ P.xOF1 ; x.F1 T F2 / / P.xOF2 ; x.F1 T F2 / / D P.x.F1 T F2 / / P.x.F1 T F2 / / P.x.F1 T F2 / / D

P.xF1 /P.xF2 / : P.x.F1 T F2 / /

(2.4)

1530

J. Basak and R. Kothari

By similar arguments and approximation, the class conditional probability can be approximated as O P.xj! j/ D

P.xF1 j!j /P.xF2 j!j / : P.x.F1 T F2 / j!j /

(2.5)

Therefore, from equations 2.4 and 2.5, the posterior probability can be approximated as O j jx/ D P.!

P.!j jxF1 /P.!j jxF2 / : P.!j jxF1 T F2 /

(2.6)

Let us now prove the generalized approximation by the method of induction. We have to show that the approximation is valid for N feature subsets if it is valid for .N ¡ 1/ subsets. Let the feature subset collectively S available to theSrst .N ¡ 1/ classiers be represented as FC D iDN¡1 Fi iD1 so that F D FC Fn . Let xFC represent the partial view of x corresponding to the feature subset FC. Since the approximation is valid for the .N ¡ 1/ subsets, we can express Q [ kDN¡1 P.!j jxFk /] kD1 O j jxFC / D Q P.! kDN¡1;lDN¡1 [ kD1;lD1 P.!j jx.Fk T Fl / /] Q [ kDN¡1;lDN¡1;mDN¡1 P.!j jx.Fk T Fl T Fm / /] kD1;lD1;mD1 ¢ ¢ ¢ (2.7) QkDN¡1;lDN¡1;mDN¡1;nDN¡1 [ kD1;lD1;mD1;nD1 P.!j jx.Fk T Fl T Fm T Fn / /] From equation 2.6, the posterior for the overall decision can be approximated as O j jx/ D P.!

O j jxFC / P.!j jxFN /P.! : O j jx T / P.! FC

(2.8)

FN

Again FC

\

FN D

iDN¡1 [

\ .Fi

(2.9)

FN /

iD1

and also \ .Fi

FN /

\

\ .Fj

FN / D Fi

\ Fj

\ FN :

(2.10)

Therefore, with the similar approximation for N ¡ 1 partial views, the combined posterior from partial views of x observed by the pairwise intersection

A Classication Paradigm for Distributed Vertically Partitioned Data

1531

of the feature sets can be expressed as QkDN¡1

P.!j jxFk T FN /] kD1 T O P.!j jxFC FN / D QkDN¡1;lDN¡1 [ kD1;lD1 P.!j jx.Fk T Fl T FN / /] [

Q kDN¡1;lDN¡1;mDN¡1

P.!j jx.Fk T Fl T Fm T FN / /] kD1;lD1;mD1 ¢¢¢ QkDN¡1;lDN¡1;mDN¡1;nDN¡1 [ kD1;lD1;mD1;nD1 P.!j jx.Fk T Fl T Fm T Fn T FN / /] [

(2.11)

From equations 2.8 and 2.11, the overall posterior can be approximated as Q [ kDN kD1 P.!j jxFk /] O P.!j jx/ D QkDN;lDN [ kD1;lD1 P.!j jx.Fk T Fl / /] Q [ kDN;lDN;mDN jx T Fl T Fm / /] kD1;lD1;mD1 P.!j .Fk ¢¢¢ Q T T T /] [ kDN;lDN;mDN;nDN jx P.! j .F F F F / kD1;lD1;mD1;nD1 m n k l

(2.12)

Thus, the approximation of the posterior is valid for two partitions of the feature set, and therefore it is valid for three partitions (since it is valid for N partitions if it is valid for N ¡ 1 partitions). The argument can be extended inductively for any arbitrary number of partitions of the feature set. In this approximation, if a subset of attributes across the classier is empty, then the posterior for that attribute subset is approximated by the a priori probability. 2.2 Algorithm. In equation 2.12, it is observed that the odd subsets appear in the numerator, and the even subsets appear in theTdenominator. S Let there be two partitions F1 and F2 (F D F1 F2 and F1 F2 6D ;). Let a pattern x belong to class !1 such that P.!1 jx/ > P.!2 jx/, where P.:/ is the true posterior. Let the classiers be such that a correct decision is made with each individual feature partition, and also with the intersection of the feature partitions, that is, P.!1 jxF1 / > P.!2 jxF1 /, P.!1 jxF2 / > P.!2 jxF2 /, and P.!1 jxF1 T F2 / > P.!2 jxF1 T F2 /. Therefore, equation 2.12 may not guarantee that O 1 jx/ > P.! O 2 jx/; P.!

(2.13)

although each subclassier makes the decision correctly. A necessary condition to satisfy inequality 2.13 is (from equation 2.12) that P.!1 jxF1 /; P.!1 jxF2 / ¸ P.!1 jxF1 S F2 /;

(2.14)

1532

J. Basak and R. Kothari

which indicates that P.!2 jxF1 /; P.!2 jxF2 / ¸ P.!2 jxF1 S F2 / for a two-class problem. Condition 2.14 implies that a classier is able to make the decision more accurately if more features are available to it. In other words, the performance of a classier is not disturbed by the presence of redundant features. Note that certain classiers may not have this property. For example, the performance of k-NN classiers may be deteriorated by the presence of noisy features. A classier can have a certain internal logic to check the redundancy of the features (using certain feature selection mechanism) and behave as a consistent classier. We call a classier i to be consistent for a pattern x if there exists some 0 0 class label ! such that P.!jx Fi / > P.! jxFi /Tfor all ! 6D !, and P.!jxFi / > S P.!jxFi T Fj / for all Fj , Fi Fj ¾ Fi , and Fi Fj 6D ;. It is to be noted here that the Bayesian framework of deriving the approximate posterior (see equation 2.12) is valid only for the set of consistent classiers. If all classiers for the vertically partitioned data sets are consistent, then the overall classication score can be computed from equation 2.12. In our implementation, we have not imposed the restriction of consistency of each classier. We compute the overall approximated posterior for a pattern based on only the consistent classifers. Therefore, the overall algorithm is as follows. For a given pattern x, ² We nd the set of consistent classiers for each class label (note that one classier can be consistent for at most one class label for a given pattern). Let Äi be the set of classiers for the class label !i . O i jx/ ² For all non-null Äi , we compute the approximate posterior P.! based on only the classiers in Äi . O i jx/ is maximum. ² Assign the class label for which P.!

² If there exist no consistent classiers for any class label, then we approximate the posterior by the product of the posteriors of the individual classiers. 3 Information Loss Due to the Approximation In this section, we provide an alternate view of the approximation of the posterior due to partitioning of the feature set. According to the Bayes decision for a two-class problem, we have P.!1 jx/ ¸ P.!2 jx/ implies that x 2 Class !1 , otherwise x 2 Class !2 : (3.1) Let us consider that g.x/ D

P.!1 jx/ P.!2 jx/

and

O D g.x/

O 1 jx/ P.! ; O 2 jx/ P.!

(3.2)

A Classication Paradigm for Distributed Vertically Partitioned Data

1533

O is the approximated posterior as given by equation 2.12. There where P.¢/ is no error due to approximation if O g.x/ D g.x/:

(3.3)

In other words, an error in the approximation will occur if g.x/ is not O equal to g.x/ for an observation x. Thus, the error due to approximation in the classication can be formally measured as ´ Z ³ g.x/ log ED p.x/dx; O g.x/

(3.4)

that is, ´ ³ ´ Z ³ P.!1 jx/ P.!2 jx/ log ¡ log ED p.x/dx: O 1 jx/ O 2 jx/ P.! P.!

(3.5)

From equation 2.12, the approximation error, equation 3.5, can be simplied as Á ! Z p.xj! / Q p .x j! / Q p .x j! / ¢ ¢ ¢ 1 ij ij 1 ijkl ijkl 1 Q Q log ED p.x/ pi .xi j!1 / pijk .xijk j!1 / ¢ ¢ ¢ Á ! Q Q p.xj!2 / pij .xij j!2 / pijkl .xijkl j!2 / ¢ ¢ ¢ Q Q ¡ log p.x/ dx; j! j! ¢ ¢ ¢ pi .xi 2 / pijk .xijk 2 /

(3.6)

where xi , xij , xijk , and xijkl are the simplied representations of the partial views xFi , xFi T Fj , xFi T Fj T Fk , and xFi T Fj T Fk T Fl , respectively. The densities pi .xi j!/, pij .xij j!/, pijk .xijk j!/, and pijkl .xijkl j!/ are the T marginal T density T functions of the respective class in the feature subsets , , F F F F Fj Fk , i i j i T T T and Fi Fj Fk Fl , respectively, and p.x/ is the overall density function of the patterns. Considering that p.x/ D P.!1 /p.xj!1 /CP.!2 /p.xj!2 /, p.xj!1 / and p.xj!2 /, being the class-conditional densities for !1 and !2 , respectively, and the class-conditional density of class !1 is independent of the class-conditional density of class !2 , we can rewrite equation 3.6 as Á ! Q Q Z p.xj!1 / pij .xij j!1 / pijkl .xijkl j!1 / ¢ ¢ ¢ Q Q ED P.!1 / log p.xj!1 / pi .xi j!1 / pijk .xijk j!1 / ¢ ¢ ¢ Á ! Q Q p.xj!2 / pij .xij j!2 / pijkl .xijkl j!2 / ¢ ¢ ¢ Q Q ¡ P.!2 / log pi .xi j!2 / pijk .xijk j!2 / ¢ ¢ ¢ £p.xj!2 / (3.7) dx:

1534

J. Basak and R. Kothari

Let a.xj!1 / D

p.xj!1 / Q

Q

Q pij .xij j!1 / pijkl .xijkl j!1 / ¢ ¢ ¢ Q : pi .xi j!1 / pijk .xijk j!1 / ¢ ¢ ¢

(3.8)

The expression for a.xj!1 / can be rewritten as Q Qp.xj!1 / Qpij .xij j!1 / ¢ ¢ ¢ pi .xi j!1 / pijk .xijk / .¡1/N Q a2 .xj!1 / D p.xj!1 / Q pijk¢¢¢N .xijk¢¢¢N j!1 /: pi .xi j!1 / pijk .xijk j!1 / Q Q ¢¢¢ pij .xij j!1 /

(3.9)

pijkl .xijkl j!1 /

Therefore, log.a.xj!1 // D

1 2

µ log.p.xj!1 // C .¡1/N log.p.xijk¢¢¢N j!1 // Á Q ! ³ ´ pij .xij j!1 / p.xj!1 / C log Q C log Q C¢¢¢ pi .xi j!1 / pijk .xijk j!1 / ÁQ ! # ³Q ´ pijk .xijk j!1 / pi .xi j!1 / Q ¡ log ¡ log ¡ ¢ ¢ ¢ (3.10) pij .xij j!1 / pijkl .xijkl j!1 /

Thus, the error in approximation (from equations 3.7 and 3.10) is given as X .¡1/n 1 .¡1/N E · .P.!1 / I.!1 / C Dn .!1 / C IN .!1 / 2 2 2 n X .¡1/n 1 .¡1/N C P.!2 / I.!2 // C Dn .!2 // C IN .!2 / ; 2 2 2 n

(3.11)

where Z Dn .!i / D

³Q ´ p.xij¢¢¢.n¡1/ j!i / Q log p.xj!i /dx p.xij¢¢¢n j!i /

(3.12)

and Z D1 .!i / D

³

´ p.xj!i / log Q p.xj!i /dx pj .xj j!i /

(3.13)

reect the Kullback-Leibler (KL) divergence T T of the T marginalized density in the subspace of the feature subset ¢ ¢ ¢ F F Fn¡1 and in the subspace 1 2 T T T of feature subset F1 F2 ¢ ¢ ¢ Fn corresponding to class !i , I.!i / is the information contained in the class !i , and IN .!i / is the information content

A Classication Paradigm for Distributed Vertically Partitioned Data

1535

T T T T of class !i in the subspace of the S feature subset F1 F2 F3 ¢ ¢ ¢ FN . In the analysis, it is assumed that i Fi D F, that is, overlap of all feature subsets available to the classiers spans the entire feature set. As a special case, when the feature subset is split into N disjoint subsets, the error is E · jP.!1 /D.!1 /j C jP.!2 /D.!2 /j;

(3.14)

where D.!i / D D1 .!i / is the K-L divergence between the original classconditional density and the marginalized density in the space of feature subsets. From equation 3.14, it is evident that the approximationerror goes to zero if the class-conditional divergence is zero. In other words, if the feature split is such that the feature subsets are independent, then the approximation error in the classication is zero. Also, equation 3.11 provides a guideline for optimal feature split such that E is minimum (the algorithm for nding the optimal feature split is out of the scope of this discussion). 4 Experimental Results In the proposed approach, the combined classication is based on utilizing the posterior probabilities computed by the individual classiers. We rst show how the posterior probabilities can be obtained from the individual classiers when the classiers use some standard classication techniques. 4.1 Obtaining the Posterior Probabilities. 4.1.1 k-NN Classier. In the k-nearest-neighbor (k-NN) classier, we can approximate the posterior for a class !i of a classier as Pest .!i jx/ D

ki ; k

(4.1)

where ki is the number of samples belonging to class !i among the nearest k samples. The estimated posterior becomes better as the value of k increases. As a special case, if the Pest becomes zero (e.g., out of k-nearestneighboring samples, the number of samples belonging to class !i is zero) for some particular classier, then in equation 2.2, either the numerator or the denominator becomes zero. In the second case, it is difcult to obtain the posterior probability. In order to handle such situations, we regularize the posterior as Pest .!i jx/ D

²P.!i / C ki ; ²Ck

(4.2)

where ² is a small, regularizing constant and P.!i / is the a priori probability of class !i obtained from the training set available in the classier. Thus, the

1536

J. Basak and R. Kothari

posterior estimated in equation 4.2 becomes equal to that in equation 4.1, when k is large. 4.1.2 Decision Tree Classiers. can be estimated as

The posterior for a class in a decision tree

P est .!j jx/ D 1 ¡ impurity;

(4.3)

where impurity is the impurity at the leaf node, which measures the total number of samples not belonging to the assigned class !j of that leaf node divided by the total number of samples corresponding to that particular leaf node. Thus, for a large decision tree (not properly pruned), we can obtain a zero impurity for every child node. In that case, we can regularize the measure of estimated posterior in the same way as in the case of the k-NN classier. However, since a properly pruned decision tree usually provides a better generalization capability, the measure of the posterior can be better estimated from a pruned decision tree. 4.1.3 Multilayered Perceptrons. The output of a multilayered perceptron with an appropriately chosen error and activation function approaches the posterior probability provided that the network is sufciently complex and the training can reach the global minimum (Haykin, 1999). Although all of these conditions are not always satised, in practice, the adaptation of the basis functions in a multilayered perceptron leads to a better utilization of the available data than is possible with many other forms of estimation (Barron, 1993). In a multilayered perceptron having n output nodes representing n different class labels, the posterior of a sample x for a class !i is given as P.!i jx/ D vi ;

(4.4)

where vi 2 [0; 1] is the output of the ith node. 4.2 Results. The method has been simulated in a MATLAB environment. The data sets that we considered are shown in Table 1. All the data sets are readily available from Merz and Murphy (1996). For each data set, we partitioned the set vertically (there can be overlap between the partitions) and assigned one partition to each classier. Each classier performs the classication based on a k-NN algorithm with k D 10. It is not necessary that each local classier utilize the same classication paradigm, and the posteriors can be obtained by any other consistent method. We obtained the posterior probability using equation 4.1 with ² D 1. We partition the feature set in the following way. For every feature, we decide randomly a number of classiers to which it should belong to. The

A Classication Paradigm for Distributed Vertically Partitioned Data

1537

Table 1: Data Sets Used to Obtain the Experimental Results.

Data Set

jFj

Cancer Pima Diagnostic Breast Cancer Digit Recognition Optical Digit

30 8 30 64 144

Number of Classes

Training Data Size

Test Data Size

2 2 2 10 10

456 576 398 3823 5000

113 192 171 1797 5000

number of classiers to which the feature should belong to is decided as N f 2 [1; ®N];

(4.5)

where ® is a factor deciding the amount of overlap between the classiers. We then randomly allocate the feature to N f different classiers. We experimented with overlap factors 0.4 and 0.6. Thus, in each run in the experiment, a partition is randomly generated with a given number of classiers and a given value of the amount of overlap. In our experiment, we obtained the classication accuracy with two different overlap factors and feature set partitioned into different number of classiers. In each case, we obtained the classication accuracy for independent runs (as described above), and obtained the average test accuracy along with the maximum and minimum accuracies. The classication accuracy obtained from a classier having a complete view and that obtained from classiers having partial views is illustrated in Figures 1 through 5. In these gures, we have plotted the average test accuracy and have also indicated the maximum and minimum test accuracy (vertical lines) obtained from ve independent runs. There are several interesting observations that can be made from the gures. First, in general there is a degradation in performance with increasing number of partitions. However, comparing Figures 1 through 4 to Figure 5, we nd that the degradation is much less with the Digit data set. The Digit data set has a much larger number of features than the other three data sets. This can be explained to some extent by the fact that features in higherdimensional spaces tend to more correlated. Therefore, a partition is likely to contain proportionally larger classication information. It is also interesting to observe that in certain cases, the classication accuracy with a partitioned feature set is better than that with the entire feature set. This is due to the fact that we have used a k-NN algorithm for each classier, and the performance of the k-NN algorithm is, in general, perturbed by the presence of noisy features. In the partitioned case, if the number of redundant features in one classier is fewer than the informative features, then the classication accuracy can be even better than if all features are taken into account.

1538

J. Basak and R. Kothari

0.98 0.97

percentage accuracy ®

0.96 0.95 0.94 0.93 0.92 0.91 0.9

2

4

6

no. of partitions ®

8

10

0.98 0.97

percentage accuracy ®

0.96 0.95 0.94 0.93 0.92 0.91 0.9 1

2

3

4

5

no. of partitions ®

6

7

8

Figure 1: Results obtained with the Cancer data set for ® D 0:4 (top) and ® D 0:6 (bottom). Maximum and minimum accuracy on the test data are indicated by vertical lines with an x at the extremities. The average performance is computed over ve trials. Each individual classier is a k-NN classier with k D 25.

A Classication Paradigm for Distributed Vertically Partitioned Data

1539

0.9

percentage accuracy ®

0.85 0.8

0.75 0.7

0.65 0.6

0.55 0.5 1

2

3

4

5

6

3

4

5

6

no. of partitions ®

0.9

percentage accuracy ®

0.85 0.8

0.75 0.7

0.65 0.6

0.55 0.5 1

2

no. of partitions ®

Figure 2: Results obtained with the Pima data set for ® D 0:4 (top) and ® D 0:6 (bottom). Maximum and minimum accuracy on the test data are indicated by vertical lines with an x at the extremities. The average performance is computed over ve trials. Each individual classier is a k-NN classier with k D 25.

1540

J. Basak and R. Kothari

0.98

percentage accuracy ®

0.96 0.94 0.92 0.9

0.88 0.86 0.84 1

2

3

4

5

6

7

8

no. of partitions ®

9

10

0.98

percentage accuracy ®

0.96 0.94 0.92 0.9

0.88 0.86 0.84

2

4

6

no. of partitions ®

8

10

Figure 3: Results obtained with the Diagnostic Breast Cancer data set for ® D 0:4 (top) and ® D 0:6 (bottom). Maximum and minimum accuracy on the test data are indicated by vertical lines with an x at the extremities. The average performance is computed over ve trials. Each individual classier is a k-NN classier with k D 25.

A Classication Paradigm for Distributed Vertically Partitioned Data

1541

percentage accuracy ®

1

0.95

0.9

0.85 1

2

3

4

5

6

no. of partitions ® Figure 4: Results obtained with the Optical Digit data set for ® D 0:6. Maximum and minimum accuracy on the test data are indicated by vertical lines with an x at the extremities. The average performance is computed over ve trials. Each individual classier is a k-NN classier with k D 10.

5 Discussion and Conclusion In this article, we provided a model-free approach to combining the decisions made by the local classiers, each of which is constructed from partial views of the feature set. Our formulation is based on the assumption that each local classier is able to estimate the posterior probability for a given sample. We obtained a theoretical estimate of the error made in estimating the posterior probabilities based on the estimate of the posterior probabilities estimated from partial views. It is observed that for a lower number of features, the degradation of performance increases rapidly with the increase in the number of partitions as compared to that for a larger number of features. The proposed technique of decision-level fusion can be useful in many different situations of Web-based technologies, such as knowledge management, grid computing, e-commerce, and agent-based systems where the number of attributes is very large. The proposed method can have an impact on reducing the complexity of classier design; designing a classication system comprising local classiers, each constructed from a partial view of the feature set; and handling the process of fusing the decisions without co-

1542

J. Basak and R. Kothari

percentage accuracy ®

0.8

0.75

0.7

0.65 1

2

3

4

5

no. of partitions ®

6

7

8

Figure 5: Results obtained with the Digit data set for ® D 0:6. The performance is computed over a single trial. Each individual classier is a k-NN classier with k D 10.

locating the data (which may not be possible because of security or privacy issues). In combining the decisions, it is necessary to communicate the decisions or the posterior probabilities between the servers over the network. The communication or transfer of the decisions needs to be performed so that the total communication cost is minimized. Analysis of the communication and network protocol issues is an area for further study. We combined the decisions in terms of the posterior probabilities. However, a similar algorithm can also be approached by using other measures such as uncertainty or fuzzy ambiguity (Klir & Folger, 1991). Different partitions over a data set provide different classication scores. It is best to obtain a combined decision when the classiers are maximally independent, that is, the approximation error in terms of the information loss (as provided in section 3) is minimum. This also constitutes a scope for future study where it is required to determine the optimal partition under certain constraints (e.g., minimum and maximum number of attributes that a processor should handle). The problem, in a sense, is related to the problem of feature subset selection (Devijver & Kittler, 1982). We also assumed that the individual classiers were unbiased. Interesting variations arise when this is not the case (either intentionally or unin-

A Classication Paradigm for Distributed Vertically Partitioned Data

1543

tentionally). Such variations can be dealt with to some extent through the use of robust statistics (Rousseeuw, 1984; Hampel, Rousseeuw, Ronchetti, & Stahel, 1986).

References Ahmad, S., & Tresp, V. (1993). Some solutions to the missing feature problem in vision. In S. J. Hanson, J. D. Cowan, & C. L. Giles (Eds.), Advances in neural information processing systems. San Mateo, CA: Morgan Kaufmann. Barron, A. R. (1993). Universal approximation bounds for superposition of a sigmoidal function. IEEE Transactions on Information Theory, 39(3), 930–945. Behrouz, H. F., & Koono, Z. (1996). Ex-w-pert system: A web-based distributed expert system for groupware design. In J. K. Lee, J. Liebowitz, & Y. M. Chae (Eds.), Proc. of the Third World Congress on Expert Systems (pp. 545–552). Seoul, Korea: Scholium International. Breiman, L. (1996). Bagging predictors. Machine Learning, 26(2), 123–140. Caragea, D., Silvescu, A., & Honavar, V. (2001). Towards a theoretical framework for analysis and synthesis of agents that learn from distributed dynamic data sources. In S. Wermter, J. Austin, & D. Willshaw (Eds.), Emerging neural architectures based on neuroscience. Berlin: Springer-Verlag. Cooley, R. W. (2000). Web usage mining: Discovery and application of interesting patterns from web data. Unpublished doctoral dissertation, University of Minnesota. Devijver, P. A., & Kittler, J. (1982). Pattern recognition:A statisticalapproach. Upper Saddle River, NJ: Prentice Hall. Duda, R., & Hart, P. (1973). Pattern classication and scene analysis. New York: Wiley. Hampel, F. R., Rousseeuw, P. J., Ronchetti, E. M., & Stahel, W. A. (1986). Robust statistics—the approach based on inuence function. New York: Wiley. Haykin, S. (1999). Neural networks: A comprehensive foundation. Upper Saddle River, NJ: Prentice Hall. Ho, T. K., & Basu, M. (2002). Complexity measures of supervised classication problems. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(3), 289–300. Jacobs, R. A., Jordan, M. I., & Nowlan, S. J. (1991). Adaptive mixtures of local experts. Neural Computation, 3(1), 79–87. Jain, A. K., Duin, R. P. W., & Mao, J. (2000). Statistical pattern recognition: A review. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(1), 4–37. Jordan, M. I., & Jacobs, R. A. (1994). Hierarchical mixtures of experts and the EM algorithm. Neural Computation, 6(2), 181–214. Klir, G. J., & Folger, T. A. (1991). Fuzzy sets, uncertainty and information. Upper Saddle River, NJ: Prentice Hall.

1544

J. Basak and R. Kothari

Merz, C. J., & Murphy, P. M. (1996). UCI repository of machine learning databases (Tech. Rep.). Available on-line at: http://www.ics.uci.edu/˜ mlearn/ MLRepository.html. Irvine: Department of Information and Computer Science, University of California at Irvine. Nowlan, S. J., & Hinton, G. E. (1990). Evaluation of adaptive mixtures of competing experts. In R. P. Lippmann, J. E. Moody, & D. S. Touretzky (Eds.), Advances in neural information processing systems, 2 (pp. 774–780). San Mateo, CA: Morgan Kaufmann. Nowlan, S. J., & Hinton, G. E. (1991). Evaluation of adaptive mixtures of competing experts. In R. P. Lippmann, J. E. Moody, & D. S. Touretzky (Eds.), Advances in neural information processing systems (pp. 774–780). San Mateo, CA: Morgan Kaufmann. Nowlan, S. J., & Sejnowski, T. J. (1995). Filter selection model for generating visual motion signals. In C. L. Giles, S. J. Hansen, & R. P. Lippmann (Eds.), Advances in neural information processing systems, 5. San Mateo, CA: Morgan Kaufmann. Rousseeuw, P. J. (1984). Least median of squares regression. Journal of the American Statistical Association, 79, 871–880. Received December 13, 2002; accepted January 7, 2004.

LETTER

Communicated by Hugh Wilson

A Coarse-to-Fine Disparity Energy Model with Both Phase-Shift and Position-Shift Receptive Field Mechanisms Yuzhi Chen [email protected]

Ning Qian [email protected] Center for Neurobiology and Behavior and Department of Physiology and Cellular Biophysics, Columbia University, New York, NY 10032, U.S.A.

Numerous studies suggest that the visual system uses both phase- and position-shift receptive eld (RF) mechanisms for the processing of binocular disparity. Although the difference between these two mechanisms has been analyzed before, previous work mainly focused on disparity tuning curves instead of population responses. However, tuning curve and population response can exhibit different characteristics, and it is the latter that determines disparity estimation. Here we demonstrate, in the framework of the disparity energy model, that for relatively small disparities, the population response generated by the phase-shift mechanism is more reliable than that generated by the position-shift mechanism. This is true over a wide range of parameters, including the RF orientation. Since the phase model has its own drawbacks of underestimating large stimulus disparity and covering only a restricted range of disparity at a given scale, we propose a coarse-to-ne algorithm for disparity computation with a hybrid of phase-shift and position-shift components. In this algorithm, disparity at each scale is always estimated by the phase-shift mechanism to take advantage of its higher reliability. Since the phasebased estimation is most accurate at the smallest scale when the disparity is correspondingly small, the algorithm iteratively reduces the input disparity from coarse to ne scales by introducing a constant position-shift component to all cells for a given location in order to offset the stimulus disparity at that location. The model also incorporates orientation pooling and spatial pooling to further enhance reliability. We have tested the algorithm on both synthetic and natural stereo images and found that it often performs better than a simple scale-averaging procedure. 1 Introduction Position-shift and phase-shift (or phase-difference) models are two distinct mechanisms that have been proposed for describing reception eld (RF) proles and disparity sensitivity of V1 binocular cells (Bishop & Pettigrew, c 2004 Massachusetts Institute of Technology Neural Computation 16, 1545–1577 (2004) °

1546

Y. Chen and N. Qian

1986; Ohzawa, DeAngelis, & Freeman, 1990; see Qian, 1997, for a review). The position-shift model assumes that the left and right RFs of a simple cell are always identical in shape but can be centered at different spatial locations, while according to the phase-shift model, the two RFs of a cell can have different shapes but are always tuned to the same spatial location. Recent studies indicate that both the phase- and position-shift models contribute to the representation of binocular disparity in the brain (Zhu & Qian, 1996; Ohzawa, DeAngelis, & Freeman, 1997; Anzai, Ohzawa, & Freeman, 1997, 1999a; Livingstone & Tsao, 1999; Prince, Cumming, & Parker, 2000) and that the phase-shift model appears to be more prominent although position shift may also play a signicant role at high spatial frequencies (Ohzawa et al., 1997; Anzai et al., 1997, 1999a). These ndings are consistent with earlier psychophysical observations that there is a correlation between the perceived disparity limit and spatial frequency as predicted by the phase model, but the disparity range at high spatial frequencies is larger than what a purely phase-shift model allows (Schor & Wood, 1983; Smallman & MacLeod, 1994). These studies raise at least two questions: What are the relative strengths and weaknesses of the phase- and the position-shift mechanisms in disparity computation, and what is the advantage, if any, of having both mechanisms? We (Zhu & Qian, 1996; Qian & Zhu, 1997) and others (Smallman & MacLeod, 1994; Fleet, Wagner, & Heeger, 1996) have previously analyzed the similarities and the differences between the two RF models. However, that work mainly focused on the disparity tuning properties of a given cell. For the task of disparity estimation faced by the brain, it is the population responses of many cells to a given (unknown) disparity that is most relevant.1 Tuning curves and population response curves have different shapes and properties in general; they are identical only under some special conditions (Teich & Qian, 2003). Qian and Zhu (1997) did compare the disparity maps computed from the population responses of the phase- and position-shift RF models, but the study did not examine the properties of the population responses in detail and was done under the condition where the difference between the two RF models is minimal (see section 2.1). In addition, previous studies have not explored how to properly combine the phase- and position-shift mechanisms in a stereo algorithm to achieve better results than either mechanism alone. Another limitation of most previous studies on disparity computation is the exclusion of orientation tuning. Indeed, with a few exceptions (Mikaelian

1 As we will detail in section 2, for a set of cells with a range of preferred disparity, a population response curve is obtained by plotting each cell’s response to a xed stimulus disparity against the cell’s preferred disparity. The stimulus disparity can be estimated from either the peak or the mean of a population response curve. In this article, we use the peak location. In contrast, one cannot estimate stimulus disparity directly from a disparity tuning curve (Qian, 1997).

A Coarse-to-Fine Disparity Energy Model

1547

& Qian, 1997, 2000; Matthews, Meng, Xu, & Qian, 2003), most previous computational studies of the disparity energy model employed one-dimensional (1D) Gabor lters (Qian, 1994; Zhu & Qian, 1996; Fleet et al., 1996; Qian & Zhu, 1997; Tsai & Victor, 2003) instead of two-dimensional (2D), oriented lters found in the visual cortex.2 Of course, a 1D algorithm can be readily extended to 2D by replacing 1D Gabor lters with 2D Gabor lters. However, without the additional renements presented in this article, the performance of the resulting 2D algorithm is much worse than that of the corresponding 1D algorithm. The reason is that at depth boundaries, 2D lters tend to mix up regions of different disparities more than 1D lters do (Chen & Qian, unpublished observations). In this article, we analyze the differences between the phase-shift and position-shift RF models in terms of both disparity tuning curves and population responses. We consider stimuli with and without orientations, and examine various offsets between the preferred orientation of the cells and the stimulus orientation. We nd that although the two RF models generate disparity tuning curves of similar qualities, the phase-shift mechanism provides more reliable population responses than the position-shift mechanism does for relatively small stimulus disparity. Here, the reliability is measured by the standard deviation of the peak location of the tuning curve or population response curve when certain stimulus details unrelated to disparity (such as the lateral position of a bar or the dot distribution in a random dot pattern) are varied. Based on these and other considerations and our previous nding that the phase-shift-based disparity computation is accurate only when the stimulus disparity is considerably smaller than the RF sizes (Qian, 1994; Qian & Zhu, 1997), we propose an iterative algorithm that employs both the phase- and position-shift mechanisms in a specic way and demonstrate that the hybrid mechanism is more reliable and accurate than either mechanism alone. After incorporating pooling across spatial location, orientation, and spatial scale, we present a coarse-to-ne model for disparity computation and demonstrate its effectiveness on both synthetic and natural stereograms. We also compare this coarse-to-ne algorithm with a simple scale-averaging procedure applied to the population responses of the position-shift mechanism. 2 Analyses and Simulations 2.1 Phase-Shift vs. Position-Shift RF Models. The details of the disparity energy model used in this work have been described previously (Ohzawa et al., 1990; Qian, 1994). Briey, according to quantitative physiological studies, spatial RFs of a binocular simple cell can be well described

2 Some studies mentioned orientation pooling, but did not really address the issue due to the 1D lters used.

1548

Y. Chen and N. Qian

by 2D Gabor functions (Daugman, 1985; Jones & Palmer, 1987; Ohzawa et al., 1990). The position-shift and phase-shift RF models can be expressed as an overall positional difference between the left eye and right eye Gabor functions and as a phase difference between the sinusoidal modulations of the Gabor functions, respectively. We rst consider a vertically oriented Gabor function centered at origin: Á ! 1 x2 y2 exp ¡ 2 ¡ cos.!x C Á/; g.x; y; Á/ D 2¼ ¾x ¾ y 2¾x 2¾y2

(2.1)

where ! is the preferred spatial frequency, ¾x and ¾y determine the x and y RF dimensions, and Á is the phase parameter for the sinusoidal modulation. (Other preferred orientations and RF centers can be obtained by rotating and translating this function, respectively.) The phase-shift model (Ohzawa et al., 1990; DeAngelis, Ohzawa, & Freeman, 1991; Ohzawa, DeAngelis, & Freeman, 1996) posits that the left and right RFs of a simple cell are expressed as pha

gl .x; y/ D g.x; y; Á/ pha gr .x; y/

D g.x; y; Á C 1Á/:

(2.2) (2.3)

Thus, the two RFs have the same position (both centered at origin), but there is a phase difference 1Á between their sinusoidal modulations. In contrast, the position-shift model assumes that the left and right RFs take the form pos

gl .x; y/ D g.x; y; Á/ pos gr .x; y/

D g.x C d; y; Á/:

(2.4) (2.5)

These two RFs have identical shape, but there is an overall shift d between their horizontal positions. Using these RF proles, one can compute simple and complex cell responses based on the disparity energy model (Ohzawa et al., 1990; Qian, 1994). We focus on complex cell responses here because simple cell responses are much less reliable due to their stronger dependence on the Fourier phases of the stimuli (Ohzawa et al., 1990; Qian, 1994; Zhu & Qian, 1996; Qian & Zhu, 1997; Chen, Wang, & Qian, 2001). It can be shown (see the appendix) that the response of a complex cell (constructed from a quadrature pair of simple cells) to a stereo image patch of disparity D can be written as ³ ´ ³ ´ D !D ¡ 1Á pha 2 2 !D ¡ 1Á C 4AB cos rq ¼ 4A cos 2 2 ¾x ³ ´ !D ¡ 1Á £ cos ® ¡ ¯ C (2.6) 2

A Coarse-to-Fine Disparity Energy Model

³

pos rq

´ ³ ´ !.D ¡ d/ D¡d !.D ¡ d/ ¼ 4A cos C 4AB cos 2 2 ¾x ³ ´ !.D ¡ d/ £ cos ® ¡ ¯ C 2

1549

2

2

(2.7)

for the phase- and position-shift RF models, respectively. Here A and ® are the local Fourier amplitude and phase (evaluated at the RF’s preferred frequency) of the stimulus patch ltered by the RF gaussian envelope; B and ¯ are the similar amplitude and phase of the stimulus patch ltered by the rst-order derivative of the RF gaussian envelope (see the appendix). The phases ® and ¯ are more dependent on the detailed luminance distribution of the stimulus than the amplitudes A and B. The two terms in each of equations 2.6 and 2.7 are, respectively, the zeroth and the rst-order terms in D=¾x or .D ¡ d/=¾x . Before exploring the implications of the above expressions, we need to dene disparity tuning curve and population response curve explicitly. For a given cell with xed RF parameters, if we vary the stimulus disparity D and plot the response of the cell as a function of D, we obtain a disparity tuning curve. The preferred disparity of the cell is the stimulus disparity that generates the peak response in the tuning curve. For a xed stimulus disparity D, we can consider a set of cells that prefer different disparities (e.g., have different 1Á or d parameters; see below) but are otherwise identical. If we plot the responses of these cells to the same D against their preferred disparities, we get a population response curve. A complication is that a cell’s preferred disparity depends not only on its intrinsic RF parameters but also on the stimulus to some degree (Poggio, Gonzalez, & Krause, 1988; Zhu & Qian, 1996; Chen et al., 2001). The abscissa of the population response curve, however, should not depend on any stimulus parameters as these parameters are assumed to be unknown during disparity computation. In other words, the abscissa should be only a function of the intrinsic RF parameters that uniquely label each cell in the population. We will therefore use an intrinsic parameter (or a combination of intrinsic parameters) as the abscissa that approximates the preferred disparity of each cell. To do so, note that in equations 2.6 and 2.7, the rst term (zeroth order) is usually larger than the second term (rst order). If we keep only the rst term, as we did in some of our previous work (Qian, 1994; Zhu & Qian, 1996; Qian & Zhu, 1997), the preferred disparity of a cell is a function of its intrinsic RF parameters only, given by: pha

1Á ! ¼d

Dpref ¼ pos

Dpref

(2.8) (2.9)

for the phase- and position-shift models, respectively. We will therefore use 1Á=! or d to label different cells along the abscissa of population response

1550

Y. Chen and N. Qian

curves. Also note that if we keep only the rst terms, the stimulus disparity D can be estimated from the preferred disparity of the most active cell (denoted by ¤) of the population response curve according to pha

1Á¤ ! ¼ d¤

Dest ¼

(2.10)

Dest

(2.11)

pos

for the phase- and position-shift models, respectively (Qian, 1994; Zhu & Qian, 1996; Qian & Zhu, 1997). We will use the same equations for disparity estimation in this article as we did previously. However, the more accurate analyses done here (the second terms in equations 2.6 and 2.7) will allow us to determine the conditions under which equations 2.8 to 2.11 are good approximations, and will lead to a better hybrid model with both phaseand position-shift mechanisms.3 We can now consider the properties of disparity tuning curves and population response curves generated with the phase- and position-shift RF models. We are particularly interested in whether and how much these curves depend on the stimulus details other than disparity (such as the lateral position of a bar or the dot distribution in a random dot pattern). Obviously, such dependence is undesirable, as it will make disparity estimation unreliable. We rst consider disparity tuning curves. According to the denitions in the appendix, ® and ¯ in equations 2.6 and 2.7 strongly depend on the detailed luminance prole of the stimulus. If the second terms in the equations can be neglected, then the rst term will be scaled only by A. Consequently, the shape of the tuning curves will be largely independent of the stimulus details, and the peak locations will accurately follow equations 2.8 and 2.9 for the phase- and position-shift models, respectively. If the second terms are not very small, however, the tuning curves of a cell will change with stimulus details, and the peak locations may deviate signicantly from equations 2.8 and 2.9. Since the stimulus disparity D has to vary over a wide range to generate a tuning curve, D=¾x or .D ¡ d/=¾x cannot always be small, and thus the second terms cannot always be neglected. Therefore, complex cell tuning curves based on either position-shift or phase-shift RF model must be somewhat unreliable, albeit much more reliable than simple-cell tuning curves (Qian, 1994; Zhu & Qian, 1996; Chen et al., 2001). The situation is different for population responses, however. Here stimulus disparity D is xed while the cell parameter 1Á or d varies over a wide range. For the phase-shift model, if D is xed at a value much smaller than ¾x , then the second term is negligible over the entire range of 1Á because 1Á 3 Alternatively, one could solve D from equation 2.6 or 2.7, or use those equations as templates to estimate D. The resulting method, however, will involve complex procedures that are unlikely to be implemented by the visual cells.

A Coarse-to-Fine Disparity Energy Model

1551

appears only inside the bounded cosine functions. In this case, equation 2.10 is well satised at the peak location of the population response curve and can be used to estimate D accurately regardless of the other details of the stimulus. Therefore, population responses generated with the phase-shift RF model should be highly reliable when the stimulus disparity is small compared with the RF size. In contrast, population responses generated with the position-shift model can never be very reliable. The reason is that the cells’ position-shift parameter d is present both inside and outside the cosine functions in equation 2.7. As d varies over a wide range, the second term cannot always be small for any xed values of D. Intuitively, when d is signicantly different from D, the image patches covered by the left and right RFs will be very different, and this difference will introduce large variability into the population response curve. We therefore conclude that among the four cases of disparity tuning curve and population response curve generated with the phase- and position-shift RF mechanisms, only the population response curves from the phase-shift mechanism have highly reliable peak locations for small stimulus disparities. In all other cases, the results in general will vary with stimulus details unrelated to disparity. We have performed extensive computer simulations to conrm this conclusion. Two examples are shown in Figures 1 and 2 for bar and random dot stimuli, respectively. We did not include spatial pooling (Zhu & Qian, 1996; Qian & Zhu, 1997) in these simulations in order to see the difference between the two RF models more clearly. The difference will be reduced (but not vanish) when spatial or orientation pooling is introduced. Figure 1 shows simulated disparity tuning curves (of a given complex cell in response to a range of stimulus disparities) and population response curves (of an array of model complex cells to a given stimulus disparity) for both phase- and position-shift RF mechanisms. The orientation of both the RFs and the stimuli (bars) was vertical in these simulations. To measure reliability in each case, we simulated 1000 disparity tuning curves or 1000 population response curves by randomly varying the lateral bar position while keeping all the other parameters constant (see the gure caption for details). For each bar position, a given stimulus disparity was introduced by shifting the left and right eyes’ images in opposite directions by half the disparity value. For clarity, only 30 tuning curves and 30 population curves are shown in panels a and b, respectively, but the peak location histograms in panels c and d are compiled from all 1000 simulations. The numbers inside each histogram panel are the mean (m) and standard deviation (s) of the distribution. The vertical lines in the tuning curve panels indicate the cell’s preferred disparity as determined by its parameters according to equation 2.8 (for phase shift) or equation 2.9 (for position shift). The vertical lines in the population response panels indicate the stimulus disparity. For tuning curves, small or large disparity refers to the cell’s preferred disparity (1 or 5 pixels). For population response curves, small or large disparity refers to the stimulus disparity (also 1 or 5 pixels).

1552

Y. Chen and N. Qian

It should be clear from the gure that the disparity tuning curves (panels a and c) are not very reliable in all cases: the peak location distributions all show signicant spread, indicating some dependence of the preferred disparity on the lateral bar position. For the population responses, the same unreliability holds for the position-shift model (panels b and d in Figures 1B and 1D). In contrast, the population response curves of the phase-shift model to small stimulus disparity (panels b and d in Figure 1A) are both reliable (the peaks from 1000 simulations are well aligned) and accurate (the peak location agrees with the actual stimulus disparity). These results are consistent with our analysis above. The population response of the phase-shift model to large stimulus disparity (panels b and d in case Figure 1C) is also reliable. As we will show in Figure 2, this is not generally true, but happens to be so for the bar stimuli because the ® and ¯ parameters in equation 2.6 approximately cancel each other (see the appendix) and the two terms of the equation can be combined. Note, however, that in this case, the peak location is not accurate as it underestimates the stimulus disparity. This underestimation is due to a zero-disparity bias of the phase-shift model demonstrated previously (Qian & Zhu, 1997), and it grows with stimulus disparity size. If equation 2.6 is expanded up to the second-order term, then the zero disparity bias of the phase-shift model will become apparent (results not shown). Also note that the curves in Figure 1 usually have side peaks (Qian, 1994; Zhu & Qian, 1996) outside the range plotted here.

Phase-shift b

0 1

c

d

0

Frequency Responses

-6

m =0.906 s =0.003 0

D [pixel ]

6

Tuning Curves

-6

a

b

0 1

c

d

0

-6

6

Population Resp

1

m =3.145 s =4.433

0

Df/w [pixel]

m =4.530 s =0.018 0

D [pixel ]

6

-6

0

Df/w [pixel]

6

B Frequency Responses

a

m =0.988 s =0.828

Position-shift

Population Resp

1

C

Large Disparity

Tuning Curves

Tuning Curves

Population Resp

1

a

b

0 1

c

d

m =1.059 s =0.832 0

D Frequency Responses

Frequency Responses

Small Disparity

A

-6

m =0.941 s =0.832 0

D [pixel]

6

Tuning Curves

-6

a

b

0 1

c

d

0

-6

6

Population Resp

1

m =5.059 s =0.832

0

d [pixel]

m =4.941 s =0.832 0

D [pixel]

6

-6

0

d [pixel]

6

A Coarse-to-Fine Disparity Energy Model

1553

We also repeated the above simulations with random dot stereograms, and the results are shown in Figure 2 with the same format as Figure 1. The disparity tuning curves and population response curves were simulated with 1000 sets of random dot stereograms that all contain the same disparity values but different dot patterns. These curves are more variable than those in Figure 1, as reected by the larger standard deviations of all the histograms, due to the stochastic nature of random dots. Nevertheless, Figure 2 clearly shows that the population response of the phase-shift model to small disparity (panels b and d in Figure 2A) is much more reliable and accurate than all the other cases, consistent with the results for bars in Figure 1 and with the prediction of our analysis. Note that in the large disparity case, both phase-shift and position-shift mechanisms show similarly unreliable population responses. This explains why Qian and Zhu (1997) found that the two RF mechanisms generate similar results; that study was done under the large disparity condition (D=¾x ¸ 0:5). Figure 1: Facing page. Disparity tuning and population response curves of model complex cells with the phase- and position-shift RFs in response to bar stereograms. Four cases are considered: (A) small disparity with the phase-shift RFs; (B) small disparity with the position-shift RFs; (C) large disparity with the phaseshift RFs; and (D) large disparity with the position-shift RFs. For the tuning curves of a given cell to a range of stimulus disparities, small or large disparity refers to the cell’s preferred disparity of 1 or 5 pixels (indicated by the vertical lines). For population response curves from an array of cells to a given stimulus disparity, small or large disparities refers to a stimulus disparity of 1 or 5 pixels (also indicated by the vertical lines). We use 1Á=! and d as approximate measures of the cells’ preferred disparities in the population response plots (see equations 2.8 and 2.9). To test the reliability of tuning curve and population response, 1000 curves were obtained for each case by randomly varying the bar’s lateral position in the cells’ RF with subpixel interpolation. For the tuning curves, the bar’s disparity varied from ¡8 to 8 pixels in a step of 1 pixel. For the population response curves, a group of model complex cells whose preferred disparities varied from ¡8 to 8 pixels in a step of 1 pixel were used. For clarity, only 30 tuning or population response curves are shown in panels a or b, and they are normalized by the strongest response. The peak location distribution histogram in panels c or d was compiled from all 1000 curves. Each peak location was computed by a parabolic t of three points around the peak. The numbers in the histograms represent the mean peak location m and the standard deviation s, respectively. Parameters: In each stereogram, the bar’s lateral position was conned within §8 pixels from the center of RF. The width and height of bar were 8 and 97 pixels, respectively. Both RFs and bars were oriented vertically. The RF parameters of model complex cells were !=2¼ D 1=16 cycle per pixel, ¾x D 8 pixels, ¾y D 16 pixels. The RFs were computed in a 2D region of 49 £ 97 pixels. The bin size in each histogram was 0.5 pixel.

1554

Y. Chen and N. Qian

Phase-shift

Frequency Responses

a

b

0 1

c

d

m =0.833 s =3.066 0

-6

m =0.937 s =0.463 0

D [pixel ]

6

Tuning Curves

-6

a

b

0 1

c

d

m =1.598 s =6.020 -6

0

Df/w [pixel]

6

Population Resp

1

0

Frequency Responses

Population Resp

1

C

Large Disparity

Tuning Curves

Position-shift

m =4.127 s =2.680 0

D [pixel ]

6

-6

B

0

Df/w [pixel]

6

Tuning Curves

Population Resp

1

a

b

0 1

c

d

m =0.636 s =3.024 0

D Frequency Responses

Frequency Responses

Small Disparity

A

-6

m =0.428 s =3.082 0

D [pixel]

6

Tuning Curves

-6

a

b

0 1

c

d

0

-6

6

Population Resp

1

m =1.581 s =5.586

0

d [pixel]

m =1.619 s =5.554 0

D [pixel]

6

-6

0

d [pixel]

6

Figure 2: Disparity tuning and population response curves of model complex cells with the phase- and position-shift RFs in response to random dot stereograms. The size of the random dot stereograms is 49 £ 97 pixels. The dot size and density are 2 £ 2 pixels and 50%, respectively. All RF parameters and the presentation format are identical to those in Figure 1.

2.2 An Iterative Hybrid Algorithm for Disparity Estimation . We have shown above that for the phase-shift model and small stimulus disparity (relative to the cells’ RF size), the peak location of the population response curve provides both reliable and accurate estimation of the stimulus disparities, while the position-shift model is always less reliable in comparison. However, the phase-shift model has its own limitations. First, for cells with preferred horizontal spatial frequency !, the range of disparity they can detect is conned between ¡¼=! and ¼=! (Qian, 1994) due to the periodicity of the population response as a function of 1Á, which in turn is due to the periodicity of the Gabor RFs as a function of Á. Any disparity beyond this range cannot be correctly detected. Second, for stimulus disparity within the range, the reliability of the population responses gets worse as the disparity approaches the limits of the range (compare Figure 2Cd with Figure 2Ad). The accuracy also decreases with increasing disparity magnitude because the underestimation caused by the zero disparity bias increases.

A Coarse-to-Fine Disparity Energy Model

1555

Since there is ample evidence indicating that both the phase- and position-shift mechanisms are involved in disparity processing (Schor & Wood, 1983; Smallman & MacLeod, 1994; Zhu & Qian, 1996; Anzai et al., 1997, 1999a; Livingstone & Tsao, 1999; Prince et al., 2000), it is natural to consider whether a hybrid of the two RF models could provide a better solution. Similar to the derivations of equations 2.6 and 2.7, the response of a complex cell with both position shift d and phase shift 1Á between the two eyes can be written approximately as ³

hyb rq

´ !.D ¡ d/ ¡ 1Á ¼ 4A cos 2 ³ ´ .D ¡ d/ !.D ¡ d/ ¡ 1Á C 4AB cos 2 ¾x ³ ´ !.D ¡ d/ ¡ 1Á £ cos ® ¡ ¯ C 2 2

2

(2.12)

If we only consider the rst term in equation 2.12, the preferred disparity of the cell is given by hyb

Dpref ¼

1Á C d: !

(2.13)

Since .D ¡ d/ appears outside the cosine functions in the second term of equation 2.12 just as in equation 2.7, whenever the position-shift parameter d is varied over a range for a population response curve, the computed disparity will not be reliable. Therefore, one should always rely on the phase difference 1Á for disparity computation, and d should be kept a constant close to D for all cells used in a given population response curve. The best scenario occurs if d happens to be equal to D since the second term of equation 2.12 will vanish, and the residual disparity .D ¡ d/ for the phase mechanism to estimate will be zero, and this is when the phase model is most accurate. These considerations lead us to the following iterative algorithm with a hybrid of both phase- and position-shift RFs. For each image location, we start (iteration 0) with a population of cells all with d D 0 and with 1Á covering the full range from ¡¼ to ¼ , and apply the phase mechanism to get an estimation D0 of the stimulus disparity D as we did before (Qian, 1994). Next, we use a set of cells all with the xed d D D0 and the full range of 1Á, and apply the phase mechanism again to get a new estimate D1 (iteration 1). Since the original stimulus disparity D has been offset by the constant position shift d D D0 of all the cells, D1 is a measure of the residual disparity D ¡ d D D ¡ D0 . Thus, the estimated stimulus disparity Dest from the rst iteration is D0 CD1 . This process can be repeated such that at the nth iteration, cells will all have the same position-shift d D D0 C D1 C ¢ ¢ ¢ C Dn¡1 and the newly computed Dn from the phase mechanism can be added to d to form

1556

Y. Chen and N. Qian

Iterative Hybrid Algorithm A Estimate the disparity Dest from population responses of complex cells with the fixed d, but full range of Df

Initialize position shift d=0

Reset position shift d = Dest

B Frequency

Iteration 0

Iteration 1

Iteration 2

Iteration 3

Iteration 4

0.774

0.832

0.850

1 0.646 m=4.127 s=2.680 0

-6

m=4.085 s=2.354

0.190 0

6

-6

m=4.282 s=2.476 0

6

-6

m=4.254 s=2.418 0

D hyb [pixel]

6

-6

m=4.305 s=2.498 0

6

-6

0

6

est

Figure 3: Iterative hybrid algorithm. (A) Flowchart of the algorithm. For a given stimulus, rst initialize the position shift d to zero, and compute the population responses of a set of complex cells all with the same d but the full range of the phase shift 1Á from ¡¼ to ¼ . The peak location of population responses is extracted as the estimated disparity Dest according to equation 2.13. Then reset the position shift d D Dest , and repeat the above procedure until a stable disparity value is obtained. (B) The performance histograms for the rst ve iterations of the algorithm, compiled from the same 1000 random dot stereograms used in Figure 2Cd. The format of the histograms is the same as that of Figure 2Cd except that here, the maximum value of each distribution is shown above the peak in each panel. All parameters but d are the same as those used in Figure 2Cd. To simply the simulations, we generated in advance 17 populations of cells with xed d’s from ¡8 to 8 pixels in steps of 1 pixel. At each iteration, we selected the population whose xed d was closest to the estimated disparity Dest in the previous iteration.

the current estimation of disparity Dest D D0 CD1 C¢ ¢ ¢CDn . This algorithm is shown schematically in Figure 3A. Note that at each iteration, the (residual) disparity is always computed with the phase mechanism. The positionshift parameter d is simply introduced to offset the stimulus disparity D so that the phase mechanism will operate on a progressively smaller (residual) disparity and will therefore become more and more reliable and accurate. A demonstration of this algorithm is presented in Figure 3B. We applied the method to the same 1000 random dot stereograms (all with D D 5 pixels) as in Figure 2C. Figure 3B shows the distribution histograms of estimated disparity up to the fourth iterations; little change occurs afterward. Since d equals 0 in the 0th iteration (far left panel), its histogram is identical to that in Figure 2Cd, with a broad distribution of estimated disparity around

A Coarse-to-Fine Disparity Energy Model

1557

the true disparity (vertical line). The situation quickly improved with a few iterations. The fraction of estimations falling within a bin of 0.5 pixel width around the true disparity increased from 19% to 85% in four iterations. This result suggests that the iterative hybrid algorithm converges quickly and that the estimation is much more accurate and reliable than a pure phaseor position-shift mechanism alone. It should be noted in Figure 3B that at each iteration, the estimated disparity always has some very small probability of being far away from the true stimulus disparity (note the negative disparity tail barely visible in the histograms). A closer examination reveals that if an estimation is far from the true disparity at the start (e.g., a wrong sign), subsequent iterations usually cannot correct the error. The reason is that if the estimated disparity is very wrong, the xed d introduced in the next step may make the residual disparity larger than the original disparity and the phase mechanism will become less reliable and harder to recover from the error. We will show that one way to reduce the occurrence of such runaway behavior is to pool across space, orientation, and scale. 2.3 Pooling. Pooling information from different sources can often improve performance. We (Zhu & Qian, 1996; Qian & Zhu, 1997) have previously demonstrated that spatial pooling of quadrature pair responses within a small neighborhood can signicantly improvethe quality of computed disparity maps. Here we focus on orientation pooling and spatial frequency (i.e., scale) pooling. 2.3.1 Orientation Pooling. Before considering orientation pooling, we rst examine how the disparity population responses of model complex cells depend on their preferred orientations. To generate RFs with orientation µ (measured from the positive horizontal axis), one can rotate the corresponding vertically oriented RF (equation 2.1) by µ ¡ 90± with respect to the RF center (positive angle means counterclockwise rotation). For the phase-shift RF model, the left and right RF centers are the same. Therefore, the response of a complex cell with preferred orientation µ to any stimulus is equal to the response of the corresponding vertically oriented complex cell to the stimulus rotated by an angle of 90± ¡ µ. If the original stimulus has a horizontal disparity D, the rotated stimulus then has components D sin µ and D cos µ orthogonal and parallel to the RF orientation, respectively. Since the RF prole along the preferred orientation changes much more slowly than that along the orthogonal axis, the parallel component D cos µ can be ignored, and the complex cell response with the phase-shift mechanism and preferred orientation µ can be obtained approximately by replacing D in equation 2.6 by D sin µ: ³ pha rq .µ /

0

¼ 4A 2 cos2

!D sin µ ¡ 1Á 2

´

1558

Y. Chen and N. Qian

³ ´ D sin µ 0 0 !D sin µ ¡ 1Á C 4A B cos 2 ¾? ³ ´ !D sin µ ¡ 1Á £ cos ® 0 ¡ ¯ 0 C ; 2

(2.14)

where A0, B0 , ® 0 , and ¯ 0 are similar to A, B, ® and ¯ in equations 2.6 and 2.7 except that they refer to the stimulus rotated an angle of 90± ¡ µ. Here, ¾? represents the gaussian width of the RF in the direction perpendicular to the cell’s preferred axis; it equals ¾x for vertically oriented RFs. For the position-shift mechanism, the left and right RF centers are not the same in general. Therefore, the rotational equivalence mentioned above has to be applied to the left and right RF responses separately before binocular combination. The nal result is that the complex cell response with the position-shift mechanism and preferred orientation µ can be obtained approximately by replacing D ¡ d in equation 2.7 by .D ¡ d/ sin µ: ³ 0

pos

rq .µ / ¼ 4A 2 cos2

!.D ¡ d/ sin µ 2

´

³ ´ .D ¡ d/ sin µ 0 0 !.D ¡ d/ sin µ 4A B cos 2 ¾? ³ ´ !.D ¡ d/ sin µ £ cos ® 0 ¡ ¯ 0 C : 2 C

(2.15)

This result depends on the assumption that the parallel component .D ¡ d/ cos.µ / can be ignored. Since d has to vary over a large range for a population response curve, equation 2.15 is not as good an approximation as equation 2.14. For vertical RF orientation, µ equals 90 degrees, and equations 2.14 and 2.15 reduce to equations 2.6 and 2.7, respectively. Similar to the discussion following equations 2.6 and 2.7, equations 2.14 and 2.15 also indicate that only the population response of the phase-shift mechanism to small stimulus disparity D (compared with the RF size) can be reliable and accurate. A new feature is that the sin µ factor will make the tuning curves and population response curves broader as the RF orientation µ deviates further from vertical. In the extreme case of horizontal orientation (µ D 0), the curves will be at with innite width. For phase-shift RFs with orientation µ , the preferred disparity expression should be generalized from equation 2.8 to: pha

Dpref ¼

1Á 1Á ´ ; ! sin µ !x

(2.16)

and the detectable disparity range becomes .¡¼=! sin µ; ¼=! sin µ/ or .¡¼=!x ; ¼=!x / where !x D ! sin µ (Qian, 1994; Mikaelian & Qian, 2000;

A Coarse-to-Fine Disparity Energy Model

1559

Matthews, Meng, Zu, & Qian, 2003). This range is smallest for the vertically oriented RFs and increases when the RF orientation µ deviates from vertical. Here ! is the preferred spatial frequency along the axis perpendicular to the RF orientation, and !x is the preferred spatial frequency along the horizontal axis; they equal each other when the RF is vertically oriented. For the position-shift RFs, equation 2.9 remains the same since d is assumed to be a horizontal shift regardless of the RF orientation. (If d is assumed to be the shift orthogonal to the RF orientation, then equation 2.9 will become pos Dpref ¼ d= sin µ.) The simulated disparity population responses and the peak-location histograms of complex cells are shown in Figure 4 for a bar with a small horizontal disparity of 1 pixel (marked by the vertical line in each panel) and an orientation of 67.5 degrees. Eight RF orientations evenly distributed in the entire 180 degree range were considered. Since the complex cells with horizontal orientation are not sensitive to horizontal disparity and generate at curves, we present only simulations from the seven nonhorizontal preferred orientations. The results for the phase- and position-shift RF models are shown in Figures 4A and 4B of the gure, respectively. For all seven preferred orientations, the population responses with the phase-shift mechanism are more reliable than those with the position-shift mechanism. For the phase-shift mechanism (see Figure 4A), the peak location of the population response depends on the difference between the RF and bar orientations. Only when the RF orientation matches the bar orientation (67.5 degrees), does the peak location agree with the actual stimulus disparity. Otherwise, the peak location underestimates the stimulus disparity magnitude. In particular, when the preferred orientation is 157.5 degrees, orthogonal to the bar orientation, the peak locates at zero disparity. As predicted by the analysis, the width of the curves in Figure 4 varies with the RF orientation.4 The maximum response in each panel (max) is shown at the top of the panel. Not surprisingly, the largest response occurs when the RF orientation matches the bar orientation. We have also done similar simulations with a random dot stereogram. The results (not shown) are similar except that since the stimulus is nonoriented, the maximum responses across all RF orientations do not differ much, and there is no orientation-dependent underestimation of the small disparity. We now consider pooling across cells with different orientations. The above results suggest that one should rst average together the population response curves from all orientations and then use the peak location of the

4 However, for the position-shift RF model (see Figure 4B), the sharpest response curves occur when the cells’ orientation matches the bar orientation (67.5degrees) instead of when it is vertical (90 degrees). This is not predicted by the approximate result of equation 2.15.

1560

Y. Chen and N. Qian

Disparity Population Responses RF Orientations

22.5 o 45 o *67.5 o 90 o 112.5 o 135 o 157.5 o max=5.3e-05 max=4.8e-03 max=2.4e-01 max=5.4e-03 max=1.6e-04 max=3.3e-05 max=8.8e-06 1

Phase-shift

Frequency Responses

A

0 1

0

Position-shift

Frequency Responses

B 1

0 1

0

m=0.52 s=0.02 -6

0

m=0.77 s=0.00 6

-6

0

m=0.90 s=0.00 6

-6

0

m=0.57 s=0.01 6

-6

0

m=0.32 s=0.12 6

-6

Df /(w sin q) [pixel]

0

m=0.27 s=0.10 6

-6

0

m=-0.07 s=0.03 6

-6

0

6

max=5.3e-05 max=4.8e-03 max=2.4e-01 max=5.4e-03 max=1.7e-04 max=3.3e-05 max=1.0e-05

m=0.83 s=2.23 -6

0

m=0.95 s=1.26 6

-6

0

m=0.97 s=0.83 6

-6

0

6

m=0.93 s=1.34

m=0.65 s=2.47

-6

-6

0 6 d [pixel]

0

m=0.72 s=3.70

6

-6

0

m=0.52 s=7.52 6

-6

0

6

Figure 4: Population responses to bar stereograms from model complex cells with different preferred orientations and with (A) the phase-shift and (B) the position-shift RF mechanisms. The bar stereograms have a xed disparity of 1 pixel and a xed orientation of 67.5 degrees. The preferred orientations of the cells are indicated above the rst row of the curves, and the case where the preferred orientation matches the bar orientation is marked by an asterisk. All simulation parameters (except orientation) and presentation format are the same as those for the population responses in Figures 1A and 1B. The number over the curves in each panel represents the maximum response to the 1000 stimuli with an arbitrary unit. Although the disparity range covered by a family of cells increases when the preferred orientation is closer to horizontal (see text), for the ease of presentation and comparison, we conned the disparity range of all cell families to that of the vertically oriented cells.

averaged curve to estimate disparity.5 For oriented stimuli such as bars, the response from cells with the matched orientation is the most accurate, and it will dominate the average because it is also the strongest. For nonoriented stimuli such as random dots, cells with different orientations respond similarly, and the averaging helps reduce noise. An alternative method is to average the estimated disparities from all orientations weighted by the

5 This procedure is analogous to what we did previously with spatial pooling (Zhu & Qian, 1996; Qian & Zhu, 1997).

A Coarse-to-Fine Disparity Energy Model

Phase-shift

A Frequency

Bar

1 m=0.894 s=0.002

0

-6

m=0.964 s=0.841

0

6

C Frequency

0

-6

0

6

0

6

D

1

1 m=0.968 s=0.107

0

Position-shift

B

1

Random-dot

1561

-6

m=1.014 s=0.603

0

Disparity [pixel]

6

0

-6

Disparity [pixel]

Figure 5: Distribution histograms of the estimated disparity after orientation pooling. Both phase- and position-shift mechanisms were applied to bar and random dot stimuli. The bar stimuli are same as in Figure 4, and the random dot stimuli are identical to those for panel Ad or Bd of Figure 2. The seven preferred orientations in Figure 4 were used in the pooling.

corresponding peak responses. We found that this method is usually not as good and will report simulation results only with the rst method. Figure 5 shows the distribution histograms of the estimated disparity with orientation pooling applied to bar and random-dot stimuli. The bar stimuli are same as in Figure 4. As expected, the distribution for the phaseshift mechanism in Figure 5A is as good as the case where the RF orientation matches the bar orientation (67.5 degrees) in Figure 4A. The random dot stimuli are identical to those for Figure 2Ad, and the distribution in Figure 5C is much more reliable due the pooling. In contrast, the orientation pooling is much less effective for the position-shift RF model (Figures 5B and 5D). Note that we used a relatively small stimulus disparity D; if D is increased to about half of the maximum value allowed by the phase-shift model, the difference between the two RF models will disappear (results not shown).

1562

Y. Chen and N. Qian

2.3.2 Scale Pooling and a Coarse-to-Fine Algorithm. In addition to orientation, disparity-selective cells are also tuned to spatial frequency. Cells in each frequency band (or scale) can be used to compute disparity (Marr & Poggio, 1979; Sanger, 1988; Qian, 1994), and an important question is how to combine information from different scales. Obviously, the scale pooling can be done according to the same methods for orientation pooling. The rst method is to average all the population response curves from different scales and then estimate the disparity.6 Alternatively, one could estimate one disparity from each scale and then average the estimates with proper weighting factors (Sanger, 1988; Qian & Zhu, 1997; Mikaelian & Qian, 2000). However, although these approaches work reasonably well for orientation pooling, they may not be adequate for scale pooling because the large and small scales have different problems that are unlikely to cancel each other through averaging. Cells at large scales have large RFs, and they tend to mix together different disparities in the image area covered by the RFs. This will make transitions at disparity boundaries less sharp than our perception (Qian & Zhu, 1997) and render disparities in small image regions difcult to detect. At small scales, the detectable disparity range by the cells is correspondingly smaller (Sanger, 1988; Qian, 1994), and large disparities in a stereogram may lead to completely wrong estimations. If one simply averages across the scales, the resulting disparity map will not be accurate unless the majority of the scales included perform reasonably well. With too many large scales included, the computed disparity map will lose sharp details, and with too many small scales included, disparity estimations at some locations may be totally wrong. If one knew the true stimulus disparities beforehand, one could pick a few appropriate scales and average across them only. However, the purpose of a stereo algorithm is to compute disparity maps from arbitrary stereograms with unknown disparities. A method known to alleviate these problems is coarse-to-ne tracking across scales. This method has been applied to stereovision previously (Marr & Poggio, 1979), but to our knowledge, its role in the disparity energy model with the phase- and position-shift RF mechanisms has not been explored. It is most natural to introduce the coarse-to-ne technique into our iterative algorithm presented in section 2.2. The only modication is to start at a large scale and then reduce the scale of the RFs through the iterations. By starting at a large scale, the algorithm can cover a large range of disparities. With each iteration, the disparity will be reduced by a constant position shift, and the residual disparity can thus be estimated by the phase mechanism at a smaller scale that sharpens the disparity boundaries. This procedure can be continued until a ne disparity map is obtained at the smallest scale.

6 This approach has been applied to disparity tuning curves (Fleet et al., 1996) and can be extended to population responses for disparity computation.

A Coarse-to-Fine Disparity Energy Model

1563

Coarse-to-fine Algorithm Start at the largest scale Initialize the position shift d = 0 at each location

Compute disparity population responses of complex cells with the fixed d, but full range of Df and q

Apply spatial pooling and orientation pooling, and estimate the disparity D

est

from the pooled population responses

Move to the next smaller scale, and reset d = D

est

Figure 6: The coarse-to-ne algorithm with both phase- and position-shift RFs.

The full algorithm, with the spatial pooling and orientation pooling procedures incorporated, is shown schematically in Figure 6. At each iteration, the spatial pooling step combines quadrature pair responses in a local area to improve the reliability of the population responses (Zhu & Qian, 1996; Qian & Zhu, 1997), and the orientation pooling further improves the quality of the estimated disparity as described in section 2.3.1. Based on the above considerations, the largest scale should be chosen according to the largest disparity the system should be able to extract, and the smallest scale should correspond to the nest details of disparity the system should be able to recover. For the simulations reported below, we employed a set of scales whose ¾ ’s follow a geometric series with a ratio p of 2. To keep the cells’ frequency bandwidth constant at different scales, we xed the product !¾? D ¼ for all scales, where ¾? is again the gaussian width in the direction orthogonal to the RF orientation. At each scale, there are several sets of cell populations, each with a constant position shift d and the full range of 1Á. Only the population whose d is closest to the estimated disparity in the previous scale will actually be used. We simply let the range of d’s be the same across all scales and equal to the disparity range of the phase-shift mechanism at the largest scale. For spatial pooling,

1564

Y. Chen and N. Qian

we used a 2D gaussian function to combine the responses of quadrature pairs around each location (Zhu & Qian, 1996; Qian & Zhu, 1997). Finally, to reduce computational time, we used ve RF orientations (30, 60, 90, 120, and 150 degrees from horizontal) to perform orientation pooling instead of the seven orientations in Figure 5. Figure 7C is an example of applying our coarse-to-ne algorithm to a random dot stereogram. To test the model’s performance for both large and small stimulus disparities, we picked a far disparity of 5 pixels for the central region and a near disparity of ¡1 pixel for the surround. The panels in Figure 7C show the estimated disparity maps at each of the ve iterations or scales (with one iteration per scale). At the largest scale, the transition at the disparity boundaries is poor, and the disparity magnitude of both the center and the surround are underestimated due to the zero disparity bias of the phase mechanism (Qian & Zhu, 1997). 7 However, since this disparity map is generally in the right direction, the subsequent smaller scales were able to rene it. At the smallest scale, the map has sharp transition boundaries and accurate disparity values for both center and surround regions. The nal map is much better than those computed from any individual scale with either phase- or position-shift mechanism. We also compared the coarse-to-ne algorithm with the simple method of averaging population responses across scales mentioned above. Since at small scales, the phase-shift RF model can cover only very small ranges of stimulus disparity, we used the position-shift model with the scale averaging method. The results of applying the method to the same random dot stereogram are shown in Figure 7D. The spatial pooling and orientation pooling were applied as in Figure 7C. In Figure 7D, the panels from left to right show the computed disparity maps by gradually including more scales in the averaging process, with the left-most panel showing the result from the largest scale alone and the right-most panel the result of averaging all ve scales. As expected, in the left two panels where the scales are large, the disparity map is fuzzy at the transition boundaries. With more smaller scales included, the estimated disparity at some locations is totally wrong (the black and white spots in the right panels of Figure 7D). In the simulations of different scales, we xed the stimuli and generated the RFs of different sizes. Alternatively, we could x the RFs and scale the stimuli appropriately. With this approach, both the coarse-to-ne and the scale-averaging methods generate results similar to that in Figure 7C. This suggests that the scale-averaging method is more sensitive to implementation details. Also note that for the simulations in Figure 7, at each scale, only eight complex cells were used for the coarse-to-ne simulation but 33 cells 7 As we noted earlier, without scale pooling, the single-scale disparity maps computed with 2D lters in this article are much worse than those computed with 1D lters reported previously (Qian, 1994; Qian & Zhu, 1997) because 2D lters tend to mix up disparities in a larger region.

A Coarse-to-Fine Disparity Energy Model

1565

Figure 7: The coarse-to-ne and scale-averaging algorithms applied to a random dot stereogram. (A) A 200£ 200 random dot stereogram with a dot density of 50% and dot size of 1 pixel. The central 100 £ 100 area has a disparity of 5 pixels, while the surround has a disparity of ¡1 pixel. (B) True disparity map. The white and black colors represent near and far disparities, respectively. (C) Estimated disparity maps at ve scales and iterations obtained with the coarseto-ne algorithm. The spatial pooling function was a 2D gaussian with both standard deviations equal to ¾? in each scale. Orientation pooling covered ve orientations from 30 to 150 degrees in steps of 30 degrees. The other parameters for the RFs in the left-most panel were the same as those in Figure 3. The p scales of other panels were successively reduced from left to right by a factor of 2, as indicated by the ¾? value under each panel. Note that for all ve scales, !¾? D ¼ ; thus, the frequency bandwidths of all scales were xed at 1.14 octaves. For each scale, the position shift d always varied from ¡8 pixels to 8 pixels in a step of 0.5 pixel, while the phase shift 1Á covered a period from ¡¼ to ¼ with a step of ¼=4. (D) Estimated disparity maps with the scale-averaging procedure and the position-shift mechanism. From left to right, a smaller scale with the indicated ¾? was added to the average at each panel. Thus, the left-most panel shows the result from the largest scale alone, while the right-most panel is the average result of all ve scales. The spatial pooling and orientation pooling were also applied as in C. The RF parameters were same as those in C, except that 1Á was kept at 0.

1566

Y. Chen and N. Qian

were used for the scale-averaging simulation. The coarse-to-ne method requires fewer cells because at each ner scale, the cells used are more focused around the stimulus disparity. No such adjustment is present in the scaleaveraging method, and if Figure 7D used only eight cells, the results (not shown) would be much worse. The scale-averaging method is also more sensitive to the frequency contents of a stereogram; it performs worse for narrowband stimuli (results not shown). In summary, the scale-averaging method is not as robust as the coarse-to-ne method. 2.4 Application of the Coarse-to-Fine Algorithm to More Complex Stereograms. We also applied the coarse-to-ne algorithm to more complex synthetic stereograms and to real-world stereograms. Figure 8 shows the results for a disparity ramp and a Gabor disparity prole. The stereograms were created by starting with a reference random dot image and then shifting each dot horizontally in opposite directions for the left and right images by half of the disparity value prescribed to the dot. Gray-level interpolation was used to represent subpixel disparities. Figures 8C and 8F demonstrate that the coarse-to-ne algorithm works well on these stereograms. Except for the slightly blurred disparity boundaries, the estimated disparity maps closely match the true disparity maps for both stereograms, with most errors within §0:25 pixel (true for 89% and 93% of the total pixels for the ramp and Gabor stereograms, respectively). For comparison, we also show in Figure 8 the results from the scale-averaging method used for Figure 7D; again, the estimated disparities at some locations are completely wrong. Figure 9 shows three real-world stereograms and the estimated disparity maps with our coarse-to-ne algorithm and the scale-averaging algorithm. Since the true disparity maps are unknown, we can access the performance only qualitatively. The results computed with our coarse-to-ne algorithm seem to be quite reasonable for the Pentagon and Tree stereograms, but less accurate for the Shrub stereogram. In general, the method works well on image areas with relatively high-contrast textures (e.g., the grass ground of the Tree stereogram), but fails at low-contrast regions (the foreground pavement of the Shrub stereogram). This problem is not surprising as the low-contrast areas generate only weak complex-cell responses that are more prone to noise. Another problem is exemplied by the small black spot in the Pentagon disparity map: if a large scale reports a very wrong disparity, the smaller scales usually cannot correct it. Solving these problems may require more global but selective interactions of disparity information at different locations and a bidirectional information ow among the scales (see section 3). With the scale-averaging method, the problem of spots with wrong disparities is more pronounced, similar to Figures 7 and 8. Moreover, the estimated disparity maps appear more blurred than those obtained with the coarse-to-ne algorithm (e.g., the signpost in Shrub).

A Coarse-to-Fine Disparity Energy Model

1567

Figure 8: More complex synthetic stereograms. (A) A ramp stereogram with a size of 200£ 200 pixels. In the central 160£ 160 area, the disparity varies linearly from ¡5 pixels to 5 pixels, while the surround has a zero disparity. The gray level of a pixel is randomly chosen between 0 and 1. (B) True disparity map for the ramp stereogram. The white and black colors represent near and far disparities, respectively. (C) Estimated disparity map with the coarse-to-ne algorithm (left panel) and the scale-averaging algorithm (right panel). (D) Gabor stereogram with the same size as the ramp stereogram. The disparity map is created 2

2

according to a Gabor function: D.x; y/ D Dmax exp.¡ x2¾Cy2 / cos.!D sin.µD /x C D

!D cos.µD /y C ÁD /. The parameters of the Gabor function are: Dmax D 5 pixels, !D =2¼ D 1=80 cyc/pixel, ¾D D 40 pixels, ÁD D 1:39, and disparity orientation µD D 30± . (E) True disparity maps for the Gabor stereogram. (F) Estimated disparity map with the coarse-to-ne algorithm (left panel) and the scale-averaging algorithm (right panel). All the RF parameters applied to both ramp and Gabor stereograms are same as those in Figure 7.

3 Discussion We have demonstrated through analyses and simulations that in the framework of the disparity energy model, the phase-shift RF mechanism is better suited for disparity computation than the position-shift mechanism. Although the two RF mechanisms generate similar tuning curve properties, the phase-shift mechanism provides more reliable population response curves than the position-shift mechanism does. The phase-shift model, however, has its own limitations, such as the restricted range of detectable disparity (Qian, 1994) and a zero-disparity bias (Qian & Zhu, 1997). To overcome

1568

Y. Chen and N. Qian

Figure 9: Real-world stereograms and the estimated disparity maps with the coarse-to-ne and the scale-averaging algorithms. (A) Pentagon stereogram. The size of the stereogram is 256 £ 256 pixels. (B) Shrub stereogram. The size is 256£240 pixels. (C) Tree stereogram. The size is 256£233 pixels. The stereograms have all been scaled to the same size in the gure for ease of presentation. All the RF parameters are the same as those in Figure 8. For attenuating the interference of the DC component in the stimuli, we deducted the mean luminance from each stereogram before applying the lters. All three stereograms are obtained from the Carnegie Mellon University Image Database.

these problems, we have proposed an iterative procedure in which disparity is always estimated with the phase-shift mechanism but the magnitude of disparity is gradually reduced by introducing a constant position-shift parameter to all cells at each iteration. We then considered integrating information across different RF orientations and spatial scales and found that although a simple averaging procedure (used previously for spatial pooling; Zhu & Qian, 1996; Qian & Zhu, 1997) seems sensible for orientation pooling, it may not be a good approach for scale pooling. Instead, it is better to treat the scale pooling as a coarse-to-ne process. Such a coarse-to-ne process can be easily combined with the iterative procedure with a smaller scale used at each new iteration. The nal algorithm is an iterative coarse-to-ne procedure with spatial and orientation pooling incorporated and with both position-shift and phase-shift RF components.

A Coarse-to-Fine Disparity Energy Model

1569

We have applied the algorithm to a variety of stereograms. Our simulations demonstrate that the algorithm can recover both sharp disparity boundaries and gradual disparity changes in the stimuli. However, a couple of problems of the algorithm are also revealed by real-world stereograms. Unlike synthetic patterns, real-world images tend to have regions of very low contrast. Model cells whose RFs cover only a low-contrast area will not respond well, and the results will tend to be very unreliable. The solution to the problem might involve introducing long-range spatial interactions to allow disparity information to propagate from high-contrast regions to low-contrast regions. The challenge is that the interactions should also be selective in order not to smear out real disparity boundaries. It is not clear how to introduce such interactions in a physiologically plausible manner without resorting to a list of ad hoc rules. Another problem with the current algorithm is that if a large scale gives a completely wrong estimation of stimulus disparity, it is impossible for the subsequent smaller scales to correct it. This does not happen often due to the inclusion of spatial and orientation pooling, which greatly improves the reliability of estimation at any scale, but when it does happen, the estimated disparity can be far from the true value. This problem may be solved by a bidirectional information ow across the scales instead of the current unidirectional ow from large to small scales. Indeed, there is psychophysical evidence suggesting that scale interaction is bidirectional (Wilson, Blake, & Halpern, 1991; Smallman & MacLeod, 1994). The scale-averaging approach may be viewed as bidirectional. However, it does not appear to be adequate because the wrongdisparity problem becomes worse with such a method. In general, we found that the coarse-to-ne algorithm is superior to the scale-averaging method, particularly with complex or natural stereograms. But the scale averaging is much easier to implement and may be useful when the precision of the disparity map is not critical. It is interesting to compare the effects of varying RF orientation and varying RF scale to disparity computation. When the RF orientation is farther away from the vertical, the cells’ detectable disparity range increases. This is similar to an increase of the RF scale. However, as the orientation gets closer to horizontal, there will be fewer cycles of RF modulation along the horizontal dimension, tuning curves and population response curves will become broader, and the disparity computation will become less meaningful. This problem does not occur when one uses vertically oriented cells at larger scales to cover a wider disparity range, as long as the spatial-frequency bandwidth (and thus the number of RF modulation cycles) is kept constant across scales. At a xed scale, cells with different orientations cover the same stimulus area, whereas cells with different scales cover very different stimulus areas. These differences suggest that the orientation pooling and scale pooling should be treated differently, as we did in this article. In particular, since cells with different orientations at a given scale all cover the same image patch, their responses can be averaged together to represent the total

1570

Y. Chen and N. Qian

disparity energy of that patch. In contrast, cells with different scales cover different image sizes, and it appears more sensible to use a coarse-to-ne algorithm that can gradually recover disparity details. Also note that the averaging procedure of orientation pooling appears to be more effective for the phase-shift model than for the position-shift model, at least for small disparities (see Figure 5). We nally discuss the physiological relevance of our model. Our algorithm is based on the disparity energy model, which has been found to provide a good approximation of real complex-cell responses although some discrepancies have also been noted (Freeman & Ohzawa, 1990; Ohzawa et al., 1990, 1997; DeAngelis et al., 1991; Anzai, Ohzawa, & Freeman, 1999b, 1999c; Chen et al., 2001; Livingstone & Tsao, 1999; Cumming, 2002). In addition, experimental evidence indicates that both the phase- and position-shift RF mechanisms are employed by the binocular cells for disparity representation (Hubel & Wiesel, 1962; Bishop & Pettigrew, 1986; Poggio, Motter, Squatrito, & Trotter, 1985; Ohzawa et al., 1990, 1997; DeAngelis et al., 1991; Anzai et al., 1999a; Prince et al., 2000). We (Zhu & Qian, 1996; Qian & Zhu, 1997) have pointed out previously that spatial pooling of quadrature pair responses to construct complex-cell responses is consistent with the fact that the RFs of complex cells are, on average, larger than those of simple cells (Hubel & Wiesel, 1962; Schiller, Finlay, & Volman, 1976). It is also reasonable to incorporate orientation pooling since physiological studies have demonstrated that obliquely oriented cells are tuned to disparity (Poggio & Fischer, 1977; Ohzawa et al., 1996, 1997) and thus must contribute to disparity estimation. This is further supported by the psychophysical nding that binocular correspondence appears to be solved by oriented lters along the contours in a stimulus (Farell, 1998) and does not follow the epipolar constraint (Stevenson & Schor, 1997). On the other hand, there is no evidence for (or against) our specic proposal of using the phase- and position-shift mechanisms in an iterative, coarse-to-ne process. In particular, we do not know if real binocular cells rely on the phaseshift mechanism to estimate disparity and use the position-shift mechanism to reduce the disparity magnitude that the phase mechanism has to process. Our coarse-to-ne procedure is similar to that proposed by Marr and Poggio (1979), but with one important difference. Marr and Poggio assumed that the large disparity is reduced by vergence eye movement before being processed by smaller scales. Since vergence eye movement shifts stimulus disparity globally, a particular vergence state will not be able to reduce stimulus disparities at all spatial locations, and different vergence state will have to be assumed for each image location. Our model requires only a single vergence state that brings stimulus disparity within a reasonable range; further disparity reduction during the estimation process is carried out by the position-shift mechanism locally at each point. This is consistent with the psychophysical observation that the scale interaction does not depend

A Coarse-to-Fine Disparity Energy Model

1571

on eye movement (Rohaly & Wilson, 1993). 8 In addition, the model is consistent with the nding that with vergence minimized, the fusional range decreases with spatial frequency (Schor, Wood, & Ogawa, 1984) because higher-frequency stimuli benet less from the guidance of the larger scales. A recent report indicates that individual V1 cells may undergo a coarse-tone process over time independent of eye movement (Menz & Freeman, 2003). Therefore, it may not be necessary to implement our coarse-to-ne algorithm with several different populations of cells as we did in our simulations; instead, a single cell population progressively reducing their scale over time might be sufcient. Appendix: Derivation of Equations 2.6 and 2.7 Based on the disparity energy model, the response of simple cell to a stereo image pair I.x; y/ and I.x C D; y/ can be written as (Ohzawa et al., 1990; DeAngelis, Ohzawa, & Freeman, 1993; Qian, 1994; Anzai et al., 1999b; Chen et al., 2001): 2 rs D 4 2 D4

Z Z1 ¡1

Z Z1 ¡1

32 fgl .x; y/I.x; y/ C gr .x; y/I.x C D; y/gdxdy 5 32 fgl .x; y/I.x; y/ C gr .x ¡ D; y/I.x; y/gdxdy 5 ;

(A.1)

where D is the stimulus disparity, and gl .x; y/ and gr .x; y/ are the left and right RFs of simple cell. Note that the full squaring used here is equivalent to a push-pull pair of half-squaring simple cells. We dene the linear ltering of the left and right images as

rsl D rsr D

Z Z1 gl .x; y/I.x; y/dxdy

(A.2)

gr .x ¡ D; y/I.x; y/dxdy;

(A.3)

¡1

Z Z1 ¡1

8 The model may also be consistent with the nding that the stereo threshold elevation with base disparity is not eliminated by the addition of a low-frequency component (Rohaly & Wilson, 1993) because the low-frequency component itself has an elevated threshold with base disparity and thus becomes unreliable at large base disparity.

1572

Y. Chen and N. Qian

so that rs D .rsl C rsr /2 . The left RF of a simple cell has the same form for the phase- and position-shift models, and we rewrite equations 2.2 and 2.4 as gl .x; y/ D cos.Á/gcos .x; y/ ¡ sin.Á/gsin .x; y/;

(A.4)

where Á ! 1 x2 y2 exp ¡ 2 ¡ cos.!x/ gcos .x; y/ D 2¼ ¾x ¾y 2¾x 2¾y2 Á ! 1 x2 y2 exp ¡ 2 ¡ sin.!x/: gsin .x; y/ D 2¼ ¾x ¾y 2¾x 2¾y2 Then, rsl for both phase- and position-shift RF models becomes rsl D cos.Á/

Z Z1 ¡1

gcos .x; y/I.x; y/dxdy ¡ sin.Á/

D A cos.® C Á/;

Z Z1 gsin .x; y/I.x; y/dxdy ¡1

(A.5)

where q AD A1 D A2 D

A21 C A22 ;

Z Z1

® D arctan.A2 =A1 /

(A.6)

gcos .x; y/I.x; y/dxdy ¡1 Z Z1

gsin .x; y/I.x; y/dxdy: ¡1

Since the right RF of a simple cell has different forms for the two RF models, we consider them separately. In equation A.3, gr .x ¡ D; y/ can be written as Á ! 1 .x ¡ D/2 y2 pha exp ¡ ¡ gr .x ¡ D; y/ D 2¼ ¾x ¾ y 2¾x2 2¾y2 £ cos.! .x ¡ D/ C Á C 1Á/ Á ! 1 .x ¡ .D ¡ d//2 y2 pos exp ¡ ¡ gr .x ¡ D; y/ D 2¼ ¾x ¾ y 2¾x2 2¾ y2 £ cos.! .x ¡ .D ¡ d// C Á/

(A.7)

(A.8)

for the phase- and position-shift mechanisms, respectively. In the previous modeling analyses (Qian, 1994; Zhu & Qian, 1996; Fleet et al., 1996; Qian

A Coarse-to-Fine Disparity Energy Model

1573

& Zhu, 1997; Chen et al., 2001), a simplifying assumption is that the horizontal envelope shifts (D in the gaussian term of equation A.7 and D ¡ d in the gaussian term of equation A.8) are small enough to be ignored. For the phase-shift mechanism, the assumption can be satised as long as the stimulus disparity D is small enough. However, for the position-shift mechanism, the above assumption requires the stimulus disparity D close to the position shift d. This is obviously a more stringent requirement. To compare the two mechanisms, the envelope shifts should be considered. Here, we take the rst-order Taylor expansion in D=¾x and .D ¡ d/=¾x for the two mechanisms, respectively: ³ ´ ³ ´ ³ ´ .x ¡ D/2 x2 xD x2 exp ¡ ¼ exp ¡ C exp ¡ 2¾x2 2¾x2 2¾x2 ¾x2 ³ ´ ³ ´ .x ¡ .D ¡ d//2 x2 x.D ¡ d/ exp ¡ ¼ exp ¡ 2 C 2 2¾x 2¾x ¾x2 ³ ´ x2 £ exp ¡ 2 : 2¾x

(A.9)

(A.10)

Under these approximations, the right RFs in equiations A.7 and A.8 can be written as Á ! 1 .x ¡ D/2 y2 pha exp ¡ ¡ gr .x ¡ D; y/ ¼ 2¼ ¾x ¾y 2¾x2 2¾ y2 ³ ´ xD £ cos.!.x ¡ D/ C Á C 1Á/ 1 C 2 (A.11) ¾x Á ! 1 .x ¡ .D ¡ d//2 y2 pos exp ¡ ¡ gr .x ¡ D; y/ ¼ 2¼ ¾x ¾y 2¾x2 2¾y2 ³ ´ x.D ¡ d/ £ cos.!.x ¡ .D ¡ d// C Á/ 1 C : (A.12) ¾x2 Then, similar to the derivation for equation A.5, we have pha

rsr ¼ A cos.® C Á C 1Á ¡ !D/ C

D B cos.¯ C Á C 1Á ¡ !D/ (A.13) ¾x

pos

rsr ¼ A cos.® C Á ¡ !.D ¡ d// D¡d C B cos.¯ C Á ¡ !.D ¡ d// ¾x

(A.14)

for the phase- and position-shift models, respectively, with q BD

B21 C B22 ;

¯ D arctan.B2 =B1 /

(A.15)

1574

Y. Chen and N. Qian

B1 D B2 D

Z Z1 ¡1 Z Z1 ¡1

x gcos .x; y/I.x; y/dxdy ¾x x gsin .x; y/I.x; y/dxdy: ¾x

The nal expressions for simple cell response are: pha

rs

pos

rs

µ ³ ´ ³ ´ 1Á ¡ !D 1Á ¡ !D ¼ 2A cos ® C Á C cos 2 2 ¶2 D C B cos.¯ C Á C 1Á ¡ !D/ ¾x µ ³ ´ ³ ´ !.D ¡ d/ !.D ¡ d/ ¼ 2A cos ® C Á ¡ cos 2 2 ¶2 ¡ D d C B cos.¯ C Á ¡ !.D ¡ d// : ¾x

(A.16)

(A.17)

Based on the well-known quadrature pair method for the energy models (Adelson & Bergen, 1985; Watson & Ahumada, 1985; Pollen, 1981; Ohzawa et al., 1990; Emerson, Bergen, & Adelson, 1992; Qian, 1994), the complex cell receives the inputs from two simple cells, both with identical 1Á, but their Á differing by ¼=2. The resulting complex-cell responses to the rst-order approximation are: pha

rq

pos

rq

³ ´ ³ ´ !D ¡ 1Á D !D ¡ 1Á ¼ 4A2 cos2 C 4AB cos 2 2 ¾x ³ ´ !D ¡ 1Á £ cos ® ¡ ¯ C 2 ³ ´ ³ ´ ¡ !.D d/ D¡d !.D ¡ d/ 2 2 ¼ 4A cos C 4AB cos 2 2 ¾x ³ ´ !.D ¡ d/ £ cos ® ¡ ¯ C 2

(A.18)

(A.19)

These are equations 2.6 and 2.7. For a relatively thin bar at xo , I.x; y/ ¼ ±.x ¡ xo ; y/, and equations A.6 and A.15 show ® ¼ ¯. Acknowledgments This work was supported by NIH grant MH54125. We thank the two anonymous reviewers for their helpful comments.

A Coarse-to-Fine Disparity Energy Model

1575

References Adelson, E. H., & Bergen, J. R. (1985). Spatiotemporal energy models for the perception of motion. J. Opt. Soc. Am. A, 2, 284–299. Anzai, A., Ohzawa, I., & Freeman, R. D. (1997). Neural mechanisms underlying binocular fusion and stereopsis: Position vs. phase. Proc. Nat. Acad. Sci. USA, 94, 5438–5443. Anzai, A., Ohzawa, I., & Freeman, R. D. (1999a). Neural mechanisms for encoding binocular disparity: receptive eld position vs. phase. J. Neurophysiol., 82, 874–890. Anzai, A., Ohzawa, I., & Freeman, R. D. (1999b). Neural mechanisms for processing binocular information: I. Simple cells. J. Neurophysiol., 82, 891–908. Anzai, A., Ohzawa, I., & Freeman, R. D. (1999c). Neural mechanisms for processing binocular information: II. Complex cells. J. Neurophysiol., 82, 909–924. Bishop, P. O., & Pettigrew, J. D. (1986). Neural mechanisms of binocular vision. Vision Res., 26, 1587–1600. Chen, Y., Wang, Y., & Qian, N. (2001). Modeling V1 disparity tuning to timevarying stimuli. J. Neurophysiol., 86(1), 143–155. Cumming, B. G. (2002). An unexpected specialization for horizontal disparity in primate primary visual cortex. Nature, 418, 633–636. Daugman, J. G. (1985). Uncertainty relation for resolution in space, spatial frequency, and orientation optimized by two-dimensional visual cortical lters. J. Opt. Soc. Am. A, 2, 1160–1169. DeAngelis, G. C., Ohzawa, I., & Freeman, R. D. (1991). Depth is encoded in the visual cortex by a specialized receptive eld structure. Nature, 352, 156–159. DeAngelis, G. C., Ohzawa, I., & Freeman, R. D. (1993). Spatiotemporal organization of simple-cell receptive elds in the cat’s striate cortex. II. Linearity of temporal and spatial summation. J. Neurophysiol., 69, 1118–1135. Emerson, R. C., Bergen, J. R., & Adelson, E. H. (1992). Directionally selective complex cells and the computation of motion energy in cat visual cortex. Vision Res., 32, 203–218. Farell, B. (1998). Two-dimensional matches from one-dimensional stimulus components in human stereopsis. Nature, 395, 689–692. Fleet, D. J., Wagner, H., & Heeger, D. J. (1996). Encoding of binocular disparity: Energy models, position shifts and phase shifts. Vision Res., 36, 1839–1858. Freeman, R. D., & Ohzawa, I. (1990). On the neurophysiological organization of binocular vision. Vision Res., 30, 1661–1676. Hubel, D. H., & Wiesel, T. (1962). Receptive elds, binocular interaction, and functional architecture in the cat’s visual cortex. J. Physiol., 160, 106–154. Jones, J. P., & Palmer, L. A. (1987). The two-dimensional spatial structure of simple receptive elds in the cat striate cortex. J. Neurophysiol., 58, 1187–1211. Livingstone, M. S., & Tsao, D. T. (1999). Receptive elds of disparity-selective neurons in macaque striate cortex. Nature Neurosci., 2(9), 825–832. Marr, D., & Poggio, T. (1979). A computational theory of human stereo vision. Proc. R. Soc. Lond. B, 204, 301–328. Matthews, N., Meng, X., Xu, P., & Qian, N. (2003). A physiological theory of depth perception from vertical disparity. Vision Res., 43, 85–99.

1576

Y. Chen and N. Qian

Menz, M. D., & Freeman, R. D. (2003). Stereoscopic depth processing in the visual cortex: A coarse-to-ne mechanism. Nature Neurosci., 6(1), 59–65. Mikaelian, S., & Qian, N. (1997). Disparity attraction and repulsion in a twodimensional stereo model. Soc. Neurosci. Abs., 23, 569. Mikaelian, S., & Qian, N. (2000). A physiologically-based explanation of disparity attraction and repulsion. Vision Res., 40, 2999–3016. Ohzawa, I., DeAngelis, G. C., & Freeman, R. D. (1990). Stereoscopic depth discrimination in the visual cortex: Neurons ideally suited as disparity detectors. Science, 249, 1037–1041. Ohzawa, I., DeAngelis, G. C., & Freeman, R. D. (1996). Encoding of binocular disparity by simple cells in the cat’s visual cortex. J. Neurophysiol., 75, 1779– 1805. Ohzawa, I., DeAngelis, G. C., & Freeman, R. D. (1997). Encoding of binocular disparity by complex cells in the cat’s visual cortex. J. Neurophysiol., 77, 2879– 2909. Poggio, G. F., & Fischer, B. (1977). Binocular interaction and depth sensitivity in striate and prestriate cortex of behaving rhesus monkey. J. Neurophysiol., 40, 1392–1405. Poggio, G. F., Gonzalez, F., & Krause, F. (1988). Stereoscopic mechanisms in monkey visual cortex: Binocular correlation and disparity selectivity. J. Neurosci., 8, 4531–4550. Poggio, G. F., Motter, B. C., Squatrito, S., & Trotter, Y. (1985). Responses of neurons in visual cortex (V1 and V2) of the alert macaque to dynamic randomdot stereograms. Vision Res., 25, 397–406. Pollen, D. A. (1981). Phase relationship between adjacent simple cells in the visual cortex. Nature, 212, 1409–1411. Prince, S. J. D., Cumming, B. G., & Parker, A. J. (2000). Range and mechanism of encoding of horizontal disparity in macaque V1. J. Neurophysiol., 87, 209–221. Qian, N. (1994). Computing stereo disparity and motion with known binocular cell properties. Neural Comput., 6, 390–404. Qian, N. (1997). Binocular disparity and the perception of depth. Neuron, 18, 359–368. Qian, N., & Zhu, Y. (1997). Physiological computation of binocular disparity. Vision Res. 37, 1811–1827. Rohaly, A. M., & Wilson, H. R. (1993). Nature of coarse-to-ne constraints on binocular fusion. J. Opt. Soc. Am. A, 10, 2433–2441. Sanger, T. D. (1988). Stereo disparity computation using Gabor lters. Biol. Cybern., 59, 405–418. Schiller, P. H., Finlay, B. L., & Volman, S. F. (1976). Quantitative studies of singlecell properties in monkey striate cortex: I. Spatiotemporal organization of receptive elds. J. Neurophysiol., 39, 1288–1319. Schor, C. M., & Wood, I. (1983). Disparity range for local stereopsis as a function of luminance spatial frequency. Vision Res., 23, 1649–1654. Schor, C. M., Wood, I., & Ogawa, J. (1984). Binocular sensory fusion is limited by spatial resolution. Vision Res., 24(7), 661–665. Smallman, H. S., & MacLeod, D. I. (1994). Size-disparity correlation in stereopsis at contrast threshold. J. Opt. Soc. Am. A, 11, 2169–2183.

A Coarse-to-Fine Disparity Energy Model

1577

Stevenson, S. B., & Schor, C. M. (1997). Human stereo matching is not restricted to epipolar lines. Vision Res., 37, 2717–2773. Teich, A. F., & Qian, N. (2003). Learning and adaptation in a recurrent model of V1 orientation selectivity. J. Neurophysiol., 89, 2086–2100. Tsai, J. J., & Victor, J. D. (2003). Reading a population code: A multi-scale neural model for representing binocular disparity. Vision Res., 43, 445–466. Watson, A. B., & Ahumada, A. J. (1985). Model of human visual-motion sensing. J. Opt. Soc. Am. A, 2, 322–342. Wilson, H. R., Blake, R., & Halpern, D. L. (1991). Coarse spatial scales constrain the range of binocular fusion on ne scales. J. Opt. Soc. Am. A, 8, 229–236. Zhu, Y., & Qian, N. (1996). Binocular receptive elds, disparity tuning, and characteristic disparity. Neural Comput., 8, 1611–1641. Received July 30, 2003; accepted January 30, 2004.

LETTER

Communicated by Ning Qian

A Preference for Phase-Based Disparity in a Neuromorphic Implementation of the Binocular Energy Model Eric K. C. Tsang [email protected]

Bertram E. Shi [email protected] Department of Electrical and Electronic Engineering, Hong Kong University of Science and Technology, Clear Water Bay, Kowloon, Hong Kong

The relative depth of objects causes small shifts in the left and right retinal positions of these objects, called binocular disparity. This letter describes an electronic implementation of a single binocularly tuned complex cell based on the binocular energy model, which has been proposed to model disparity-tuned complex cells in the mammalian primary visual cortex. Our system consists of two silicon retinas representing the left and right eyes, two silicon chips containing retinotopic arrays of spiking neurons with monocular Gabor-type spatial receptive elds, and logic circuits that combine the spike outputs to compute a disparity-selective complex cell response. The tuned disparity can be adjusted electronically by introducing either position or phase shifts between the monocular receptive eld proles. Mismatch between the monocular receptive eld proles caused by transistor mismatch can degrade the relative responses of neurons tuned to different disparities. In our system, the relative responses between neurons tuned by phase encoding are better matched than neurons tuned by position encoding. Our numerical sensitivity analysis indicates that the relative responses of phase-encoded neurons that are least sensitive to the receptive eld parameters vary the most in our system. We conjecture that this robustness may be one reason for the existence of phase-encoded disparity-tuned neurons in biological neural systems. 1 Introduction The accurate perception of the relative depth of objects enables both biological organisms and articial autonomous systems to interact successfully with their environment. Binocular disparity, the positional shift between the image locations of an object in two eyes or cameras caused by the difference in their vantage points, is one important cue that can be used to infer depth. Neurons in the mammalian visual cortex combine signals from the left and right eyes to generate responses selective for a particular disparity c 2004 Massachusetts Institute of Technology Neural Computation 16, 1579–1600 (2004) °

1580

E. Tsang and B. Shi

(Barlow, Blackemore, & Pettigrew, 1967; Poggio, Motler, Squatrito, & Trotter, 1985). Ohzawa, DeAngelis, and Freeman (1990) proposed the binocular disparity energy model to explain the responses of binocular complex cells and found that the predictions of this model match measured data from the cat. Cumming and Parker (1997) showed that this model also matches data from the macaque. Although there are discrepancies between the model and the biological data (Ohzawa, DeAngelis, & Freeman, 1997; Chen, Wang, & Qian, 2001; Cumming, 2002), it remains a good rst-order approximation. In the model, a neuron can achieve a particular disparity tuning by a position or a phase shift between its monocular receptive eld (RF) proles of the left and right eyes. Based on an analysis of a population of binocular cells, Anzai, Ohzawa, and Freeman (1999a) suggest that the cat primarily encodes disparity via a phase shift, although position shifts may play a larger role at higher spatial frequencies. A preference for phase encoding is also consistent with human psychophysical data (Schor, Wood, & Ogawa, 1984; DeValois & DeValois, 1988). More recent data from the alert macaque suggest that a mixture of position and phase encodings is common (Prince, Cumming, & Parker 2002; Tsao, Conway, & Livingstone, 2003). Since computational studies indicate that populations of neurons tuned by either encoding can be used to recover disparity (Qian & Zhu, 1997), the natural question that arises is, What are the relative advantages of one encoding scheme over the other? This letter suggests one possible advantage of neurons tuned by phase shifts: robustness in the presence of neuronal variability, which we discovered in the course of developing a neuromorphic implementation of disparity-tuned neurons. Although physiological measurements reveal a large diversity of responses from populations or neurons, most computational studies of the binocular energy model assume retinotopic arrays of neurons that are identical except for position or phase shifts. In our neuromorphic implementation, variability is unavoidable due to manufacturing tolerances. Thus, we face a similar problem as biological systems: computing with inexact computing elements. Solutions to this problem will lead to more robust engineering systems, as well as possible insights into biological computational strategies. Section 2 reviews the binocular energy model and the encoding of disparity by position and phase shifts. Section 3 describes our implementation, which is constructed using a pair of silicon retinas that feed into a pair of silicon chips containing retinotopic arrays of neurons with monocular Gabor-type RF proles. Section 4 presents measured results from the system. We nd that the relative responses of neurons tuned by phase encoding are in better agreement with theoretical predictions than neurons tuned by position encoding. In our system, random transistor mismatch introduces mismatch between the nominally identical monocular neurons used to construct the binocular responses. We investigated the sensitivities of the position and phase models numerically to variations in different parameters of the monocular RF proles. For most parameters, phase encoding

A Preference for Phase-Based Disparity

1581

is more robust than position encoding. We also characterized the variations in the monocular RF proles within one chip. Serendipitously, we nd that the phase model is least sensitive to the parameters that vary the most in our system. Section 5 summarizes our results and future directions. 2 The Binocular Energy Model Ohzawa et al. (1990) proposed the binocular energy model to explain the response of binocular complex cells measured in the cat. The model is similar to the motion energy model previously proposed by Adelson and Bergen (1985) for human perception of visual motion. Anzai, Ohzawa, and Freeman further rened the model (1999a, 1999b, 1999c). In the model, the response of a binocular complex cell is the linear combination of the outputs of four binocular simple cells, as shown in Figure 1. Anzai et al. (1999b) model the response of a binocular simple cell using a linear binocular lter followed by a half-squaring nonlinearity. The linear binocular lter sums the responses of two monocular lters modeled by Gabor functions. Let I.x/ denote image intensity where x 2 R2 . The output of a Gabor monocular lter m.c; Á; I/ is given by Z D m.c; Á; I/ g.x; c; Á/I.x/dx ³ ´ 1 g.x; c; Á/ D · exp ¡ .x ¡ c/T C¡1 .x ¡ c/ cos.ÄT .x ¡ c/ C Á/; 2 Left Linear Binocular Filter

Right

Half-squaring

f=0

+ +

S

f=p

+ +

S

f= p/2

+ +

S

+

f=p/2

+ +

S

Binocular Complex Cell

In phase

+ +

Quadrature

+

S

Binocular Simple Cells

Figure 1: The binocular energy model of complex cell (adapted from Anzai et al., 1999a). The response of binocular complex cell is the linear combination of four binocular simple cell outputs. Each binocular simple cell is constructed from a linear binocular lter followed by half-squaring nonlinearity. The monocular lters combined by the binocular lter are modeled as Gabor functions.

1582

E. Tsang and B. Shi

where Ä 2 R2 and C 2 R2£2 control the spatial frequency tuning and bandwidth of the lter, and · controls the gain. These parameters are assumed to be the same in all of the simple cells that make up a complex cell. However, the center position c 2 R2 and the phase Á 2 R vary between the two eyes and among the four simple cells. The linear binocular lter output is b.cR ; cL ; ÁR ; ÁL / D m.cR ; ÁR ; IR / C m.cL ; ÁL IL /; where the subscripts R and L differentiate the right and left eye center positions, phases, and image intensities. The simple cell output is given by rs D .jb.cR ; cL ; ÁR ; ÁL /jC /2 ;

where jbjC D maxfb; 0g is the positive half-wave rectifying nonlinearity. While the response of simple cells depends heavily on the stimulus contrast and phase in addition to the disparity and position of the stimulus, the response of complex cells is largely independent of the phase and contrast. The binocular energy model posits that complex cells achieve phase invariance by linearly combining the outputs of four simple cells whose binocular lters differ in phase by ¼=2. If rc .cR ; cL ; ÁR ; ÁL / is the complex cell output, then rc .cR ; cL ; ÁR ; ÁL / D

± ¼ ¼² r s cR ; cL ; Á R C k ; Á L C k : 2 2 kD0

3 X

Since lters that differ in phase by ¼ are equal except for a change in sign, ± ¼ ¼² D .jb.cR ; cL ; ÁR ; ÁL /j¡ /2 r s cR ; cL ; Á R C 2 ; Á L C 2 2 2 ³± ´ ± ¡ 2 ¼ ¼² ¼ ¼ ² D r s cR ; cL ; Á R C 3 ; Á L C 3 b cR ; cL ; ÁR C ; ÁL C ; 2 2 2 2 where jbj¡ D maxf¡b; 0g is the negative half-wave rectifying nonlinearity. Thus, we can construct the four required simple cell outputs from positive and negative half-wave rectied outputs of two binocular lters. Complex cells constructed according to the binocular energy model are selective for disparities in the direction orthogonal to their preferred orientation. For example, cells tuned to vertical orientations are selective for horizontal disparities. Their disparity tuning depends on the relative center positions and the relative phases of the two monocular lters that make up the binocular linear lter. Dene the disparity, d, of a binocular stimulus to be the shift of the right eye stimulus with respect to the left eye stimulus, IR .x/ ¼ IL .x ¡ d/. A binocular complex cell whose monocular lters are shifted by 1c D cR ¡ cL and 1Á D ÁR ¡ ÁL will respond maximally when the input has a preferred disparity Dpref ¼ 1c ¡

1Á ; Ä

A Preference for Phase-Based Disparity

1583

where Ä is the spatial frequency tuning in the horizontal direction. However, this is only an approximation, as the preferred disparity depends also on the frequency content of the input stimulus (Qian, 1994; Zhu & Qian, 1996). Disparity is encoded by a position shift if 1c 6D 0 and 1Á D 0. Disparity is encoded by a phase shift if 1c D 0 and 1Á 6D 0. The cell uses a hybrid encoding if both 1c 6D 0 and 1Á 6D 0. Phase encoding and position encoding are equivalent for the zero-disparity-tuned cell. Assuming that the monocular RFs have the same centers, the complex cell output of Figure 1 is tuned to zero disparity. 3 Experimental Setup 3.1 System Overview. Figure 2 shows a block diagram of our binocular cell system, which uses a combination of analog and digital processing. The visual front end consists of two silicon retina chips (Zaghoul & Boahen, 2004a,b) mounted behind 4 mm lenses that focus light onto them. Each retina contains a 60 £ 96 pixel array of phototransistors that sample the incident light and convert it into a current. Processing circuits on the chip generate spike outputs that mimic the spiking responses of sustained-ON, sustained-OFF, transient-ON, and transient-OFF retinal ganglion cells. Our system uses only the sustained outputs, which respond continuously to a static image. ON and OFF cells respond to positive and negative contrasts. The ON and OFF cells are arranged in arrays of 30 £ 48 neurons. Each chip has an asynchronous digital output bus that communicates the spikes in the array off chip using the address event representation (AER) protocol (Boahen, 2000). The two Gabor-type chips perform the required monocular ltering operations (Choi, Shi, & Boahen, 2004). Each chip can process a 32 £ 64 pixel input image, which is encoded as the difference in spike rate of ON and OFF input channels, which are supplied using the AER protocol. Our system uses only the upper left 30 £ 48 pixels due to the lower resolution of the silicon retinas. Each chip lters its input image with both EVEN and ODD symmetric Gabor-type lters, corresponding to phases 0 and ¡¼=2. Four silicon neuron circuits at each pixel encode the positive (ON) and negative (OFF) components of the EVEN and ODD lter outputs as spike trains, which are communicated off chip using AER. We refer to these signals using the abbreviations eC; e¡; oC; and o¡. The Gabor-type lters differ from Gabor lters in that the modulating function is not gaussian but still decays with distance from the origin. For a one-dimensional image, the RF prole can be approximated by g.x; c; Á/ D ·

1Ä exp.¡1Äjxj/ cos.Ä.x ¡ c/ C Á/; 2

(3.1)

where x 2 R indexes pixel position. The parameters Ä 2 R and 1Ä 2 R control the spatial frequency tuning and bandwidth, · controls the gain,

1584

E. Tsang and B. Shi

Figure 2: Block diagram of the neuromorphic implementation. Connections via AER are denoted with oppositely oriented arrows. The outputs of the address lters are ON-OFF encoded voltage spike trains. The outputs of the binocular combination block are spike rate estimates obtained by binning. (A) The system congured for position encoding. (B) The system congured for phase encoding of positive disparity. Negative disparity can be encoded by swapping the routing between the ON and OFF channels from one retina in the binocular combination block.

A Preference for Phase-Based Disparity

1585

c 2 R indicates the center position, and Á 2 R represents the phase of the prole. Since the distance between the points where Fourier transform of the RF prole drops to approximately half its peak value is 21Ä, we refer to 1Ä as the half-bandwidth. The difference in the lter shape should not affect the resulting binocular complex cell responses signicantly; Qian and Zhu (1997) have shown that the binocular complex cell responses in the energy model are insensitive to the exact shape of the modulating envelope. Because we expect primarily horizontal disparities, we tune the Gabor-type chips for vertical orientations. The two address lters and the binocular combination block combine the outputs of the two Gabor-type chips to implement two-phase quadrature binocular lters. Each address lter passes only those spikes from the four neurons at a desired retinal location and demultiplexes them as voltage pulses on four separate digital lines. The binocular combination block merges the spike trains from the two eyes into four lines that represent the outputs of two phase quadrature binocular lters differentially. By changing the retinal location selected from the left and right eyes in the AER address lters, we can change the position encoding of disparity. By altering the routing in the binocular combination block, we can change the phase encoding. We implement the address lters and the binocular combination block using Xilinx complex programmable logic devices (CPLDs). The spike trains on the four output lines of the binocular combination block are binned over a 40 ms time window to estimate the spike rate then differenced to recover the two binocular lter outputs. The binocular lter outputs are then positive and negative half-squared to compute four simple cell outputs. The simple cell outputs are summed to compute the complex cell output. Spike binning is performed on the same Xilinx CPLD used to implement the binocular combination block. The nal differencing, squaring, and summing operations are done by an 8051 microcontroller (MCU) running at 24 MHz. The nal binocular cell responses are sent to a PC via the MCU serial port for analysis. The remaining sections describe the design of the AER address lter and the binocular combination block in more detail and can be skipped without loss of continuity. The silicon retina and Gabor-type transceiver chip have been described elsewhere. The nal steps computed by the microcontroller are not discussed further, as they are quite straightforward. 3.2 Address Filter. The AER output of each Gabor-type chip encodes the spike activity of all the neurons on that chip. The address lter extracts only those spikes corresponding to the four neurons at a selected retinal location. The AER protocol communicates spikes from one chip, the transmitter, to another, the receiver. The transmitter signals the occurrence of a spike in the array by placing an address on the bus that uniquely identies the neuron that spiked. Each neuron is assigned a unique X (column)

1586

E. Tsang and B. Shi

Figure 3: Block diagram of address lter showing the handshaking signals used by the AER protocol.

and Y (row) address pair. The time that the address appears on the bus encodes the spike time directly. As illustrated at the top of Figure 3, spike addresses are represented using a word serial format consisting of bursts of addresses, each of which encodes the locations of all simultaneously spiking neurons within one row. The rst address within each burst corresponds to a chip address, identifying the chip that the spikes originate from. The next address identies the row. Subsequent addresses identify the columns containing spiking neurons. Simultaneous spikes from different rows are handled by an arbiter that arranges them into sequential bursts. Three handshaking signals (ReqY, »ReqX, and Ack) ensure that addresses are communicated correctly. The transmitter takes ReqY high and low to signal the start and the end of a burst. It takes »ReqX low to signal the row and the subsequent column addresses. The receiver makes a transition on Ack to acknowledge every transition on either of the two request lines. For the Gabor-type chip, each retinal position has four associated neurons, which are arranged in 2 £ 2 blocks. ON and OFF neurons are in-

A Preference for Phase-Based Disparity

1587

terleaved in alternating columns. EVEN and ODD neurons are located in alternating rows. Thus, the least signicant bit (LSB) of the X address encodes ON-OFF, and the LSB of the Y address encodes EVEN-ODD. As shown in Figure 3, the address lter extracts only those spikes corresponding to the four neurons at a desired retinal position and demultiplexes these spikes as voltage pulses on four separate wires. As addresses appear on the AER bus, the lter generates the Ack signal that is sent back to the Gabor-type chip. Two latches latch the row and column addresses of each spike, which are compared with the row and column addresses of the desired retinal location. In our addressing scheme, the retinal location is encoded on bits 1 to 6 of the address. Once the decoder detects a spike from the desired retinal location, it generates a voltage pulse, which is demultiplexed onto one of four output lines depending on the LSBs of the latched row and column addresses. To avoid losing events, we implement the address lter on a Xilinx XC9500 series CPLD to minimize the time that the address lter requires to process each spike. We chose this series because of its speed and exibility. The block delay in each macrocell is only 7 ns. The series supports in system programming enable rapid debugging during system design. Because the AER protocol is asynchronous, we needed to pay particular attention to the timing in the signal path to ensure that addresses are latched correctly and to avoid glitches that could be interpreted as output spikes. First, we connected the »ReqX signal from the Gabor-type chips that signals the validity of both row and column addresses to an ungated global clock pin (GCLK) to generate the signal that latches the row and column addresses. Because the GCLK pin is routed to every logic block directly, this ensures that the ip-ops latch the address bits simultaneously. Second, rather than using automatic place and route, we manually equalized the path delays in the address comparison circuits and in the 1 to 4 decoder. This was essential since the routing delays in the switching matrix can be on the same order as the block delays. 3.3 Binocular Combination Block. The binocular combination block combines the eight monocular spike trains (four from each eye) to compute two-phase quadrature binocular lter outputs. For a zero-disparity-tuned cell, whose required routing is shown in Figure 2A, we set the AER address lters so that they extract spikes from monocular neurons corresponding to the same retinal location in the left and right eyes .1c D 0/. To compute the output of the rst binocular lter B1, the binocular combination block sums the outputs of the left and right eye EVEN lters by merging the spikes from the e¡ channels onto a positive output line, B1C, and the spikes from the e channels onto a negative output line, B1¡. Although the difference between the spike rates on B1C and B1¡ encodes the B1 lter output, the B1C and B1¡ spike rates do not represent the ON and OFF components of B1 since they may be positive simultaneously. To compute the output of the second

1588

E. Tsang and B. Shi

lter, B2, the binocular combination block merges the spikes from the ODD channels similarly. We can electronically recongure the system to construct the binocular lter outputs required for neurons tuned to nonzero disparities. For position encoding, we change the relative addresses selected by the AER address lters to set 1c 6D 0, but leave the binocular combination block unchanged. For phase encoding, we leave the AER address lters unchanged and alter the routing in the binocular combination block. Because the RF proles of the Gabor-type chips have two phase values, altering the routing can yield four distinct binocular lters with monocular lter phase shifts of 1Á D ¡¼=2; 0; ¼=2; and ¼. Neurons with phase shifts of ¡¼=2, whose required routing is shown in Figure 2B, are tuned to positive disparities. Neurons with phase shifts of ¼=2, which can be obtained by swapping the routing of the ON and OFF signals from one eye in Figure 2B, are tuned to negative disparities. Neurons with phase shifts of ¼ , which can be obtained by swapping the routing of the ON and OFF signals from one eye in Figure 2A, are tuned inhibitory complex cells whose outputs are minimized by zero-disparity stimuli. We implement the binocular combination block using a Xilinx XC9500 series CPLD. Inputs control the monocular phase shift of the resulting binocular lter by modifying the routing. For simplicity, we merge spikes using inclusive OR gates without arbitration. Although simultaneous spikes on the left and right channels could be merged into a single spike, the probability that this will happen is negligible, since the width of the voltage pulse representing each spike (»32 ns) is much smaller than the interspike intervals, which are on the order of milliseconds. 4 Results This section describes the measured responses from the binocular complex cell implementation described in the previous section, the results of a sensitivity analysis of the responses to mismatch in the parameters of the monocular RFs, and a characterization of the monocular RF mismatch in our system. 4.1 Binocular Receptive Field Maps-Technique. Figure 4 shows our experimental setup for measuring the binocular RFs of the complex cells. Both Gabor-type chips were tuned for vertical orientations with similar RF parameters, as listed in Table 1. The RF parameters were estimated by tting the monocular Gabor-type chip responses to equation 3.1. We presented each retina with a 2.4 cm wide vertical dark bar on a bright background using two LCD monitors placed 55 cm from the retinas. This arrangement meant that the bar spanned four adjacent retinal positions, corresponding to about one-third of the period of the sinusoidal modulation in the RF prole and two-thirds of the width of the modulating envelope,

A Preference for Phase-Based Disparity

1589

Figure 4: The experimental setup for measuring the binocular RF maps. The monitors were moved closer to the retinas to enable more detail of the hardware system to be shown. The insert board converts the AER address format used by the silicon retina to that used by the Gabor-type chips by inserting a chip address into each burst. Table 1: Tunings for Two Gabor-Type Chips on the Experimental Setup.

· 1Ä Ä

Left

Right

2152 0.3059 0.5226

2265 0.2494 0.5748

which is dened as 2.1Ä/¡1 . Using two monitors gives us precise control over the input disparity. We stimulated 31 £ 31 pairs of left and right stimulus positions. The visual angle between two adjacent stimulus positions was 0.935 degree. We characterized the responses of ve binocular complex cells: one cell tuned to zero disparity and two pairs of cells tuned to positive and negative disparities by phase or position encoding. We obtained the zero-disparitytuned cell and the phase-encoded cells by combining the spikes from pixels with (row, column) address of (15, 24) in the left and right chips. For the position-encoded cells, we combined the spikes from pixel (15, 24) in the right chip with the spikes from pixels (15, 20) and (15, 28) in the left chip. 4.2 Binocular Receptive Field Maps—Results. Figure 5 shows the binocular RF maps measured as described in the previous section. The map

1590

E. Tsang and B. Shi

Right Stim. Position 16 31 16 31 Left Stim. Position

Right Stim. Position

Zero Disparity

+

1

16 31 Left Stim. Position

1

-

1

1

Right Stim. Position 16 31

1

Right Stim. Position 16 31

for the zero-disparity-tuned neuron (Figure 5E) shows the expected peak in the center of the map. Maps for neurons tuned to positive (Figures 5C and 5F) or negative (Figures 5A and 5D) disparities show this peak shifted above or below the main diagonal. For position encoding (Figures 5D and

1

16 31 Left Stim. Position

1

31 16 Left Stim. Position

1

16 31 Left Stim. Position

1

1

Right Stim. Position 16 31

Right Stim. Position 16 31

Left Stim. Position

Figure 5: Binocular RF maps shown as images where the gray level of pixel encodes the response of the complex cell to a pair of spatial bar stimuli applied to the left and right Gabor-type chips. Darker gray levels indicate higher ring rates. (A, D) The RF maps of neurons tuned to negative disparities by phase and position encoding. (B) The regions of the maps corresponding to different disparities. (C, F) The RF maps of neurons tuned to positive disparities by phase and position encoding. (E) The map of the neuron tuned to zero disparity, which is the same for both position and phase encoding.

A Preference for Phase-Based Disparity

1591

5F), the peaks are qualitatively similar to the peak of the zero-disparity neurons, only shifted. The responses of the neurons tuned by phase encoding (Figures 5A and 5C) are qualitatively different from the response of the zero-disparity-tuned neuron, exhibiting suppression in their response to the opposite disparity. 4.3 Disparity Tuning Curves. Figure 6 shows the disparity tuning curves for complex cells tuned to negative, zero, and positive disparities by phase and position encoding, which we computed by integrating the responses in Figure 5 along lines of constant disparity. We also computed

3000

3000

2500

2500

2000

2000

1500

1500

1000

1000

500

500

0 -15

-10

-5

0

5

10

15

0 -15

-10

-5

0

5

10

15

-10

-5

0

5

10

15

M 3000

3000

2500

2500

2000

2000

1500

1500

1000

1000

500

500

0 -15

-10

-5

0

5

10

15

0 -15

Figure 6: Disparity tuning curves for the disparity cells in phase model and position model. (A, B) The measured tuning curves for the phase and position models. Tuning curves for different neurons are denoted by different widths. Thick line: zero-disparity tuned neuron. Medium line: negative disparity-tuned neuron. Thin line: positive disparity-tuned neuron. (C, D) The ideal tuning curves assuming perfectly matched monocular RF proles using the least-squares tted RF parameters of the measured monocular responses from one eye. Disparities are expressed in units of discrete stimulus position (0.935 degrees/position).

1592

E. Tsang and B. Shi

the ideal disparity tuning curves by assuming that the monocular responses to inputs at a position x had the form m.c; Á; x/ D ·

1Ä exp.¡1Äjxj/ cos.Ä .x ¡ c/ C Á/; 2

where the parameters ·; Ä, and 1Ä were obtained by least-squares t of the expression to the measured monocular responses from one eye, and Á was either 0 or ¼=2. Although we do observe the expected shifts in locations of the peak responses for different disparity tunings, the relative sizes of the peaks for the phase-encoded neurons appear to match the theoretical predictions of the disparity energy model better than those of the position-encoded neurons. The model predicts that the disparity tuning curves for the position-encoded neurons should be identical in size and shape, just shifted to the left or the right. However, the measured responses show three peaks with varying heights. These deviations between expected and measured responses are due primarily to mismatch between the monocular RF proles of the Gabor-type chips. Because subsequent processing is implemented in digital hardware, there is no variability introduced by these stages, except due to quantization, which is negligible here. The disparity energy model generally assumes that the left and right monocular RFs have the same gain and spatial frequency tunings. However, random transistor mismatch within the Gabor-type chips causes the gain and spatial frequency tuning to vary from neuron to neuron. The presence of this mismatch and its observed effect on the relative responses of neurons tuned to different disparities raises a question: Which of the two encoding schemes is more robust in the presence of these variations? As described below, we addressed this question in two parts. First, how does the variation in the parameters controlling the monocular RF proles affect the relative outputs of complex cells? Second, how much do the RF parameters vary from neuron to neuron on the chip? 4.4 Sensitivity Analysis. We model the mismatch by assuming the responses of the monocular lters have the same parameterized form, but that the parameters vary from pixel to pixel. For simplicity, we assume onedimensional input images consisting of a single spatial impulse and model the response of the four neurons at the origin of retinal coordinates to an impulse at pixel x by heC .x/ D ·eC jhe .x/ ¡ »e jC he¡ .x/ D ·e¡ jhe .x/ ¡ »e j¡

hoC .x/ D ·oC jho .x/ ¡ »o jC ho¡ .x/ D ·o¡ jho .x/ ¡ »o j¡

A Preference for Phase-Based Disparity

1593

where 1Ä ¡1Äjxj 1Ä ¡1Äjxj cos.Äx C Áe / and ho .x/ D sin.Äx C Áo /: e e 2 2 We examined the effect of mismatch in ve RF parameters: the spatial frequency Ä, the half-bandwidth 1Ä, the offsets .»e ; »o /, the phase offsets .Áe ; Áo /, and the gains .· eC ; ·e¡ ; ·oC ; ·o¡ /. Mismatch in the analog transistors implementing the Gabor-type lters introduces mismatch in the spatial frequency, half-bandwidth, phases, and offsets. Mismatch in the transistors of the input integrators and output spiking neurons introduces gain mismatch. We show in the next section that this mathematical model captures approximately 90% of the RF variability on the Gabor-type chips. To examine the robustness of the relative responses, we need to establish metrics on the quality of the relative responses. We concentrate on the relative responses of populations of neurons tuned to different disparities but similar frontoparallel positions since we can compare these to estimate disparity. Perhaps the simplest method classies a stimulus into three categories depending on which neuron has the largest response, as shown in Figure 7. In this case, we are interested in how monocular RF mismatch affects the locations of decision boundaries L and R. Although this is only one possible method for estimating disparity, other methods should display a similar dependence on the parameter mismatch. We examined how much these boundaries shifted in response to monocular RF parameter mismatch by perturbing the RF parameters in the left and right neurons by additive gaussian offsets with zero mean and varying percentage deviations. We used the mean of the left and right RF parameters listed in Table 1 as the nominal parameters. Deviations in the offsets, which are nominally zero, were expressed as a percentage of the peak value of the EVEN lter impulse response. Similarly, the deviations in the phase offsets he .x/ D

C

Disparity Energy

x

D

L

R

Disparity (pixel)

Figure 7: Decision boundaries obtained by comparing relative responses of neurons tuned to negative, zero, and positive disparities. L and R denote boundaries between negative/zero and zero/positive disparity stimuli. The parameters D D R ¡ L and C D .L C R/=2 describe the distance between and the centroid of the decision boundaries.

1594

E. Tsang and B. Shi

were expressed as a percentage of the maximum phase shift before phase wraparound, ¼ . We found that in general, variations in L and R were coupled, but that the variations in distance D D R ¡ L between the boundaries and their centroid C D .L C R/=2 were approximately independent. Deviations in the decision boundary parameters D and C are expressed as a percentage of the nominal value of D. Thus, a 100% shift in C indicates that the centroid of the left and right boundaries has shifted so much that the original and the perturbed regions corresponding to xation share no overlap if D remains constant. The variations in the decision boundary parameters increase linearly for small RF parameter deviations. We dene the sensitivity of each decision boundary parameter in response to variations in each RF parameter as the slope of this line, which we estimate by linear least-squares tting. If the sensitivity is less than one, we say that the decision boundary parameter is robust to variations in the RF parameter. Figure 8 compares the sensitivities of the position and phase models. Except for one combination (the sensitivity of D to variations in Ä), the phase model is more robust than the position model. In addition, the sensitivity of neurons tuned by phase encoding is always less than one, indicating that it is robust to the variation in the monocular RF proles. The distance D is more sensitive to variations than the centroid C. 4.5 Intrachip Variability of RF Parameters. The previous section showed that in general, the phase model is more robust than the position model. However, since there are RF parameters for which phase encoding is more sensitive than position encoding, it is important to establish how much each RF parameter varies within one chip. Although transistor parameters will vary from chip to chip, the effect of this interchip variation can be minimized by adjusting the external bias voltages appropriately. However, intrachip variability makes it impossible to match the responses of all pairs of neurons on two different chips and is therefore the limiting factor. We characterized the variability by measuring the variability of RF parameters obtained by least-squares ts to the RF proles measured at seven neurons located in row 16 and columns 26, 28, 30, 32, 34, 36, and 38 of one chip. We chose a restricted range of columns and rows to avoid the edge effects due to the chip boundaries. The least-squares ts of the RF proles of these neurons are illustrated in Figure 9. We measured the quality of the ts by the percentage accuracy v 0 1 uP O e [k]/2 C .ho [k] ¡ hO o [k]/2 ] u [.h [k] ¡ h e A £ 100%; A% D @1 ¡ t k P O 2Ch O 0 [k]2 ] [ [k] h e k where he [k] D

X j

.heC [j; k] ¡ he¡ [j; k]/ and ho [k] D

X j

.hoC [j; k] ¡ ho¡ [j; k]/

A Preference for Phase-Based Disparity

1595

Figure 8: The sensitivity of position and phase encoding. (A) Sensitivity of the distance D between the decision boundaries to mismatch in different RF parameters. (B) Sensitivity of the centroid C.

are the measured one-dimensional EVEN and ODD RF proles of the neurons, their tted RF proles are hO e [k] and hO o [k], and the indices j and k correspond to the row and column position. The ts shown in Figure 9 closely matched the measured proles. The percentage accuracy for about half of the neurons was over 90% (the average value was 88.37%). It turns out that the phase variability is quite small. Fixing the phase offsets Áe and Áo to be zero only degraded the quality of the ts by 1.2%. The percentage standard deviations are plotted in Figure 10. The lter gains vary the most, followed by the half-bandwidth. The spatial frequency, phase, and offset vary the least. This gure also shows that there is a complementary relationship between the sensitivity of the distance between the decision boundaries, D, for phase-encoded neurons and the RF parameter variability. Although the gains vary the most, the decision boundaries are not very sensitive to their variations. On the other hand, the decision bound-

1596

E. Tsang and B. Shi

1500

400

1000

0

500 0 -10

2000

0

10

-400 -10 500

0

10

0

10

0

10

0

10

-10

0

10

-10

0

10

0

10

1000

0

0

-500

-10

3000

0

10

2000

-1000 -10 1000 500 0

1000 0 -10 4000

-500 0

10

-10 0

2000 -1000 0 -10

4000

0

10

400 0

2000

-400

0 3000

-2000 -10

-10

0

10

-800 1000

2000 0

1000 0 -10 2000

0

10

-1000 500

0

1000

-500 -1000

0 -10

0

10

-10

Figure 9: The least-squares t of measured Gabor-type RF proles. (A–G) The ts for the EVEN and ODD proles corresponding to the seven neurons located in row 16 and columns 26, 28, 30, 32, 34, 36, and 38 of a chip. The discrete data points, +, represent the measured responses. The solid lines represent the ts.

A Preference for Phase-Based Disparity

1597

Figure 10: Comparison between the sensitivity of the distance D between decision boundaries using phase-encoded neurons to variations in different RF parameters and the measured variation of the parameters in the system.

aries are relatively more sensitive to variations in the spatial frequency and offset, but these do not vary much on the chip. 5 Conclusion We implemented a neuromorphic multichip model of disparity-tuned complex cells in the mammalian primary visual cortex based on the disparity energy model. We evaluated two possible mechanisms for disparity tuning, phase, and position offsets between the monocular RF proles and demonstrated that our system prefers phase encoding, because the relative responses of neurons tuned to different disparities are less sensitive to mismatch in the monocular RF proles. We found that the phase model is least sensitive to the RF parameters that vary the most in our system. Biological systems would also benet from this type of robustness, as they must also extract information from the responses of a population of neurons. Intuitively, the better robustness of our phase-encoded neurons arises because binocular disparity-selective neurons tuned to different disparities are constructed from the same sets of monocular outputs but in different combinations. Thus, different binocular neurons are subject to the same variations. On the other hand, our position-encoded neurons are constructed from different sets of monocular outputs. The biological analog of the large variability in the gain of RF proles we measured may be the large variation in the ring rate of real neurons. Thus, our result implies that phase encoding is one mechanism that enables real cells to deal with ring rate variability. Our system architecture is similar to a recent cortical

1598

E. Tsang and B. Shi

model of stereopsis, which proposes that monocular neural outputs from layer 4 of V1 are combined binocularly in layer 3B (Grossberg & Howe, 2003). Interestingly, the population response of phase-encoded neurons is also more robust in another sense: the peak of the population response is an accurate estimate of the disparity regardless of the detailed luminance prole of the stimulus, while this is not true for the population responses of position-encoded neurons (Chen & Qian, 2004). This implementation is an initial step toward the development of a multichip neuromorphic system capable of extracting depth information about the visual environment using silicon neurons with physiologically based functionality. Currently, the system can only compute the output of a single cell tuned to a particular disparity and retinal position. However, the estimation of disparities in a visual scene requires a population of neurons tuned to different disparities and retinal positions. Although we could accomplish this by exploiting the electronic recongurability of the system using time multiplexing, this approach is inefcient and time-consuming. Our current work seeks to implement this population in a fully parallel manner. By combining the AER outputs from the left and right Gabor-type chips onto a chip containing a two-dimensional array of neurons with the appropriate address remapping, we can implement an array of neurons tuned to the same disparity but varying retinal locations. Additional chips could represent neurons tuned to other disparities. Computing the neuronal outputs in parallel will enable us to investigate the roles of additional processing steps such as pooling (Fleet, Wagner, & Heeger, 1996; Qian & Zhu, 1997) and normalization (Albrecht & Geisler 1991; Heeger, 1992). Acknowledgments This work was supported in part by the Hong Kong Research Grants Council under grants HKUST6218/01E and HKUST6200/03E. It was inspired by an initial project at the 2002 Telluride Neuromorphic Workshop performed with Y. Miyawaki. We thank K. A. Boahen for supplying the silicon retina and AER receiver chips used in this work and for helpful discussions and T. Choi for his assistance in building the system. References Adelson, E. H., & Bergen, J. R. (1985). Spatiotemporal energy models for the perception of motion. J. Opt. Soc. Am. A, 2, 284–299. Albrecht, D. G., & Geisler, W. S. (1991). Motion selectivity and the contrast response functions of simple cells in the visual cortex. Visual Neuroscience, 7, 531–546. Anzai, A., Ohzawa, I., & Freeman, R. D. (1999a). Neural mechanisms for encoding binocular disparity: Position vs. phase. J. Neurophysiol., 82, 874–890.

A Preference for Phase-Based Disparity

1599

Anzai, A., Ohzawa, I., & Freeman, R. D. (1999b). Neural mechanisms for processing binocular information I. Simple cells. J. Neurophysiol., 82, 891– 908. Anzai, A., Ohzawa, I., & Freeman, R. D. (1999c). Neural mechanisms for processing binocular information II. Complex cells. J. Neurophysiol., 82, 909– 924. Barlow, H. B., Blackemore, C., & Pettigrew, J. D. (1967). The neural mechanism of binocular depth discrimination. J. Physiol. Lond., 193, 327–342. Boahen, K. A. (2000). Point-to-point connectivity between neuromorphic chips using address events. IEEE Transactions on Circuits and Systems–II: Analog and Digital Signal Processing, 47, 416–434. Chen, Y., Wang, Y., & Qian, N. (2001). Modeling V1 disparity tuning to timevarying stimuli. J. Neurophysiol., 86(1), 143–155. Chen, Y., & Qian, N. (2004). A coarse-to-ne disparity energy model with both phase-shift and position-shift receptive eld mechanisms. Neural Comput., 16, 1545–1577. Choi, T. Y. W., Shi, B. E., & Boahen, K. (2004). An ON-OFF orientation selective address event representation image transceiver chip. IEEE Transactions on Circuits and Systems–I: Fundamental Theory and Applications, 51(2), 342–353. Cumming, B. G. (2002). An unexpected specialization for horizontal disparity in primate primary visual cortex. Nature, 418, 633–636. Cumming, B. G., & Parker, A. J. (1997). Responses of primary visual cortical neurons to binocular disparity without depth perception. Nature, 389, 280– 283. DeValois, R. L., & DeValois, K. K. (1988). Spatial vision. New York: Oxford University Press. Fleet, D. J., Wagner, H., & Heeger, D. J. (1996). Neural encoding of binocular disparity: Energy models, position shifts and phase shifts. Vision Res., 36, 1839–1857. Grossberg, S., & Howe, P. D. L. (2003). A laminar cortical model of stereopsis and three-dimensional surface perception. Vision Res., 43, 801–829. Heeger, D. J. (1992). Normalization of cell responses in cat striate cortex. Visual Neuroscience, 9, 181–197. Ohzawa, I., DeAngelis, G. C., & Freeman, R. D. (1990). Stereoscopic depth discrimination in the visual cortex: Neurons ideally suited as disparity detectors. Science, 249, 1037–1041. Ohzawa, I., DeAngelis, G. C., & Freeman, R. D. (1997). Encoding of binocular disparity by complex cells in the catís visual cortex. J. Neurophysiol., 77, 2879– 2909. Poggio, G. F., Motter, B. C., Squatrito, S., & Trotter, Y. (1985).Responses of neurons in visual cortex (V1 and V2) of the alert macaque to dynamic random-dot stereograms. Vision Res., 25, 397–406. Prince, S. J., Cumming, B. G., & Parker, A. J. (2002). Range and mechanism of encoding of horizontal disparity in macaque V1. J. Neurophysiol., 87(1), 209– 221. Qian, N. (1994). Computing stereo disparity and motion with known binocular cell properties. Neural Comput., 6, 390–404.

1600

E. Tsang and B. Shi

Qian, N., & Zhu, Y. (1997). Physiological computation of binocular disparity. Vision Res., 37, 1811–1827. Schor, C., Wood, I., & Ogawa, J. (1984). Binocular sensory fusion is limited by spatial resolution. Vision Res., 24, 661–665. Tsao, D. Y., Conway, B. R., & Livingstone, M. S. (2003). Receptive elds of disparity-tuned simple cells in macaque V1. Neuron, 38(1), 103–114. Zaghoul, K. A., & Boahen, K. (2004a).Optic nerve signals in a neuromorphic chip I: Outer and inner retina models. IEEE Transactions on Biomedical Engineering, 51(4), 657–666. Zaghoul, K. A., & Boahen, K. (2004b). Optic nerve signals in a neuromorphic chip II: Testing and results. IEEE Transactions on Biomedical Engineering, 51(4), 667–675. Zhu, Y., & Qian, N. (1996). Binocular receptive eld models, disparity tuning, and characteristic disparity. Neural Comput., 8, 1611–1641. Received August 18, 2003; accepted February 2, 2004.

LETTER

Communicated by Misha Tsodyks

Learning Classication in the Olfactory System of Insects Ramon ´ Huerta [email protected] Institute for Nonlinear Science, University of California San Diego, La Jolla CA 920930402,U.S.A., and GNB. Escuela T´ecnica Superior de Inform´atica, UniversidadAut´onoma de Madrid, 28049 Madrid, Spain

Thomas Nowotny [email protected] Institute for Nonlinear Science, University of California San Diego, La Jolla CA 920930402, U.S.A.

Marta Garc´õ a-Sanchez [email protected] GNB. Escuela T´ecnica Superior de Inform´atica, Universidad Aut´onoma de Madrid, 28049 Madrid, Spain, and Institute for Nonlinear Science, University of California San Diego, La Jolla CA 92093-0402, U.S.A.

H. D. I. Abarbanel [email protected] Institute for Nonlinear Science, University of California San Diego, La Jolla CA 920930402, U.S.A., and Department of Physics and Marine Physical Laboratory, Scripps Institute of Oceanography, University of California San Diego, La Jolla, CA 920930402, U.S.A.

M. I. Rabinovich [email protected] Institute for Nonlinear Science, University of California San Diego, La Jolla CA 920930402, U.S.A.

We propose a theoretical framework for odor classication in the olfactory system of insects. The classication task is accomplished in two steps. The rst is a transformation from the antennal lobe to the intrinsic Kenyon cells in the mushroom body. This transformation into a higherdimensional space is an injective function and can be implemented without any type of learning at the synaptic connections. In the second step, the encoded odors in the intrinsic Kenyon cells are linearly classied in the mushroom body lobes. The neurons that perform this linear classication are equivalent to hyperplanes whose connections are tuned by local Hebbian learning and by competition due to mutual inhibition. We c 2004 Massachusetts Institute of Technology Neural Computation 16, 1601–1640 (2004) °

1602

R. Huerta et al.

calculate the range of values of activity and size of the network required to achieve efcient classication within this scheme in insect olfaction. We are able to demonstrate that biologically plausible control mechanisms can accomplish efcient classication of odors. 1 Introduction Odor classication by nervous systems involves several quite different computational tasks: (1) similar odor receptions need to be classied as the same odor, (2) distinct odors have to be discriminated from each other, and (3) some quite different odors might carry the same meaning and therefore must be associated with each other. From studies of insect olfaction, we have developed a framework that may provide insights into how the olfactory system of insects accomplishes these different tasks with the available natural technology. Insects have three known processing stages of odor information before classication: the antenna, the antennal lobe (AL), and the mushroom body (MB) (see Figure 1 for a description). To put our work into the right context, we will describe the known facts about the system. Each olfactory receptor cell in the antenna expresses one type of receptor, and all olfactory receptor cells expressing the same receptor type connect to the same glomerulus in the AL (Gao, Yuan, & Chess, 2000; Vosshall, Wong, & Axel, 2000; Scott et al., 2001). Thus, a chemosensory map of receptor activity in the antenna is formed in the AL: the genetically encoded architecture induces a stimulus-dependent spatial code in the glomeruli (Rodrigues, 1988; Distler, Bausenwein, & Boeckh, 1998; Joerges, Kuettner, Galizia, & Menzel, 1997; Galizia, Joerges, Kuettner, Faber, & Menzel, 1997; Galizia, Nagler, Holldobler, & Menzel, 1998). Moreover, the spatial code is conserved across individuals of the same species (Galizia, Sachse, Rappert, & Menzel, 1999; Wang, Wong, Flores, Vosshal, & Axel, 2003) as would be expected given the genetic origin of the code. There are many fewer neurons in the rst relay station (the AL) than the number of receptors types by a ratio of 1:10. The reasons for such a strong convergence are not well understood and remain an interesting theoretical question for future study. It is known that increasing odor concentrations recruit increasing numbers of glomeruli (Ng et al., 2002; Wang et al., 2003). A simple transduction of the glomerular activity would increase the number of active neurons in the MB with increasing odor concentration as well. Calcium recordings in the MB of Drosophila show, however, that the activity in the MB indicated by calcium concentrations is independent of the odor concentration for the odors ethyl acetate and benzaldehyde (Wang et al., 2001). Moreover, recent recordings from the AL in the locust indicate that the activity in the projections of the excitatory neurons in the AL into the MB is nearly constant in time (Stopfer, Jayaraman, & Laurent, 2003). Therefore, a gain control mechanism maintaining a nearly constant average neuronal activity in the AL

Learning Classication in the Olfactory System of Insects Antenna

Antennal Lobe (AL) Network

Mushroom body (MB) Intrinsic KCs

Extrinsic KCs

Structural Organization

Glomeruli

1603

= LN

Genetically Combinato Spatio encoded temporal rial code map code Odor concentration dependent

Unknown

Our analysis

odor PN Adaptation conditioning adaptation dependent Not under analysis

Not under analysis

Assump tions

What is known ?

= PN,

None

None

Input space

Binomial distribution

Sparse code

Unknown

Not concentration dependent

Unknown

Loci for learning odor conditioning Injective function into a large display

Linear classifier

No learning in synaptic connections

Hebbian learning + mutual inhibition

Figure 1: Description of the structural organization of the rst few processing layers of the olfactory system of insects, including a list of what is known in terms of coding, odor concentration dependence, and learning of odor conditioning.

must exist. It seems that the AL performs some preprocessing of the data to feed an adequate representation of it into the area of the insect brain that is responsible for learning odor conditioning, the MB. Although the MBs are not critical for normal behavior of insects, they are crucial for odor conditioning (Heisenberg, Borst, Wagner, & Byers, 1985). Genetic manipulation has proved to be a powerful tool to identify the MB as the locus for learning (de Belle & Heisenberg, 1994; Zars, Fischer, Schulz, & Heisenberg, 2000; Pascual & Preat, 2001; Dubnau, Grady, Kitamoto, & Tully 2001; McGuire, Le, & Davis, 2001; Connolly et al., 1996; Menzel, 2001; Heisenberg, 2003; Dubnau, Chiang, & Tully, 2003). Dubnau and colleagues (2001) propose that “the Hebbian processes underlying olfactory associative learning reside in mushroom body dendrites or upstream of the mushroom body and that the resulting alterations in synaptic strength modulate mushroom body output during memory retrieval.” This is an important idea that supports our hypothesis that the important synaptic changes to achieve

1604

R. Huerta et al.

classication of odors have to occur in the connections from the intrinsic KCs to the MB lobes, not earlier or later. We propose a theory of odor classication that relies on a large coding display in the MB, random connectivity among neurons, Hebbian learning, and mutual inhibition. We show that these elements of an odor identication network are sufcient to accomplish the classication tasks described above (see Figure 1). Our main hypothesis as stated in Figure 1 is that the classication decision occurs in the MB lobes that receive inputs from the intrinsic KCs. This hypothesis is suggested by the concept of support vector machines (SVM) (Cortes & Vapnik, 1995) and Cover’s work (Cover, 1965), which were developed in the context of the classication problem based on linear threshold devices when the input classes are not linearly separable. The strategy consists of casting the input into a high-dimensional space by means of a nonlinear transformation and then using linear threshold devices for classication in this higher-dimensional representation. The nonlinear transformation is designed to separate the previously inseparable input classes in the new high-dimensional space. In this article, we consider the AL as the input space, the intrinsic KCs form the high-dimensional space, and the neurons in the MB lobes are the linear threshold devices that can separate the classes. The experimental support for having such an SVM is based on the known large divergence of connections from the AL to the MB and the observed role of the MB as the focus of learning odor conditioning. Let us elaborate on the necessary processing layers of the classication circuit and our hypotheses about these different stages of information processing. We consider the AL as the input layer to the classication layers in the MB. Figure 2 shows the proposed classication scheme. The rst processing stage is the nonlinear transformation of the information contained in the AL activity into the large screen of intrinsic KCs. Our hypothesis for this stage is that the mapping of activity patterns resulting from the transformation from the AL to the MB should be statistically injective (Garcia-Sanchez & Huerta, 2003). Injectivity means that if two different states of the AL, arising from distinct odors, are projected into the screen of intrinsic KCs, the resulting states in the intrinsic KC layer should be distinct as well. The next hypothesis is that once the odor representations in the AL have been widely separated in the intrinsic KC layer, they can be classied at the next processing stage using simple linear threshold devices (see Figure 2); we interpret the neurons of the MB lobes as such linear threshold devices. At this stage, we then employ biologically feasible learning mechanisms to allow memory formation and self-organization in this part of the network. In addition to showing that the developed framework indeed allows successful classication, we also determine the range of possible values for the sparseness of activity in the intrinsic KC layer. The results are consistent with experimental recordings of intrinsic KCs in the MB of locust (PerezOrive et al., 2002) and thus give a theoretical explanation for the observed

Learning Classication in the Olfactory System of Insects

1605

Nonlinear expansion

AL

intrinsic KC layer

MB lobes

NKC

N AL

Non specific connectivity matrix

NMB NKC >> NAL

‘Hebbian’ learning

Mutual inhibition

Sparse code

Linear classification

Figure 2: Main elements required to account for efcient classication in the insect olfactory system. The rst stage is an injective function from the AL to the intrinsic KC layer using nonspecic, connectivity matrices (left). By nonspecic, we mean that the connectivity matrix does not need to be learned or be genetically encoded in detail. For classication purposes, a large number of neurons in the intrinsic KC layer and a sparse code are needed, as we will show. The second phase (linear classication) is the decision phase, where linear classication takes place. It is characterized by converging connections, Hebbian learning, and mutual inhibition between MB lobe neurons.

sparseness of the code. Finally, we demonstrate that the large size of the display or screen in the intrinsic KC layer is critical for efcient classication and quantify the interdependence of storage capacity and KC layer size. Previous theoretical studies have revealed the possibility of using the temporal dynamics of the rst relay station of olfactory processing for odor recognition (Hendin, Horn, & Hopeld, 1994; Hendin, Horn, & Tsodyks, 1998; White, Dickinson, Walt, & Kauer, 1998; Li & Hertz, 2000). White et al., (1998) propose delay lines to improve discrimination between odors. Hendin et al. (1994, 1998) show how the olfactory bulb (equivalent to the AL) can, in principle, implement an associative memory and segmentation of odors. Li and Hertz (2000) emphasize the importance of the feedback from the cortex (equivalent to the MB, the location of odor recognition) to the olfactory bulb (equivalent to the AL) for odor segmentation. In this work, we do not consider time or odor segmentation. Time is important to enhance the separation of similar odors (Friedrich & Laurent, 2001; Linster & Cleland, 2001; Laurent, 2002; Laurent et al., 2001). The problem of segmentation addresses the question of recognizing odor A or odor B when

1606

R. Huerta et al.

encountering a mixture of both. Both the role of time and the problem of odor segmentation require detailed knowledge of the AL representation of odors and mixtures of odors. To date, there is no theoretical work that addresses these issues, especially of the code for mixtures of odors in the AL, in a satisfactory way. We therefore restrict our analysis to the spatial aspects of the representation of pure odors in an appropriate sense. The model that we present in this article is to our knowledge the rst to address quantitatively the levels of neural activity, convergent and divergent connectivities, and the function of th MB in odor classication. The role of time is left out for further investigation. We need to understand the limitations of spatial code processing before exploring the advantages of using time in the neural code. The organization of the letter is as follows. We start with a description of the elements of the system. Then we present two analyses: an analytical description of a single decision neuron and a systematic numerical analysis demonstrating that the AL and MB structure together with Hebbian learning and mutual inhibition account for efcient classication. 2 Description of the Processing Stages As mentioned above and shown in Figure 2, there are two essential stages in our model of odor classication: a nonlinear transformation followed by a linear classication. We will see that the similarities of this core concept of SVMs to the observed biological system are striking even though the specic implementation is different because of the actual composition, the available “wetware,” of the insect olfactory network. 2.1 Nonlinear Transformation from the AL to the MB. Recently, strong evidence has been collected in the locust (Perez-Orive et al., 2002) that the spatiotemporal activity in the AL is read by the MB as pictures or snapshots. The two observations supporting this hypothesis are the strong feedforward inhibition from the lateral horn interneurons onto the Kenyon cells and the short integration time of the intrinsic KCs. The periodic strong inhibition resets the activity in the KC layer every 50 ms, whereas the short integration time of the KCs makes them coincidence detectors that cannot process input across more than one inhibition cycle. Because we are not addressing the temporal aspects of the system in this work, the input for our classication system is a single snapshot. The hypothesis for the nonlinear transformation from the AL to the MB is then that every such snapshot or code word in the AL has a unique corresponding code word in the MB: the nonlinear transformation needs to be an injective function at least in a statistical sense. In previous work (Garcia-Sanchez & Huerta, 2003), we proposed a method to select the parameter values that allow one to construct such an injective function from the AL to the intrinsic KC layer with very high probability. The appropriate parameters are used

Learning Classication in the Olfactory System of Insects

1607

throughout this article for designing the nonlinear transformation from the AL to the MB. It is known that the activity among intrinsic KCs is very sparse (PerezOrive et al., 2002). Most of the intrinsic KCs re just one or two spikes during one odor presentation of a few seconds. Given this low endogenous activity of the intrinsic KC neurons, we chose simple McCulloch-Pitts “neurons” (McCulloch & Pitts, 1943) to represent them. The state of this model neuron is described by a binary number (0 D no spike and 1 D spike). In particular, the McCulloch-Pitts neuron is described by yj D 2

Á NAL X iD1

! cji xi ¡ µKC

j D 1; 2; : : : ; NKC :

(2.1)

x is the state vector of the AL neurons. It has dimension NAL , where NAL is the number of AL neurons. The components of the vector x D [x1 ; x2 ; : : : ; xNAL ] are 0s and 1s. y is the state vector for the intrinsic KC layer; it is NKC dimensional. The cij are the components of the connectivity matrix, which is NAL £ NKC in size; its elements are also 0s and 1s. µKC is an integer number that gives the ring threshold in an intrinsic KC. The Heaviside function 2.¢/ is unity when its argument is positive and zero when its argument is negative. To determine the statistical degree of injectivity of the connectivity between the AL and intrinsic KC, we rst calculate the probability of having identical outputs given different inputs for a given connectivity matrix: P.y D y0 j x 6D x0 ; C/, where C is one of the possible connectivity matrices (see Garcia-Sanchez & Huerta, 2003, for details) and the notation x 6D x0 is f.x; x0 / : x 6D x0 g. We want this probability, which we call the probability of confusion, to be as small as possible, on average, over all inputs and over all connectivity matrices. « ¬ We write this average as P.confusion/ D hP.y D y0 j x 6D x0 ; C/ix6Dx0 C , where h¢ix6Dx0 is the average over all nonidentical input pairs (x; x0 ), and h¢iC is the average over all connectivity matrices C. This gives us a measure of clarity, the opposite of confusion, as I D 1 ¡ P.confusion/:

(2.2)

The closer I is to 1, the better is our statistically injective transformation from the states x of the AL to the states y of the intrinsic KCs. There are two parameters of the model that can be adjusted using the measure of clarity. One is the probability pC of having a connection between a given neuron in the AL and a given intrinsic KC. The second is the ring threshold µKC of the intrinsic KCs. Fixed parameters in the model are the probability pAL of having an active neuron in the AL layer, the number NAL of input neurons, and the number NKC of intrinsic KCs. pC and µKC can be

1608

R. Huerta et al.

estimated using the following inequality, I · 1 ¡ fpKC 2 C .1 ¡ pKC /2 C 2¾ 2 gNKC ;

(2.3)

where pKC is the ring probability of a single neuron in the intrinsic KC layer. It can be calculated for inputs and connection matrices generated by a Bernoulli process with probabilities pAL and pC as pKC D

´ NAL ³ X NAL .p AL pC /i .1 ¡ pAL pC /NAL ¡i : i iDµ

(2.4)

KC

This probability has variance (¾ 2 above) when we average over all possible inputs and connectivity matrices. The formula for the probability of confusion can be intuitively understood if we assume that the activity of every intrinsic KC is statistically independent of the activity of the others. If so, the probability of confusion in one output neuron is the sum of the probability of having a one for two inputs plus the probability of having a zero for both: p2KC C.1¡pKC /2 . Thus, the probability of confusion in all NKC output neurons is .p2KC C .1 ¡ pKC /2 /NKC in the approximation of independent inputs. This bound on I should be close to unity for any set of parameter values we choose. The inequality for the measure of clarity becomes an equality for sparse connectivity matrices. This, fortunately, is the case in the locust, where every intrinsic KC receives an average of only 10 to 20 connections from the AL where nearly 900 are possible. 2.2 Linear Classication in the MB Lobes. As shown in Figure 2, we hypothesize that the classication decision takes place at the MB lobes. The neurons in the MB lobes are again modeled by McCulloch-Pitts neurons, which are simple linear threshold devices, 0 1 NKC X (2.5) zl D 2 @ wlj ¢ yj ¡ µLB A ; l D 1; 2; : : : ; NLB : jD1

Here, the index LB denotes the MB Lobes. The vector z is the state of the MB lobes; it has dimension NLB . µLB is the threshold for the decision neurons in the MB lobes. The NKC £ NLB connectivity matrix wlj has entries 0 or 1. We introduce a plasticity rule for the entries wlj below. Every column vector wl (wlj for xed j) of a connectivity matrix denes a hyperplane in the intrinsic KC layer coordinates y (see Figure 3 for an illustration of this). This is the plane that is normal to wl . This hyperplane discriminates between the points that are “above it” and the points that are “below it.” There is a different hyperplane for each MB lobe neuron, and the combinatorial placement of the planes determines the classication space.

Learning Classication in the Olfactory System of Insects

1609

Figure 3: Hypothesized functional diagram of the olfactory system of insects. The data of the AL are encoded in snapshots. Every snapshot is a binary code of spikes (1) or no spikes (0). The projection from the AL to the intrinsic KC layer separates the encoded odors to be classied. The connectivity matrix from the intrinsic KC layer to the MB lobes denes a set of hyperplanes w that are able to discriminate between different sets of encoded odors. Note that without the projection into a higher-dimensional space, it would not be possible to linearly discriminate the dark points from the light ones.

2.2.1 Local Synaptic Changes: Hebbian Learning. Another key ingredient in addressing the classication problem is a learning mechanism. Hebbian learning is the classical choice for local synaptic changes (Hebb, 1949). The resulting local synaptic modications are made at the linear classication stage (see Figure 2) and efciently solve the classication problem. Hebbian plasticity is used to adjust the connectivity matrix wlj from the intrinsic KC layer to the MB lobes. No other areas need to be involved in learning for our model of classication. The plasticity rule is applied by rst choosing a connectivity matrix with some randomly chosen initial entries. Then inputs are presented to the system. The entries of the connectivity matrix at the time of the nth input are denoted by wlj .n/. The values after the next input, wlj .n C 1/, are given by the rule ¡ ¢ wlj .n C 1/ D H zl ; yj ; wlj .n/ ; where » H.1; 1; wlj .n// D

1 wlj .n/

with probability pC ; with probability 1 ¡ pC ;

(2.6)

1610

R. Huerta et al.

» 0 H.1; 0; wlj .n// D wlj .n/ H.0; 1; wlj .n// D wlj .n/;

with probability p¡ ; with probability 1 ¡ p¡ ;

(2.7)

H.0; 0; wlj .n// D wlj .n/:

A synaptic connection is activated (it becomes 1 when it was 0) with probability pC if the input activity is accompanied by an output activity. The connection is removed (it becomes 0 if it was 1) with probability p¡ if an output occurs in the absence of input activity. In the remaining cases, no input and no output and input but no output, the synapse remains unaltered. For example, let us apply the plasticity rule to a given input y and one output z D 1. This is the case that we primarily analyze in the next section. It can be easily calculated that the average number of iterations it takes for the connection wj to become 1 if yj D 1 and wj D 0 is 1=pC . If wj was already 1, then the number of iterations is 0. On the other hand, the average number of iterations it takes for the connection wj to become 0 if yj D 0 and wj D 1 is 1=p¡ . In other words, the inverses of the probabilities pC and p¡ are just the timescales of synaptic changes. Therefore, if we apply this rule sufciently long for one given input activity at the intrinsic KC layer and an active response z D 1 in an output neuron, the set of active connections will eventually be the set of active inputs itself. 2.2.2 Mutual Inhibition: nW -Winner-Take-All. We hypothesize that mutual inhibition exists and, in joint action with Hebbian learning, is able to organize a nonoverlapping response of the decision neurons in the MB lobes. This is often considered a neural mechanism for self-organization in the brain. The combination of Hebbian learning and mutual inhibition has already been proposed as a biologically feasible mechanism to account for learning in neural networks (O’Reilly, 2001). Mutual inhibition is implemented articially in the nite automata model with McCulloch-Pitts neurons. We allow only a subset of decision neurons that receive the highest synaptic input to re. The size of this subset is xed to allow nW winners in a winner-take-all conguration for every odor prePNKC sentation. Let us dene the vector u¹ D jD1 w¹j yj ¡ µLB . Then equation 2.5 is rewritten as 0 1 NKC X zl D 2 @ wlj yj ¡ µLB ¡ fugnW C1 A ;

(2.8)

jD1

where fugnW C1 denotes the (nW C 1)th largest component of the vector u. Since we force the system to have exactly nW active extrinsic KCs, µLB does not play any role because it is canceled in equation 2.8 and we can simply

Learning Classication in the Olfactory System of Insects

1611

Table 1: Summary of the Parameters and Variables of All the Processing Layers.

AL

AL!KC

KC

KC! lobe

Variables

x

cij

y

M lobe

wij

z

Parameters

NAL , pAL

pC

NKC ; µKC pKC .NAL ; pAL ; pC ; µKC /

pC ; p¡

NLB ; nW

write 0 zl D 2 @

NKC X jD1

wlj yj ¡

8 NKC <X :

9 = w¹j yj

jD1

1 A:

;

(2.9)

nW C1

The parameters and variables of the model are summarized in Table 1. The AL parameters are the number of neurons NAL and the probability of ring pAL . The connectivity probability from the AL to the MB is pC . The parameters of the intrinsic KC layer are the number of neurons NKC and the threshold for activation µKC . The ring probability pKC is a function of NAL , pAL , pC , and µKC . The dynamics of the connections from the intrinsic KC layer to the MB lobes is governed by the probabilities (timescales) for increasing and decreasing connection strength, pC and p¡ . Finally, the response of the decision layer depends on the number of allowed active neurons in the MB lobes, nW . For clarity, we also include a summary of the equations of the model in Table 2. 3 Results In the following sections, we divide our analysis into two parts. First, we analyze the parameter values that allow classication with one single output neuron. For this analysis, we use basic probability theory in Table 2: Summary of the Equations. yj D 2

Intrinsic KC layer Mushroom body lobes Plasticity rule

zl D 2

³ P NKC

wlj .n C 1/ D

(

jD1

±P

NAL iD1 cji

wlj yj ¡

1 0 wlj .n/

² xi ¡ µKC

nP N KC jD1

w¹j yj

o nW C1

´ ,

with pC if yj D 1 and zl D 1 with p¡ if yj D 0 and zl D 1 in all other cases

Note: The index n represents the iteration number for presentations of a snapshot of AL activity to the intrinsic KCs.

1612

R. Huerta et al.

a similar framework as Caticha, Palo Tejada, Lancet, & Domany, (2002). In principle, Shannon information can be employed as well, as in Nadal (1991) and Frolov & Murav’ev (1993) for associative memories. For our purposes, basic probability theory yields straightforward results that are easy to interpret. For simplicity, we will use a simple convention to separate the classication problem into two subproblems we call discrimination and association: ² The discrimination task is the ability of a single output neuron to distinguish one input from the rest, assuming uncorrelated input patterns in the antennal lobe. The probability space of AL activity will be unchanged throughout our analysis ² The association problem is dened as the capability of a single output neuron to re for d given uncorrelated inputs while not responding to the rest. The discrimination problem can be seen as a special case of the association problem with d D 1. In the second part of the results, we present simulations of a complete system with many output neurons that organizes via the plasticity rule into winner-take-all congurations. This analysis is different from the previous one in that the system self-organizes without supervision. 3.1 Discrimination for a Single Output Neuron. We consider a single output neuron in the MB lobes, usually referred to as an extrinsic Kenyon cell (eKC). We will show how the ability of this neuron to discriminate one odor from the rest depends on the total number of neurons in the intrinsic KC layer and the level of KC activity. The obtained discrimination ability can be enhanced by redundancy, for example, by using several decision neurons. The single neuron calculations presented here are in this sense a lower bound on the ability of the system to perform discrimination. The application of error correcting codes (Hamming, 1950) based on redundancy will be investigated in future work. The goal of the investigation presented in this section is to obtain exact quantitative statements rather than general trends to be able to compare our results to experimental ndings. Let us proceed with the analysis as illustrated in Figure 4. As a rst step, we need to calculate the probability distribution of the number of active intrinsic KC neurons (iKCs). The expectation value for the number of active iKCs is given by E nKC D NKC pKC , where NKC is the number of iKC and pKC is given by equation 2.4, (see section A.1 in the appendix). The probability distribution for the number of active iKC is, however, not binomial despite the random connectivity matrix between the AL and the MB. It can be calculated using the same procedure as in Nowotny and Huerta (2003). The details are explained in the appendix. The probability distribution P.nKC D k/ cannot be simplied into a more compact form. However, it can be calculated for given pC

Learning Classication in the Olfactory System of Insects

1613

Mushroom Body Antennal Lobe

iKCs

PNs eKC qLB

0.03

0.03

0.02 0.01 0

0

200 400 600 800 # of active PNs

P(# active iKCs)

P(# active PNs)

0.04

0.02

0.01

0

0

200 400 600 # of active iKCs

Figure 4: Characteristics of the rst processing stage, the nonlinear expansion. The input probability space in the AL is xed as a Bernoulli process with pAL D 0:15 in 50 millisecond intervals, as observed experimentally. The probability distribution, which is a binomial distribution, is shown in the lower left graph. The intrinsic KCs receive direct input from the PNs according to equation 2.1. The main parameter, here, is the probability of connection from the AL to the MB, pC . By changing this parameter, we can regulate the expected activity in the intrinsic KC layer. The resulting probability distribution for the number of active iKCs is not a binomial distribution (dotted line in the lower right graph) but a much wider distribution (solid line in the lower right graph) with an expected value given by equation 2.4. Given this distribution regulated by pC , we study the ability to discriminate uncorrelated inputs in the AL by using a single extrinsic KC. This investigation gives us a lower limit on the performance of this system that can be quantitatively compared to experimental data.

1614

R. Huerta et al.

and pAL and stored in the computer for further calculations. As shown in Figure 4, the distribution P.nKC D k/ is three to four times wider than the binomial distribution obtained if assuming independence of ring events in the iKCs. The distributions in the gure were obtained for the typical parameter values of the locust. To quantitatively determine the ability of the system to discriminate, we then generate a set of random inputs x0 ; x1 ; : : : ; xN in the AL by independent identical Bernoulli processes with probability p AL . In section 2.1, we give the conditions such that the map from the AL activity to the MB activity is an injective function. If these conditions are fullled, the set of random inputs in the AL corresponds (with probability close to one) uniquely to a set y0 ; y1 ; : : : ; yN of activity vectors in the MB. The activities nKC .y j / of the activity vectors obey the probability distribution given in equation A.4. The rst pattern y0 is then trained in one single classication output neuron: one neuron is forced to spike in response to this pattern, while for all other input patterns, the system acts autonomously. The Hebbian plasticity rule given in equation 2.8 generates a learned connectivity vector w D y0 if the rule is applied in the order of maxf1=p¡ ; 1=pC g times. Given this learned vector w, we determine the structural parameters such that z.y1 / D z.y2 / D ¢ ¢ ¢ D z.yN / D 0 with probability close to one. The probability of discrimination P.z.y1 / D 0; z.y2 / D 0; : : : ; z.yN / D 0/ is calculated in the appendix in section A.2. For successful discrimination, it needs to be as close to one as possible. To obtain the probability of discrimination, we basically estimate the degree of overlap of each of the uncorrelated inputs to the learned one. If the learned vector has l.w/ ones, the probability of having i overlapping 1s between w and any given y j is

p.i overlapping ones/ D

¡l.w/¢¡NKC ¡l.w/¢ i

l.y j /¡i

¡ NKC ¢

:

(3.1)

l.y j /

The overlapping activity vectors are read by an eKC that has a threshold of activation, µLB . In order for the eKC not to spike in response to the wrong activity vector y j , the probability for overlaps with i ¸ µLB has to be close to 0. It can quickly be seen that for sparse activity, the probability of overlaps decreases. The advantage of using sparse code for associative memory was pointed out by Marr (1969) and others (Willshaw, Buneman, & LonguetHiggins, 1969) 30 years ago using nonoverlapping patterns of activity. In this work, we rigorously calculate the probability of discrimination in order to determine the minimum size of the intrinsic KC layer, the maximal and minimal activity in the intrinsic KC layer, the capacity of the system, and the ability to discriminate similar odors. This will allow us to compare our results to the real system on a quantitative level.

Learning Classication in the Olfactory System of Insects

1615

3.1.1 Dependence on the Intrinsic KC Layer Size . It is interesting to investigate the size of the intrinsic KC layer because of the contradiction that nature seems to have chosen in insect olfaction. On the one hand, the number of 50; 000 intrinsic KCs in locust is very large. On the other hand, the locust seems to use only a few of these cells. That raises the question: Is there an optimal size of the intrinsic KC layer that allows the system to work most efciently? There are two main cases: keep pKC constant, leading to an increasing number of active neurons nKC D pKC ¢ NKC , with increasing size NKC , or keep the number of active neurons nKC constant by adjusting pKC appropriately. Using constant pKC is problematic because the learned vector will have too many active connections due to the Hebbian plasticity rule when NKC is large. Since the threshold µLB of the decision neuron is not size dependent, this causes the neuron to re for any input, and the network’s discrimination ability drops drastically for large sizes. We therefore concentrate on the more interesting second case ensuring a constant average number of active neurons by using equation 2.4 to modify pC appropriately. In Figure 5A, the probability of discrimination for a total of N D 100 inputs, for various values of the (xed) average number nKC of active intrinsic KCs as a function of the intrinsic KC layer size NKC , is shown. For better discrimination, one obviously needs to increase the number of neurons in the intrinsic KC layer. Note that although the scale shown is logarithmic, one can see a relatively small area in which the system makes a transition from failure to function. This points to the possibility that there is a critical number of intrinsic KCs that can be quantitatively compared to experimental data. The most important result is that the minimum iKC layer size for successful discrimination has scaling property as a function of the average number of active iKCs. The results are shown in Figure 5B for three different values of N. As one can see in the log-log plot, the exponent does not depend on N. When we t these values, NKC / n2:12 KC . We can theoretically estimate that NKC / n2KC (this will be published elsewhere). However, we do not know at this point where the discrepancy between the exponents stems from because we took severe limiting cases, like pAL pC ! 0 and NAL À µKC , which has an effect on the scaling exponent. The implication of this result is that for discrimination purposes, if the internal representation in the intrinsic KC layer is large, the size of this layer needs to be even larger. The effects of noise can be strong, because it would require increasing the number of active neurons, as explained in Garcia-Sanchez & Huerta (2003). We can conjecture that the intrinsic KC neurons need to have very low endogenous noise.

3.1.2 What Is the Maximum Number of Active iKC That Allows the System to Discriminate? It has been experimentally observed that the representation

1616

R. Huerta et al.

B

A

100000

0.8

nKC=20

0.6

10000 n KC=50

0.4

n KC=100

1000 100

0.2 0 10

NKC

P(z=0|N=100)

1

100

1000

NKC

10000 100000

10

1

10

100

nKC

Figure 5: (A) Probability of discrimination for N D 100 inputs for several numbers of active neurons nKC in the intrinsic KC layer for a threshold value of activation in the MB lobe layer of µLB D 7. (B) Scaling law properties of the boundary of discrimination. Solid line: N D 1000 inputs; middle dashed line: N D 100; dashed line-N D 10. The exponent of the power law is 2.12, NKC / n2:12 KC .

of information in the intrinsic KC layer of the locust is sparse (Perez-Orive et al., 2002). The percentage of iKC neurons that do not respond to the presentation of any given odor is 90% (the statistic was gathered with fewer than 100 neurons). This means that only ¼ 10% are active during 1 second of odor response. Therefore, the probability that an iKC will be active in a window of 50 milliseconds is pKC » 0:005. This means that on average, there are 250 active neurons in every 50 ms snapshot. If we compare this value to the theoretical sparseness values presented in Figure 6A, we obtain that the probability of discrimination is 97% for N D 10 uncorrelated inputs, 84% for N D 100, and 56% for N D 1000. As explained in section 3.1 these values can be improved by using redundant error-correcting code techniques. It turns out that there is a mechanism that increases the boundary of successful discrimination and thus induces a better correspondence to the experimental observations. Let us assume in the following, for convenience of the analysis, that all connections from the iKCs to the extrinsic KC are initially very unlikely. Recall that our learning paradigm so far assumed that the learning time is much larger than maxf1=p¡ ; 1=pC g such that w D y.x0 /. For nite time, however, there is a nonzero probability of w 6D y.x0 /. The probability of having an active connection at time t during learning is pt D pC .1 ¡ pC /t¡1 , where t denotes the number of presentations of input x0 . This probability is valid only for the connections whose presynaptic neuron is active. Therefore, we can write p.l.w// D

´ NKC ³ X i .1 ¡ pt /l.w/ pi¡l.w/ P.nKC D i/; t l.w/ iDl.w/

(3.2)

A

B

1

1

0.8

0.8

P(z=0|N)

P(z=0|N)

Learning Classication in the Olfactory System of Insects

0.6 0.4

0.6 0.4 0.2

0.2 0

1617

1

10

100

nKC

1000

10000

0 100

1000

nKC

Figure 6: (A) Probability of discrimination for N D 1000 (solid line), N D 100 (middle dashed line), N D 10(dashed line) inputs as a function of the average number of active neurons nKC in the intrinsic KC layer for a threshold value of activation in the MB lobe layer of µLB D 7 and NKC D 50000 as in the locust. The experimental values for the locust are located near the boundary of the discrimination regime. (B) Effects of having less time to reinforce the positive odor. The solid line represents pt D 1, the dotted line is for pt D 0:9, and the dashed line for pt D 0:8. A limited time for learning increases the boundaries of the discrimination regime.

where the P.nKC D i/ is calculated in the appendix. The probability p.l.w// of having l.w/ connections from the iKCs to an eKC is used in equations A.17 and A.22 in the appendix. If we recalculate the probability of discrimination, we obtain the results shown in Figure 6B. If the probability of transition pC is decreased, the sparse region increases. So in this case, a limited time to learn the vector to be discriminated can increase the boundary of the parameter region for successful discrimination. 3.1.3 Discrimination Capacity. Given a set of structural parameters such as threshold for activation, µLB , size NKC of the intrinsic KC layer, and level of activity pKC , we try to determine the maximal number of inputs that can be discriminated; we determine the maximum value Nmax of N such that P.z.y1 / D 0; : : : ; z.yN / D 0/ ¸ 1 ¡ ² where ² is a xed small error tolerance. In the appendix (section A.2.4) we calculate the upper bound for the capacity Nmax ·

log P.z.y1 / D 0; : : : ; z.yN¡1 / D 0/ ² ¼ : log P.z D 0/ P.z D 1/

(3.3)

The result shows that capacity is proportional to the inverse of the probability of misclassication. It is trivial to say that if we increase the probability of discrimination, we can increase the capacity to discriminate. However, it

1618

R. Huerta et al.

is not trivial to conclude that the capacity to discriminate is proportional to the inverse of the probability of misclassication. 3.1.4 Resolution of Discrimination. In the previous sections, we dealt with an uncorrelated set of inputs in the AL. This is a good starting point for achieving good discrimination abilities. The next question is what the resolution of the insect olfaction system is when the inputs are highly correlated or very similar. Let us consider an input x that is learned as the vector x0 , that is, the representation of x in the intrinsic KC layer, y.x/, corresponds to w. Now let us consider another vector x0 , which is at distance l.x0 ¡ x/ D ¹. This means that the vector x0 and x differ in ¹ ones. Let us dene f as the event that the input x produces a spike in a neuron of the intrinsic KC layer and f 0 as the event that the input x0 produces a spike in the same neuron of the intrinsic KC layer. We want to determine the probability that the input x0 produces a spike given that the (similar) input x already produced a spike and that the distance between x0 and x is ¹, l.x0 ¡ x/ D ¹. As always, we also need to condition for a given number of active AL neurons. We denote the desired probability as q ´ P. f 0 j nAL D k; f; l D ¹/. In the appendix (section A.2.5), we explain how to calculate q. After obtaining q, we can calculate whether there are sufcient 1s in the intersection of y.x0 / and w D y.x/ such that the eKC under consideration is above threshold for input x0 . Given that there are nKC D r active intrinsic KCs in y.x/, the probability of having i 1s in the new vector y.x0 / is P.i coincident 1s j nAL D k; nKC

³ ´ r i D r/ D q .1 ¡ q/nK C ¡i : i

(3.4)

Therefore, P.i coincident 1s/ D

NKC NAL X X kD0 rDi

P.i coincident 1s j nAL D k; nKC D r/

£ P.nKC D r j nAL D k/P.n AL D k/;

(3.5)

where P.nKC D r j nAL D k/ and P.nAL D k/ are given in the appendix (section A.1). We now have the tools to calculate the probability of z.x0 / D 1 given that z.x/ D 1 and l.x0 ¡ x/ D ¹: P.z.x0 / D 1 j z.x/ D 1; l.x0 ¡ x/ D ¹/ D 1 ¡

µX LB ¡1

P.i coincident 1s/ (3.6)

iD0

Figure 7 shows the ability of the system to discriminate similar inputs as a function of the separation of the input states and for different average levels of activity in the intrinsic KC layer. The system is able to separate or

Learning Classication in the Olfactory System of Insects

P(z=1|m)

1

1619

nKC=50

0.8

nKC=100 nKC=200

0.6

nKC=300 nKC=400

0.4 0.2 0

0

1

2

3

4

5

m Figure 7: Ability to discriminate similar inputs as a function of distance between the inputs and different levels of average activity in the intrinsic KC layer. The probability of separating the rst learned vector from a similar vector is high when the distance is larger than 2, 3, or 4 depending on the level of sparseness of the activity in the KC layer.

discriminate when the distance between the inputs is larger than 4, which is a rather small distance considering the dimension of the input space, that is, the number of AL neurons. We therefore conclude that it has a great ability to discriminate similar inputs. 3.2 Association for One Single Output. In the general problem of association, one has a set of uncorrelated inputs in the intrinsic KC layer y1 ; y2 ; : : : ; yd . All these uncorrelated inputs need to produce a 1 in the classication neuron, z.y/ D 1, using Hebbian learning. The rule in equation 2.8 has two limiting cases depending on the choice of pC and p¡ . The rst limiting case is given for pC ! 0 and p¡ D 1. Note that in this case, every time that there is a 0 in the input yj , the connectivity wj immediately drops to 0. Only the 1s occurring in all d inputs will produce a wj D 1 in an average time 1=pC . Therefore, the connectivity vector w is eventually equal to the intersection of all the inputs, that is, w D y1 \ y2 \ ¢ ¢ ¢ \ yd . The second limiting case is given in the extreme pC D 1 and p¡ ! 0. Following a similar argument as before, it can easily be seen that w D y1 [ y2 [ ¢ ¢ ¢ [ yd in an average time 1=p¡ . Playing with pC and p¡ , one can move w between these two limiting cases. Let us study the disadvantages of the rst case, (pC ! 0; p¡ D 1). To allow any activation of the output neuron, the number of 1s in w needs to be greater than or equal to µLB . One therefore needs to nd parameters such

1620

R. Huerta et al.

that the probability of having less than µLB 1s in w is very P small, that is, P.l.w/ < µLB / D ² for some small given ². Here, l.w/ D j wj the number of 1s in w. The proper method to calculate this probability is explained in the appendix in section A.3. However, it is not possible to carry out due to its computational cost. Nevertheless, we can make a simple approximation that can contribute to clarifying the association problem. Let us approximate the probability distribution piKC .nKC / by a binomial distribution with probability pKC , where pKC is the expected value of number of active neurons in the intrinsic KC layer. Then P.l.w/ < µLB / can easily be calculated as P.l.w/ < µLB / D

³ µX LB ¡1 iD0

´ NKC .pdKC /i .1 ¡ pdKC /NKC ¡i : i

(3.7)

Using this probability, one can calculate the necessary level of activity pKC in the intrinsic KC layer as a function of d for a xed tolerance ². Figure 8 shows that in order to be able to associate a few uncorrelated inputs, one needs to strongly increase the level of activity in the network up to unrealistic levels. And this is not all. One also needs to make sure that one can still discriminate the learned inputs from another sequence of uncorrelated inputs ydC1 ; ydC2 ; : : : ; yNCdC1 with the activity pKC obtained from Figure 8. The way to measure this ability is to calculate P.z.ydC1 / D 0; z.ydC2 / D 0; : : : ; z.yNCdC1 / D 0/. How to calculate this probability follows from equation A.22. The result of this calculation is that for N D 1 and d D 2, P.z.ydC1 / D 0; z.ydC2 / D 0; : : : ; z.yNCdC1 / D 0/ D 10¡7 , whereas for larger 1

pKC (activity)

0.8 0.6 0.4 0.2 0

10

20

30

40

50

d (number of inputs) Figure 8: Necessary activity level pKC for an intrinsic KC layer of 50,000 neurons to be able to associate d independent inputs in the intersection scenario. The threshold value of activation in the MB lobe layer was µLB D 7, and we used a tolerance level of ² D 10¡6 . Details are explained in the text.

Learning Classication in the Olfactory System of Insects

1621

values of d, the probability is practically 0. This means that in principle, one can associate d uncorrelated inputs. However, later, one can no longer distinguish them from the remainder of the N inputs. Therefore, the plasticity parameters that led to the intersection of the d inputs, (pC ! 0; p¡ D 1), are not suitable for association. Let us proceed to the second scenario in which one has the union of d inputs, (pC D 1, p¡ ! 0). In this training process, active states of the preceding stimulus are diffused to avoid the restrictive condition of the intersecting sets. One obtains a connectivity vector that is the union of all the training input vectors, w D y1 [ y2 [ ¢ ¢ ¢ [ yd . The probability distribution exclup.l.y1 [y2 [ ¢ ¢ ¢[ yd // can inSprinciple be calculated S using the inclusion P sion principle to obtain l. 1·i·d yi /, namely, l. 1·i·d yi / D 1·i1 ·d l.yi1 / ¡ P P T i1 i2 i1 i2 i3 dC1 l. diD1 yi / 1·i1
1622

R. Huerta et al. A

P(y

d+1

,...,y

d+N+1

)

1.1 1 0.9 0.8 0.7 0.6

0

2

4

6

8

10 12 14

d (associated inputs)

d (capacity for association)

B 14 12 10 8 6 4 2 0

0

50

100

150

nKC (average active neurons)

Figure 9: (A) Probability of discriminating d associated inputs from N D 10 uncorrelated inputs. The solid line represents the values using the approximation explained in the text. The circles show the semianalytical values, for which p.l.y1 [ y2 [ ¢ ¢ ¢ [ yd // was estimated from simulations. (B) Maximum number of associated inputs, dmax for an intrinsic KC layer of 50,000 neurons as a function of the average number of active neurons nKC for a threshold value of activation in the MB lobe layer of µLB D 7. The lines from left to right are for N D 102 ; 104 ; 106 . We used a tolerance level of ² D 10¡6 .

the approximation start to separate for probability values far from 1. Since we are interested in determining dmax such that P.z.ydC1 / D 0; z.ydC2 / D 0; : : : ; z.yNCdC1 / D 0/ ¸ 1 ¡ ², that is, we work in a region where the probabilities are close to 1, we can reasonably use the approximation. Figure 9B shows the maximum number of associated inputs dmax as a function of the average number of active neurons, nKC D pKC NKC , in a KC layer of NKC D 50; 000 neurons. Sparse levels of activity achieve the higher values for the number of inputs that can be associated. However, the overall ability of the system to associate inputs is not very good. This implies a prediction on how many uncorrelated odors an insect can associate. From this analysis, we can see that there are surely no more than 10. 3.3 Self-Organization for Classication with Many Output Neurons. In the preceding sections, we were able to quantify the dependence of the classication performance of a single output neuron on the KC layer size and the level of activity in this layer. In this section, we investigate the performance of the full system with several output neurons in the decision layer. It remains to be seen whether the dependencies observed for the single output neuron are preserved for the system that self-organizes by mutual inhibition between eKCs and Hebbian learning as described in section 2.2.1. First, let us be specic about the input classes used in the AL layer. We create a basis for a set of input classes by a Bernoulli process such that the probability of having a 1 in an AL neuron is pAL ; that is, we create Nc inde-

Learning Classication in the Olfactory System of Insects

1623

pendently chosen basis vectors ´ 1 ; ´2 ; : : : ; ´ Nc . For each basis vector ´, we then create a set or class of Nr inputs that are highly correlated to the basis vector ´ ¹ . The vectors in each of the classes ¹ are denoted as x j .´ ¹ / where ¹ D 1; : : : ; Nc and j D 1; : : : ; Nr . The inputs x j .´ ¹ / belonging to class ¹ are generated by relocating each 1 in ´ ¹ with probability pr . For pr D 0, all the inputs of the class are therefore exactly the same, x j .´ ¹ / D ´¹ , and they become increasingly different from each other and the basis ´ ¹ with increasing pr . This mechanism to generate classes allows us to automatically create a large number of correlated and uncorrelated inputs with a controlled degree of correlation. The correlated inputs are those generated from the same basis vector ´¹ , and the uncorrelated ones are those produced from different such vectors. We shall see that the correlated inputs will be classied by the same output neurons and the uncorrelated ones by different ones. Many parameters need to be adjusted at this point. Let us rst specify which parameters are xed due to computational constraints. The number of input neurons, NAL , is set to 100. This number is close to the number of projection neurons in the AL of the Drosophila. Note that the locust has 830, which makes it very difcult to gather sufcient statistics in this case due to long computation times. The intrinsic KC layer size, NKC , is mainly set to 2500, again as in Drosophila. This parameter is occasionally varied to test the classication performance as a function of the screen layer. The activity level in the AL of locust is experimentally at 15% in a 50 millisecond snapshot, pAL D 0:15. We make the assumption that this is similar in Drosophila. The rest of the parameters, pC and µKC , are chosen to keep the mapping from the AL activity onto the KC activity an injective function as explained in section 2.1 and to regulate the level of average activity, nKC , in the intrinsic KC layer. The number of extrinsic KCs is not known, so we assume that NLB D 100. Note that the qualitative results do not depend on this number. The output response of the decision layer has nW active neurons for every input x j .´ ¹ /. This is achieved via an appropriate level of mutual inhibition between eKCs. The correlated inputs that belong to the same class should be classied by the same output response, while different inputs from different classes are to be separated, resulting in different output responses. To measure the degree of such separation or nonseparation, we dene two quantities that quantify the distance within and among sets of outputs. Let us rst denote the average output in response to a class ¹ of inputs by hz¹ i D hz.x j .´ ¹ //ix j .´¹ / ;

(3.9)

where j takes the values 1; : : : ; Nr and labels the inputs within the class. Then the distance within a given class is dened as Dintra D

Nc Nr 1 X 1 X jz.x j .´ ¹ // ¡ hz¹ ij; Nc ¹D1 Nr jD1

(3.10)

1624

R. Huerta et al.

Dintra,Dinter

8 6 4 2 0

0

500

1000

1500

Iteration number Figure 10: An example of the evolution of learning through several presentations of odors for pr D 0:9, nW D 5, pC D 0:2, p¡ D 0:5, and average number of active neurons in the intrinsic KC layer set to nKC D 35. The solid lines represents the inter- (upper) and intradistance (lower) for an initial probability of connection from the intrinsic KC to the MB lobe of 0:1. The dashed line represents an initial probability of connection of 0:05. This initial probability of connection is an important parameter for minimizing learning time.

and the interclass distance is expressed as Dinter D

N Nc c ¡1 X X 1 jhz¹ i ¡ hzº ij: 2 .Nc ¡ Nc /=2 ¹D1 º>¹

(3.11)

Note that we calculated the distance between classes using the average of the outputs in response to a class. We chose this denition over using the distance between all pairs of outputs of different classes to reduce computational costs. For a given nW , the values of the distances are bounded, that is, Dintra; Dinter · 2 nW . As an illustration of how the system self-organizes by means of mutual inhibition and Hebbian learning, Figure 10 shows the time evolution of the system through several odor presentations. The main parameter under study in this gure is the initial probability of connections from the intrinsic KC layer to the MB lobe. Lower initial connection probabilities lead to a high number of iterations before a steady state is achieved. In all our simulations, we chose the initial probability that leads to short learning times. 3.3.1 Level of Inhibition: Number of nW Active MB Lobe Neurons. In contrast to the AL and the intrinsic KC layer, not much is known about the

Learning Classication in the Olfactory System of Insects

1625

connectivity and levels of activity within the MB lobes. Thus, the number of active neurons in the MB lobes, the threshold level, the probability of modication of synaptic connectivity, and the initial number of connections in the system (before learning starts) are free parameters. We need to elaborate on these parameters before running a parametric search. First, theoretically, the number of classes that can be stored in the MB ¡ ¢ lobes is NnWLB , which increases quickly with the number of active extrinsic KCs, nW . If we knew the number of classes that the insect stores, we would be able to extract an estimate for nW from this knowledge. It seems that the larger nW , the better the system becomes. It is, however, unclear whether the system is able to learn a high nW =NLB ratio. nW also determines the optimal value of the interdistance, Dinter. Let us now consider two vectors that belong to different classes, and let us assume for a rst analysis that z and z0 are randomly chosen vectors with probability nW =NLB to have a 1 at any position. The probability that z and z0 share a given active neuron is .nW =NLB /2 . Then the probability of having j ¡ ¢ common active neurons is P.j/ D NjLB ..nW =NLB /2 / j .1 ¡ .nW =NLB /2 /nW ¡j . So

the average number of common neurons is n2W =NLB . The expectation value for the optimal interdistance is then Dinter ¼ 2 nW ¡ 2n2W =NLB . Therefore, although the capacity grows with nW , the overlaps among responses increase proportional to n2W . These theoretical arguments are rather academic, as the activity vectors are clearly not independent Bernoulli processes in the real system. In the simulations of the real system, we randomly choose 2000 inputs from the set of N D 400 generated inputs and present them to the system. Usually the system reaches a stationary state after 250 presentations. To produce reliable statistics, we perform a sufcient number of independent trials. Five hundred repetitions appeared to be the appropriate compromise between maximizing precision and minimizing computer time. Figure 11 shows one general example of the efciency of classication as a function of the number of active neurons nW in the MB lobes. The solid line represents the difference between the inter- and intradistance normalized to the maximum possible value, which is 2 nW . As we can see, the optimal value to discriminate between classes while grouping the inputs belonging to the same class is nW D 1. This result is consistent with associative networks, where sparseness is helpful to achieve minimum overlaps in the output (Frolov & Murav’ev, 1993; Nadal & Tolouse, 1990). Shall we then choose nW D 1 as a default value? There are additional criteria that we have not considered here: noise, the ability of the downstream reader to see one single neuron, and the implementation of the system with real conductance-based models. In all the calculations that we have shown here, we did not consider noise. Real neurons are, however, always noisy, and some redundancy in the system has to be expected. Second, when considering who reads the output of the MB lobes, we need to ask whether it

R. Huerta et al.

Dintra,Dinter-Dintra,Dinter/2nW

1626

1 0.8 0.6 0.4 0.2 0

0

10

20

30

40

50

nW Figure 11: Estimates for the representation quality in the MB lobes as a function of the number of active neurons nW in the lobes. The dotted line is Dintra normalized to the maximum value 2 nW , the dashed line is Dinter , and the thick solid line is the difference between them normalized to the maximum. The dash-dotted line represents Dinter ¡ Dintra normalized to the expected maximal difference, which is 2 nW ¡ 2n2W =NLB . The values used to obtain this curve are Nc D 40, Nr D 10, pr D 0:9, p¡ D 0:5, pC D 0:2, and pC D 0:21. The different parameter sets used in the next sections produce the same qualitative results. In particular, increasing the number of classes N does not change the slope of the decreasing functions.

is possible to read the response of a single neuron. This is another reason for some redundancy in the representation. The last issue concerns the implementation of an nW -representation with more realistic neuron models. It is difcult to adjust the level of mutual inhibition, then represented by realistically modeled inhibitory synapses with given maximal conductance, such that any input produces a single neuron to respond. Therefore, to avoid being unrealistic, we will consider nW D 5 as a moderate choice for the level of activity in the MB lobes. 3.3.2 Intrinsic KC Layer Size. We have shown the dependence of the discrimination performance as a function of the intrinsic KC layer size, NKC . To compare in the same terms as in previous sections, the activity of the intrinsic KC layer is again regulated to have a constant average number of active neurons nKC . Here, we chose the same parameter values as in the previous section: Nc D 40, Nr D 10, pr D 0:1, and p¡ D 0:5. The probability of connections from the AL to the MB is set from equation 2.4 such that the average number of active neurons is nKC D 35. Figure 12 shows from left to right the observed inter- and intradistances after learning for different intrinsic KC layer sizes. The discrimination ability

Dintra, Dinter

Learning Classication in the Olfactory System of Insects

10 8 6 4 2 0

NKC=10

4

0 0.4 0.8

p+

3

NKC=5x10 NKC=2.5x10

0.4 0.8

p+

0.4 0.8

p+

3

NKC=10

1627 3

0.4 0.8

p+

NKC=5x10

2

0.4 0.8

p+

Figure 12: Classication ability among classes versus several sizes of the intrinsic KC layer. From left to right, the size runs from 10,000 neurons to 500. The solid line is the interdistance between clusters, and the dotted line is the average intradistance within the clusters. There is a clear improvement with increasing size of the intrinsic KC layer.

clearly improves with increasing NKC . It is also clear that the positive plasticity timescale pC is critical for smaller KC layer sizes. There is an optimal value of pC where the discrimination ability among classes is best. However, we can also observe that the intradistance within a class gets worse with lower values of pC . It seems reasonable that there is a balance between the ability of the system to separate classes and the ability to put together inputs that belong to the same class. Therefore, to be able to have an understanding of the overall performance between these two antagonistic goals, we plot Dinter ¡Dintra versus pC in Figure 13. One of the main messages one can infer from this plot is that for larger sizes of the intrinsic KC layer, the learning rate is not so important. This means that one of the additional advantages of having a larger system size is more robust learning. For sizes larger than 5000, the system reaches a saturation level regardless of the learning rate. This size is smaller than the one in Figure 5 due to different numbers of inputs used. 3.3.3 Level of Activity in the Intrinsic KC Layer. We have demonstrated that sparse code in the intrinsic KC layer for one output neuron is optimal, as can be seen in Figure 6. We now analyze the optimal level of activity in the KC layer with a complete decision layer, including mutual inhibition between eKCs. To test the ability of the system to classify properly, we increase the difculty of the problem by increasing the number of classes to 200. We then perform the same type of simulations as in the previous section, but now for different activity levels of the intrinsic KC layer. To regulate this activity level, we adjust pC appropriately. To increase the speed of the simulations, we consider only a single MB of Drosophila, which has NKC D 2500. Figure 14 shows the results of the simulations. They are consistent with the conclusions drawn in the previous section. High activity levels on the intrinsic KC layer do not perform very well. It appears that

1628

R. Huerta et al.

8 7

Dinter-Dintra

6 5 4 3 2 1 0

0

1000 2500 5000 10000 500

0.2

0.4

0.6

0.8

1

p+

Dintra, Dinter

Figure 13: Difference between the inter- and intradistances observed versus the inverse learning rate pC extracted from Figure 12.

6

nKC=48

nKC=113

nKC=230

nKC=465

nKC=950

4 2 0

0 0.4 0.8

p+

0.4 0.8

p+

0.4 0.8

p+

0.4 0.8

p+

0.4 0.8

p+

Figure 14: Classication ability among classes versus several activity levels on the intrinsic KC layer, nKC , for a system with NKC D 2500. The solid line represents the interdistance between clusters, and the dotted line is the intradistance within a cluster. It seems that nKC D 113 classies better than the rest, and large activity in the intrinsic KC layer produces very poor performance.

the best performance is achieved at activity levels near 100. Note that the interdistance is not very high because there are 200 different classes with 10 inputs for each class. This means that there are 2000 inputs that each trigger 5 output neurons in these simulations. Therefore, there have to be many output neurons that share classes. To quantify the performance better, we plot the maximum over all tested pC of the difference between the inter- and intradistance as a function of the KC activity levels used, that is, maxpC fDinter ¡ Dintra g as a function of nKC .

Learning Classication in the Olfactory System of Insects

1629

3 2

+

maxp {Dinter-Dintra}

4

1 0 10

100

1000

nKC Figure 15: maxpC fDinter ¡Dintra g versus increasing values of activity nKC for NKC D 2500. There is an optimal activity level in the sparse regime close to nKC D 100.

The result is shown in Figure 15. The best performance is obtained for nKC roughly around 100. Note the logarithmic scale on the nKC axis. 4 Conclusions Our results clearly show that the classication task can be implemented in the rst few relay stations of the olfactory system of insects. We have shown that there is no need for complicated, nonrealistic global learning rules or highly specic connections. The fan-out and fan-in network structure, nonspecic or random connections, simple Hebbian learning, and mutual inhibition can implement discrimination between classes and grouping of similar inputs into the same class. Within this classication scheme are two main parameters that heavily affect the performance of the system: intrinsic KC layer size and level of activity in the MB. In simple terms, the larger the number of neurons, the better the insect will perform in the classication task. For the level of activity in the MB, there is an optimal value at a very sparse level. A straightforward prediction arises. The honeybee with 170,000 intrinsic KC neurons will outperform the Drosophila in any classication task. However, if we were somehow able to increase the level of activity in the honeybee MB by genetic mutation or chemical intrusion that increases the excitability of the neurons, our theory predicts that the performance will quickly decay. We have also shown that the structural organization of the AL and the MB provides the basic means to obtain a high level of discrimination for very similar odors, as we have demonstrated. This indicates that the main

1630

R. Huerta et al.

function of the intrinsic KC layer is to separate all the odor inputs as much as possible to be associated in the next layer, the MB lobe. The analysis of the association of unrelated inputs into a single output response showed that only a limited number of inputs can be associated (see Figure 9). This can be a specic prediction of the analysis of this article. The number of uncorrelated odors that an insect can associate is limited to a few. An experiment to conrm this prediction would need to use a pool of uncorrelated odors. The detection of uncorrelated odors is in itself not a simple matter, but we take it for granted for the moment. Then the insect would need to be forced to associate d odors, and we would test whether the insect is able to discriminate these odors from N new, uncorrelated ones. A value of d D 10 associated odors is the maximum number that the insect might be able to achieve according to our analysis. In summary, the correct level of sparseness and an appropriate system size are crucial for association and discrimination in this model that replicates the structural organization of the olfactory system of insects. The importance of sparseness for the capacity of associative memories is an old theoretical idea that was rst proposed by Marr (1969) in a model of the cerebellar cortex, and Willshaw and coworkers (1969) in an associative memory model. Since then, a lot of progress has been made concerning the role of sparse representations to increase the capacity of articial associative memories (Palm, 1980; Tsodyks & Feigel’man, 1988; Amari, 1989; Buhman, 1989; Perez-Vicent & Amit, 1989). Given that a sparse code has been found experimentally in the intrinsic KC neurons of the MB of locust (Perez-Orive et al., 2002), it is feasible to locate the classication task in the MB lobes. This is also consistent with experimental observations about the location of learning in odor conditioning. 5 Discussion We have explored sufcient conditions to achieve classication in the olfactory system of insects. An issue that we have disregarded in this work is the temporal aspects of the neuronal code. Is there a need for a temporal code? From the results obtained, it is clear that increasing the activity in the MB degrades the performance of the system. So what happens if the presented odor concentrations are increased? With increasing odor concentration, the number of active sensory receptors increases, and increased activity at the glomerular level is recorded (Ng et al., 2002; Wang et al., 2003). On the other hand, recordings in the MB of Drosophila show that in normal specimens the levels of activity do not seem to change for different odor concentrations (Wang et al., 2001). Recent investigations have shown that the average activity of the excitatory neurons in the locust AL (Stopfer et al., 2003) remains nearly constant across odor concentrations as well. In between the glomeruli and the MB, there obviously is a gain control mechanism that regulates the activity being fed into the MB. The network of excitatory and

Learning Classication in the Olfactory System of Insects

1631

inhibitory neurons in the AL is a good candidate to perform this function. In this work, we did not consider this function and how to integrate the temporal snapshots of activity in the AL. We explored just the boundaries and limitations of a neural network using biologically feasible properties while disregarding time. Summarizing, the olfactory system certainly needs a gain control mechanism in order to maintain the activity of the MB within its working regime. Another suggested function of the AL that we did not address here is the role of spatiotemporal code to separate similar odors. Hosler, Buxton, and Smith (2000) Stopfer, Bhagavan, Smith, and Laurent (1997) showed that temporal activity plays a role in behavior. Although different odors can be discriminated without temporal activity, similar odors require spatiotemporal activity to be separated (Hosler et al., 2000; Stopfer et al., 1997). It has been shown experimentally (Friedrich & Laurent, 2001) and theoretically (Friedrich & Laurent, 2001) that time improves discrimination. In this article, we show that the transformation from the AL to the MB using a random connectivity matrix can produce a very good level of resolution for similar odors (see Figure 7) without using temporal mechanisms. We like to think that the connections from the AL to the MB are random not only because of the mathematical simplicity but also because it works well. We think that random connections from the AL to the MB allow the system to be ready for any type of input, such that it does not matter whether the insect knows the odor. A similar way of thinking can be found in O’Reilly and McClelland (1994), where random transformations from the entorhinal cortex to the dentate gyrus in the hippocampus are found to be better in achieving pattern separation. In principle, we do not think that there is need for learning at this stage. One possible functional role of synaptic plasticity has been shown in (Finelli et al., 2004), where synaptic learning improves the sparse representation in the intrinsic KC layer. Appendix A.1 P.nKC D r/. The probability that a given intrinsic neuron res is given by pKC D

´ NAL ³ X N AL .pAL pC /i .1 ¡ p AL pC /NAL ¡i : i iDµ

(A.1)

KC

However, in order to estimate the probability of discrimination, we need to calculate the probability distribution P.nKC D r/ to have nKC D r active intrinsic KCs. The conditional probability that an intrinsic KC res given that n AL D k PNs are active is k ³ ´ X k i (A.2) p.re j n AL D k/ D p .1 ¡ pC /k¡i : i C iDµK C

1632

R. Huerta et al.

Then the conditional probability to have nKC D r active intrinsic KCs given nAL D k active PNs is ³

P.nKC D r j nAL D k/ D

´ NKC p.re j n AL D k/r r

£ .1 ¡ p.re j nAL D k//NKC ¡r :

(A.3)

because of the independently chosen connections. Finally, the probability distribution for the number of active nKC is P.nKC D r/ D

³ ´ NAL k P.nKC D r j nAL D k/ p AL .1 ¡ pAL /NAL ¡k : (A.4) k kD0

NAL X

Unfortunately, this probability distribution cannot be expressed in simpler terms and needs to be evaluated numerically. The expectation value of this distribution is by denition

E nKC D D

NKC X rD0

rP.nKC D r/

´ NAL ³ X NAL kD0

k

pkAL .1 ¡ pAL /NAL ¡k

(A.5) ´ NKC ³ X NKC r r rD0

£ p.re j n AL D k/r .1 ¡ p.re j n AL D k//NKC ¡r :

(A.6)

The second sum is the expectation value of a binomial distribution with parameters NKC and p.re j n AL D k/ and thus is equal to NKC p.re j n AL D k/ such that

EnKC D NKC D NKC

´ NAL ³ X NAL kD0

k

NAL ³ ´ ¡ ¢N ¡k X k i pkAL 1 ¡ pAL AL pC .1 ¡ pC /k¡i (A.7) i iDµKC

´ NAL ³ X NAL .pAL pC /i .1 ¡ pAL pC /NAL ¡i : i iDµKC

(A.8)

Note that this is the same expectation value as one obtains assuming independence of ring events of intrinsic KCs. The standard deviation of the probability distribution, equation A.4, is, however, quite a bit larger than the standard deviation of the binomial distribution obtained under this simplifying assumption. A.2 P.z.y1 / D 0; z.y2 / D 0; : : : ; z.yN / D 0/.

Learning Classication in the Olfactory System of Insects

1633

A.2.1 Preliminary Analysis. All the activity vectors in the intrinsic KC layer are obtained from independently, randomly generated activity vectors in the AL. Assuming that a randomly chosen activity vector y0 .x0 / has been trained sufciently long, the resulting connectivity vector will be w D y0 , as explained in section 3.1. If another input vector y.x/ is chosen randomly, the probability of a proper classication is Pproper D P.z D 0 j y 6D y0 /:

(A.9)

It is problematic to calculate this conditional probability directly, so let us try to simplify it: P.z D 0/ D PproperP.y 6D y0 / C P.z D 0 j y D y0 /P.y D y0 / 0

D Pproper.1 ¡ P.y D y //:

(A.10) (A.11)

The probability P.y D y0 / of having obtained the same activity vector twice is identical to P.x D x0 / if we assume that the mapping from the AL to the MB is an injective function. In section 2.1, we give the conditions on the parameters that ensure an injective function. Appropriate parameters are used throughout this article. We conclude, X P.y D y0 / D P.x D x0 / D P.x/P.x/ x

´ NAL ³ X N AL 2l D pAL .1 ¡ pAL /2.NAL ¡l/ l lD0

(A.12)

± ²NAL D p2AL C .1 ¡ pAL /2

(A.13)

For example, the locust has NAL D 830 neurons and pAL D 0:15; this probability is approximately 10¡106 . Another example is the Drosophila with NAL D 140. In this example, the probability of collision is 10¡17 . We therefore can safely neglect P.y D y0 / in equation A.11 and call P.z D 0/ the probability of proper classication. A.2.2 Probability to Discriminate Another Randomly Chosen Input. The probability of having i coincident P ones in a randomly chosen activity vector y with l.y/ 1s, that is, l.y/ D i yi , and a given connectivity vector w is ¡l.w/¢¡NKC ¡l.w/¢ P.i j w; l.y// D

i

l.y/¡i

¡NKC ¢

:

(A.14)

l.y/

Then the probability of proper classication given w and l.y/ is P.z D 0 j w; l.y// D

µX LB ¡1 iD0

P.i j w; l.y//:

(A.15)

1634

R. Huerta et al.

In the next section, we will use

P.z D 0 j w/ D

NKC X l.y/D0

P.z D 0 j w; l.y//P.nKC D l.y//:

(A.16)

Finally, the total probability of proper classication for two random inputs is P.z D 0/ D

NK C X l.w/D0

P.z D 0 j w/P.nKC D l.w//:

(A.17)

A.2.3 Probability of Discrimination Failure for a Set of N Inputs. We now generalize the result obtained in the previous section to a set of N inputs. The conditional probability for a given connectivity w and a single input is generalized to the joint conditional probability to produce z D 0 with each of a set of N inputs. First, P.z.y1 / D 0; z.y2 / D 0; : : : ; z.yN / D 0 j w/ D

N Y jD1

P.z.y j / D 0 j w/;

(A.18)

because all y j are statistically independent (as they are functions of the independently chosen x j ). As all x j are generated by identical Bernoulli processes, P.z.y j / D 0 j w/ D P.z.y1 / D 0 j w/

8j;

(A.19)

such that we obtain P.z.y1 / D 0; z.y2 / D 0; : : : ; z.yN / D 0/ D

NKC X l.w/D0

P.z.y1 / D 0; z.y2 / D 0; : : : ; z.yN / D 0 j w/

£ P.nKC D l.w// D

NKC X l.w/D0

(A.20)

.P.z D 0 j w//N P.nKC D l.w//;

where we used P.z.y1 / D 0 j w/ D P.z D 0 j w/.

(A.21) (A.22)

Learning Classication in the Olfactory System of Insects

1635

A.2.4 Discrimination Capacity. To estimate the capacity, we need to nd the maximum value of N such that 1 ¡ P.z.y1 / D 0; : : : ; z.yN / D 0/ is below a given small tolerance level ². Thus, we take equation A.22 and apply the Jensen inequality to the convex function f .x/ D xN , leading to P.z.y1 / D 0; : : : ; z.yN / D 0/ D

NKC X l.w/D0

Á ¸

.P.z D 0 j w//N P.nKC D l.w//

NKC X

l.w/D0

(A.23)

!N P.z D 0 j w/P.nKC D l.w//

D .P.z D 0//N :

(A.24) (A.25)

Note that P.z D 0/ has been calculated in section A.2. So N is bounded by N·

log P.z.y1 / D 0; : : : ; z.yN / D 0/ : log P.z D 0/

(A.26)

Using P.z.y1 / D 0; : : : ; z.yN / D 0/ D 1 ¡ ² and taking into account that with P.z.y1 / D 0; : : : ; z.yN / D 0/ close to one automatically also P.z D 0/ is very close to one, we can write in very good approximation N·

log P.z.y1 / D 0; : : : ; z.yN / D 0/ ² ¼ : log P.z D 0/ P.z D 1/

A.2.5 P. f 0 j n AL D k; f; l D ¹/.

The probability q can be expressed as

q D P. f 0 j n AL D k; f; l D ¹/ D

NAL X ´D0

P. f 0 j

£ P. D

P

P

j cj xj

NAL X ´DmaxfµKC ;¹g

£ P.

P

j cj xj

j cj xj

(A.27)

(A.28)

D ´; n AL D k; f; l D ¹/

D ´ j n AL D k; f; l D ¹/ P. f 0 j

P

j cj xj

(A.29)

D ´; nAL D k; l D ¹/

D ´ j n AL D k; f /:

(A.30)

The last factor can be rewritten as

P P. j cj xj D ´ \ f \ n AL D k/ P P. j cj xj D ´ j nAL D k; f / D P. f j n AL D k/P.nAL D k/ P P. j cj xj D ´ j nAL D k/ D : P. f j nAL D k/

(A.31)

1636

R. Huerta et al.

All factors are now known, namely, P (A.32) P. f 0 j j cj xj D ´; n AL D k; l D ¹/ 8 ³ ´ ¹ X > ¹ < .pC /i .1 ¡ pC /¹¡i for k ¸ ´; ´ ¸ µKC ; ¹ D iDmaxf¹¡.´¡µ /;0g i ; KC > : 0 otherwise ³ ´ P k ´ (A.33) P. j cj xj D ´ j nAL D k/ D p .1 ¡ pC /k¡´ ; ´ C and P. f j n AL D k/ is given by equation A.2. A.3 P.l.y1 \y2 \¢ ¢ ¢\yd / < µLB /. The probability of having i1 intersecting 1s given two vectors y1 and y2 , whose lengths are l1 and l2 , is

P.i1 j l1 ; l2 / D

¡l1 ¢¡NKC ¡l1 ¢ i1

l ¡i

(A.34)

2 ¡NKC ¢1 :

l2

The probability of having i2 intersecting 1s of the vector y3 given i1 coincidences of the rst two vectors is P.i2 j l1 ; l2 ; l3 ; i1 / D

¡i1 ¢¡NKC ¡i1 ¢ i2

l ¡i

3 ¡NKC ¢2

(A.35)

:

l3

Given that we know P.i1 j l1 ; l2 /, we can now write P.i2 j l1 ; l2 ; l3 / D

¡ ¢¡NKC ¡i1 ¢ ¡l ¢¡NKC ¡l ¢

minfl 1 ;l 2 ;l3 g i1 X i2 i1 D0

1

l ¡i

i1

3 ¡NKC ¢2

l ¡i

1

2 ¡NKC ¢1 :

l3

(A.36)

l2

We can continue this procedure to calculate the probability of having id¡1 1s in the vector .y1 \ y2 \ ¢ ¢ ¢ \ yd /, which is 1 P.id¡1 j l1 ; l2 ; : : : ; ld / D Qd ¡N ¢ KC jD2

£ where lmin D minfl1 ; l2 ; : : : ; ld g.

lj

d¡2 Y³ jD1

lmin X id¡2 ;id¡3 ;:::; i1 D0

i d¡2 ·i d¡3 ·:::·i1

ij ijC1

³ ´³ ´ l1 NKC ¡ l1 i1 l2 ¡ i1

´³

´ NKC ¡ ij ; ljC2 ¡ ijC1

(A.37)

Learning Classication in the Olfactory System of Insects

1637

The not-ring probability of the extrinsic KC is P.l.y1 \ y2 \ ¢ ¢ ¢ \ yd / < µLB /. Therefore, the conditional not-ring probability given .l1 ; l2 ; : : : ; ld / is P.re j l1 ; l2 ; : : : ; ld / D

µX LB ¡1 id¡1 D0

P.id¡1 j l1 ; l2 ; : : : ; ld /:

(A.38)

Finally, P.l.y1 \ y2 \ ¢ ¢ ¢ \ yd / < µLB / D

NKC X l1 ;l 2 ;:::;ld D0

P.re j l1 ; l2 ; : : : ; ld /

d Y jD1

P.nKC D lj /;

(A.39)

where P.nKC D lj / is given in Section A.1. The main problem with this calculation is the computer time required to evaluate it. The minimum number of calls to the hypergeometric function based on the multiprecision library (Torbjorn, 2001) is .Nmax .d ¡ 1/µ LB /d . For the parameter values corresponding to the locust we have µLB D 7 and Nmax D 2000, where Nmax is dened by the condition N max X iD0

P.nKC D i/ D 1 ¡ ²;

(A.40)

with ² D 10¡6 . The number of associated inputs that we can analyze is therefore with d · 4 rather small. An approximation is required to draw conclusions for larger d. Acknowledgments We are very grateful to Gilles Laurent for conversations about this work. We also thank Ildiko Aradi and Gabriel Mindlin for constructive criticisms. This work was partially supported by the National Science Foundation under grants NSF/EIA-0130708 and NSF PHY0097134, and the U. S. Department of Energy, Ofce of Basic Energy Sciences, Division of Engineering and Geosciences, under grants DE-FG03-90ER14138 and DE-FG03- 96ER14592. This work was also partially supported by M. Ciencia y Tecnolog´õ a BFI20000157 (M.G.). References Amari, S. (1989). Characteristics of sparsely encoded associative memory. Neural Networks, 2, 451–457.

1638

R. Huerta et al.

Buhmann, J. (1989). Oscillations and low ring rates in associative memory neural networks. Phys. Rev. A, 40(7), 4145–4148. Caticha, C., Palo Tejada, J. E., Lancet, D., & Domany, E. (2002). Computational capacity of an odorant discriminator: The linear separability of curves. Neural Comput., 14, 2201–2220. Connolly, J. B., Roberts, I. J., Armstrong, J. D., Kaiser, K., Forte, M., Tully, T., & O’Kane, C. J. (1996). Associative learning disrupted by impaired Gs signaling in Drosophila mushroom bodies. Science, 274(5295), 2104–2107. Cortes, C,. & Vapnik, V. (1995). Support vector networks. Machine Learning, 20, 273–297. Cover, T. (1965). Geometric and statistical properties of systems of linear inequalities with applications in pattern recognition. IEEE T Elect. Comput., 14, 326. de Belle, J. S., & Heisenberg, M. (1994). Associative odor learning in Drosophila abolished by chemical ablation of mushroom bodies. Science, 263, 692–695. Distler, P. G., Bausenwein, B., & Boeckh, J. (1998). Localization of odor-induced neuronal activity in the antennal lobes of the blowy Calliphora vicina, a [3H] 2-deoxyglucose labeling study. Brain Res., 805, 263–266. Dubnau, J., Chiang, A. S., & Tully, T. (2003). Neural substrates of memory: From synapse to system. J. Neurobiol., 54(1), 238–253. Dubnau, J., Grady, L., Kitamoto, T., & Tully, T. (2001). Disruption of neurotransmission in Drosophila mushroom body blocks retrieval but not acquisition of memory. Nature, 411(N6836), 476–480. Finelli, L. A., Haney, S., Bazhenov, M., Stopfer, M., Sejnowski, T. J., & Laurent, G. (2004). Effects of a synaptic learning rule on the sparseness of odor representationsin a model of the locust olfactory system. Unpublished manuscript, Salk Institute, San Diego, CA. Friedrich, R,. & Laurent, G. (2001). Dynamical optimization of odor representations in the olfactory bulb by slow temporal patterning of mitral cell activity. Science, 291, 889–894. Frolov, A. A. & Murav’ev, I. P. (1993). Informational characteristics of neural networks capable of associative learning based on Hebbian plasticity. Network, 4, 495–536. Galizia, C. G., Joerges, J., Kuettner, A., Faber, T., & Menzel, R. (1997). A semi-invivo preparation for optical recording of the insect brain. J. Neurosci. Meth., 76, 61–69. Galizia, C. G., Nagler, K., Holldobler, B., & Menzel, R. (1998). Odor coding is bilaterally symmetrical in the antennal lobes of honeybees (Apis mellifera). Eur. J. Neurosci., 10, 2964–2974. Galizia, C. G., Sachse, S., Rappert, A., & Menzel, R. (1999). The glomerular code for odor representation is species specic in the honeybee Apis mellifera R. Nature Neurosci., 2, 473–478. Gao, Q., Yuan, B., & Chess, A. (2000). Convergent projections of Drosophila olfactory neurons to specic glomeruli in the antennal lobe. Nat. Neurosci., 3(8),780–785. Garcia-Sanchez, M., & Huerta, R. (2003). Design parameters of the fan-out phase of sensory systems. J. Comput. Neurosci., 15, 5–17.

Learning Classication in the Olfactory System of Insects

1639

Hamming, R. W. (1950). Error detecting and error correcting codes. Bell System Technical Journal, 26(2), 147–160. Hebb, D. (1949). The organization of behavior. New York: Wiley. Heisenberg, M. (2003). Mushroom body memoir, from maps to models. Nature Rev. Neurosci., 4, 266–275. Heisenberg, M., Borst, A., Wagner, S., and Byers, D. (1985). Drosophilamushroom body mutants are decient in olfactory learning. J. Neurogenet., 2, 1–30. Hendin, O., Horn, D., & Hopeld, J. J. (1994). Decomposition of a mixture of signals in a model of the olfactory bulb. P. Natl. Acad. Sci. USA, 91(13), 5942– 5946. Hendin, O., Horn, D., & Tsodyks, M. V. (1998). Associative memory and segmentation in an oscillatory neural model of the olfactory bulb. J. Comput. Neurosci., 5(2), 157–169. Hosler, J. S., Buxton, K. L., & Smith, B. H.(2000). Impairment of olfactory discrimination by blockade of GABA and nitric oxide activity in the honey bee antennal lobes. Behav. Neurosci., 114(3), 514-525. Joerges, J., Kuettner, A., Galizia, C. G., & Menzel, R. (1997). Representations of odors and odor mixtures visualized in the honeybee brain. Nature, 387, 285–288. Laurent, G. (2002). Olfactory network dynamics and the coding of multidimensional signals. Nature Rev. Neurosci., 3, 884–895. Laurent, G., Stopfer, M., Friedrich, R. W., Rabinovich, M. I., Volkovskii, A., & Abarbanel, H. D. I. (2001). Odor encoding as an active, dynamical process, Experiments, computation, and theory. Annual Rev. Neurosci., 24, 263– 297. Li, Z., & Hertz, J. (2000). Odour recognition and segmentation by a model olfactory bulb and cortex. Network Comput. Neural Syst., 11, 83–102. Linster, C., & Cleland, T. A. (2001). How spike synchronization among olfactory neurons can contribute to sensory discrimination. J. Comput. Neurosci., 10(2), 187–193. Marr, D. (1969). A theory of cerebellar cortex. J. Physiol., 202, 437–470. McCulloch, W. S., & Pitts, W. (1943). A logical calculus of ideas immanent in nervous activity. Bulletin of Mathematical Biophysics, 5, 115–133. McGuire, S. E., Le, P. T., & Davis, R. L. (2001). The role of Drosophila mushroom body signaling in olfactory memory. Science, 293(5533), 1330–1333. Menzel, R. (2001). Searching for the memory trace in a mini-brain, the honeybee. Learn. Mem., 8(2), 53–62. Nadal, J. P. (1991). Associative memory, on the (puzzling) sparse coding limit. J. Phys. A, 24, 1093–1101. Nadal, J. P., & Tolouse, G. (1990). Information storage in sparsely coded memory nets. Network, 1, 61–74. Ng, M., Roorda, R. D., Lima, S. Q., Zemelman, B. V., Morcillo, P., & Miesenbock, G. (2002). Transmission of olfactory information between three populations of neurons in the antennal lobe of the y. Neuron, 36(3), 463–474. Nowotny, T., & Huerta, R. (2003). Explaining synchrony in feed-forward networks: Are McCulloch-Pitts neurons good enough? Biol. Cyber., 89(4), 237241.

1640

R. Huerta et al.

O’Reilly, R. C. (2001). Generalization in interactive networks: The benets of inhibitory competition and Hebbian learning. Neural Comput., 13,1199–1241 O’Reilly, R. C., & McClelland, J. L. (1994). Hippocampal conjunctive encoding, storage, and recall, avoiding a trade-off. Hippocampus, 4(6), 661-682. Palm, G. (1980). On associative memory. Biol. Cyber., 36, 59–71. Pascual, A., & Preat, T. (2001). Localization of long-term memory within the Drosophila mushroom body. Science, 294(5544), 1115–1117. Perez-Orive, J., Mazor, O., Turner, G. C., Cassenaer, S., Wilson, R. I., & Laurent, G. (2002). Oscillations and the sparsening of odor representations in the mushroom body. Science, 297, 359–365 Perez Vicente, C. J., & Amit, D. J. (1989). Optimised network for sparsely coded patterns. J. Phys. A, 22(5), 559–569. Rodrigues, V. (1988). Spatial coding of olfactory information in the antennal lobe of Drosophila melanogaster. Brain Res., 453, 299–307. Scott, K., Brady, R. Jr., Cravchik, A., Morozov, P., Rzhetsky, A., Zuker, C., & Axel, R. (2001). A chemosensory gene family encoding candidate gustatory and olfactory receptors in Drosophila. Cell, 104(5), 661–673. Stopfer, M., Bhagavan, S., Smith, B. H., & Laurent, G. (1997). Impaired odour discrimination on desynchronization of odour-encoding neural assemblies. Nature, 390(6655), 70-74. Stopfer, M., Jayaraman, V., & Laurent, G. (2003). Intensity versus identity coding in an olfactory system. Neuron, 39(6), 991–1004. Torbjorn, G. (2001).GNU MP: The GNU multiple precisionarithmeticlibrary,version. 4.0.1. Available on-line at: http://www.swox.com/gmp/. Tsodyks, M. V., & Feigel’man, M. V. (1988). The enhanced storage capacity in neural networks with low activity level. Europhys. Lett., 6, 101–105. Vosshall, L. B., Wong, A. M., & Axel, R. (2000). An olfactory sensory map in the y brain. Cell, 102(2), 147–159. Wang, Y., Wright, J. D., Guo, H.-F., Zuoping, X., Svoboda, K., Malinow, R., Smith, D. P., & Zhong, Y. (2001). Genetic manipulation of the odor-evoked distributed neural activity in the Drosophilamushroom body. Neuron, 29, 267– 276. Wang, J. W., Wong, A. M., Flores, J., Vosshall, L. B., & Axel, R. (2003). Two-photon calcium imaging reveals an odor-evoked map of activity in the y brain. Cell, 112, 271–282. Willshaw, D., Buneman, O. P., & Longuet-Higgins, H. C. (1969).Nonholographic associative memory. Nature, 222, 960. White, J., Dickinson, T. A., Walt, D. R., & Kauer, J. S. (1998).An olfactory neuronal network for vapor recognition in an articial nose. Biol. Cyber., 78, 245-251. Zars, T., Fischer, M., Schulz, R., & Heisenberg, M. (2000). Localization of a shortterm memory in Drosophila. Science, 288, 672–675. Received June 9, 2003; accepted January 30, 2004.

LETTER

Communicated by Erkki Oja

Adaptive Blind Separation with an Unknown Number of Sources Ji-Min Ye Key Lab for Radar Signal Processing and School of Science, Xidian University, Xi’an 710071, China

Xiao-Long Zhu xlzhu [email protected]

Xian-Da Zhang [email protected] Department of Automation, State Key Lab of Intelligent Technology and Systems, Tsinghua University, Beijing 100084, China

The blind source separation (BSS) problem with an unknown number of sources is an important practical issue that is usually skipped by assuming that the source number n is known and equal to the number m of sensors. This letter studies the general BSS problem satisfying m ¸ n. First, it is shown that the mutual information of outputs of the separation network is a cost function for BSS, provided that the mixing matrix is of full column rank and the m£ m separating matrix is nonsingular. The mutual information reaches its local minima at the separation points, where the m outputs consist of n desired source signals and m ¡ n redundant signals. Second, it is proved that the natural gradient algorithm proposed primarily for complete BSS (m D n) can be generalized to deal with the overdetermined BSS problem (m > n), but it would diverge inevitably due to lack of a stationary point. To overcome this shortcoming, we present a modied algorithm, which can perform BSS steadily and provide the desired source signals at specied channels if some matrix is designed properly. Finally, the validity of the proposed algorithm is conrmed by computer simulations on articially synthesized data.

1 Introduction The blind source separation (BSS) problem consists of recovering statistically independent but otherwise unobserved source signals from their linear mixtures without knowing the mixing coefcients. This kind of blind technique has signicant potential applications in various elds, such as wireless telecommunication systems, sonar and radar systems, image enhancement, speech processing, and biomedical signal processing. c 2004 Massachusetts Institute of Technology Neural Computation 16, 1641–1660 (2004) °

1642

J. Ye, X. Zhu, and X. Zhang

Since the pioneering work of Jutten and H´erault (1991), a variety of algorithms have been proposed for BSS (see, e.g., Amari & Cichocki, 1998; Sanchez A., 2002; Douglas, 2003; Hyvarinen, Karhunen, & Oja 2001; Cichocki & Amari, 2002). Generally, the existing algorithms can be divided into four major categories: neural network–based algorithms (Karhunen, Oja, Wang, Vigario, & Joutsensalo, 1997; Amari & Cichocki, 1998; Cichocki, Karhunen, Kasprzak, & Vigario, 1999; Pajunen, & Karhunen, 1998; Zhu & Zhang, 2002), density model–based algorithms (Comon 1994; Amari, Cichocki, & Yang, 1996; Yang & Amari, 1997; Lee, Girolami, & Sejnowski, 1999; Choi, Cichocki, & Amari, 2000; Cao, Murata, Amari, Cichocki, & Takeda, 2003), algebraic algorithms (Cardoso, 1993; Belouchrani, AbedMeraim, Cardoso, & Moulines, 1997; Hyvarinen & Oja, 1997; Lindgren & Broman, 1998; Pham & Cardoso, 2001; Feng, Zhang, & Bao, 2003), and information–theoretic algorithms (Bell & Sejnowski, 1995; Igual, Vergara, Camacho, & Miralles, 2003). Most of the algorithms for BSS in the literature assume that the number of sources is known a priori. Typically, it should be equal to the number of sensors. In practice, however, such an assumption does not often hold. The BSS problem when there are fewer sensors than sources is referred to as the underdetermined or overcomplete BSS (Amari, 1999; Lewick & Sejnowski, 2000). Since in this case, at most a part of source signals can be successfully extracted if additional prior information about the mixing matrix or source signals is not at hand (Cao & Liu, 1996; Li & Wang, 2002), we focus on the overdetermined and determined BSS problem, where the number m of available mixtures is greater than or equal to the true number n of sources: m ¸ n. To deal with the overdetermined BSS problem with an unknown number of sources, several neural network architectures, together with the associated learning algorithms, are presented (Cichocki et al., 1999). When a prewhitening (sphering) layer is used to estimate the source number and reduce the dimension of the data vectors, separation results may be poor for ill-conditioned mixing matrices or weak source signals. Similar disadvantages exist if possible data compression takes place in the separation layer instead of the prewhitening layer. Alternatively, some neural algorithms proposed primarily for determined BSS are generalized to learn an m £ m separating matrix, and it was observed via simulation experiments (Cichocki et al., 1999) that among m outputs of the BSS network at convergence, n components are mutually independent or as independent as possible, and they are the desired rescaled or rearranged source signals; the remaining m ¡ n components are copies of some independent component and thus redundant. Since a signal and its copy are coherent (in the ideal case) or highly correlated (in the practical case), one or more copies of a signal can be easily removed using decorrelation approaches. That is, a postprocessing layer is appended to the separation network for elimination of the redundant signals and determination of the number of active source signals.

Adaptive Blind Separation with an Unknown Number of Sources

1643

Although extensive experiments show the above phenomenon on redundant signals, a complete theoretical proof has not been found in the literature. One of the goals of this article is to ll this gap. Additionally, we provide new insight into the existing mutual information of output components when used as a cost function of the overdetermined BSS problem. Another goal is to show that the natural gradient algorithm for BSS has no stationary point if there are more mixtures than sources, resulting in inevitable divergence. To perform BSS steadily, a modied algorithm is proposed here, which includes the orthogonal natural gradient algorithm (Amari, Chen, & Cichocki, 2000) as a special case and can provide the desired source signals at the rst n channels if some matrix is designed properly. Section 2 presents the problem statement. Section 3 discusses the mutual information contrast function. In section 4, we analyze the behaviors of the natural gradient BSS algorithm in the overdetermined case and then propose a modied natural gradient algorithm. The effectiveness of the new approach is justied by computer simulations on articially synthesized data in section 5. Finally, conclusions are set out in section 6. 2 Problem Statement Assume that there exist n zero mean source signals s1 .t/; : : : ; sn .t/ that are scalar valued and mutually statistically independent at each time instant t. For simplicity, we restrict ourselves to the noise-free case. Denote by the superscript T the transpose of a vector or matrix; then the available sensor vector xt D [x1 .t/; : : : ; xm .t/]T is given by xt D Ast ;

(2.1)

where A 2 Rm£n is a constant but unknown mixing matrix, and st D [s1 .t/; : : : ; sn .t/]T represents the vector of the unobserved source signals. According to the relation between the source number n and the sensor number m, the BSS problem can be divided into three cases: the determined/complete BSS (m D n), the overdetermined/undercomplete BSS (m > n), and the underdetermined/overcomplete BSS (m < n). For the difcult case where there are fewer mixtures than sources (Cao & Liu, 1996; Amari, 1999; Lewick & Sejnowski, 2000; Li & Wang, 2002), separation may be achievable only in special instances; that is, all the source signals cannot be successfully extracted. Therefore, we focus here on the overdetermined and determined BSS problem (m ¸ n) and make further assumptions that the mixing matrix A is of full column rank and at most one of the source signals is allowed to be gaussian distributed (Tong, Liu, Soon, & Huang, 1991; Cao & Liu, 1996; Amari, Chen, & Cichocki, 1997; Amari & Cichocki, 1998). The task of the BSS problem is to reconstruct all the source signals from the observations subject to unknown probability distributions. For this pur-

1644

J. Ye, X. Zhu, and X. Zhang

pose, standard neural and adaptive BSS approaches adjust the so-called separating matrix W such that the output vector yt D Wxt

(2.2)

can provide the separated source signals. If the number n of sources is equal to that of sensors or is known a priori, then the BSS problem is easily handled (see, e.g., Amari & Cichocki, 1998; Zhang, Cichocki, & Amari, 1999). One can conne the separating matrix W in equation 2:2 to be an n £ m matrix with rank n, and update it according to some on-line learning rule until the output yt is a permuted or rescaled version of st , that is, yt D PDst ;

(2.3)

where P is a permutation matrix and D is a nonsingular diagonal matrix. The multiplication of P and D results in a generalized permutation matrix, which describes the order indeterminacy and scale ambiguity inherent in the BSS problem (Tong et al., 1991; Hyvarinen et al., 2001; Lu & Rajapakse, 2003). When the source number n is unknown, the BSS problem becomes much difcult. Since it is true in many practical applications, we discuss such a general case. If the source signals are extracted one by one (Delfosse & Loubaton, 1995; Hyvarinen & Oja, 1997; Thawonmas, Cichocki, & Amari, 1998; Feng et al., 2003), then it is unnecessary to estimate the number of sources. However, sequential BSS approaches have some shortcomings too. Some of them require prewhitening processing of the observations, and the quality of the separated source signals would be increasingly degraded due to accumulated errors from the deation process. More important, signal extraction at subsequent extraction units cannot proceed unless the current deation process has converged to the equilibrium point, which is prohibitive in some real-time situations such as wireless communication systems. As to parallel algorithms that recover the source signals jointly, there are several schemes to tackle the unknown number of sources, including the model selection criteria (Sato, 2001; Sugiyama & Ogawa, 2001) and the crossvalidation technique (Cao et al., 2003). Cichocki et al. (1999) outline three other approaches. One applies a prewhitening layer to estimate the source number and reduce the dimension of the data vectors. The second method processes the possible data compression in the separation layer that follows the prewhitening layer. Both methods, however, suffer from poor separation results for ill-conditioned mixing matrices or weak source signals. In the third approach, some algorithms for determined BSS are employed directly to learn an m £ m separating matrix W. If it is realized that n components of the m-dimensional output vector yt are the original source signals or their

Adaptive Blind Separation with an Unknown Number of Sources

1645

rescaled versions, and the remaining m ¡ n components consist of copies of some source signal, that is, all the source signals appear at least once in the network outputs, then the m ¡ n redundant signals may be easily removed using decorrelation techniques. The idea that a postprocessing layer is utilized to drop the redundant signals and estimate the source number takes such an advantage that the structure of separation network can remain unchanged when the number of sources varies over time, which is of key importance for some engineering applications. Clearly, the third approach builds validity on condition that the m outputs should be composed of n independent components and m ¡ n redundant components. Although such a phenomenon has been reported by several researchers (Cichocki et al., 1999), a rigorous theoretical proof remains open. 3 Contrast Function A contrast function is a function whose maximization leads to source separation in some restrictions (Comon, 1994; Moreau & Macchi, 1996; Moreau & Thirion-Moreau, 1999). There exist in the literature many contrast functions (contrasts) for BSS, including maximum likelihood contrasts (Cardoso, 1997; Pham & Garrat, 1997), mutual information contrasts (Comon, 1994; Amari et al., 1996; Yang & Amari, 1997), nonlinear principle component analysis contrasts (Oja, 1997; Karhunen, Pajunen, & Oja, 1998), and high-order statistics contrasts (Moreau & Macchi, 1996; Moreau & Thirion-Moreau, 1999; Cardoso, 1999). As a mainstream technique to the BSS problem, the independent component analysis or abbreviative ICA (Comon, 1994) usually employs the mutual information of outputs of the separation system as the cost function (minus contrast function). The mutual information describes the dependence among components of yt , and it can be measured by the KullbackLeibler divergence (KLD) (Cover & Thomas, 1991; Amari et al., 1996; Yang & Amari, 1997) between the joint probability Q density function (PDF) pY .yt / of yt and its factorized version pQY .yt / D m iD1 pYi .yi .t//, where pYi .yi .t// is the marginal PDF of yi .t/. For convenience, we omit the time instant t, and thus we have Z 1 p .y/ (3.1) I.yI W/ D KLD[pY .y/jjpQY .y/] D pY .y/ ln Y dy; pQY .y/ ¡1 in which ln.¢/ denotes the natural logarithm operator. Using the property of KLD (Cover & Thomas, 1991), it is easy to show that I.yI W/ ¸ 0 with equality if and only if yi (i D 1; : : : ; m) are mutually independent. When m D n, it is proved that I.yI W/ is a cost function of the BSS problem (Comon, 1994), meaning I.yI W/ D 0

iff WA D G;

(3.2)

1646

J. Ye, X. Zhu, and X. Zhang

where G is an n£n generalized permutation matrix. Since the m components of y are linear combinations of the original sources, there are at most n independent components in y when m > n. Therefore, I.yI W/ is not equal to zero unless the outputs are composed of independent random signals and deterministic zero signals. When adaptive learning rules are applied to search maxima of the contrast function, any output component of y is impossible to stay at zero; hence, we do not consider in this article the special case where some output component is a zero signal, namely, I.yI W/ is always greater than zero for m > n. Provided that the m £ m separating matrix W is invertible and the m £ n mixing matrix A is of full column rank, the global transfer matrix C D WA has rank n, which implies that there must exist an n £ m submatrix W1 of W such that W1 A is nonsingular. Without loss of generality, we suppose that W1 is composed of the rst n rows of W, and denote by W2 the submatrix constituted by the remaining m ¡ n rows (m > n). Let b y D W1 x D [y1 ; : : : ; yn ]T with the PDF pb.y1 ; : : : ; yn / and y D W2 x D Y

[ynC1 ; : : : ; ym ]T with the PDF p .ynC1 ; : : : ; ym /, one has Y

¡

¢

pb.y1 ; : : : ; yn / p .ynC1 ; : : : ; ym / pY y1 ; : : : ; ym Y Y D ¢ pY .y1 / ¢ ¢ ¢ pYm .ym / pY .y1 / ¢ ¢ ¢ pYn .yn / pY .ynC1 / ¢ ¢ ¢ pYm .ym / nC1 1 1 ¡ ¢ pY y1 ; : : : ; ym ¢ (3.3) : pb.y1 ; : : : ; yn /p .ynC1 ; : : : ; ym / Y

Y

Substitution of equation 3.3 into equation 3.1 results in I.yI W/ D I.y1 ; : : : ; yn / C I.ynC1 ; : : : ; ym /

C I.fy1 ; : : : ; yn g; fynC1 ; : : : ; ym g/;

(3.4)

which is true as well for m D n C 1 if the convention I.ynC1 / D 0 is made.1 By denition, W1 A is an n£n invertible matrix; therefore, s D .W1 A/¡1b y, and thus y D W2 A ¢ s D W2 A.W1 A/¡1 ¢ b y, which indicates the conditional entropy ¡ ¢ H ynC1 ; : : : ; ym y1 ; : : : ; yn D 0:

(3.5)

b In addition I.y;b y/ D H.y/ ¡ H.y y /; it holds that I 1

¡©

ª © ª¢ ¡ ¢ y1 ; : : : ; yn ; ynC1 ; : : : ; ym D H ynC1 ; : : : ; ym :

(3.6)

Note that I.ynC1 / 6D I.ynC1 ; ynC1 /, since the right-hand-side term is actually the R1 entropy of ynC1 , dened by H.ynC1 / D I.ynC1 ; ynC1 / D ¡ ¡1 pY nC1 .ynC1 / ln pYnC1 .ynC1 / dynC1 , which is not equal to zero unless ynC1 is a deterministic variable (Cover & Thomas, 1991).

Adaptive Blind Separation with an Unknown Number of Sources

1647

Then, using the property (Cover & Thomas, 1991; Comon, 1994; Yang & Amari, 1997), I.ynC1 ; : : : ; ym / D

m¡n X kD1

¡ ¢ ¡ ¢ H ynCk ¡ H ynC1 ; : : : ; ym ;

(3.7)

and combining equations 3.4, 3.6, and 3.7, we have I.yI W/ D I.y1 ; : : : ; yn / C

m¡n X

H.ynCk /:

(3.8)

kD1

The rst term on the right-hand side (RHS) of equation 3.8 obtains the global minimum that is zero if and only if y1 ; : : : ; yn are mutually independent. On the other hand, ynCk D

n X

[W2 A]nCk;j sj ;

jD1

k D 1; : : : ; m ¡ n;

(3.9)

in which Bpq denotes the .p; q/th entry of matrix B. Since a random variable that is the sum of several independent components has the entropy not less than any individual does (Cover & Thomas, 1991), that is, ¡ ¢ (3.10) H ynCk ¸ max H.[W2 A]nCk;j sj /; j2f1;:::;ng

we can infer that the second term on the RHS of equation 3.8 reaches the local minima when each component ynCk is a copy of some source signal. Let »t be an m-dimensional vector, the rst n components of which are the n source signals and the rest m ¡ n components consist of some source signal. Clearly, there are n! ¢ nm¡n possible versions of »t in all, and they form the set Ä. The mutual information is invariant to nonzero rescaling and reordering of its arguments, so the following lemma is straightforward. Lemma 1. The mutual information I.yt I W/ of the m output components of the network 2.2 with a nonsingular separating matrix W is a cost function for BSS with an unknown number of sources, and it obtains a local minimum if and only if yt D G» t , where G is an m £ m generalized permutation matrix, and »t 2 Ä. Theorem 1. If the m £ n mixing matrix A is of full column rank (m > n), then the mutual information I.yt I W/ reaches local minima under the sufcient and necessary condition that the m £ m separating matrix W D G[.AC /T ; N.AT / C T

W ]T , in which G is an m £ m generalized permutation matrix, AC is the pseudoinverse of the mixing matrix A, N.AT / denotes the m£ .m¡ n/ matrix constructed by the basis vectors in the null space of AT , and W denotes an .m ¡ n/ £ m matrix whose rows are made up of some row(s) of AC .

1648

J. Ye, X. Zhu, and X. Zhang

Proof. Since the m £ n mixing matrix A is of full column rank, AC A D .AT A/¡1 AT A D I is the n £ n identity matrix, and .N.AT //T A D .AT N.AT //T D O is an .m ¡ n/ £ n null matrix. Therefore, the separatT

ing matrix W D G[.AC /T ; N.AT / C W ]T implies that the output vector yt D WAst D G[sTt ; .WAst /T ]T D G» t , and vice versa. Clearly, the theorem holds by lemma 1. If the mixing matrix A is of full column rank and an m £ m nonsingular separating matrix W is updated such that the mutual information is minimized, then at convergence, the output yt contains n independent components that are the separated source signals and m ¡ n redundant components. Such empirical ndings were rst reported by Cichocki et al. (1999). Here we present the theoretical justication behind it from the viewpoint of optimization. 4 A Modied Natural Gradient Algorithm Based on the relation between the mutual information and the differential entropy (Cover & Thomas, 1991; Amari et al., 1996; Yang & Amari, 1997), the cost function 3.1 can be rewritten as I.yI W/ D

m X kD1

¡ ¢ ¡ ¢ H yk ¡ H y1 ; : : : ; ym :

(4.1)

If the m £ m separating matrix is conned to be invertible, then (Bell & Sejnowski, 1995; Yang & Amari, 1997) ¡ ¢ ¡ ¢ (4.2) H y1 ; : : : ; ym D H x1 ; : : : ; xm C ln jdet.W/j ; where jdet.W/j is the absolute value of the determinant. Substituting equation 4.2 into 4.1 and applying the natural gradient learning rule (Amari, 1998) 2 or, equivalently, the relative gradient learning rule (Cardoso & Laheld, 1996), dW @I.yI W/ D ¢ WT W dt @W

(4.3)

to search local minima of equation 4.1 or 3.8, one has the discrete time natural gradient algorithm for BSS (Amari et al., 1996; Yang & Amari, 1997; Amari, 1998; Zhang et al., 1999), as WtC1 D Wt C ´t [I ¡ Á.yt /yTt ]Wt ;

(4.4)

2 The natural gradient on the Stiefel manifold (Amari, 1999) where the separating matrix is conned to be an orthonormal matrix instead of a nonsingular matrix is given h iT @I.yIW / @I.yIW / W W. by dW D ¡ dt @W @W

Adaptive Blind Separation with an Unknown Number of Sources

1649

£ ¤T in which ´t > 0 is a learning rate, Á.yt / D ’1 .y1 /; : : : ; ’m .ym / is a nonlinear-transformed vector of yt , and ’i .yi / D ¡

dpYi .yi / 1 ¢ ; pYi .yi / dyi

i D 1; : : : ; m

(4.5)

is referred to as the score function in BSS. From the derivation of learning rule 4.4, it can be seen that the natural gradient algorithm for complete BSS problem (m D n) can be used to deal directly with the overdetermined BSS problem (m > n). Moreover, algorithm 4.4 is clearly independent of the source number n, so it works as well when the number of sources is unknown. At convergence, one local minimum of the cost function is achieved, and the outputs of the BSS network provide the desired source signals. However, the natural gradient algorithm 4.4 cannot be stable in such a state, since the stationary condition EfI ¡ Á.yt /yTt g D O

(4.6)

in which Ef¢g is the expectation operator does not hold when yt D G» t . To understand the behavior of algorithm 4.4 at the separation points, we consider the simplest case, where n < m · 2n, and the rst n components of yt are the original source signals in order, while the remaining m ¡ n components correspond respectively to the rst m ¡ n sources, that is, » si .t/; si¡n .t/;

yi .t/ D

i D 1; : : : ; n : i D n C 1; : : : ; m

(4.7)

The outputs in equation 4.7 lead to3 h i m¡nC1 m EfI ¡ Á.yt /yTt g D ¡ I1nC1 ; : : : ; Im¡n ; : : : ; 0n ; InC1 m ;0 1 ; : : : ; Im¡n D 0;

(4.8)

q

where Ip is the pth column vector of the m£m identity matrix I, 0q is an m£1 zero vector, and the superscript q denotes the column index of 0. Therefore, » E

dW dt

¼

h iT D ¡´t wTnC1 ; : : : ; wTm ; 0T ; : : : ; 0T ; wT1 ; : : : ; wTm¡n ;

(4.9)

in which wp represents the pth row vector of the separating matrix W. It can be inferred from equation 4.9 that when algorithm 4.4 converges, the two rows of W that make outputs of the separation network be copies of 3

At convergence, the scaling of the separated source signals satises ’i .yi .t//yi .t/ D 1.

1650

J. Ye, X. Zhu, and X. Zhang

the same source signal would have an impact on each other. By theorem 1, these two rows may have a difference of some transposed basis vector of the null space of AT . Hence, as long as the whole separating matrix updates in the equivalent class (Amari et al., 2000), dened by CW

» h¡ ¢ iT ¼ e e D G AC T ; N.AT / C WT D W W ;

(4.10)

where G, AC , N.AT / and W are the same as those in theorem 1, the BSS system can still provide the separated source signals (before divergence), although algorithm 4.4 does not have stationary point. We stress that due to overow of some rows of W as a result of the accumulated inter-impact, the natural gradient algorithm 4.4 would diverge inevitably if the learning rate ´t is not sufciently small, which is veried by our simulation experiments. Additionally, the above analysis is based on the simplest output model 4.7, but similar conclusions can be made when m > 2n or several copies of one or more source signals appear at different output channels. To perform BSS steadily when there are more mixtures than sources, we modify the natural gradient algorithm as h i WtC1 D Wt C ´t R ¡ Á.yt /yTt Wt ;

(4.11)

where R is an m £ m matrix. If R D I is the identity matrix, equation 4.11 is obviously the original natural gradient algorithm 4.4. However, it cannot work stably for the overdetermined case m > n. Recently, a nonholonomic orthogonal learning algorithm for BSS was presented (Amari et al., 2000), where © ª R D diag ’1 .y1 .t//y1 .t/; : : : ; ’1 .ym .t//ym .t/

(4.12)

is a diagonal matrix. The orthogonal natural gradient algorithm can make the redundant components decay to zero, and it is particularly efcient when magnitudes of the source signals are intermittent or changing rapidly over time (e.g., speech sounds, electroencephalogram signals). One drawback of this algorithm is that the redundant signals, although very weak in scale, may still be mixtures of several source signals. Furthermore, the magnitudes of some separated source signals are possibly very small as well; hence, it is sometimes difcult to discriminate them from the redundant signals. Besides the above two selections, we propose another choice of R, given by n o R D E Á.yt /yTt y DG» ; t

t

(4.13)

Adaptive Blind Separation with an Unknown Number of Sources

1651

which is clearly no longer a diagonal matrix if the number of sensors is greater than the number of sources.4 Unlike the diagonal matrix 4.12, a properly designed matrix 4.13 can control positions of the redundant signals if the source number n is known a priori or has been estimated in advance. Theorem 2. If the m £ n mixing matrix A is of full column rank, S D [stC1 ; : : : ; stCK ] is an n £ K full row rank matrix of the independent source signals (K ¸ n), and xt D Ast is the noise-free observation vector, then AT spans the same null space as the K £ m data matrix XT D [xtC1 ; : : : ; xtCK ]T , and the source number n is equal to the rank of X. Proof. For any m £ 1 vector b that lies in the null space of AT , denoted by null.AT /, we have AT b D 0. Hence, we use this result and X D AS to get XT b D ST AT b D 0;

(4.14)

which implies b 2 null.XT /. On the other hand, 8b 2 null.XT /, it is easily obtained from equation 4.14, and the full row rank assumption5 of S that AT b D 0, which means b 2 null.AT /. Since null.AT / D null.XT /, it is straightforward that the source number equals the rank of X: By theorem 2, the source number can be determined directly from K samples of the noise-free observation vector. In practice, such a method performs satisfactorily for somewhat large K, for example, K D m » 2m if the sources are all stationary random signals. Next, we consider the design of R in equation 4.13. If the source number is unknown, we may start R with the identity matrix, and then take it as an exponentially windowed average of equation 4.13, that is, RtC1 D ¸ ¢ Rt C .1 ¡ ¸/ ¢ Á.yt /yTt ;

(4.15)

where 0 < ¸ < 1 is a forgetting factor. If the source number is known a priori or has been estimated by theorem 2 or by other schemes (Sato, 2001; Sugiyama & Ogawa, 2001; Cao et al., 2003), we may make use of the magnitude ambiguity and order indeterminacy inherent in the BSS problem (Tong et al., 1991; Hyvarinen et al., 2001; Lu & Rajapakse, 2003) to design the proper matrix R such that the original source 4 When m D n, equation 4.13 is simplied to equation 4.12, and in this sense, the proposed algorithm can be regarded as a generalized orthogonal natural gradient algorithm for BSS. 5 By statistical theory (Cramer, 1946), when K is sufciently large, S D [stC1 ; : : : ; stCK ] © ª has the same rank with probability 1 as the correlation matrix Rss D E st sTt . In the BSS problem, the source signals are assumed to be mutually independent, so S is of full row rank with probability 1 for sufciently large K.

1652

J. Ye, X. Zhu, and X. Zhang

signals are separated at specied channels and the redundant signals are copies of outputs of certain channels. For example, if the source number n satises n < m · 2n, and it is desired that all the source signals are recovered at the rst n channels, while the other m ¡ n redundant components are copies of outputs of the rst m ¡ n channels, then we design R D I ¡ 0;

(4.16)

in which I is the identity matrix, and 0 is dened as equation 4.8. Alternatively, we can take h i m R D IC 11m¡n ; 02 ; : : : ; 0n ; 1nC1 m¡n ; : : : ; 1m¡n ;

(4.17)

q

where 1m¡n is a column vector with the rst n elements all equal to zero and the last m ¡ n elements all equal to 1, so that among m outputs of algorithm 4.11, the rst n components are the separated source signals, whereas the remaining m ¡ n components are all copies of the rst component. 5 Simulation Results In order to verify the effectiveness of proposed algorithm 4.11, we consider separation of the following source signals (Amari et al., 1996; Yang & Amari, 1997): S1: Sign signal: s1 .t/ D sign.cos.2¼ 155t//

S2: High-frequency sinusoid signal: s2 .t/ D sin.2¼800t/ S3: Low-frequency sinusoid signal: s3 .t/ D sin.2¼ 90t/

S4: Amplitude-modulated signal: s4 .t/ D sin.2¼ 9t/ sin.2¼ 300t/

S5: Phase-modulated signal: s5 .t/ D sin.2¼ 300t C 6 cos.2¼ 60t//

S6: Noise signal: n.t/ is uniformly distributed in [¡1; C1]; and s6 .t/ D n.t/. Figure 1 shows the source signals in one run. In simulations, nine sensors are used, and the mixing matrix A is xed in each run, but the elements are randomly generated subject to the uniform distribution in [¡1; C1] such that A is of full column rank. The mixed signals are sampled at the sampling rate of 10 KHz, that is, the sampling period Ts D 0:0001s. The proposed algorithm 4.11 is determined by three ingredients. One is the matrix R. Another is the learning rate parameter; for good separation accuracy, we take ´t D 30£Ts in simulations. The third ingredient is the nonlinear score functions. Since the PDFs of source signals are unknown in the BSS problem, the score function is not available, and it is usually replaced by another nonlinear function, the activation function. To perform BSS steadily,

Adaptive Blind Separation with an Unknown Number of Sources

1653

Figure 1: Waveforms of the six original source signals in one run.

the activation function should satisfy the local stability condition (Amari et al., 1997, 2000; Cardoso, 2000; Hyvarinen et al., 2001). Since the above six sources are all subgaussian signals, one may select ’i .yi .t// D y3i .t/. Here, we apply the adaptive function (Yang & Amari, 1997): ³

´ 1 i 9 i i Ái .yi .t// D ¡ ·3 C ·3 ·4 y2i .t/ 2 4 ³ ´ 1 3 3 C ¡ ·4i C .· 3i /2 C .· 4i /2 y3i .t/; 6 2 4

(5.1)

where ·3i D Efy3i .t/g and ·4i D Efy4i .t/g ¡ 3 represent, respectively, the skewness and the kurtosis of yi .t/. They are updated as follows, ± ² d·3i D ¡¹t ·3i ¡ y3i .t/ dt ± ² d·4i D ¡¹t ·4i ¡ y4i .t/ C 3 ; dt

(5.2) (5.3)

in which ¹t is the step size parameter and taken as ¹t D 50 £ Ts in simulations.

1654

J. Ye, X. Zhu, and X. Zhang

Two experiments are executed. In the rst, the source number is unknown, and we run the natural gradient algorithm 4.4 (´t D 30 £ Ts ), the orthogonal natural gradient algorithm 4.11 with 4.12 (´t D 180 £ Ts ), and the proposed algorithm 4.11 with 4.13 at the same time. In the latter two algorithms, both R’s start with the identity matrix and then take the exponentially windowed averages of algorithms 4.12 and 4.13, respectively, with the same forgetting factor ¸ D 0:99. To evaluate the performance of the BSS algorithms, we use the cross-talking error as the performance index (Amari et al., 1996; Yang & Amari, 1997), 0 1 0 1 m n n m X X X X c c 1 1 pq pq @ @ PI D ¡ 1A C ¡ 1A ; (5.4) m pD1 qD1 maxk n qD1 pD1 maxk cpk ckq where C D WA D fcpqg is the combined mixing-separating matrix. Figure 2 plots the curves of performance indexes averaged over 500 independent runs (both the mixing matrix A and the noise source n.t/ are randomly generated in each run). Clearly, both the orthogonal natural gradient algo-

Figure 2: Average performance indexes over 500 independent runs of the natural gradient algorithm 4.4, the orthogonal natural gradient algorithm 4.11 with 4.12, and the proposed algorithm 4.11 with 4.13.

Adaptive Blind Separation with an Unknown Number of Sources

1655

Figure 3: Average performance indexes over 500 independent runs of the proposed algorithm 4.11 with 4.16 and 4.11 with 4.17.

rithm and the proposed algorithm 4.11 can perform BSS steadily,6 while algorithm 4.4 would diverge due to the reason presented in the previous section. In the second experiment, we apply the method described by theorem 2 where K D m to estimate the unknown source number, and simulate algorithm 4.11 with 4.16 and 4.17 at the same time. Since the desired source signals appear at the rst n channels, we use the cross-talking error 5.4 with C D W1 A, where W1 is the submatrix composed of the rst n rows of W, to measure the performance. Another 500 independent runs are executed. Figure 3 shows the average results, and Figures 4 through 6 plot, respectively, the performance indexes and the waveforms of the recovered source signals in one run. From these gures, it can be seen that besides n source signals separated at the rst n channels, the proposed algorithm 4.11 can make the rest of the m ¡ n redundant signals be copies of outputs of the rst m ¡ n 6 The orthogonal natural gradient algorithm makes the redundant components as well as some of the separated source signals very small in scale; hence, the steady-state crosstalking error is somewhat less than those of the original natural gradient algorithm (before divergence) and the proposed one, but it does not necessarily mean a higher recovery quality of the separated source signals.

1656

J. Ye, X. Zhu, and X. Zhang

Figure 4: Performance indexes of the proposed algorithm 4.11 with 4.16 and 4.11 with 4.17 in one run.

channels by using algorithm 4.16, while all copies of the rst component by using algorithm 4.17, which agree with what we expect. 6 Conclusions This article studies the general BSS problem where the number m of sensors is not fewer than the number n of sources. First, it is shown that the mutual information of the output components is a cost function for BSS, and it reaches a local minimum if and only if m outputs consist of n independent components and m ¡ n redundant components. Therefore, this article presents a rigorous theoretical justication for the empirical ndings previously reported by Cichocki et al. (1999). Second, it is illustrated that when the natural gradient algorithm for complete BSS (m D n) is used to deal with the overdetermined case (m > n), it would diverge inevitably. As a solution to this problem, we propose a modied learning rule, which includes the standard and orthogonal natural gradient algorithms (Amari et al., 2000) as special cases. Moreover, the proposed one can provide the desired source signals at specied channels and control what and where the redundant components are if the source number is known or has been estimated in advance.

Adaptive Blind Separation with an Unknown Number of Sources

1657

Figure 5: Waveforms of the recovered source signals by the proposed algorithm 4.11 with 4.16 in one run.

Figure 6: Waveforms of the recovered source signals by the proposed algorithm 4.11 with 4.17 in one run.

1658

J. Ye, X. Zhu, and X. Zhang

Acknowledgments We are grateful to the anonymous reviewers and Terrence J. Sejnowski for their valuable comments and suggestions. This work was supported by the National Natural Science Foundation of China under contract 60375004. References Amari, S., Cichocki, A., & Yang, H. H. (1996). A new algorithm for blind signal separation. In D. S. Touretzky, M. C. Mozer, & M. E. Hasselmo (Eds.), Advances in neural information processing systems, 8 (pp. 757–763). Cambridge, MA: MIT Press. Amari, S., Chen, T. P., & Cichocki, A. (1997). Stability analysis of learning algorithms for blind source separation. Neural Networks, 10(8), 1345–1351. Amari, S. (1998). Natural gradient works efciently in learning. Neural Computation, 10(2), 251–276. Amari, S., & Cichocki A. (1998). Adaptive blind signal processing: Neural network approaches. Proceedings IEEE, 86(10), 2026–2048. Amari, S. (1999). Natural gradient for over- and under-complete bases in ICA. Neural Computation, 11(8), 1875–1883. Amari, S., Chen, T. P., & Cichocki, A. (2000). Nonholonomic orthogonal learning algorithms for blind source separation. Neural Computation, 12(6), 1463– 1484. Bell, A. J., & Sejnowski, T. J. (1995). An information maximization approach to blind separation and blind deconvolution. Neural Computation, 7(6), 1129– 1159. Belouchrani, A., Abed-Meraim, K., Cardoso, J. F., & Moulines, E. (1997). A blind source separation technique using second-order statistics. IEEE Trans. on Signal Processing, 45(2), 434–444. Cao, J. T., Murata, N., Amari, S., Cichocki, A., & Takeda T. (2003). A robust approach to independent component analysis of signals with high-level noise measurements. IEEE Trans. on Neural Networks, 14(3), 631–645. Cao, X. R., & Liu, R. W. (1996). A general approach to blind source separation. IEEE Trans. on Signal Processing, 44(3), 562–571. Cardoso, J. F., & Souloumiac, A. (1993). Blind beamforming for non-Gaussian signals. IEEE Proceedings F, Radar and Signal Processing, 140(6), 362–370. Cardoso, J. F. (1997). Infomax and maximum likelihood for source separation.IEEE Signal Processing Letters, 4(4), 112–114. Cardoso, J. F. (1999). High-order contrasts for independent component analysis. Neural Computation, 11(1), 157–192. Cardoso, J. F. (2000). On the stability of source separation algorithms. Journal of VLSI Signal Processing, 26(1), 7–14. Cardoso, J. F., & Laheld, H. (1996). Equivariant adaptive source separation. IEEE Trans. on Signal Processing, 44(12), 3017–3029. Choi, S., Cichocki, A., & Amari, S. (2000). Flexible independent component analysis. Journal of VLSI Signal Processing, 26(1), 25–39.

Adaptive Blind Separation with an Unknown Number of Sources

1659

Cichocki, A., & Amari, S. (2002). Adaptive signal processing: Learning algorithms and applications. New York: Wiley. Cichocki, A., Karhunen, J., Kasprzak, W., & Vigario, R. (1999). Neural networks for blind separation with unknown number of sources. Neurocomputing, 24(1), 55–93. Comon, P. (1994). Independent component analysis, a new concept? Signal Processing, 36(3), 287–314. Cover, T. M., & Thomas J. A. (1991). Elements of information theory. New York: Wiley. Cramer, H. (1946). Mathematical methods of statistics. Princeton, NJ: Princeton University Press. Delfosse, N., & Loubaton, P. (1995). Adaptive blind separation of independent sources: A deation approach. Signal Processing, 45(1), 59–83. Douglas, S. C. (2003). Blind source separation and independent component analysis: A crossroads of tools and ideas. In S. Amari, A. Cichocki, S. Makino, & N. Murata (Eds.), International Symposium on Independent Component Analysis and Blind Signal Separation (ICA2003), 4 (pp. 1–10). Nara, Japan. Feng, D. Z., Zhang, X. D., & Bao, Z. (2003). An efcient multistage decomposition approach for independent components. Signal Processing, 83(1), 181– 197. Hyvarinen, A., Karhunen, J., & Oja, E. (2001). Independent component analysis. New York: Wiley. Hyvarinen, A., & Oja, E. (1997). A fast xed-point algorithm for independent component analysis. Neural Computation, 9(7), 1483–1492. Igual, J., Vergara, L., Camacho, A., & Miralles, R. (2003).Independent component analysis with prior information about the mixing matrix.Neurocomputing, 50, 419–438. Jutten, C., & H´erault J. (1991).Blind separation of sources: An adaptive algorithm based on neuromimetic architecture. Signal Processing, 24(1), 1–10. Karhunen, J., Oja, E., Wang, L., Vigario, R., & Joutsensalo, J. (1997). A class of neural networks for independent component analysis. IEEE Trans. on Neural Networks, 8(3), 486–504. Karhunen, J., Pajunen, P., & Oja, E. (1998). The nonlinear PCA criterion in blind source separation: Relations with other approaches, Neurocomputing, 22(1), 5–20. Lee, T. W., Girolami, M., & Sejnowski, T. J. (1999). Independent component analysis using an extended infomax algorithm for mixed sub-gaussian and super-gaussian sources. Neural Computation, 11(2), 409–433. Lewick, M., & Sejnowski, T. J. (2000). Learning overcomplete representations. Neural Computation, 12(2), 337–365. Li, Y. Q., & Wang, J. (2002). Sequential blind extraction of instantaneously mixed sources. IEEE Trans. on Signal Processing, 50(5), 997–1006. Lindgren, U. A., & Broman, H. (1998). Source separation using a criterion based on second-order statistics. IEEE Trans. on Signal Processing, 46(7), 1837– 1850. Lu, W., & Rajapakse, J. C. (2003). Eliminating indeterminacy in ICA. Neurocomputing, 50, 271–290.

1660

J. Ye, X. Zhu, and X. Zhang

Moreau, E., & Macchi, O. (1996). High-order contrasts for self-adaptive source separation. InternationalJournal of Adaptive Control and Signal Processing, 10(1), 19–46. Moreau, E., & Thirion-Moreau, N. (1999). Nonsymmetrical contrasts for source separation. IEEE Trans. on Signal Processing, 47(8), 2241–2252. Oja, E. (1997). The nonlinear PCA learning rule and signal separation: Mathematical analysis. Neurocomputing, 17(1), 25–46. Pajunen, P., & Karhunen, J. (1998). Least-squares methods for blind source separation based on nonlinear PCA. International Journal of Neural Systems, 8, 601–612. Pham, D. T., & Garrat, P. (1997). Blind separation of mixture of independent sources through a quasi-maximum likelihood approach. IEEE Trans. on Signal Processing, 45(7), 1712–1725. Pham, D. T., & Cardoso, J. F. (2001). Blind separation of instantaneous mixtures of nonstationary sources. IEEE Trans. on Signal Processing, 49(9), 1837–1848. Sanchez A., V. D. (2002).Frontiers of research in BSS/ICA. Neurocomputing, 49(1), 7–23. Sato, M. (2001). Online model selection based on the variational bayes. Neural Computation, 13(7), 1649–1681. Sugiyama, M., & Ogawa, H. (2001). Subspace information criterion for model selection. Neural Computation, 13(8), 1863–1889. Thawonmas, R., Cichocki, A., & Amari, S. (1998). A cascade neural network for blind extraction without spurious equilibria. IEICE Trans. on Fundamentals of Electronics, Communications, and Computer Science, E81-A(9), 1833–1846. Tong, L., Liu, R., Soon, V. C., & Huang, Y. F. (1991). Indeterminacy and identiability of blind identication. IEEE Trans. on Circuits and Systems, 38(5), 499–509. Yang, H. H., & Amari, S. (1997). Adaptive on-line learning algorithms for blind separation: Maximum entropy and minimum mutual information. Neural Computation, 9(5), 457–1482. Zhang, L. Q., Cichocki, A., & Amari, S. (1999). Natural gradient algorithm for blind separation of over-determined mixtures with additive noise. IEEE Signal Processing Letters, 6(11), 293–295. Zhu, X. L., & Zhang, X. D. (2002). Adaptive RLS algorithm for blind source separation using a natural gradient. IEEE Signal Processing Letters, 9(12), 432– 435. Received July 16, 2003; accepted January 30, 2004.

LETTER

Communicated by Maneesh Sahani

Unsupervised Spike Detection and Sorting with Wavelets and Superparamagnetic Clustering R. Quian Quiroga [email protected]

Z. Nadasdy [email protected] Division of Biology, California Institute of Technology, Pasadena, CA 91125, U.S.A.

Y. Ben-Shaul [email protected] ICNC, Hebrew University, Jerusalem, Israel

This study introduces a new method for detecting and sorting spikes from multiunit recordings. The method combines the wavelet transform, which localizes distinctive spike features, with superparamagnetic clustering, which allows automatic classication of the data without assumptions such as low variance or gaussian distributions. Moreover, an improved method for setting amplitude thresholds for spike detection is proposed. We describe several criteria for implementation that render the algorithm unsupervised and fast. The algorithm is compared to other conventional methods using several simulated data sets whose characteristics closely resemble those of in vivo recordings. For these data sets, we found that the proposed algorithm outperformed conventional methods.

1 Introduction Many questions in neuroscience depend on the analysis of neuronal spiking activity recorded under various behavioral conditions. For this reason, data acquired simultaneously from multiple neurons are invaluable for elucidating principles of neural information processing. Recent advances in commercially available acquisition systems allow recordings of up to hundreds of channels simultaneously, and the reliability of these data critically depends on accurately identifying the activity of individual neurons. However, developments of efcient and reliable computational methods for classifying multiunit data, that is, spike sorting algorithms, lag behind the capabilities afforded by current hardware. In practice, supervised spike sorting of a large number of channels is highly time-consuming and nearly impossible to perform during the course of an experiment. c 2004 Massachusetts Institute of Technology Neural Computation 16, 1661–1687 (2004) °

1662

R. Quiroga, Z. Nadasdy, and Y. Ben-Shaul

The basic algorithmic steps of spike classication are as follows: (1) spike detection, (2) extraction of distinctive features from the spike shapes, and (3) clustering of the spikes by these features. Spike sorting methods are typically based on clustering predened spike shape features such as peakto-peak amplitude, width, or principal components (Abeles & Goldstein, 1977; Lewicki, 1998). Nevertheless, it is impossible to know beforehand which of these features is optimal for discriminating between spike classes in a given data set. In the specic case where the spike features are projections on the rst few principal components, the planes onto which the spikes are projected maximize the variance of data but do not necessarily provide an optimal separation between the clusters. A second critical issue is that even when optimal features from a given data set are used for classication, the distribution of the data imposes additional constraints on the clustering procedure. In particular, violation of normality in a given feature’s distribution compromises most unsupervised clustering algorithms, and therefore manual clustering of the data is usually preferred. However, besides being a very time-consuming task, manual clustering introduces errors due to both the limited dimensionality of the cluster cutting space and human biases (Harris, Henze, Csicsvari, Hirase, & Buzs a´ ki, 2000). An alternative approach is to dene spike classes by a set of manually selected thresholds (window discriminators) or with spike templates. Although this is computationally very efcient and can be implemented on-line, it is reliable only when the signal-to-noise ratio is very high and it is limited to the number of channels a human operator is able to supervise. In this article, we introduce a new method that improves spike separation in the feature space and implements a novel unsupervised clustering algorithm. Combining these two features results in a novel unsupervised spike sorting system. The cornerstones of our method are the wavelet transform, which is a time-frequency decomposition of the signal with optimal resolution in both the time and the frequency domains, and superparamagnetic clustering (SPC; Blatt, Wiseman, & Domany, 1996), a relatively new clustering procedure developed in the context of statistical mechanics. The complete algorithm encompasses three principal stages: (1) spike detection, (2) selection of spike features, and (3) clustering of the selected spike features. In the rst step, spikes are detected with an automatic amplitude threshold on the high-pass ltered data. In the second step, a small set of wavelet coefcients from each spike is chosen as input for the clustering algorithm. Finally, the SPC classies the spikes according to the selected set of wavelet coefcients. We stress that the entire process of detection, feature extraction, and clustering is performed without supervision and relatively quickly. In this study, we compare the performance of the algorithm with other methods using simulated data that closely resemble real recordings. The rationale of using simulated data is to obtain an objective measure of performance, since the simulation sets the identity of the spikes.

Unsupervised Spike Detection and Sorting

1663

2 Theoretical Background 2.1 Wavelet Transform. The wavelet transform (WT) is a time-frequency representation of the signal that has two main advantages over conventional methods: it provides an optimal resolution in both the time and the frequency domains, and it eliminates the requirement of signal stationarity. It is dened as the convolution between the signal x.t/ and the wavelet functions Ãa;b .t/, WÃ X.a; b/ D hx.t/ j Ãa;b .t/i;

(2.1)

where Ãa;b .t/ are dilated (contracted), and shifted versions of a unique wavelet function Ã.t/, ³ 1

Ãa;b .t/ D jaj¡ 2 Ã

t¡b a

´ (2.2)

where a and b are the scale and translation parameters, respectively. Equation 2.1 can be inverted, thus providing the reconstruction of x.t/. The WT maps the signal that is represented by one independent variable t onto a function of two independent variables a, b. This procedure is redundant and inefcient for algorithmic implementations; therefore, the WT is usually dened at discrete scales a and discrete times b by choosing the set of parameters faj D 2¡j I bj;k D 2¡j kg, with integers j and k. Contracted versions of the wavelet function match the high-frequency components, while dilated versions match the low-frequency components. Then, by correlating the original signal with wavelet functions of different sizes, we can obtain details of the signal at several scales. These correlations with the different wavelet functions can be arranged in a hierarchical scheme called multiresolution decomposition (Mallat, 1989). The multiresolution decomposition algorithm separates the signal into details at different scales and a coarser representation of the signal named “approximation” (for details, see Mallat, 1989; Chui, 1992; Samar, Swartz, & Raghveer, 1995; Quian Quiroga, Sakowicz, Basar, & Schurmann, ¨ 2001; Quian Quiroga & Garcia, 2003). In this study we implemented a four-level decomposition using Haar wavelets, which are rescaled square functions. Haar wavelets were chosen due to their compact support and orthogonality, which allows the discriminative features of the spikes to be expressed with a few wavelet coefcients and without a priori assumptions on the spike shapes. 2.2 Superparamagnetic Clustering. The following is a brief description of the key ideas of superparamagnetic clustering (SPC), which is based on simulated interactions between each data point and its K-nearest neighbors (for details, see Blatt et al., 1996; Blatt, Wiseman, & Domany, 1997). The method is implemented as a Monte Carlo iteration of a Potts model. The

1664

R. Quiroga, Z. Nadasdy, and Y. Ben-Shaul

Potts model is a generalization of the Ising model where instead of having spins with values §1=2, there are q different states per particle (Binder & Heermann, 1988). The rst step is to represent the m selected features of each spike i by a point xi in an m-dimensional phase space. The interaction strength between points xi is then dened as 8 Á ! > kxi ¡ xj k2 <1 exp ¡ if xi is a nearest neighbor of xj ; (2.3) Jij D K 2a2 > : 0 else where a is the average nearest-neighbors distance and K is the number of nearest neighbors. Note that the strength of interaction Jij between nearestneighbor spikes falls off exponentially with increasing Euclidean distance dij D kxi ¡ xj k2 , which corresponds to the similarity of the selected features (i.e., similar spikes will have a strong interaction). In the second step, an initial random state s from 1 to q is assigned to each point xi . Then N Monte Carlo iterations are run for different temperatures T using the Wolf algorithm (Wolf, 1989; Binder & Heermann, 1988). Blatt et al. (1997) used a Swendnsen-Wang algorithm instead, but its implementation and performance are both very similar. The advantage of both algorithms over simpler approaches such as the Metropolis algorithm is their enhanced performance in the superparamagnetic regime (see Binder & Heermann, 1988; Blatt et al., 1997, for details). The main idea of the Wolf algorithm is that given an initial conguration of states s, a point xi is randomly selected and its state s changed to a new state snew , randomly chosen between 1 and q. The probability that the nearest neighbors of xi will also change their state to snew is given by ³ pij D 1 ¡ exp ¡

Jij T

´ ±si ;sj ;

(2.4)

where T is the temperature (see below). Note that only those nearest neighbors of xi that were in the same previous state s are the candidates to change their values to snew . Neighbors that change their values create a “frontier” and cannot change their value again during the same iteration. Points that do not change their value in a rst attempt can do so if revisited during the same iteration. Then for each point of the frontier, we apply equation 2.4 again to calculate the probability of changing the state to snew for their respective neighbors. The frontier is updated, and the update is repeated until the frontier does not change any more. At that stage, we start the procedure again from another point and repeat it several times in order to get representative statistics. Points that are relatively close together (i.e., corresponding to a given cluster) will change their state together. This observation can be quantied by measuring the point-point correlation h±si ;sj i and dening xi , xj to be members of the same cluster if h±si ;sj i ¸ µ , for a given threshold µ.

Unsupervised Spike Detection and Sorting

1665

As in Blatt et al. (1996), we used q D 20 states, K D 11 nearest neighbors, N D 500 iterations, and µ D 0:5. It has indeed been shown that clustering results mainly depend on the temperature and are robust to small changes in the previous parameters (Blatt et al., 1996). Let us now discuss the role of the temperature T. Note from equation 2.4 that high temperatures correspond to a low probability of changing the state of neighboring points together, whereas low temperatures correspond to a higher probability regardless of how weak the interaction Jij is. This has a physical analogy with a spin glass, in which at a relatively high temperature, all the spins are switching randomly, regardless of their interactions (paramagnetic phase). At a low temperature, the entire spin glass changes its state together (ferromagnetic phase). However, at a certain medium range of temperatures, the system reaches a “superparamagnetic” phase in which only those spins that are grouped together will change their state simultaneously. Regarding our clustering problem, at low temperatures, all points will change their state together and will therefore be considered as a single cluster; at high temperatures, many points will change their state independently from one another, thus partitioning the data into several clusters with only a few points in each; and for temperatures corresponding to the superparamagnetic phase, only those points that are grouped together will change their state simultaneously. Figure 1A shows a two-dimensional (2D) example in which 2400 2D points were distributed in three different clusters. Note that the clusters partially overlap, they have a large variance, and, moreover, the centers fall outside the clusters. In particular, the distance between arbitrarily chosen points of the same cluster can be much larger than the distance between points in different clusters. These features render the use of conventional clustering algorithms unreliable. The different markers represent the outcome after clustering with SPC. Clearly, most of the points were correctly classied. In fact, only 102 of 2400 (4%) data points were not classied because they were near the boundaries of the clusters. Figure 1B shows the number of elements assigned to each given cluster as a function of the temperature. At low temperatures, we have a single cluster with all 2400 data points included. At a temperature between 0.04 and 0.05, this cluster breaks down into three subclusters corresponding to the superparamagnetic transition. The classication shown in the upper plot was performed at T D 0:05. At about T D 0:08, we observe the transition to the paramagnetic phase, where the clusters break down into several groups with a few members each. Note that the algorithm is based on K-nearest neighbor interactions and therefore does not assume that clusters are nonoverlapping or that they have low variance or a gaussian distribution. 3 Description of the Method Figure 2 summarizes the three principal stages of the algorithm: (1) spikes are detected automatically via amplitude thresholding; (2) the wavelet trans-

1666

R. Quiroga, Z. Nadasdy, and Y. Ben-Shaul

Figure 1: Example showing the performance of superparamagnetic clustering. (A) The two-dimensional data points used as inputs. The different markers represent the outcome of the clustering algorithm. (B) Cluster size vs. temperature. At temperature 0.05, the transition to the superparamagnetic phase occurs, and the three clusters are separated.

form is calculated for each of the spikes and the optimal coefcients for separating the spike classes are automatically selected; and (3) the selected wavelet coefcients then serve as the input to the SPC algorithm, and clustering is performed after automatic selection of the temperature corresponding to the superparamagnetic phase. (A Matlab implementation of the algorithm can be obtained on-line from www.vis.caltech.edu/»rodri.)

3.1 Spike Detection. Spike detection was performed by amplitude thresholding after bandpass ltering the signal (300–6000 Hz, four pole

Unsupervised Spike Detection and Sorting

1667

Figure 2: Overview of the automatic clustering procedure. (A) Spikes are detected by setting an amplitude threshold. (B) A set of wavelet coefcients representing the relevant features of the spikes is selected. (C) The SPC algorithm is used to cluster the spikes automatically.

1668

R. Quiroga, Z. Nadasdy, and Y. Ben-Shaul

butterworth lter). The threshold (Thr) was automatically set to » Thr D 4¾n I

¾n D median

¼ jxj ; 0:6745

(3.1)

where x is the bandpass-ltered signal and ¾ n is an estimate of the standard deviation of the background noise (Donoho & Johnstone, 1994). Note that taking the standard deviation of the signal (including the spikes) could lead to very high threshold values, especially in cases with high ring rates and large spike amplitudes. In contrast, by using the estimation based on the median, the interference of the spikes is diminished (under the reasonable assumption that spikes amount to a small fraction of all samples). To demonstrate this, we generated a segment of 10 sec of background noise with unit standard deviation, and in successive simulations, we added a distinct spike class with different ring rates. Figure 3 shows that for noise alone (i.e., zero ring rate), both estimates are equal, but as the ring rate increases, the standard deviation of the signal (conventional estimate) gives an increasingly erroneous estimate of the noise level, whereas the improved estimate from equation 3.1 remains close to the real value. For each detected spike, 64 samples (i.e., »2.5 ms) were saved for further analysis. All spikes were aligned to their maximum at data point 20. In order to avoid spike misalignments due to low sampling, spike maxima were determined from interpolated waveforms of 256 samples, using cubic splines. 3.2 Selection of Wavelet Coefcients. After spikes are detected, their wavelet transform is calculated, thus obtaining 64 wavelet coefcients for each spike. We implemented a four-level multiresolution decomposition using Haar wavelets. As explained in section 2.1, each wavelet coefcient characterizes the spike shapes at different scales and times. The goal is to select a few coefcients that best separate the different spike classes. Clearly, such coefcients should have a multimodal distribution (unless there is only one spike class). To perform this selection automatically, the Lilliefors modication of a Kolmogorov-Smirnov (KS) test for normality was used (Press, Teukolsky, Vetterling, & Flannery, 1992). Note that we do not rely on any particular distribution of the data; rather, we are interested in deviation from normality as a sign of a multimodal distribution. Given a data set x, the test compares the cumulative distribution function of the data .F.x// with that of a gaussian distribution with the same mean and variance .G.x//. Deviation from normality is then quantied by max.jF.x/ ¡ G.x/j/:

(3.2)

In our implementation, the rst 10 coefcients with the largest deviation from normality were used. The selected set of wavelet coefcients provides

Unsupervised Spike Detection and Sorting

1669

2.8

Conventional estimation Improved estimation Real value

2.6 2.4

noise std

2.2 2 1.8 1.6 1.4 1.2 1 0.8 0

5

10

15

20

25

30

35

40

45

50

Firing rate (Hz) Figure 3: Estimation of noise level used for determining the amplitude threshold. Note how the conventional estimation based on the standard deviation of the signal increases with the ring rate, whereas the improved estimation from equation 3.1 remains close to the real value. See the text for details.

a compressed representation of the spike features that serves as the input to the clustering algorithm. Overlapping spikes (i.e., spikes from different neurons appearing quasisimultaneously) introduce outliers in the distribution of the wavelet coefcients that cause deviations from normality in unimodal (as well as multimodal) distributions, thus compromising the use of the KS test as an estimation of multimodality. In order to minimize this effect, for each coefcient we only considered values within §3 standard deviations. 3.3 SPC and Localization of the Superparamagnetic Phase. Once the selected set of wavelet coefcients is chosen, we run the SPC algorithm for a wide range of temperatures spanning the ferromagnetic, superparamag-

1670

R. Quiroga, Z. Nadasdy, and Y. Ben-Shaul

netic, and paramagnetic phases. In order to localize the superparamagnetic phase automatically, a criterion based on the cluster sizes is used. The idea is that for both the paramagnetic and ferromagnetic phases, temperature increases can only lead to the creation of clusters with few members each. Indeed, in the paramagnetic phase (i.e., high temperature), the clusters break down into several small ones, and in the ferromagnetic phase, there are almost no changes when the temperature is increased. In contrast, in the superparamagnetic phase, increasing the temperature creates new clusters with a large number of members. In our implementation, we varied the temperature from 0 to 0.2 in increments of 0.01 and looked for the highest temperature at which a cluster containing more than 60 points appeared (not being present at lower temperatures). Since our simulations were 60 sec long, this means that we considered clusters corresponding to neurons with a mean ring rate of at least 1 Hz. The threshold of 1 Hz gave us optimal results for all our simulations, but it should be decreased if one considers neurons with lower ring rates. Alternatively, one can consider a fraction of the total number of spikes. If no cluster with a minimum of 60 points was found, we kept the minimum temperature value. Using this criterion, we can automatically select the optimal temperature for cluster assignments, and therefore the whole clustering procedure becomes unsupervised. 4 Data Simulation Simulated signals were constructed using a database of 594 different average spike shapes compiled from recordings in the neocortex and basal ganglia. For generating background noise, spikes randomly selected from the database were superimposed at random times and amplitudes. This was done for half the times of the samples. The rationale was to mimic the background noise of actual recordings that is generated by the activity of distant neurons. Next, we superimposed a train of three distinct spike shapes (also preselected from the same database of spikes) on the noise signal at random times. The amplitude of the three spike classes was normalized to have a peak value of 1. The noise level was determined from its standard deviation, which was equal to 0.05, 0.1, 0.15, and 0.2 relative to the amplitude of the spike classes. In one case, since clustering was relatively easy, we also considered noise levels 0.25, 0.30, 0.35, and 0.4. Spike times and identities were saved for subsequent evaluation of the clustering algorithm. The data were rst simulated at a sampling rate of 96,000 Hz, and by using interpolated waveforms of the original spike shapes, we simulated the spike times to fall continuously between samples (to machine precision). Finally, the data were downsampled to 24,000 Hz. This procedure was introduced in order to imitate actual recording conditions in which samples do not necessarily fall on the same features within a spike (i.e., the peak of the signal does not necessarily coincide with a discrete sample).

Unsupervised Spike Detection and Sorting

1671

In all simulations, the three distinct spikes had a Poisson distribution of interspike intervals with a mean ring rate of 20 Hz. A 2 ms refractory period between spikes of the same class was introduced. Note that the background noise reproduces spike shape variability in biological signals (Fee, Mitra, & Kleinfeld, 1996; Pouzat, Mazor, & Laurent, 2002). Moreover, constructing noise from spikes ensures that this noise shares a similar power spectrum with the spikes themselves (1/f spectrum). The realistic simulation conditions applied here render the entire procedure of spike sorting more challenging than, for example, assuming a white noise distribution of background activity. Further complications of real recordings (e.g., overlapping spikes, bursting activity, moving electrodes) will be addressed in the next section. Figure 4 shows one of the simulated data sets with a noise level 0.1. Figure 4A discloses the three spike shapes that were added to the background noise, as shown in Figure 4B. Figure 4C shows a section of the data in Figure 4B in ner temporal resolution. Note the variance in shape and amplitude between spikes of the same class (identied with a marker of the same gray level) due to the additive background noise. Figure 5 shows another example with noise level 0.15, in which classication is considerably more difcult than in the rst data set. Here, the three spike classes share the same peak amplitudes and very similar widths and shapes. The differences A)

B)

1

C la s s 1 C la s s 2 C la s s 3

0.5

1.5 1 0.5

0

0 0.5

0.5

1 1.5

1 0

50

100

150

200

250

300

0

1

S a m p le s

C)

1.5

3 1

3

1

2

3

4

5

T im e (s e c )

21

2

3

1

2

1 0.5 0 0.5 1 0.9

0.92

0.94

0.96

0.98

1

1.02

T im e (s e c )

Figure 4: Simulated data set used for spike sorting. (A) The three template spike shapes. (B) The previous spikes embedded in the background noise. (C) The same data with a magnied timescale. Note the variability of spikes from the same class due to the background noise.

1672 A)

R. Quiroga, Z. Nadasdy, and Y. Ben-Shaul B) 1

1.5

C la s s 1 C la s s 2 C la s s 3

1

0.5

0.5

0

0

0.5 0.5 0

50

100

150

200

250

300

1 0

1

2

S a m p le s

C)

1.5

3

3

4

5

T im e (s e c )

2

3

2

1

1

2

1

0.5

0

0.5 0.06

0.08

0.1

0.12

0.14

0.16

T im e (s e c )

Figure 5: Another simulated data set. (A) The three template spike shapes. (B) The previous spikes embedded in the background noise. (C) The same data with a magnied timescale. Here the spike shapes are more difcult to differentiate. Note in the lower plot that the variability in the spike shapes makes their classication difcult.

between them are relatively small and temporally localized. By adding the background noise, it appears to be very difcult to identify the three spike classes (see Figure 5C). As with the previous data set, the variability of spikes of the same class is apparent. All the data sets used in this article are available on-line at www.vis. caltech.edu/»rodri. 5 Results The method was tested using four generic examples of 60 sec length, each simulated at four different noise levels, as described in the previous section. Since the rst example was relatively easy to cluster, in this case we also generated four extra time series with higher noise levels. 5.1 Spike Detection. Figures 4 and 5 show two of the simulated data sets. The horizontal lines drawn in Figures 4B and C and 5B and C are the thresholds for spike detection using equation 3.1. Table 1 summarizes the performance of the detection procedure for all data sets and noise levels. Detection performances for overlapping spikes (i.e., spike pairs within 64 data points) are reported separately (values in brackets). Overlapping

Unsupervised Spike Detection and Sorting

1673

Table 1: Number of Misses and False Positives for the Different Data Sets. Example Number (Noise Level) Example 1 [0.05] [0.10] [0.15] [0.20] Example 2 [0.05] [0.10] [0.15] [0.20] Example 3 [0.05] [0.10] [0.15] [0.20] Example 4 [0.05] [0.10] [0.15] [0.20]

Number of Spikes 3514 (785) 3522 (769) 3477 (784) 3474 (796) 3410 (791) 3520 (826) 3411 (763) 3526 (811) 3383 (767) 3448 (810) 3472 (812) 3414 (790) 3364 (829) 3462 (720) 3440 (809) 3493 (777)

Misses 17 (193) 2 (177) 145 (215) 714 (275) 0 (174) 0 (191) 10 (173) 376 (256) 1 (210) 0 (191) 8 (203) 184 (219) 0 (182) 0 (152) 3 (186) 262 (228)

False Positives 711 57 14 10 0 2 1 5 63 10 6 2 1 5 4 2

Notes: Noise level is represented in terms of its standard deviation relative to the peak amplitude of the spikes. All spike classes had a peak value of 1. Values in brackets are for overlapping spikes.

spikes hamper the detection performance because they are detected as single events when they appear too close in time. In comparison with the other examples, a relatively large number of spikes were not detected in data set 1 for the highest noise levels (0.15 and 0.2). This is due to the spike class with opposite polarity (class 2 in Figure 4). In fact, setting up an additional negative threshold reduced the number of misses from 145 to 5 for noise level 0.15 and from 714 to 178 for 0.2. In the case of the overlapping spikes, this reduction is from 360 to 52 and from 989 to 134, respectively. In all other cases, the number of undetected spikes was relatively low. With the exception of the rst two noise levels in example 1 and the rst noise level in example 3, the number of false positives was very small (less than 1%). Lowering the threshold value in equation 3.1 (e.g., 3.5 ¾n ) would indeed reduce the number of misses but also increase the number of false positives. The optimal trade-off between number of misses and false positives depends on the experimenter’s preference, but we remark that the automatic threshold of equation 3.1 gives an optimal value for different noise levels. In the case of example 1 (noise level 0.05 and 0.1) and example 3 (noise level 0.05), the large number of false positives is exclusively due to double detections. Since the noise level is very low in these cases, the threshold is also low, and consequently, the second positive peak of the class 3 spike shown in Figure 4 is detected. One solution would be to take a higher threshold value (e.g., 4.5 ¾n ), but this would not be optimal for high

1674

R. Quiroga, Z. Nadasdy, and Y. Ben-Shaul

noise levels. Although double detections decrease the performance of the detection algorithm, it does not represent a problem when considering the whole spike sorting procedure. In practice, the false positives show up as an additional cluster that can be disregarded later. For further testing of the clustering algorithm, the complete data set of simulated spikes (with both the detected and the undetected ones) will be used. 5.2 Feature Extraction. Figure 6 shows the wavelet coefcients for spikes in the data set shown in Figure 4A and Figure 5B. For clarity, wavelet coefcients of overlapping spikes are not plotted. Coefcients corresponding to individual spikes are superimposed, each representing how closely the spike waveform matches the wavelet function at a particular scale and time. Coefcients are organized in detail levels (D1–4 ) and a last approximation (A4 ), which correspond to the different frequency bands in which spike shapes are decomposed. Especially in Figure 6A, we observe that some of the coefcients cluster around different values for the different spike classes, thus being well suited for classication. Most of these coefcients are chosen by the KS test, as shown with black markers. For comparison, the 10 coefcients with maximum variance are also marked. It is clear from this gure that coefcients showing the best discrimination are not necessarily the ones with the largest variance. In particular, the maximum variance criterion misses several coefcients from the high-frequency scales (D1 –D2 ) that allow a good separation between the different spike shapes. Figure 7 discloses the distribution of the 10 best wavelet coefcients from Figure 6B (in this case, including coefcients corresponding to overlapping spikes) using the KS criterion versus the maximum variance criterion. Three wavelet coefcients out of the ten selected using the KS criterion show a deviation from normality that is not associated with multimodal distribution: coefcient 42, showing a skewed distribution, and coefcients 19 and 10, which, in addition to skewed distribution, have signicant kurtosis mainly due to the outliers introduced by the overlapping spikes. In the remaining cases, the KS criterion selected coefcients with a multimodal distribution. In contrast, with the exception of coefcient 20, the variance criterion selects coefcients with a uniform distribution that hampers classication. For the same data, in Figure 8 we show the best three-dimensional (3D) projections of the wavelet coefcients selected with the KS criterion (Figure 8A), the variance criterion (Figure 8B) and projections of the rst three principal components (Figure 8C). In all cases, the clustering was done automatically with SPC and is represented with different gray levels. We observe that using the KS criterion, it is possible to clearly identify the three clusters. In contrast, when choosing the coefcients with the largest variance, it is possible to identify two out of three clusters, and when using the rst three principal components, only a single cluster is detected (the number of classication errors is shown in Table 2, example 2, noise 0.15). Note also that the cluster shapes can be quite elongated, thus challenging any cluster-

Unsupervised Spike Detection and Sorting

1675

Figure 6: Wavelet transform of the spikes from Figure 4 and Figure 5 (panes A and B, respectively). Each curve represents the wavelet coefcients for a given spike, the gray levels denoting the spike class after clustering with SPC. (A) Several wavelet coefcients are sensitive to localized features. (B) Separation is much more difcult due to the similarity of the spike shapes. The markers show coefcients selected based on the variance criteria and coefcients selected based on deviation from normality. D1 –D4 are the detail levels, and A 4 corresponds to the last approximation level.

1676

R. Quiroga, Z. Nadasdy, and Y. Ben-Shaul

21

2

1000 0

0.6

0.5

1 0.2

0

0.2

4 0.5

0

0.5

5 0.2

0

0.2

0

7

500 0.2

0

0.2

0

8

1000 0.5

0

0.5

0

10

500 0.05

0.1

0.15

0.2

0

6

500 0.15

0.1

0.05

500 0

1

0

1

2

4

2

0

2

4

1

0

1

2

3

2

1

0

1

2

1

0

1

2

1

0

1

2

1

0

1

2

0

1

2

3

0.5

0

0.5

1

500 0

0

2

500 0

0.25

0

1000 0

1

2

1000 0

0.4

6

1000 0

0.4

4

500 0

1

2

500 0

0.4

0

500 0

0

20

39 20 38 40

1

500 0

0.2

1000 0

19

0

500 0

43

0.2

1000 0

42

0.4

500 0

10

Variance criterion

3

41

KS criterion

500 0

Figure 7: Distribution of the wavelet coefcients corresponding to Figure 6B. (A) The coefcients selected with the Kolmogorov-Smirnov criterion. (B) The coefcients selected with a maximum variance criterion. The coefcient number is at the left of each panel. Note that the rst criterion is more appropriate as it selects coefcients with multimodal distributions.

Unsupervised Spike Detection and Sorting

1677

Figure 8: Best projection of the wavelet coefcients selected with the (A) KS criterion and the (B) variance criterion. (C) The projection on the rst three principal components. Note that only with the wavelet coefcients selected by using the KS criterion is it possible to separate the three clusters. Clusters assignment (shown with different gray levels) was done after use of SPC.

1678

R. Quiroga, Z. Nadasdy, and Y. Ben-Shaul

Table 2: Number of Classication Errors for All Examples and Noise Levels Obtained Using SPC, K-Means, and Different Spike Features. Example Number (Noise Level) Example 1 [0.05] [0.10] [0.15] [0.20] [0.25] [0.30] [0.35] [0.40] Example 2 [0.05] [0.10] [0.15] [0.20] Example 3 [0.05] [0.10] [0.15] [0.20] Example 4 [0.05] [0.10] [0.15] [0.20] Average

Number of Spikes

Wavelets

PCA

SPC Spike Shape

Feature Set

2729 2753 2693 2678 2586 2629 2702 2645 2619 2694 2648 2715 2616 2638 2660 2624 2535 2742 2631 2716

1 5 5 12 64 276 483 741 3 10 45 306 0 41 81 651 1 8 443 1462 (2)

1 17 19 130 911 1913 1926 (2) 1738 (1) 4 704 1732 (1) 1791 (1) 7 1781 1748 (1) 1711 (1) 1310 946 (2) 1716 (2) 1732 (1)

0 0 0 24 266 838 1424 (2) 1738 (1) 2 59 1054 (2) 2253 (1) 3 794 2131 (1) 2449 (1) 24 970 (2) 1709 (1) 1732 (1)

863 833 2015 (2) 614 1265 (2) 1699 (1) 1958 (1) 1977 (1) 502 1893 (1) 2199 (1) 2199 (1) 619 1930 (1) 2150 (1) 2185 (1) 1809 (1) 1987 (1) 2259 (1) 1867 (1)

0 0 0 17 69 177 308 930 0 2 31 154 0 850 859 874 686 271 546 872

0 0 0 17 68 220 515 733 0 53 336 740 1 184 848 1170 212 579 746 1004

2662

232

1092

873

1641

332

371

K-means Wavelets PCA

Notes: In parentheses are the number of correct clusters detected when different from 3. The numbers corresponding to the example shown on Figure 8 are underlined.

ing procedure based on Euclidean distances to the cluster centers, such as K-means. 5.3 Clustering of the Spike Features. In Figure 9, we show the performance of the algorithm for the rst data set (shown in Figure 4). In Figure 9A, we plot the cluster sizes as a function of the temperature. At a temperature T D 0:02, the transition to the superparamagnetic phase occurs. As the temperature is increased, a transition to the paramagnetic regime takes place at T D 0:12. The temperature T D 0:02 (vertical dotted line) is determined for clustering based on the criterion described in section 3.3. In Figure 9B, we see the classication after clustering and in Figure 9C the original spike shapes (without the noise). In this case, spike shapes are easy to differentiate due to the large negative phase of the class 1 spikes and the initial negative peak of the class 3 spikes. Figure 10 shows the other three data sets with a noise level 0.1. In all these cases, the classication errors were very low (see Table 2). Note also that many overlapping spikes were correctlyclassied (those in color), especially when the latency between the spike peaks was larger than about 0.5 ms. Pairs of spikes appearing with a lower time separation are not clustered

Unsupervised Spike Detection and Sorting

1679

Figure 9: (A) Cluster size vs. temperature. Based on the stability criterion of the clusters, a temperature of 0.02 was automatically chosen for separating the three spike classes. (B) All spikes with gray levels according to the outcome of the clustering algorithm. Note the presence of overlapping spikes. (C) Original spike shapes.

by the algorithm (in gray) but can, in principle, be identied in a second stage by using the clustered spikes as templates. Then one should look for the combination (allowing delays between the spike templates) that best reproduces the nonclustered spike shapes. This procedure is outside the scope of this study and will not be further addressed.

1680

R. Quiroga, Z. Nadasdy, and Y. Ben-Shaul

Figure 10: Outcome of the clustering algorithm for the three remaining examples. The inset plots show the original spike shapes. Most of the classication errors (gray traces) are due to overlapping spikes with short temporal separation.

Unsupervised Spike Detection and Sorting

1681

5.4 Comparison with Other Spike Features. Errors of spike identication cumulatively derive from two sources: incorrect feature extraction and incorrect clustering. First, we compared the discrimination power of wavelets at different noise levels with other feature extraction methods using the same clustering algorithm, SPC. Specically, we compared the outcome of classication with wavelets, principal component analysis (using the rst three principal components), the whole spike shape, and a xed set of spike features. The spike features were the mean square value of the signal (energy), the amplitude of the main peak, and the amplitude of the peaks preceding and following it. The width of the spike (given by the position of the three peaks) was also tested but found to worsen the spike classication. Table 2 summarizes the results (examples 1 and 2 were shown in Figures 4 and 5, respectively). Performance was quantied in terms of the number of classication errors and the number of clusters detected. Values in brackets denote the number of clusters detected, when different from 3. Errors due to overlapping spikes are not included in the table (but overlapping spikes were also inputs to the algorithm). Since example 1 was easier to cluster, in this case we also analyzed four higher noise levels. In general, the best performance was achieved using the selected wavelet coefcients as spike features. In fact, with wavelets, all three clusters are correctly detected, with the exception of example 4, noise level 0.2. Using the other features, only one cluster is detected when increasing the noise level. Considering all wavelet coefcients as inputs to SPC (i.e., without the selection based on the KS test) gave nearly the same results as those obtained using the entire spike shape (not shown). This is not surprising since the wavelet transform is linear, which implies rotations or rescaling of the original high-dimensional space. The clear benet is introduced when considering only those coefcients that allow a good separation between the clusters, using the KS test. Note that PCA features gave slightly worse results than those using the entire spike shapes (and clearly worse than using wavelets). Therefore, PCA gives a reasonable solution when using clustering algorithms that cannot handle high-dimensional spaces (although this is not a problem for SPC). The xed features, such as peak amplitudes and mean squared values, were much less efcient. This was expected since spike classes were generated with the same peak value and in some cases also with very similar shapes that could not be discriminated by these features. We remark that the number of detected clusters decreases monotonically with noise level, but the number of classication errors does not necessarily increase monotonically because even when three clusters are correctly recognized, a large number of spikes may remain unassigned to any of them (see, e.g., example 4 with PCA for noise levels 0.05 and 0.1). 5.5 Comparison with Other Clustering Algorithms. A number of different clustering algorithms can be applied for spike sorting, and the choice of an optimal one is important in order to exploit the discrimination power

1682

R. Quiroga, Z. Nadasdy, and Y. Ben-Shaul

of the feature space. The most used algorithms are supervised and usually assume gaussian shapes of the clusters and specic properties of the noise distribution. In order to illustrate the difference with these methods, we will compare results using SPC with those obtained using K-means. The partitioning of data by K-means keeps objects of the same cluster as close as possible (using Euclidean distance in our case) and as far as possible from objects in the other clusters. The standard K-means algorithm leaves no unclassied items, but the total number of clusters should be predened (therefore being a supervised method). These constraints simplify the clustering problem and give an advantage to K-means in comparison to SPC, since in the rst case, we know that each object should be assigned to one of the three clusters. The right-most two columns of Table 2 show the clustering performance using K-means with wavelets and PCA. Despite being unsupervised, SPC applied on the wavelet features gives the best performance. For spike shapes relatively easy to differentiate (Table 2, examples 1 and 2), the outcomes with wavelets are similar using K-means or SPC. However, the advantage of SPC with wavelet becomes apparent when spike shapes are more similar (Table 2, examples 3 and 4). We remark that with SPC, points may remain unclassied, whereas with K-means, all points are assigned to one of the three clusters (thus having at least a 33% chance of being correctly classied). This led K-means to outperform SPC for example 4, at noise level 0.2, where only two out of three clusters were identied with SPC. In general, the number of classication errors using PCA with K-means is higher than the ones using wavelets with K-means. The few exceptions where PCA outperformed wavelets with using K-means (example 3 at noise level 0.1; example 4 at noise level 0.05) can be attributed to the presence of more elongated cluster shapes obtained with wavelets that K-means fails to separate. 5.6 Simulations with Nongaussian Spike Distributions. In this section, we consider conditions of real recordings that may compromise the performance of the proposed clustering algorithm. In particular, we will simulate the effect of an electrode moving with respect to one of the neurons, bursting activity, and a correlation between the spikes and the local eld potential. Clearly, these situations are difcult to handle with algorithms that assume a particular distribution of the noise. In fact, all of these cases add a nongaussian component to the spike shape variance. For simulating a moving electrode, we use the example shown in Figure 5 (example 2, noise 0.15 in Table 2), but in this case, we progressively decreased the amplitude of the rst spike class (linearly) with time from a value of 1 at the beginning of the recording to 0.3 at the end. Using SPC with wavelet, it was still possible to detect all three clusters, and from a total of 2692 spikes, the number of classication errors was only of 48. Second, we simulated a bursting cell based also on the example shown in Figure 5. The rst class consisted of three consecutive spikes with ampli-

Unsupervised Spike Detection and Sorting

1683

tudes 1.0, 0.7, and 0.5, separated by 3 ms in average (SD = 1, range, 1–5 ms). From a total of 2360 spikes, again the three clusters were correctly detected, and we had 25 classication errors using wavelets with SPC. Finally we considered a correlation between the spike amplitudes and the background activity, similar to the condition when spikes co-occur with local eld events. We used the same spike shapes and noise level shown in Figure 5, but the spike amplitudes varied from 0.5 to 1 depending on the amplitude of the background activity at the time of the spike (0.5 when the background activity reached its minimum and 1.0 when it reached its maximum). In this case, again, the three clusters were correctly detected, and we had 439 classication errors from a total of 2706 spikes. 6 Discussion We presented a method for detection and sorting neuronal multiunit activity. The procedure is fully unsupervised and fast, thus being particularly interesting for the classication of spikes from a large number of channels recorded simultaneously. To obtain a quantitative measure of its performance, the method was tested on simulated data sets with different noise levels and similar spike shapes. The noise was generated by superposition of a large number of small-amplitude spikes, resembling characteristics of real recordings. This makes the spectral characteristics of noise and spikes similar, thus increasing the difculty in detection and clustering. The proposed method had an overall better performance than conventional approaches, such as using PCA for extracting spike features or K-means for clustering. Spike detection was achieved by using an amplitude threshold on the high-pass ltered data. The threshold value was calculated automatically using the median of the absolute value of the signal. The advantage of this estimation, rather than using the variance of the overall signal, is that it diminishes the dependence of the threshold on the ring rate and the peakto-peak amplitude of the spikes, thus giving an improved estimation of the background noise level. Indeed, high ring rates and high spike amplitudes lead to an overestimation of the appropriate threshold value. In terms of the number of misses and number of false positives, the proposed automatic detection procedure had good performance for the different examples and noise levels. The advantage of using the wavelet transform as a feature extractor is that very localized shape differences of the different units can be discerned. The information about the shape of the spikes is distributed in several wavelet coefcients, whereas with PCA, most of the information about the spike shapes is captured only by the rst three principal components (Letelier & Weber, 2000; Hulata, Segev, Shapira, Benveniste, & Ben-Jacob, 2000; Hulata, Segev, & Ben-Jacob, 2002), which are not necessarily optimal for cluster identication (see Figure 7). Moreover, wavelet coefcients are localized in time. In agreement with these considerations, a better performance of

1684

R. Quiroga, Z. Nadasdy, and Y. Ben-Shaul

wavelet coefcients in comparison with PCA was shown for several examples generated with different noise levels. For comparison, we also used the whole spike shape as input to the clustering algorithm. As shown in Table 2, the dimensionality reduction achieved with the KS test clearly improves the clustering performance. Since wavelets are a linear transform, using all the wavelet coefcients yields nearly the same results as taking the entire spike shape (as it is just a rescaling of the space). Since the need of a lowdimensional space is a limiting factor for many clustering algorithms, the dimensionality reduction achieved by combining wavelets with the KS test may have a broad range of interest. The use of wavelets for spike sorting has been proposed recently by Letelier et al. (2000) and Hulata et al. (2000, 2002). Our approach differs from theirs in several aspects, most notably by the implementation of our algorithm as a single unsupervised process. One key feature of our algorithm is the choice of wavelet coefcients by using a Kolmogorov-Smirnov test of normality, thus selecting features that give an optimal separation between the different clusters. Letelier et al. (2000) suggested to visually select those wavelet coefcients with the largest mean, variance, and, most fundamental, multimodal distribution. However, neither a large average nor a large variance entitles the given coefcient to be the best separator. In contrast, we considered only the multimodality of the distributions. In this respect, we showed that coefcients in the low-frequency bands, with a large variance and uniform distribution, are not appropriate for the separation of distinct clusters. In contrast, they introduce dimensions with nonsegregated distributions that in practice may compromise the performance of the clustering algorithm. A caveat of the KS test as an estimator of multimodality is that it can also select unimodal nongaussian distributions (those that are skewed or have large kurtosis). In fact, this was the case of three coefcients shown in Figure 7. Despite this limitation, the selection of wavelet coefcients with the KS test gave optimal results that indeed outperformed other feature selection methods. The main caveat of PCA is that eigenvectors accounting for the largest variance of the data are selected, but these directions do not necessarily provide the best separation of the spike classes. In other words, it may well be that the information for separating the clusters is represented in principal components with low eigenvalues, which are usually disregarded. In this respect, our method is more reminiscent of independent component analysis (ICA), where directions with a minimum of mutual information are chosen. Moreover, it has been shown that minimum mutual information is related to maximum deviation from normality (Hyvarinen & Oja, 2000). Hulata et al. (2000, 2002) proposed a criterion based on the mutual information of all pairs of cluster combinations for selecting the best wavelet packet coefcients. However, such an approach cannot be implemented in an unsupervised way. In fact, Hulata and coworkers used previous knowledge of the spike classes for selecting the best wavelet packets. A second caveat

Unsupervised Spike Detection and Sorting

1685

is the difculty of estimating mutual information (Quian Quiroga, Kraskov, Kreuz, & Grassberger, 2002) in comparison with the KS test of normality. The second stage of the clustering algorithm is based on the use of superparamagnetic clustering. This method is based on K-nearest neighbor interactions and therefore does not require low variance, nonoverlapping clusters, or a priori assumptions of a certain distribution of the data (e.g., gaussian). Superparamagnetic clustering has already been applied with excellent results to several clustering problems (Blatt et al., 1996, 1997; Domany, 1999). In our study, we demonstrated for the rst time its application to spike sorting. Moreover, it is possible to automatically locate the superparamagnetic regime, thus making the entire sorting procedure unsupervised. Besides the obvious advantage of unsupervised clustering, we compared the results obtained with SPC to those obtained using K-means (with Euclidean distance). Although this comparison should not be generalized to all existing clustering methods, it exemplies the advantages of SPC over methods that rely on the presence of gaussian distributions, clusters with centers inside the cluster (see Figure 1 for counterexamples), nonoverlapping clusters with low variance, and others. The performance of K-means could in principle be improved by using another distance metric. However, this would generally imply assumptions about the noise distribution and its interference with spike variability. Such assumptions may improve the clustering performance in some situations but may also be violated in other conditions of real recordings. Note that this comparison was in principle unfair to SPC since K-means is a supervised algorithm where the total number of clusters is given as an extra input. Of course, the total number of clusters is usually not known in real recording situations. In general, besides the advantage of being unsupervised, SPC showed better results than the ones obtained with K-means. Overall, the presented results show an optimal performance of the clustering algorithm in situations that resemble real recording conditions. However, we should stress that this clustering method should not be taken as a black box giving the optimal spike sorting. When possible, it is always desirable to conrm the validity of the results based on the shape and variance of the spike shapes, the interspike interval distribution, the presence of a refractory period, and so forth. Finally, we anticipate the generalization of the method to tetrode recordings. Indeed, adding spike features from adjacent channels should improve spike classication and reduce ambiguity.

Acknowledgments We are very thankful to Richard Andersen and Christof Koch for support and advice. We also acknowledge very useful discussions with Noam Shental, Moshe Abeles, Ofer Mazor, Bijan Pesaran, and Gabriel Kreiman. We are in debt to Eytan Domany for providing us the SPC code and to Alon Nevet

1686

R. Quiroga, Z. Nadasdy, and Y. Ben-Shaul

who provided the original spike data for the simulation. This work was supported by the Sloan-Swartz foundation and DARPA. References Abeles, M., & Goldstein, M. (1977). Multispike train analysis. Proc. IEEE, 65, 762–773. Binder, K., & Heermann, D. W. (1988). Monte Carlo simulations in statisticalphysics: An introduction. Berlin: Springer-Verlag. Blatt, M., Wiseman, S., & Domany, E. (1996). Super-paramagnetic clustering of data. Phys. Rev. Lett., 76, 3251–3254. Blatt, M., Wiseman, S., & Domany, E. (1997). Data clustering using a model granular magnet. Neural Computation, 9, 1805–1842. Chui, C. (1992). An introduction to wavelets. San Diego, CA: Academic Press. Domany, E. (1999). Super-paramagnetic clustering of data: The denitive solution of an ill-posed problem. Physica A, 263, 158–169. Donoho, D., & Johnstone, I. M. (1994).Ideal spatial adaptation by wavelet shrinkage. Biometrika, 81, 425–455. Fee, M. S., Mitra, P. P., & Kleinfeld, D. (1996). Variability of extracellular spike waveforms of cortical neurons. J. Neurophysiol., 76, 3823–3833. Harris, K. D., Henze, D. A., Csicsvari, J., Hirase, H., & Buzs´aki, G. (2000). Accuracy of tetrode spike separation as determined by simultaneous intracellular and extracellular measurements. J. Neurophysiol., 84, 401–414. Hulata, E., Segev, R., & Ben-Jacob, E. (2002). A method for spike sorting and detection based on wavelet packets and Shannonís mutual information. J. Neurosci. Methods, 117, 1–12. Hulata, E., Segev, R., Shapira, Y., Benveniste, M., & Ben-Jacob, E. (2000). Detections and sorting of neural spikes using wavelet packets. Phys. Rev. Lett., 85, 4637–4640. Hyvarinen, A., & Oja, E. (2000). Independent component analysis: Algorithms and applications. Neural Networks, 13, 411–430. Letelier, J. C., & Weber, P. P. (2000). Spike sorting based on discrete wavelet transform coefcients. J. Neurosci. Methods, 101, 93–106. Lewicki, M. (1998). A review of methods for spike sorting: The detection and classication of neural action potentials. Network: Comput. Neural Syst., 9, R53–R78. Mallat, S. (1989). A theory for multiresolution signal decomposition: The wavelet representation. IEEE Trans. Pattern Analysis and Machine Intell., 2, 674– 693. Pouzat, C., Mazor, O., & Laurent, G. (2002). Using noise signature to optimize spike-sorting and to assess neuronal classication quality. J. Neurosci. Methods, 122, 43–57. Press, W. H., Teukolsky, S. A., Vetterling, W. T., & Flannery, B. P. (1992).Numerical recipes in C. Cambridge: Cambridge University Press. Quian Quiroga, R., & Garcia, H. (2003). Single-trial event-related potentials with wavelet denoising. Clin. Neurophysiol., 114, 376–390.

Unsupervised Spike Detection and Sorting

1687

Quian Quiroga, R., Kraskov, A., Kreuz, T., & Grassberger, P. (2002). Performance of different synchronization measures in real data: A case study on electroencephalographic signals. Phys. Rev. E, 65, 041903. Quian Quiroga, R., Sakowicz, O., Basar, E., & Schurmann, ¨ M. (2001). Wavelet Transform in the analysis of the frequency composition of evoked potentials. Brain Research Protocols, 8, 16–24. Samar, V. J., Swartz, K. P., & Raghveer, M. R. (1995). Multiresolution analysis of event-related potentials by wavelet decomposition. Brain and Cognition, 27, 398–438. Wolf, U. (1989). Comparison between cluster Monte Carlo algorithms in the Ising spin model. Phys. Lett. B, 228, 379–382. Received December 3, 2002; accepted January 30, 2004.

LETTER

Communicated by John Platt

Decomposition Methods for Linear Support Vector Machines Wei-Chun Kao [email protected]

Kai-Min Chung [email protected]

Chia-Liang Sun [email protected]

Chih-Jen Lin [email protected] Department of Computer Science and Information Engineering, National Taiwan University, Taipei 106, Taiwan

In this letter, we show that decomposition methods with alpha seeding are extremely useful for solving a sequence of linear support vector machines (SVMs) with more data than attributes. This strategy is motivated by Keerthi and Lin (2003), who proved that for an SVM with data not linearly separable, after C is large enough, the dual solutions have the same free and bounded components. We explain why a direct use of decomposition methods for linear SVMs is sometimes very slow and then analyze why alpha seeding is much more effective for linear than nonlinear SVMs. We also conduct comparisons with other methods that are efcient for linear SVMs and demonstrate the effectiveness of alpha seeding techniques in model selection.

1 Introduction Solving linear and nonlinear support vector machines (SVM) has been considered two different tasks. For linear SVM without too many attributes in data instances, people have been able to train millions of data (e.g. Mangasarian & Musicant, 2000), but for other types of problems, in particular, nonlinear SVMs, the requirement of huge memory as well as computational time has prohibited us from solving very large problems. Currently, the decomposition method, a specially designed optimization procedure, is one of the main tools for nonlinear SVMs. In this letter, we show the drawbacks of existing decomposition methods, in particular, sequential minimal optimization (SMO)–type algorithms, for linear SVMs. To remedy these drawbacks, using theorem 3 of Keerthi and Lin (2003), we develop effective strategies so that decomposition methods become efcient for solving linear SVMs. c 2004 Massachusetts Institute of Technology Neural Computation 16, 1689–1704 (2004) °

1690

W.-C. Kao, K.-M. Chung, C.-L. Sun, and C.-J. Lin

First, we briey describe linear and nonlinear SVMs. Given training vectors xi 2 Rn ; i D 1; : : : ; l, in two classes, and a vector y 2 Rl such that yi 2 f1; ¡1g, the standard SVM formulation (Cortes & Vapnik, 1995) is as follows: min w;b;»

l X 1 T w wCC »i 2 iD1

subject to yi .wT Á .xi / C b/ ¸ 1 ¡ »i ;

(1.1)

»i ¸ 0; i D 1; : : : ; l:

If Á .x/ D x, usually we say equation 1.1 is the form of a linear SVM. On the other hand, if Á maps x to a higher-dimensional space, equation 1.1 is a nonlinear SVM. For a nonlinear SVM, the number of variables depends on the size of w and can be very large (even innite), so people solve the following dual form: 1 T ® Q® ¡ eT ® 2 subject to yT ® D 0; min ®

(1.2)

0 · ®i · C; i D 1; : : : ; l;

where Q is an l £ l positive semidenite matrix with Qij D yi yj Á.xi /T Á.xj /, e is the vector of all ones, and K.xi ; xj / D Á.xi /T Á.xj / is the kernel function. Equation 1.2 is solvable because its number of variables is the size of the training set, independent of the dimensionality of Á.x/. The primal and dual relation shows wD

l X

(1.3)

®i yi Á.xi /;

iD1

so Á sgn.wT Á.x/ C b/ D sgn

l X iD1

! ®i yi K.xi ; x/ C b

is the decision function. Unfortunately, for a large training set, Q becomes such a huge, dense matrix that traditional optimization methods cannot be directly applied. Currently, some specially designed approaches, such as decomposition methods (Osuna, Freund, & Girosi, 1997; Joachims, 1998; Platt, 1998) and nding the nearest points of two convex hulls (Keerthi, Shevade, Bhattacharyya, & Murthy, 2000), are major ways of solving equation 1.2.

Decomposition Methods for Linear Support Vector Machines

1691

On the other hand, for linear SVMs, if n ¿ l, w is not a huge vector variable, so equation 1.1 can be solved by many regular optimization methods. As at the optimal solution »i D max.0; 1 ¡yi .wT xi Cb//, in a sense we mainly have to nd out w and b. Therefore, if the number of attributes n is small, there are not many main variables w and b in equation 1.1, no matter how large the training set is. Currently, people have been able to train a linear SVM with millions of data (e.g. Mangasarian & Musicant, 2000); but for a nonlinear SVM with many fewer data, we need more computational time as well as more computer memory. Therefore, it is natural to ask whether in SVM software, linear and nonlinear SVMs should be treated differently and solved by two methods. It is also interesting to see how capable nonlinear SVM methods (e.g., decomposition methods) are for linear SVMs. By linear SVMs, we mean those with n < l. If n ¸ l, the dual form, equation 1.2, has fewer variables than w of the primal, a situation similar to nonlinear SVMs. As the rank of Q is less than or (usually) equal to min.n; l/, the linear SVMs we are interested in here are those with low-ranked Q. Recently, in many situations, linear and nonlinear SVMs have been considered together. Some approaches (Lee & Mangasarian, 2001; Fine & Scheinberg, 2001) approximate nonlinear SVMs by different problems, which are in the form of linear SVMs (Lin & Lin, in press; Lin, 2002) with n ¿ l. In addition, for nonlinear SVM model selection with gaussian kernel, Keerthi and Lin (2003) proposed an efcient method, which has to conduct linear SVMs model selection rst (i.e., linear SVMs with different C). Therefore, it is important to discuss optimization methods for linear and nonlinear SVMs at the same time. In this letter, we focus on decomposition methods. In section 2, we show that existing decomposition methods are inefcient for training linear SVMs. Section 3 demonstrates theoretically and experimentally that the alpha seeding technique is particularly useful for linear SVMs. Some implementation issues are discussed in section 4. The decomposition method with alpha seeding is compared with existing linear SVM methods in section 5. In section 6, we apply the new implementation to solve a sequence of linear SVMs required for the model selection method in Keerthi and Lin (2003). The discussion and concluding remarks are in section 7. 2 Drawbacks of Decomposition Methods for Linear SVMs with n ¿ l The decomposition method is an iterative procedure. In each iteration, the index set of variables is separated into two sets B and N, where B is the working set. Then in that iteration, variables corresponding to N are xed while a subproblem on variables corresponding to B is minimized. If q is the size of the working set B, in each iteration, only q columns of the Hessian matrix Q are required. They can be calculated and stored in computer memory when needed. Thus, unlike regular optimization methods, which usually require

1692

W.-C. Kao, K.-M. Chung, C.-L. Sun, and C.-J. Lin

access of the whole Q, here, the memory problem is solved. Clearly, decomposition methods are specially designed for nonlinear SVMs. Throughout this article, we use DSVM to refer to the solver of SVM that adopts the decomposition method, such as LIBSVM (Chang & Lin, 2001b) and SVMlight (Joachims, 1998). When the size of its working set is two, we say it is of the SMO type (Platt, 1998). Unlike popular optimization methods such as Newton or quasi-Newton, which enjoy fast convergence, decomposition methods converge slowly, as in each iteration only very few variables are updated. We will show that the situation is even worse when solving linear SVMs. It has been demonstrated (Hsu & Lin, 2002b) by experiments that if C is large and the Hessian matrix Q is not well conditioned, decomposition methods converge very slowly. For linear SVMs, if n ¿ l, then Q is a lowrank and hence ill-conditioned matrix. In Figure 1, we demonstrate a simple example by using the problem heart from the statlog database (Michie, Spiegelhalter, & Taylor, 1994). Each attribute is scaled to [¡1; 1]. We use LIBSVM (Chang & Lin, 2001b) to solve linear and nonlinear (RBF kernel, 2 2 e¡kxi ¡xj k =.2¾ / with 1=.2¾ 2 / D 1=n) SVMs with C D 2¡8 ; 2¡7:5 ; : : : ; 28 and present the number of iterations. Though two different problems are solved (in particular, their Qij ’s are in different ranges), Figure 1 clearly indicates the huge number of iterations for solving the linear SVMs. Note that for linear SVMs, the slope is greater than that for nonlinear SVMs and is very close to one, especially when C is large. This means that a doubled C leads to a doubled number of iterations.

he art_sca le 1e+06

Iterations

100000

10000

1000

100

-8

-6

-4

-2

0

log (C)

2

4

6

8

Figure 1: Number of decomposition iterations for solving SVMs with linear (the thick line) and RBF (the thin line) kernel.

Decomposition Methods for Linear Support Vector Machines

1693

The following theorems that hold only for linear SVMs help us to realize the difculty the decomposition methods suffer from. Theorem 2 can further explain why the number of iterations is nearly doubled when C is doubled. Theorem 1 (Keerthi & Lin, 2003). properties:

The dual linear SVM has the following

² There is C¤ such that for all C ¸ C¤ , there are optimal solutions at the same face. ² In addition, for all C ¸ C¤ , the primal solution w is the same. By the “face” of ®, we mean three types of value of each ®i : (1) lowerbounded, that is, ®i D 0, (2) upper-bounded, that is, ®i D C, and (3) free, that is, 0 < ®i < C. More precisely, the face of ® can be represented by a length l vector whose components are in flower-bounded; upper-bounded; freeg. This theorem indicates that after C ¸ C¤ , exponentially increased numbers of iterations are wasted in order to obtain the same primal solution w. Even if we could detect C¤ and stop training SVMs, for C not far below C¤ , the number of iterations may already be huge. Therefore, it is important to have an efcient linear SVM solver that could handle both large and small C. Next, we try to explain the nearly doubled iterations by the difculty of locating faces of the dual solution ®. Theorem 2. Assume that any two parallel hyperplanes in the feature space do not contain more than n C 1 points of fxi g on them. We have:1 1. For any optimal solution of equation 1.2, it has no more than n C 1 free components. 2. There is C¤ such that after C ¸ C¤ , all optimal solutions of equation 1.2 share at least the same l ¡ n ¡ 1 bounded ® variables. The proof is available in Kao, Chung, Sun, and Lin (2002). This result indicates that when n ¿ l, most components of optimal solutions are at bounds. Furthermore, dual solutions at C and 2C share at least the same l ¡ 2.n ¡ 1/ upper- and lower-bounded components. If upper-bounded ®i at C remains upper-bounded at 2C, a direct use of decomposition methods means that ®i is updated from 0 to C and from 0 to 2C, respectively. Thus, we anticipate that the efforts are roughly doubled. We conrm this explanation 1 Note that a pair of parallel hyperplanes is decided by n C 1 numbers (the n number decides one hyperplane in the feature space Rn , and another one decides the other hyperplane parallel to it). So the assumption of theorem 2 would be violated if m linear equations in n C 1 variables, where m > n C 1, have solutions. The occurrence of this scenario is of measure zero. This explains that the assumption of theorem 2 is generic.

1694

W.-C. Kao, K.-M. Chung, C.-L. Sun, and C.-J. Lin heart_scale(linear, C=128)

70

70

60

60

50 40 30

50 40 30

20

20

10

10

0

0

500

1000

1500

Iterationx100

2000

heart_scale(linear, C=256)

80

Error face rate

Error face rate

80

2500

3000

0

0

1000

2000

3000

Iterationx100

4000

5000

6000

Figure 2: The error face rate (i.e., the difference between the current face and the one at the nal solution) for solving linear SVM with C D 128 and C D 256.

by comparing the error face rate (i.e., the difference between the current face and the one at the nal solution) with C D 27 and C D 28 . As shown in Figure 2, two curves are quite similar except that the scale of x-axis differs by twice. This indicates that ® travels similar faces for C D 27 and C D 28 , and the number of iterations spent on each face with C D 28 is roughly doubled. 3 Alpha Seeding for Linear SVMs Theorem 2 implies that for linear SVMs, dual solutions may share many upper- and lower-bounded variables. Therefore, we conjecture that if ® 1 is an optimal solution at C D C1 , then ® 1 C2 =C1 can be a very good initial point for solving equation 1.2 with C D C2 . The reason is that ® 1 C2 =C1 is at the same face as ® 1 , and it is likely to be at a similar face of one optimal solution of C D C2 . This technique, called alpha seeding, was originally proposed for SVM model selection (DeCoste & Wagstaff, 2000) where several problems (see equation 1.2) with different C have to be solved. Earlier work that focuses on nonlinear SVMs mainly uses alpha seeding as a heuristic. For linear SVMs, the speed could be signicantly boosted due to the above analysis. The following theorem further supports the use of alpha seeding: Theorem 3. There are two vectors A, B, and a number C¤ such that for any C ¸ C¤ , AC C B is an optimal solution of equation 1.2. The proof is in Kao et al. (2002). If Ai > 1, Ai C C B > C after C is large enough, and this violates the bounded constraints in equation 1.2. Similarly, Ai cannot be less than zero, so 0 · Ai · 1. Therefore, we can consider the following three situations of vectors A and B: 1. 0 < Ai · 1

Decomposition Methods for Linear Support Vector Machines

1695

Table 1: Comparison of Iterations (Linear Kernel), With and Without Alpha Seeding. Without ®-Seeding

®-Seeding Number of Iterations

Problem heart australian diabetes german web adult ijcnn

Number of Number of Iterations Iterations wT w Total Iterations (C D 27:5 ) (C D 28 )

C¤

27,231 23:5 5.712 79,162 22:5 2.071 33,264 26:5 16.69 277,932 210 3.783 24,044,242 Unstable Unstable 3,212,093 Unstable Unstable 590,645 26 108.6

2,449,067 20,353,966 1,217,926 42,673,649 ¸ 108 ¸ 108 41,440,735

507,122 3,981,265 274,155 6,778,373 74,717,242 56214289 8,860,930

737,734 5,469,092 279,062 14,641,135 ¸ 108 84,111,627 13,927,522

2. Ai D 0; Bi D 0 3. Ai D 0; Bi > 0

For the second case, ®i1 CC2 D Ai C2 CBi D 0, and for the rst case, Ai C À Bi 1

after C is large enough. Therefore, ®i1 CC2 D Ai C2 C Bi CC2 ¼ Ai C2 C Bi . For both 1 1 cases, alpha seeding is very useful. On the other hand, using theorem 2, there are few .· n C 1/ components satisfying the third case. Next, we conduct some comparisons between DSVM with and without alpha seeding. Here, we consider two-class problems only. Some statistics of the data sets used are in Tables 1 and 2. The four small problems are from the statlog collection (Michie et al., 1994). The problem adult is compiled by Platt (1998) from the UCI “adult” data set (Blake & Merz, 1998). Problem web is also from Platt. Problem ijcnn is from the rst problem of IJCNN challenge 2001 (Prokhorov, 2001). Note that we use the winner ’s transformation of the raw data (Chang & Lin, 2001a). We train linear SVMs with C 2 f2¡8 ; 2¡7:5 ; : : : ; 28 g. That is, [2¡8 ; 28 ] is discretized to 33 points with equal ratio. Table 1 presents the total number Table 2: Comparison of Iterations (RBF Kernel), With and Without Alpha Seeding. Problem heart australian diabetes german web adult ijcnn

l

n

®-Seeding

Without ®-Seeding

270 690 768 1000 49,749 32,561 49,990

13 14 8 24 300 123 22

43,663 230,983 101,378 191,509 633,788 2,380,265 891,563

56,792 323,288 190,047 260,774 883,319 4,110,663 1,968,396

1696

W.-C. Kao, K.-M. Chung, C.-L. Sun, and C.-J. Lin

of iterations of training 33 linear SVMs using the alpha seeding approach. We also individually solve them by LIBSVM and list the number of iterations (total, C D 27:5 , and C D 28 ). The alpha seeding implementation will be described in detail in section 4. We also list the approximate C¤ for which linear SVMs with C ¸ C¤ have the same decision function. In addition, the constant wT w after C ¸ C¤ is also given. For some problems (e.g., web and adult), wT w has not reached a constant until C is very large, so we indicate them as “unstable” in Table 1. To demonstrate that alpha seeding is much more effective for linear than nonlinear SVMs, Table 2 presents the number of iterations using the radial 2 2 basis function (RBF) kernel K.xi ; xj / D e¡kxi ¡xj k =.2¾ / with 1=2¾ 2 D 1=n. It is clear that the number of iterations saved by using alpha seeding is marginal. In addition, comparing to the “Total Iterations” column in Table 1, we conrm again the slow convergence for linear SVMs without alpha seeding. The alpha seeding approach performs so well that its total number of iterations is much fewer than solving one single linear SVM with the original decomposition implementation. Therefore, if we intend to solve one linear SVM with a particular C, it may be more efcient to solve one with small initial C0 and then use the proposed alpha seeding method by gradually increasing C. Furthermore, since we have solved linear SVMs with different C, model selection by cross validation is already done. From the discussion in section 2, solving several linear SVMs without alpha seeding is time-consuming and model selection is not easy. Note that in Table 1, web is the most difcult problem and requires the largest number of iterations. Theorem 2 helps to explain this: since web’s large number of attributes might lead to more free variables during iterations or at the nal solution, alpha seeding is less effective. 4 Implementation Though the concept is so simple, an efcient and elegant implementation requires considerable thought. Most DSVM implementations maintain the gradient vector of the dual objective function during iterations. The gradient is used for selecting the working set or checking the stopping condition. In the nonlinear SVM, calculation of the gradient Q® ¡ e requires O.l2 n/ operations (O.n/ for each kernel evaluation), which are expensive. Therefore, many forms of DSVM software use ® D 0 as the initial solution, which makes the initial gradient ¡e immediately available. However, in DSVM with alpha seeding, the initial solution is obtained from the last problem, so the initial gradient is not a constant vector. Fortunately, for linear SVMs, the situation is not as bad as that in the nonlinear SVM. In this case, the kernel matrix is of the form Q D XT X, where X D [y1 x1 ; : : : ; yl xl ] is an n by l matrix. We can calculate

Decomposition Methods for Linear Support Vector Machines

1697

the gradient by Q® ¡ e D XT .X®/ ¡ e, which requires only O.ln/ operations. The rst decomposition software that used this for linear SVMs is SVMlight (Joachims, 1998). Similarly, if ® is changed by 1® between two consecutive iterations, then the change of gradient is Q.1®/ D XT .X1®/. Since there are only q nonzero elements in 1®, the gradient can be updated with O.nq/ C O.ln/ D O.ln/ operations, where q is the size of working set. Note that because l À q, increasing q from 2 to some other small constant will not affect the time of the updating gradient. In contrast, for nonlinear SVMs, if Q is not in the cache, the cost for updating the gradient is by computing Q.1®/, which requires q columns of Q and takes O.lnq/-time. Thus, while the implementation of nonlinear SVMs may choose SMO-type implementation (i.e., q D 2) for a lower cost per iteration, we should use a larger q for linear SVMs because the gradient update is independent of q and the number of total iterations may be reduced. In the nonlinear SVM, constructing the kernel matrix Q is expensive, so a cache for storing recently used elements of Q is necessary. However, in linear SVM, either the kernel matrix Q or the cache is no longer needed. 5 Comparison with Other Approaches It is interesting to compare the proposed alpha seeding approach with efcient methods for linear SVMs. In this section, we consider active SVM (ASVM) (Mangasarian & Musicant, 2000) and Lagrangian SVM (LSVM) (Mangasarian & Musicant, 2001). DSVM, ASVM, and LSVM solve slightly different formulations, so a fair comparison is difcult. However, our goal here is only to demonstrate that with alpha seeding, decomposition methods can be several times faster and competitive with other linear SVM methods. In the following, we briey describe the three implementations. DSVM is the standard SVM, which uses the dual formulation, equation 1.2. ASVM and LSVM both consider a square error term in the objective function: l X 1 T .w w C b2 / C C »i2 : 2 iD1

(5.1)

Then, the dual problem of equation 5.1 is ³ ´ 1 T I T min ® Q C yy C ® ¡ eT ® ® 2 2C subject to 0 · ®i ; i D 1; : : : ; l;

(5.2)

min w;b;»

where I is the identity matrix. The solution of equation 5.2 has far more free components than that of equation 1.2 as upper-bounded variables of

1698

W.-C. Kao, K.-M. Chung, C.-L. Sun, and C.-J. Lin

Table 3: Comparison of Different Approaches for Linear SVMs. Decomposition Methods (LIBSVM)

Methods for Linear SVMs

With ® Seeding Without ® Seeding Problem australian heart diabetes german ijcnn adult web

Acc. Time: C · 28 .C · 23 / 85.51 85.56 79.69 73.65 92.65 85.02 98.67

4.1 (3.6) 1.4 (0.9) 2.4 (2.3) 10.2 (6.2) 981.9 (746.0) 1065.9 (724.8) 18,035.3 (1738.2)

Time: C · 23 7.0 1.0 1.6 18.2 3,708.6 12,026.8 7,035.6

ASVM

LSVM

Time

Time

88.70 8.1 85.56 5.8 81.64 6.2 73.95 16.4 92.51 725.2 84.90 3,130.7 98.63 10,315.9

2.3 1.8 2.0 9.0 17,496.4 13,445.4 43,060.1

Acc.

Notes: Acc.: test accuracy using the parameter obtained from cross validation. Time (in seconds): total training time of ve-fold cross validation by trying C D 2¡10 ; 2¡9:5 ; : : : ; 28 (or 23 if specied).

this equation are likely to be free now. With different formulations, their stopping conditions are not exactly the same. We use conditions from similar derivations; details are discussed in Kao et al. (2002). In this experiment, we consider LIBSVM for DSVM (with and without alpha seeding). For ASVM, we use the authors’ C++ implementation available on-line at http://www.cs.wisc.edu/dmi/asvm. The authors of LSVM provide only MATLAB programs, so we implement it by modifying LIBSVM. The experiments were done on an Intel Xeon 2.8 GHz machine with 1024 MB RAM using the gcc compiler. Using the same benchmark problems as in section 3, we perform comparisons in Table 3 as follows: for each problem, we randomly select twothirds of the data for training and leave the remaining for testing. For algorithms except DSVM without alpha seeding, ve-fold cross validation with C D 2¡10 ; 2¡9:5 ; : : : ; 28 on the training set is conducted. For DSVM without alpha seeding, as the training time is huge, only C up to 23 is tried. Then using the C that gives the best cross-validation rate, we train a model and predict the test data. Both testing accuracy and the total computational time are reported. In Table 3, alpha seeding with C up to 28 is competitive with solving C up to only 23 without alpha seeding. For these problems, considering C · 23 is enough, and if alpha seeding stops at 23 as well, it is several times faster than without alpha seeding. Since alpha seeding is not applied to ASVM and LSVM, we admit that their computational time can be further improved. Results here also serve as the rst comparison between ASVM and LSVM. Clearly ASVM is faster. Moreover, due to the huge computational time, we set the maximal iterations of LSVM at 1000. For problems adult and web, after C is large, the iteration limit is reached before stopping conditions are satised. In addition to comparing DSVM with ASVM and LSVM, we compare the performance of SMO type (q D 2) and that with a larger working set

Decomposition Methods for Linear Support Vector Machines

1699

Table 4: Comparison of Different Subproblem Size in Decomposition Methods for Linear SVMs. Decomposition Methods with Alpha Seeding q D 2 (SVMlight ) Problem australian heart diabetes german ijcnn adult web

q D 30 (SVMlight )

Total Iterations

Time

Total Iterations

Time

50,145 25,163 30,265 182,051 345,630 1,666,607 NAa

0.71 0.21 0.4 2.9 185.85 1455.2 NAa

6533 1317 5378 6006 79847 414,798 2,673,578

1.19 0.33 0.31 3.68 115.03 516.71 1885.1

Notes: Time in seconds; q: size of the working set. The time here is shorter than that in Table 3 because we do not perform cross validation. a SVMlight faced numerical difculties.

(q D 30) for DSVM with alpha seeding in Table 4 by modifying the software SVM light , which allows adjustable q. All default settings of SVMlight are used. In this experiment, we solve linear SVMs with C D 2¡8 ; 2¡7:5 ; : : : ; 28 and report their computational time and total number of iterations. Note that when q is two, SVMlight and LIBSVM use the same algorithm and differ only in some implementation details. The results in Table 4 show that the implementation with a larger working set takes less time than that with a smaller one. This is consistent with our earlier statement that for linear SVMs, SMO-type decomposition methods are less favorable. Regarding the computational time reported in this section, we must caution that quite a few implementation details may affect it. For example, each iteration of ASVM and LSVM involves several matrix vector multiplications. Hence, it is possible to use nely tuned dense linear algebra subroutines. For the LSVM implementation here, by using ATLAS (Whaley, Petitet, & Dongarra, 2000), for large problems, the time is reduced by twothirds. Thus, it is possible to further reduce the time of ASVM in Table 3, though we nd it too complicated to modify the authors’ program. Using such tools also means X is considered as a dense matrix. In contrast, X is currently treated as a sparse matrix in both LIBSVM and SVMlight, where each iteration requires two matrix vector multiplications X.XT .® kC1 ¡ ® k //. This sparse format creates some overhead when data are dense. 6 Experiments on Model Selection If the RBF kernel K.xi ; xj / D e¡kxi ¡xj k

2

=.2¾ 2 /

1700

W.-C. Kao, K.-M. Chung, C.-L. Sun, and C.-J. Lin

is used, Keerthi and Lin (2003) propose the following model selection procedure for nding good C and ¾ 2 : Algorithm 1.

Two-line model selection

Q 1. Search for the best C of linear SVMs and call it C.

2. Fix CQ from step 1 and search for the best .C; ¾ 2 / satisfying log ¾ 2 D log C ¡ log CQ using the RBF kernel. That is, we solve a sequence of linear SVMs rst and then a sequence of nonlinear SVMs with the RBF kernel. The advantage of algorithm 1 over an exhaustive search of the parameter space is that only parameters on two lines are considered. If decomposition methods are directly used for both linear and nonlinear SVMs here, due to the huge number of iterations, solving the linear SVMs becomes a bottleneck. Our goal is to show that by applying the alpha seeding technique to linear SVMs, the computational time spent on the linear part becomes similar to that on the nonlinear SVMs. Earlier, in Keerthi and Lin (2003), due to the difculty of solving linear SVMs, algorithm 1 was tested on only small two-class problems. Here, we would like to evaluate this algorithm on large multiclass data sets. We consider the problems dna, satimage, letter, and shuttle, which were originally from the statlog collection (Michie et al., 1994) and were used in Hsu and Lin (2002a). Except dna, which takes two possible values 0 and 1, each attribute of all training data is scaled to [¡1; 1]. Then test data are adjusted using the same linear transformation. Since LIBSVM contains a well-developed cross-validation procedure, we use it as the DSVM solver in this experiment. We search for CQ by ve-fold cross validation on linear SVMs using uniformly spaced log2 CQ value in [¡10; 10] (with grid space 1). As LIBSVM considers ° D 1=2¾ 2 as the kernel parameter, the second step is to search for good .C; ° / satisfying Q ¡ 1 ¡ log2 ° D log2 C ¡ log2 C:

(6.1)

We discretize [¡10; 4] as values of log2 ° and calculate log2 C from equation 6.1. To avoid log2 C locating in an abnormal region, we consider only points with ¡2 · log2 C · 12, so the second step may solve fewer SVMs than the rst step. The same computational environment as that for section 3 is used. Since this model selection method is based on the analysis of binary SVMs, a multiclass problem has to be decomposed to several binary SVMs. We employ the one-against-one approach: if there are k classes of data, all k.k ¡ 1/=2 two-class combinations are considered. For any two classes of data, the model selection is conducted to have the best .C; ¾ 2 /. With

Decomposition Methods for Linear Support Vector Machines

1701

Table 5: Comparison of Different Model Selection Methods. Complete Grid Search Problem dna satimage letter shuttle

1 .C; ¾ 2 / Accuracy

k.k ¡ 1/=2 .C; ¾ 2 / Time Accuracy

95.62 4945 91.9 7860 97.9 56,753 99.92 104,904

95.11 92.2 97.72 99.94

Algorithm 1 Time

Time (linear)

202 1014 5365 4196

123 743 3423 2802

Time (nonlinear)

Accuracy

79 94.86 (94.77) 271 91.55 (90.55) 1942 96.54 (95.9) 1394 99.81 (99.7)

Notes: Accuracies of algorithm 1 enclosed in parentheses are the accuracies if we search log2 CQ 2 [¡10; 3] in step 1 of algorithm 1. Time is in seconds.

the k.k ¡ 1/=2 best .C; ¾ 2 / and corresponding decision functions, a voting strategy is used for the nal prediction. In Table 5, we compare this approach with two versions of complete grid searches. First, for any two classes of data, ve-fold cross validation is conducted on 225 points, a discretization of the .log2 C; log2 ° / D [¡2; 12] £ [¡10; 4] space. The second way is from the cross-validation procedure adopted by LIBSVM for multiclass data, where a list of .C; ¾ 2 / is selected rst, and then for each .C; ¾ 2 /, the one-against-one method is used for estimating the cross-validation accuracy of the multiclass data. Therefore, for the nal optimal model, k.k ¡ 1/=2 decision functions share the same C and ¾ 2 . Since the same number of nonlinear SVMs is trained, the time for the two complete grid searches is exactly the same, but the performance (test accuracy) may be different. There is no comparison so far, so we present a preliminary investigation here. Table 5 presents experimental results. For each problem, we compare test accuracy by two complete grid searches and by algorithm 1. The two grid searches are represented as “1 .C; ¾ 2 /” and “k.k ¡ 1/=2 .C; ¾ 2 /,” respectively, depending on how many .C; ¾ 2 / used by the decision functions. The performances of the three approaches are very similar. However, the total model selection time of algorithm 1 is much shorter. In addition, we also list the accuracy of algorithm 1 in parentheses if we search only log2 CQ value in [¡10; 3] in step 1. We nd that the accuracy is consistently lower if we search only CQ in this smaller region. In fact, if we search log2 CQ in [¡10; 3], there are many log2 CQ equals to 3 in this experiment. This means that [¡10; 3] is too small to cover good parameter regions. We make algorithm 1 practical with using alpha seeding. Otherwise, the time for solving linear SVMs increases greatly, so the proposed model selection does not possess any advantage. We then investigate the stability of the new model selection approach. Due to timing restrictions, we consider two smaller problems, banana and adult small, tested in Keerthi and Lin (2003). adult small, a subset of adult used in section 3, is a binary problem with 1605 examples. Table 6 shows

1702

W.-C. Kao, K.-M. Chung, C.-L. Sun, and C.-J. Lin

Table 6: Mean and Standard Deviation of Two Model Selection Methods. Complete Grid Search log2 C Problem banana adult small dna satimage

log2 °

Algorithm 1

Accuracy

log2 CQ

Accuracy

Mean

SD

Mean

SD

Mean

SD

Mean

SD

Mean

SD

7 5.4 5.4 2.5

4.45 2.37 3.34 0.71

-0.4 -7.6 -5 0.1

1.51 1.71 0 0.57

87.91 83.82 95.56 91.74

0.47 0.27 0.19 0.24

-1.9 0.3 -

2.18 4.08 -

76.36 83.20 94.85 91.19

12.21 1.28 0.20 0.28

Note: Each method was applied 10 times.

the means and standard deviations of parameters and accuracy using the k.k ¡ 1/=2 .C; ¾ 2 / grid search and algorithm 1 10 times. For algorithm 1, we Q variances because the variances of parameters C and ¾ 2 , which list only C’s are computed from equation 6.1, are less meaningful. By applying the same method 10 times, note that different parameters as well as accuracy are due to the randomness of cross validation. From Table 6, we can see that although the performance (testing accuracy and consumed time) of the model selection algorithm 1 is good, it might be less stable. That is, the variance of accuracy is signicantly larger than that of the complete grid search method, while the variances of both parameters are large. We think that in the complete grid search method, the crossvalidation estimation bounds the overall error. Thus, the variances of gained parameters do not affect the testing performance. However, in the two-line search method (algorithm 1), two-stage cross validations are utilized. Thus, the variance in the rst stage may affect the best performance of the second stage. 7 Discussion and Conclusion It is arguable that we may have used a too strict a stopping condition in DSVM when C is large. One possibility is to use the stopping tolerance that is proportional to C. This will reduce the number of iterations so that directly solving linear SVMs with large C may be possible. However, in the appendix of Chung et al. (2002), we show that even in these settings, DSVM with alpha seeding still makes the computational time several times faster than the original DSVM, especially for large data sets. Moreover, a stopping tolerance that is too large will cause DSVM to stop with wrong solutions. In conclusion, we hope that based on this work, SVM software using decomposition methods can be suitable for all types of problems, both n ¿ l and n À l.

Decomposition Methods for Linear Support Vector Machines

1703

Acknowledgments This work was supported in part by the National Science Council of Taiwan, by grant NSC 90-2213-E-002-111. We thank Thorsten Joachims for help on modifying SVMlight for experiments in section 5. References Blake, C. L., & Merz, C. J. (1998). UCI repository of machine learning databases (Tech. Rep.). Irvine, CA: University of California, Department of Information and Computer Science. Available on-line at: http://www.ics.uci.edu/ ~mlearn/MLRepository.html. Chang, C.-C., & Lin, C.-J. (2001a). IJCNN 2001 challenge: Generalization ability and text decoding. In Proceedings of IJCNN. New York: IEEE. Chang, C.-C., & Lin, C.-J. (2001b). LIBSVM: A library for support vector machines. Available on-line at: http://www.csie.ntu.edu.tw/~cjlin/libsvm. Cortes, C., & Vapnik, V. (1995). Support-vector network. Machine Learning, 20, 273–297. DeCoste, D., & Wagstaff, K. (2000). Alpha seeding for support vector machines. In Proceedings of International Conference on Knowledge Discovery and Data Mining. New York: ACM Press. Fine, S., & Scheinberg, K. (2001). Efcient SVM training using low-rank kernel representations. Journal of Machine Learning Research, 2, 243–264. Hsu, C.-W., & Lin, C.-J.(2002a).A comparison of methods for multi-class support vector machines. IEEE Transactions on Neural Networks, 13(2), 415–425. Hsu, C.-W., & Lin, C.-J. (2002b). A simple decomposition method for support vector machines. Machine Learning, 46, 291–314. Joachims, T. (1998). Making large-scale SVM learning practical. In B. Sch¨olkopf, C. J. C. Burges, & A. J. Smola (Eds.), Advances in kernel methods—support vector learning, Cambridge, MA: MIT Press. Kao, W.-C., Chung, K.-M., Sun, T., & Lin, C.-J. (2002). Decomposition methods for linear support vector machines (Tech. Rep.). Taipei: Department of Computer Science and Information Engineering, National Taiwan University. Available on-line at: http://www.csie.ntu.edu.tw/»cjlin/papers/linear.pdf. Keerthi, S. S., & Lin, C.-J. (2003). Asymptotic behaviors of support vector machines with gaussian kernel. Neural Computation, 15(7), 1667–1689. Keerthi, S. S., Shevade, S. K., Bhattacharyya, C., & Murthy, K. R. K. (2000). A fast iterative nearest point algorithm for support vector machine classier design. IEEE Transactions on Neural Networks, 11(1), 124–136. Lee, Y.-J., & Mangasarian, O. L. (2001). RSVM: Reduced support vector machines. In Proceedings of the First SIAM International Conference on Data Mining. Philadelphia: SIAM. Lin, K.-M. (2002). Reduction techniques for training support vector machines. Unpublished master’s thesis, National Taiwan University. Lin, K.-M., & Lin, C.-J. (2003). A study on reduced support vector machines. IEEE Transactions on Neural Networks, 14(6), 1449–1559.

1704

W.-C. Kao, K.-M. Chung, C.-L. Sun, and C.-J. Lin

Mangasarian, O. L., & Musicant, D. R. (2000). Active set support vector machine classication. In T. K. Leen, T. G. Dietterich, & V. Tresp (Eds.), Advances in neural information processing systems, 13 (pp. 577–583). Cambridge, MA: MIT Press. Mangasarian, O. L., & Musicant, D. R. (2001). Lagrangian support vector machines. Journal of Machine Learning Research, 1, 161–177. Michie, D., Spiegelhalter, D. J., & Taylor, C. C. (1994). Machine learning, neural and statistical classication. Englewood Cliffs, NJ: Prentice Hall. Data available on-line at: ftp.ncc.up.pt/pub/statlog/. Osuna, E., Freund, R., & Girosi, F. (1997). Training support vector machines: An application to face detection. In Proceedings of CVPR’97 (pp. 130–136). New York: IEEE. Platt, J. C. (1998). Fast training of support vector machines using sequential minimal optimization. In B. Sch¨olkopf, C. J. C. Burges, & A. J. Smola (Eds.), Advances in kernel methods—Support vector learning, Cambridge, MA: MIT Press. Prokhorov, D. (2001). IJCNN 2001 neural network competition. Slide presentation in IJCNN’01, Ford Research Laboratory. Available on-line at: http://www.geocities.com/ijcnn/nnc _ijcnn01.pdf. Whaley, R. C., Petitet, A., & Dongarra, J. J. (2000). Automatically tuned linear algebra software and the ATLAS project (Tech. Rep.). Chattanooga: Department of Computer Sciences, University of Tennessee. Received March 7, 2003; accepted January 7, 2004.

LETTER

Communicated by Manfred Opper

An Asymptotic Statistical Theory of Polynomial Kernel Methods Kazushi Ikeda [email protected] Graduate School of Informatics, Kyoto University, Sakyo, Kyoto 606-8501, Japan

The generalization properties of learning classiers with a polynomial kernel function are examined. In kernel methods, input vectors are mapped into a high-dimensional feature space where the mapped vectors are linearly separated. It is well-known that a linear dichotomy has an average generalization error or a learning curve proportional to the dimension of the input space and inversely proportional to the number of given examples in the asymptotic limit. However, it does not hold in the case of kernel methods since the feature vectors lie on a submanifold in the feature space, called the input surface. In this letter, we discuss how the asymptotic average generalization error depends on the relationship between the input surface and the true separating hyperplane in the feature space where the essential dimension of the true separating polynomial, named the class, is important. We show its upper bounds in several cases and conrm these using computer simulations.

1 Introduction Kernel classiers, such as the support vector machine (SVM) and the kernel perceptron learning algorithm, have received much attention in recent years as successful new pattern classication techniques (Vapnik, 1995, 1998; Sch¨olkopf, Burges, & Smola, 1998; Cristianini & Shawe-Taylor, 2000; Smola, Bartlett, Sch¨olkopf, & Schuurmans, 2000). A kernel classier maps an input vector x 2 X to the corresponding feature vector f .x/ in a high-dimensional feature space F and discriminates it linearly, that is, outputs the sign of the inner product of the feature vector and a parameter vector w (Aizerman, Braverman, & Rozonoer, 1964). An SVM shows good generalization performance in experiments and has been analyzed theoretically on its generalization ability based on the probably approximately correct (PAC) learning approach (Vapnik, 1995, 1998). In other words, the sufcient number of examples is given as a function of ± and ² such that the probability that the generalization error is larger than ² becomes less than ± regardless of the distribution of the input vectors (Valiant, 1984). This framework considers the worst case in a sense, and the c 2004 Massachusetts Institute of Technology Neural Computation 16, 1705–1719 (2004) °

1706

K. Ikeda

VC dimension plays an essential role in representing the complexity of the learning machines (Vapnik & Chervonenkis, 1971). Another approach to generalization ability is to directly estimate the generalization error or, more specically, evaluate the average generalization error, dened as the generalization error averaged over given examples and a new input. The average generalization error as a function of the number of given examples is often called a learning curve. One tool for deriving the learning curve is statistical mechanics (Baum & Haussler, 1989; Gyorgyi ¨ & Tishby, 1990; Opper & Haussler, 1991), which has been applied to SVM learning under the assumptions that the input is binary (Dietrich, Opper, & Sompolinsky, 1999). The results given by statistical mechanics explain the phenomenon well. However, the method has yet to be justied mathematically, and it is applicable only to the case where the dimension of the input space and the number of examples tend to innity together. Another approach is asymptotic statistics (Amari, Fujita, & Shinomoto, 1992; Amari, 1993; Amari & Murata, 1993; Murata, Yoshizawa, & Amari, 1994), assuming that the number of examples is sufciently large. When the learning machine is not differentiable, as in the case of a linear dichotomy, a stochastic geometrical method can also be employed (Ikeda & Amari, 1996). The analyses above have shown that the generalization error is, in general, asymptotically proportional to the number of parameters and inversely proportional to the number of given examples, although its exact coefcient is still unknown. In this letter, we discuss the learning curve of kernel classiers. Since they are essentially linear dichotomies in a high-dimensional feature space F, it could be expected that the generalization errors are proportional to the number of parameters that coincides with the dimension of the space F. However, analyses in the PAC learning framework have shown that the generalization ability does not depend on the apparent high dimension of the feature space, even if the feature space is of innite dimension (Vapnik, 1995, 1998). Kernel methods are different from linear dichotomies in that examples in the feature space are located only on a low-dimensional submanifold wherein the input vectors are mapped. Hence, the learning curve of kernel methods does not match that of linear dichotomies in the framework of asymptotic statistics. The purpose of this letter is to derive the asymptotic learning curve of kernel methods and to clarify the generalization properties of kernel learning methods when a polynomial kernel is employed as the kernel function. When the input space is one-dimensional, it was shown that the average generalization error depends on only the complexity of the true separating polynomial and does not increase even if the degree of the polynomial kernel becomes larger (Ikeda, forthcoming). However, the result is restrictive and not applicable to general cases since it signicantly depends on the fact that any polynomial of one variable is factorizable into polynomials of degree one or two.

An Asymptotic Statistical Theory of Polynomial Kernel Methods

1707

To extend the above result to an input space of general dimension, we introduce a new index named class that characterizes the relationship between the submanifold the input vectors make and the true separating function in the feature space. The class describes the number of essential parameters, and the average generalization error is proportional to it. We give its upper bounds in some cases. The rest of the letter is organized as follows. Section 2 reviews the results on the learning curve of linear dichotomies. The mathematical formulation of polynomial kernel methods is given in section 3. A new index for the generalization ability is introduced in section 4, and its bounds in several cases are given in section 5. Section 6 is devoted to computer simulations, and conclusions and discussions are given in section 7. 2 Learning Curve of Linear Dichotomies Since a kernel classier is a linear dichotomy in feature space, we rst review the learning curve of linear dichotomies in this section. 2.1 Formulation. Consider a deterministic dichotomy machine that outputs y 2 f§1g for an input vector f D . f1 ; f2 ; : : : ; fc /T in a bounded subset F of a c-dimensional Euclidean space Rc by calculating y D sgn[wT f ], where w is a c-dimensional vector called the parameter vector, or simply the parameter, which has the prior distribution qpr .w/. We assume that f has a positive density function p.f / > 0 in F such that the difference between the outputs with w and w C 1w is proportional to k1wk when k1wk is small, that is, Z jsgn[.w C 1w/T f ] ¡ sgn[wT f ]jdf D a.w/k1wk C o.k1wk/; where a.w/ > 0, as assumed in (Amari et al., 1992). An example consists of an input f and its corresponding output y D sgn[wo T f ], where wo is the true parameter vector of a xed supervisor chosen according to qpr .w/. Given N examples denoted by 4N D f.fn ; yn /; n D 1; : : : ; Ng, a learning machine is trained so that the machine can reproduce O be such a parameter of the yn for any fn in the example set 4N . Let w learning machine. The learning machine with parameter w O predicts the output yNC1 corresponding to a test input vector fNC1 independently chosen from the same distribution as the examples. The generalization error L.4N ; fNC1 ; yNC1 / of the trained machine is dened as the probability that the predicted output yO for a test input fNC1 is different from the true output yNC1 over the posterior distribution qpo .w/ of the true parameter wo . Since the pair .fNC1 ; yNC1 / can be regarded as an example, we denote the generalization error by L.4NC1 / for short. The average generalization error

1708

K. Ikeda

is dened as the expectation of the generalization error and is denoted by LN D E 4NC1 [L.4NC1 /] :

(2.1)

LN is called a learning curve. 2.2 Learning Algorithms and Admissible Region. The average generalization error depends on how to estimate the parameter. We choose it randomly according to the posterior distribution qpo .w/ from a Bayesian standpoint. This method is called the Gibbs algorithm (Levin, Tishby, & Solla, 1990). In the case of a deterministic dichotomy, qpo .w/ _ qpr.w/ when w 2 AN and qpo .w/ D 0 otherwise hold true where AN is called the admissible region (Ikeda & Amari, 1996) or the version space (Mitchell, 1982), dened as n o (2.2) AN D wjwT fn yn > 0; n D 1; : : : ; N (see Figure 1). Hence, the R posterior distribution is expressed as qpo .w/ D qpr.w/=ZN , where ZN D AN qpr.w/dw is a measure of the admissible region AN called the partition function in statistical mechanics. 2.3 A Bound on the Average Generalization Error. We give here an asymptotic upper bound on the average generalization error. By denition, the partition function is rewritten as Z (2.3) ZN D qpr .w/dw Z D

AN

qpr .w/

N Z Y

I[wT fn yn > 0]dw;

(2.4)

nD1

f1

wo AN

fn f2

Figure 1: Examples and admissible region in the parameter space.

An Asymptotic Statistical Theory of Polynomial Kernel Methods

1709

where I[¢] denotes the indicator function. The generalization error is therefore described using the partition function as Z Z (2.5) L.4NC1 / D qpo .w/I[wT fN yN < 0]dw Z

D 1¡

1 ZN

D 1¡

ZNC1 : ZN

qpr.w/

NC1 YZ

I[wT fn yn > 0]dw

(2.6)

nD1

(2.7)

From the convexity of the logarithm function, ¡ log.ZNC1 =ZN / is an upper bound on equation 2.7 and hence ³ ´ £ ¤ £ ¤ 1 c Co (2.8) LN · E4N log ZN ¡ E4NC1 log ZNC1 D N N asymptotically holds true. (See Amari et al., 1992, and Amari, 1993, for details.) In the following, we use equation 2.8 to derive upper bounds on the kernel methods. 2.4 Restricted Input Vectors. Suppose that input vectors are located in a c0 -dimensional subspace of the input space. We can denote the input vector by f D .f 0T ; 0T /T without loss of generality where f 0 is a c0 -dimensional input vector and 0 is a .c ¡ c0 /-dimensional null vector. Since the last .c ¡ c0/ components of the parameter vector w do not affect the output yNC1 for fNC1 as well as the examples .fn ; yn /, n D 1; : : : ; N, we can regard this problem as a c0 -dimensional one. Hence, an upper bound on the average generalization error in this case is given by c0 =N. (See Amari et al., 1992 for a similar discussion.) 3 Polynomial Kernel Methods Consider a deterministic dichotomy that outputs y 2 f§1g for an input vector x D .x1 ; x2 ; : : : ; xm /T in an m-dimensional space X D Rm by calculating sgn[wT f .x/] where f .x/ is an M-dimensional vector called a feature vector where M D mCp Cm . Each feature vector consists of M monomials of degrees up to p, that is, ( ) m X d1 d2 dm f .x/ D ad1 ;:::;dm x1 x2 ¢ ¢ ¢ xm j (3.1) di · p ; iD1

where ad1 ;:::;dm is a constant determined by K.x; x0 / D .xT x0 C 1/p . In the case of m D 2 and p D 2, for example, p p p f .x/ D .x21 ; 2x1 x2 ; x22 ; 2x1 ; 2x2 ; 1/: (3.2)

1710

K. Ikeda

(See Sch¨olkopf & Smola, 2002, for details on the relationship between a kernel function and the corresponding feature space.) Note that any polynomial of a degree up to p is expressed in the form wT f .x/. An example consists of an input x drawn from a certain probability density p.x/ > 0 and its corresponding output y D sgn[wo T f .x/], where wo is the true parameter vector of a supervisor. A learning machine given N examples predicts the output yNC1 for a test input vector xNC1 with the Gibbs algorithm, that is, the estimate w O is chosen according to the posterior distribution qpo .w/ D p.wj4N / where 4N D f.xn ; yn /; n D 1; : : : ; Ng. The average generalization error LN is dened as the probability that the predicted output yO is false, averaged over all training sets and a test input O 4NC1 and all estimates w. We treat only the Gibbs algorithm in this letter, but the results shown below can easily be extended to other algorithms based on the admissible region such as the Bayesian estimation (Neal, 1996) and the SVM learning since the results mainly depend on how the input space is embedded into the feature space and how the true hyperplane intersects the input space there. Note that the estimator by an SVM is the center of the hypersphere inscribed in the admissible region in the feature space where the radius of the hypersphere corresponds to its margin. (See Herbrich, 2002, for details on the geometrical view of SVMs.) 4 Embedding and Generalization Error As seen in section 2.4, a learning curve depends on the number of essential parameters, that is, the dimension of the space input vectors span. A kernel classier also has localized examples and test inputs when it is regarded as a linear dichotomy in the feature space, since feature vectors form a submanifold determined by the feature map f .x/. Hence, we consider how the localization of the feature vectors affects the learning curve. 4.1 Input Surface and Separating Curve. We dene the input surface as the m-dimensional submanifold in the feature space wherein input vectors are mapped. Since a separating hyperplane in the feature space is described as wT f .x/ D 0;

(4.1)

the input surface and the true separating hyperplane has an .m ¡ 1/-dimensional intersection in the feature space. We call this the separating curve (see Figure 2). The task of a learning machine is to estimate the separating curve using positive examples from one side of the curve and negative ones from the other side. As a simple example, we discuss the one-dimensional case, that is, m D 1. The intersection of the input surface (called the input curve here) and the

An Asymptotic Statistical Theory of Polynomial Kernel Methods

1711

Figure 2: Input surface and the separating hyperplane.

separating hyperplane then consists of isolated points. So the learning machine is required to nd zero points of the true separating polynomial between the positive and negative examples in the neighborhood of the true zero points rather than to estimate the true polynomial itself, since uncertainty exists between the positive and negative examples in the input curve. When the number N of examples is sufciently large, the distances of the closest positive and negative examples from a true zero point go to null. Hence, the problem is divided into po one-dimensional linear dichotomies when the true separating polynomial wo T f .x/ D 0 has po zero points. Therefore, an upper bound on the average generalization error based on equation 2.8 is asymptotically po =N, and does not depend on the degree p of the kernel which corresponds to the dimension of the feature space (Ikeda, forthcoming). Consider the above case in the feature space. The input space (the left line in Figure 3) is mapped to a curve in the feature space (the right curve) and intersects the separating hyperplane at several points (the black circles). Since the given examples, as well as the test input vector, exist only in the input curve, the generalization error is determined by how much an intersecting point of the estimated hyperplane and the input curve differs from the corresponding true one. In the framework of asymptotic statistics, assuming N À 1, the direction of the difference can be approximated by the tangent vector of the input curve at each true zero point (the thick arrows). Hence, only the subspace spanned by the tangent vectors at all of the zero points affects the generalization properties. Since there are po such tangent vectors, df .x/=dx for x s.t. wo T f .x/ D 0, the vectors have different relative orientations and span po dimensions. This means that the components of the parameter vector orthogonal to the space do not affect the output and

1712

K. Ikeda

Figure 3: Tangent vectors of the input curve.

Figure 4: Separating curve and the tangent plane.

that the number of effective parameters is only po . Therefore, the problem is equivalent to a linear dichotomy of po dimensions. 4.2 Class and Generalization Error. Let us extend this idea to higherdimensional cases, m ¸ 2. Asymptotically, we consider only the examples and the test input vector in the subspace spanned by the tangent planes at x in the separating curve, that is, wo T f .x/ D 0 (see Figure 4). We call the dimension of the subspace the class of the separating polynomial (Ikeda, 2003). Since any vector in the tangent plane at x is described as @ f .x/ ¢ 1x; @x

(4.2)

where 1x is a certain m-dimensional vector, the class corresponds to the dimension of the subspace spanned by the vectors of the form 4.2 for an x s.t. wo T f .x/ D 0 and an arbitrary vector 1x. We consider the meaning of the class from another viewpoint. Denote the separating polynomial by Ã.xI w/ D wT f .x/ so as to emphasize that the separating curve is a function of x with parameter w. Suppose that a zero point x exists in the separating curve with parameter w. When the

An Asymptotic Statistical Theory of Polynomial Kernel Methods

1713

parameter w changes to w C 1w, so does the zero point x to x C 1x, that is, Ã .xI wo / D 0;

Ã.x C 1xI wo C 1w/ D 0:

(4.3) (4.4)

Assuming that 1x and 1w are small in their asymptotic limit, we derive from equation 4.3 and 4.4 @Ã @Ã ¢ 1w C ¢ 1x D 0; @w @x

(4.5)

whose rst term represents a polynomial of a degree up to p with coefcients 1w since @Ã=@w D f .x/. This means that the class is the dimension of the subspace spanned by small variations 1w in the coefcient vector from the true vector wo , which changes the zero point of Ã .xI wo / by 1x. In other words, the variation in the coefcient vector that is orthogonal to the subspace does not change the zero point (the intersection of the input surface and the separating hyperplane), and hence it does not affect the generalization error. Therefore, the number of parameters that can inuence the generalization error is equal to the class. So when the class is c, y D sgn[wT f .x/] can be regarded as a dichotomy with c parameters, and its upper bound based on equation 2.8 is asymptotically c=N. 5 Bounds on Class From the above discussion, the class is the number of essential parameters of the learning machine. In this section, we discuss the upper bounds on the class. A reasonable and rather constructive idea for deriving upper bounds is to explicitly give vectors orthogonal to any vector in the tangent planes at any x in the separating curve. If we nd c0 such vectors, an upper bound on the class is given by M ¡ c0 , where M is the dimension of the feature space. As an example, we derive a trivial upper bound on the class. Since one component (the Mth without loss of generality) of f .x/ is a constant term, @f .x/=@x ¢ 1x is necessarily orthogonal to a constant vector .0; : : : ; 0; 1/T , regardless of x and 1x. This leads to an upper bound M ¡ 1 on the class. 5.1 Class of Irreducible Polynomials. We prove here that the class of an irreducible polynomial is equal to the possible maximum, M ¡ 1. Suppose that wo T f .x/ is an irreducible polynomial of degree p. Then m polynomials wo T

@f .x/ ; @ xi

i D 1; : : : ; m

(5.1)

are mutually prime, that is, there exists an m-dimensional polynomial vector of x denoted by 1x.x/ s.t. wo T @f .x/=@x ¢ 1x.x/ D 1, where x satises

1714

K. Ikeda

wo T f .x/ D 0 and the components of 1x.x/ are mutually prime. In this case, wT @f .x/=@ x ¢ 1x.x/

(5.2)

does not constantly vanish as a polynomial for any w, except .0; : : : ; 0; 1/T . If such w exists, each wT @f .x/=@xi has to be null for any x s.t. wo T f .x/ D 0 due to the primeness of the components of 1x.x/. However, this never happens since wT @ f .x/=@xi is a polynomial of a degree up to .p ¡ 1/, whereas wo T f .x/ is an irreducible polynomial of degree p. Hence, class c takes its maximum, M ¡ 1. Note that this value is almost the same as the dimension M of the feature space, that is, the localization of the input vector in the feature space does not affect the generalization error at all, and the high dimension enlarges the generalization error. However, this seems inevitable since the true separating polynomial has such a high complexity. This result is consistent with that in Dietrich et al. (1999), where the true parameter is randomly chosen. The class of a reducible polynomial may be smaller but is still unknown (Ikeda, 2003). 5.2 Upper Bounds on Redundant Polynomials. The case mentioned above treats a rather difcult problem since a polynomial of degree p is necessary to classify inputs correctly. Hence the pessimistic result that the average generalization error is proportional to .M ¡ 1/=N is natural in a sense. In this section, we consider a more interesting problem: We employ a polynomial kernel of degree p while the true separating polynomial, denoted by Ão.xI wo / is of degree po less than p. How does the average generalization error depend on p and po ? In this case, the learning machine can correctly discriminate any inputs since the separating polynomial induced by the kernel of degree p can express any polynomial of a degree up to p, including the true polynomial of degree po . When m D 1, the average generalization error po =N based on equation 2.8 depends not on p but only on po , as shown in section 4.1. We extend the result to an input space of general dimension m ¸ 2 and give two upper bounds in a constructive manner as before, that is, giving orthogonal vectors explicitly. The rst bound is given by considering the space spanned by a tangent vector @f .x/=@xi . Since @f .x/=@xi is a polynomial of a degree up to .p ¡ 1/, it spans a .pCm¡1/ Cm -dimensional space if x is arbitrary. However, x has the constraint Ão .xI wo / D 0 in the denition of class. Any polynomial written as wT @f .x/=@ xi D Á.x/Ão .xI wo /;

(5.3)

where Á.x/ is an arbitrary polynomial of a degree up to .p¡po ¡1/, takes null for x s.t. Ão .xI wo / D 0. Such w spans a space of dimension .p¡po Cm¡1/ Cm due

An Asymptotic Statistical Theory of Polynomial Kernel Methods

1715

to the degree of Á.x/. Hence, the space spanned by @f .x/=@xi has a dimension of .pCm¡1/ Cm ¡ .p¡po Cm¡1/ Cm , at most. Considering each i D 1; 2; : : : ; m in the same way, an upper bound on the class is given by m ¤ ..pCm¡1/ Cm ¡ .p¡po Cm¡1/ Cm /:

(5.4)

The second bound is given by considering vectors orthogonal to wT @f .x/= @xi more directly. Suppose p ¸ 2po . For any polynomial written as wT f .x/ D Á.x/Ão .xI wo /2 ;

(5.5)

where Á .x/ is an arbitrary polynomial of a degree up to .p¡2po /, wT @f .x/=@xi D 0 holds true for any i and x s.t. Ão .xI wo / D 0. Hence, wT @f .x/=@x ¢ 1x D 0

(5.6)

stands for any 1x. This means that any tangent vector has no component in the direction of w satisfying equation 5.5, which spans a space of dimension .p¡2po Cm/ Cm due to the degree of Á .x/. Hence, an upper bound on the class is given by .pCm/ Cm

¡ .p¡2po Cm/ Cm :

(5.7)

Compare the two bounds 5.4 and 5.7. It is easily shown that both are polynomials of p of degree .m ¡ 1/ and the leading coefcients are m2 po and 2mpo , respectively. Hence, equation 5.7 is tighter when p is large, although it is not general since they have terms of lower degrees and equation 5.7 has an additional assumption of p ¸ 2po . 6 Computer Simulations To conrm the validity of the theoretical analyses above, some computer simulations have been carried out. In the experiments below, the kerneladatron algorithm (Anlauf & Biehl, 1989; Friess, Cristianini, & Campbell, 1998; Nishimori, 2001) is employed as an approximate of the Gibbs algorithm: 1. Initialize ®n , n D 1; 2; : : : ; N, randomly.

2. Calculate the margin of each example .xn ; yn / by zn D yn

N X

®n0 yn0 K.xn ; xn0 /:

n0 D1

3. If there exists a negative zn , ®n Ã ®n ¡ 2zn and go to 2.

(6.1)

1716

K. Ikeda

4. Stop if all zn are positive. Note that the distribution of the estimate with the kernel-adatron algorithm is unknown and depends on the input vectors. Hence it is not clear how well it approximates the Gibbs algorithm. However, we can intuitively consider that the prior distribution of the true parameter was proportional to the density function of the estimate given by the kernel-adatron so that both distributions are the same, since the bound 2.8 does not depend on the prior. In each trial, the initial values of ®n , n D 1; 2; : : : ; N are independently chosen according to the normal distribution N.0; 1/. The input vectors xn , n D 1; 2; : : : ; N also independently obey the normal distribution N.0; Im / where Im is the m £ m identity matrix. Note that this is equivalent to the case where xn is uniformly distributed in the unit hypersphere Sm¡1 . The average of the generalization error is taken over 100 trials. Figure 5 shows the result in the case of m D 2 where the true separating polynomial is x21 =4Cx22 ¡1. The x’s, o’s and *’s show the experimental values of the average generalization error times the number of examples where N D 1000, N D 3000, and N D 10000, respectively. The solid line represents the theoretical value given by equation 5.4 and 5.7 which are the same in this case. Since it is known that the generalization error with the Gibbs algorithm is almost two-thirds of the upper bound given in equation 2.8 (Gyorgyi ¨ & Tishby, 1990; Opper & Haussler, 1991; Ikeda & Amari, 1996), two-thirds of equation 5.4 is also drawn by the dashed line for convenience.

20

Coefficient

15

10

5

0 0

2

4

6

8

10

12

Degree of Polynomial Kernel

14

Figure 5: Average generalization error versus degree of polynomial kernels (m D 2).

An Asymptotic Statistical Theory of Polynomial Kernel Methods

1717

80 70

Coefficient

60 50 40 30 20 10 0 0

2

4

6

8

10

12

Degree of Polynomial Kernel

14

Figure 6: Average generalization error versus degree of polynomial kernels (m D 3).

This gure shows that the experimental values are bounded from the above by the theoretical line regardless of the number of examples and conrms the validity of asymptotic theory. The results imply that the average generalization error is smaller than the asymptotic theoretical value when the number of examples is not so large that the asymptotic theory is applicable. This is a more interesting and important phenomenon to be analyzed in the future. The same tendency can be seen in the case of m D 3 shown in Figure 6, where the true separating polynomial is x21 =4 C x22 C x23 =8 ¡ 1. The x’s, the o’s, and *’s show the experimental values of the average generalization error times the number of examples where N D 1000, N D 3000, and N D 10,000, respectively. The dashed and solid lines represent the theoretical values given by equation 5.4 and 5.7 and multiplied by two-thirds, respectively. 7 Conclusions The generalization properties of kernel classiers with a polynomial kernel have been examined. First, it was shown that the generalization error of kernel methods depends not on the dimensions of the input or the feature spaces themselves but on how the input space is embedded into the feature space, more specifically, on the class dened as the dimension of the subspace spanned by the tangent plane at the zero points of the true separating polynomial.

1718

K. Ikeda

Second, the relationship between the class and the true separating polynomial was discussed. When the true separating polynomial is irreducible (i.e., smaller models cannot separate inputs correctly), the average generalization error is comparable to that of a dichotomy having an input space of the same dimension as the feature space. When the true separating polynomial is reducible, the average generalization error may be smaller than irreducible cases, but the class or bounds on it are open, except for easy cases such as m D 1. When a smaller model can realize a correct separation, the generalization error is smaller compared to the dimension of the feature space. The validity of these analyses was conrmed by computer simulations. Let us note nally that the analyses shown above do not assume the important fact in kernel methods that the weight vector is a summand of given input vectors mapped into the feature space. Hence, the results are applicable to the case of three-layered polynomial neural networks with polynomial neurons of a degree up to p in the hidden layer and a sign neuron in the output layer.

Acknowledgments This study is supported in part by Grand-in-Aid for Scientic Research (14084210, 15700130) and by the COE Program (Informatics Research Center for Development of Knowledge Society Infrastructure) from the Ministry of Education, Culture, Sports, Science and Technology of Japan.

References Aizerman, M. A., Braverman, E. M., & Rozonoer, L. I. (1964). Theoretical foundations of the potential function method in pattern recognition learning. Automation and Remote Control, 25, 821–837. Amari, S. (1993). A universal theorem on learning curves. Neural Networks, 6, 161–166. Amari, S., Fujita, N., & Shinomoto, S. (1992). Four types of learning curves. Neural Computation, 4, 605–618. Amari, S., & Murata, N. (1993). Statistical theory of learning curves under entropic loss criterion. Neural Computation, 5, 140–153. Anlauf, J. K., & Biehl, M. (1989). The AdaTron: An adaptive perceptron algorithm. Europhysics Letters, 10, 687–692. Baum, E. B., & Haussler, D. (1989). What size net gives valid generalization? Neural Computation, 1, 151–160. Cristianini, N., & Shawe-Taylor, J. (2000). An introduction to support vector machines. Cambridge: Cambridge University Press. Dietrich, R., Opper, M., & Sompolinsky, H. (1999). Statistical mechanics of support vector networks. Physical Review Letters, 82(14), 2975–2978.

An Asymptotic Statistical Theory of Polynomial Kernel Methods

1719

Friess, T.-T., Cristianini, N., & Campbell, C. (1998). The kernel-adatron algorithm: A fast and simple learning procedure for support vector machines. In Proc. Int’l. Conf. Machine Learning. San Francisco: Morgan Kaufmann. Gy¨orgyi, G., & Tishby, N. (1990). Statistical theory of learning a rule. In K. Thuemann & R. Koeberle (Eds.), Neural networks and spin glasses (pp. 31–36). Singapore: World Scientic. Herbrich, R. (2002). Learning kernel classiers: Theory and algorithms. Cambridge, MA: MIT Press. Ikeda, K. (2003). Generalization error analysis for polynomial kernel methods— algebraic geometrical approach. In O. Kaynak, E. Alpaydin, E. Oja, & L. Xu (Eds.), Articial neural networks and neural information processing— ICANN/ICONIP, LNCS 2714 (pp. 201–208). New York: Springer-Verlag. Ikeda, K. (forthcoming). Geometry and learning curves of kernel methods with polynomial kernels. Systems and Computers in Japan. Ikeda, K., & Amari, S. (1996). Geometry of admissible parameter region in neural learning. IEICE Trans. Fundamentals, E79-A, 409–414. Levin, E., Tishby, N., & Solla, S. A. (1990). A statistical approach to learning and generalization in layered neural networks. Proc. IEEE, 78, 1568–1574. Mitchell, T. M. (1982). Generalization as search. Articial Intelligence, 18(2), 203–226. Murata, N., Yoshizawa, S., & Amari, S. (1994). Network information criterions— determining the number of parameters for an artifcial neural network model. IEEE Trans. Neural Networks, 5, 865–872. Neal, R. M. (1996). Bayesian learning for neural networks. New York: SpringerVerlag. Nishimori, H. (2001). Statistical physics of spin glasses and information processing: An introduction. Oxford: Oxford University Press. Opper, M., & Haussler, D. (1991). Calculation of the learning curve of Bayes optimal classication on algorithm for learning a perceptron with noise. In Proc. Ann. Workshop Comp. Learning Theory, 4 (pp. 75–87). San Francisco: Morgan Kaufmann. Sch¨olkopf, B., Burges, C., & Smola, A. J. (1998). Advances in kernel methods: Support vector learning. Cambridge: Cambridge University Press. Sch¨olkopf, B., & Smola, A. J. (2002). Learning with kernels: Support vector machines, regularization, optimization, and beyond. Cambridge, MA: MIT Press. Smola, A. J., Bartlett, P. L., Sch¨olkopf, B., & Schuurmans, D. S. (2000). Advances in large margin classiers. Cambridge, MA: MIT Press. Valiant, L. G. (1984). A theory of the learnable. Communications of ACM, 27, 1134–1142. Vapnik, V. N. (1995). The nature of statistical learning theory. New York: SpringerVerlag. Vapnik, V. N. (1998). Statistical learning theory. New York: Wiley. Vapnik, V. N., & Chervonenkis, A. Y. (1971). On the uniform convergence of relative frequencies of events to their probabilities. Theory of Probability and Its Applications, 16, 264–280. Received April 7, 2003; accepted February 4, 2004.

LETTER

Communicated by Eric Baum

A Neural Root Finder of Polynomials Based on Root Moments De-Shuang Huang [email protected] Hefei Institute of Intelligent Machines, Chinese Academy of Sciences, Hefei, Anhui 230031, China, and AIMtech Center, Department of Computer Science, City University of Hong Kong, Kowloon Tong, Hong Kong

Horace H.S. Ip [email protected] Institute of Intelligent Machines, Chinese Academy of Sciences, Hefei, Anhui 230031, China

Zheru Chi [email protected] Center for Multimedia Signal Processing, Hong Kong Polytechnic University, Hong Kong

This letter proposes a novel neural root nder based on the root moment method (RMM) to nd the arbitrary roots (including complex ones) of arbitrary polynomials. This neural root nder (NRF) was designed based on feedforward neural networks (FNN) and trained with a constrained learning algorithm (CLA). Specically, we have incorporated the a priori information about the root moments of polynomials into the conventional backpropagation algorithm (BPA), to construct a new CLA. The resulting NRF is shown to be able to rapidly estimate the distributions of roots of polynomials. We study and compare the advantage of the RMM-based NRF over the previous root coefcient method–based NRF and the traditional Muller and Laguerre methods as well as the mathematica roots function, and the behaviors, the accuracies of the resulting root nders, and their training speeds of two specic structures corresponding to this FNN root nder: the log 6 and the 6 ¡ 5 FNN. We also analyze the effects of the three controlling parameters f±P0 ; µp ; ´g with the CLA on the two NRFs theoretically and experimentally. Finally, we present computer simulation results to support our claims. 1 Introduction There exist many root-nding problems in practical applications such as lter designing, image tting, speech processing, and encoding and decoding in communication. (Aliphas, Narayan, & Peterson, 1983; Hoteit, 2000; c 2004 Massachusetts Institute of Technology Neural Computation 16, 1721–1762 (2004) °

1722

D.-S. Huang, H. Ip. and Z. Chi

Schmidt & Rabiner, 1977; Steiglitz & Dickinson, 1982; Thomas, Arani, & Honary, 1997). Although these problems could be solved using many traditional root-nding methods, both high accuracy and fast processing speed are very difcult to achieve simultaneously due to the trade-offs with respect to speed and accuracy in the design of existing root nders (Hoteit, 2000; Thomas et al., 1997; Lang & Frenzel, 1994). Specically, many traditional root-nding methods need to obtain the initial roots distributions before iterating. Moreover, it is well known that the traditional methods can nd the roots only one by one, that is, by deation method: the next root is obtained by the deated polynomial after the former root is found (William, Saul, William, & Brian, 1992). This means that the traditional rootnding algorithms are inherently sequential, and increasing the number of processors will not increase the speed of nding the solutions. On the other hand, this root-nding method by deation will be not able to guarantee the estimated roots accuracy for the reduced polynomials, which is greatly affected. Recently, we showed that feedforward neural networks (FNN) can be formulated to nd the roots of polynomials (Huang, 2000; Huang, Ip, Chi, & Wong, 2003). The advantage of the neural approach is that it provides more exible structures and suitable learning algorithms for the root-nding problems. More importantly, the neural root nder (NRF) presented in Huang (2000) and Huang et al. (2003) exploits the parallel structure of neural computing to simultaneously obtain all roots, resulting in a very efcient solution to the root-nding problems. Obviously, if the neural root-nding approach can be cast into an optimized software or function, in particular, if a new neural parallel processing–based hardware system with many interconnecting processing nodes like the brain is developed in the future, the designed NRFs will without question surpass in speed and accuracy any traditional non-NRFs. Hence, such a novel approach to the root-nding problem is a vitally important research topic in neural or intelligent computational eld. Briey, the idea of using FNNs to nd the roots of polynomials is to factorize the polynomial into many subfactors on the outputs of the hidden layer of the network. The connection weights (i.e., roots) from the input layer to the hidden layer are then trained by using a suitable learning algorithm until the dened output error between the actual output and desired output (the polynomial) converges to a given error accuracy. The converged connection weights obtained in this way are the roots to the underlying polynomial. For instance, for an n-order arbitrary polynomial f .x/, f .x/ D a0 xn C a1 xn¡1 C ¢ ¢ ¢ C an¡1 x C an ;

(1.1)

where n ¸ 2; a0 D 6 0. Without loss of generality, the coefcient a0 of xn is usually set as 1. Suppose that there exist n approximate roots (real or

A Neural Root Finder of Polynomials Based on Root Moments

1723

complex) in equation 1.1, then equation 1.1 can be factorized as follows, f .x/ D xn C a1 xn¡1 C ¢ ¢ ¢ C an¡1 x C an ¼

n Y iD1

.x ¡ wi /;

(1.2)

where wi is the ith root of f .x/. To design a neural network (i.e., feedforward neural network) model for nding the roots of polynomials, we expand the corresponding polynomial into an FNN, and use the FNN to express or approximate the polynomial so that the hidden-layer weights of the FNN represent the coefcients of the individual linear monomial factors such as 1 or x. As a result, a two-layered FNN model for nding roots of polynomials can be constructed, as shown in Figure 1. This model, which is somewhat similar to the 6 ¡ 5 neural network structure (Hormis, Antonion, & Mentzelopoulou, 1995), is in essence a onelayer linear network extended by a sum product (6 ¡ 5) unit. The network has two input nodes corresponding to the terms of 1 and x, n hidden nodes of forming the difference between the input x and the connection weights wi ’s, and one output product node that performs the multiplications on the outputs of the n hidden nodes. Only the weights between the input node that is clamped at value 1 and the hidden nodes need to be trained. The weights between the input x and the hidden nodes and those between the hidden nodes to the output node are always xed at 1. In mathematical terminology,

w1 1

w2

1

1

wn 1 1 wn 1

x

Figure 1: Feedforward neural network architecture for nding the roots of polynomials.

1724

D.-S. Huang, H. Ip. and Z. Chi

the output corresponding to the ith hidden node can be represented as yi D x ¡ wi ¢ 1 D x ¡ wi

(1.3)

where wi .i D 1; 2; : : : ; n/ are the network weights (the roots of the polynomial). The output of the output layer performing multiplication on the outputs of the hidden layer can be written as O D y.x/

n Y iD1

yi D

n Y iD1

.x ¡ wi /:

(1.4)

The outer-supervised signal dened at the output of this network model is the polynomial f .x/. Here, this NRF structure is referred to as the 6 ¡ 5 model. In fact, equation 1.4 can be performed by means of the logarithmic operator. Thus, we can obtain the following equation: N O D ln jy.x/j D y.x/

n X iD1

ln jx ¡ wi j:

(1.5)

As a result, another NRF model, which is structurally similar to the factorization network model proposed by Perantonis, Ampazis, Varoufakis, and Antoniou (1998), can be easily derived. In this model, the resulting hidden nodes, which originally computed the linear differences (in other words, make the summation) between the input x and the connection weight w in the 6 ¡ 5 model, become nonlinear one with a logarithmic operator activation function, and the resulting output node performs a linear summation instead of multiplication. Therefore, this resulting NRF structure is referred to as the log 6 model (Huang, 2000). For such constructed NRFs, the corresponding training algorithm is generally the traditional backpropagation algorithm (BPA) with gradient-descent type, which has a very slow training speed and often leads to unsatisfactory solutions when training the network unless a suitably chosen set of initialized connection weights is given in advance (Huang, 2000). Although the initializing connection weights can be derived from separating different roots of polynomials by means of polynomial theory, it will still take a long time to achieve the upper and lower bounds of the roots of polynomials, especially for the higher-order polynomials (Huang, 2000; Huang & Chi, 2001). In order to alleviate this difculty, we adopt the idea of a constrained learning algorithm (CLA) rst proposed by Karras and Perantonis (1995) and incorporate the a priori information about the relationship between the roots (the connection weights) and the coefcients of a polynomial into the BPA. This facilitates the learning process and leads to better solutions. As a result, we have developed a CLA for nding the roots of a polynomial (Huang & Chi, 2001), which did speed up the convergent process and

A Neural Root Finder of Polynomials Based on Root Moments

1725

improve the accuracy for nding the roots of polynomials with respect to previous results (Huang, 2000). Specically, the key point to stress is that this CLA is not sensitive to the initial weights (roots). It was also observed that a large number of computations for this CLA are spent on computing the constrained relation, which is of computational complexity of the order O.2n / between roots and coefcients of polynomial and the corresponding derivatives. This problem will become more serious for very high-order polynomials. Therefore, a method that requires fewer computations of the constrained relation between roots and coefcients will signicantly reduce training time. In the light of the fundamental idea of the CLAs, a constrained learning method based on the constrained relation between the root moments (Stathaki & Constantinides, 1995; Stathaki, 1998; Stathaki & Fotinapoulos, 2001) and the coefcients of polynomial can be constructed to solve this problem. Indeed, we have proved that the corresponding computational complexity, which is of the order O.nm3 /, for the root moment method (RMM) is much lower than the root coefcient methods (RCM) (Huang, Chi, & Siu, 2000). As a result, we consider that the roots of high-order polynomials can be found by this approach. In this article, we present the constrained relation between the root moments and the coefcients of polynomial, derive the corresponding CLA, and compare the computational complexities of the RMM-based NRF (RMM-NRF) with the RCM-based NRF (RCM-NRF). It has been found that neural methods have a signicant advantage over nonneural methods like Muller and Laguerre in mathematica roots function and that the RMM-NRF is signicantly faster than the RCM-NRF. We then focus on discussing how to apply this new RMM-NRF to nd the arbitrary roots of arbitrary polynomials. We discuss and compare the performance of two specic structures of the NRFs: the 6 ¡ 5 and log 6 models. It was found that the total performance of the log 6 model is better than that of the 6 ¡ 5 model. Moreover, it was also found that there are seldom local minima with the NRFs on the error surface if all the a priori information from polynomials has been appropriately encoded into the CLA. In addition, to apply this CLA to solve practical problems more conveniently, we investigate the effects of the three controlling parameters with the CLA on the performance of the two NRFs. The simulation results show that the training speed improves as the values of the three controlling parameters increase, while the estimated root accuracies and variances are almost kept unchanged. Section 2 of this letter presents the constrained relation between the root moment and the coefcients of polynomials, and discusses and derives the corresponding constrained learning algorithm based on the root moments for nding the roots of polynomials. In section 3, the computational complexities for the two NRFs of the RMM and the RCM, as well as traditional methods, are discussed, and the performance of the neural root-nding method is compared to the nonneural method’s performance. Experimen-

1726

D.-S. Huang, H. Ip. and Z. Chi

tal results are reported and discussed in section 4. Section 5 presents some concluding remarks. 2 Complex Constrained Learning Algorithm Based on Root Moments of Polynomial 2.1 The Root Moments of Polynomial. We have discussed the fundamental constrained relationship, referred to as the root coefcient relation, between the (real or complex) roots and the coefcients of polynomials, and formulated this relation into the conventional BPA to get the corresponding CLA for nding the roots of polynomials (Huang & Chi, 2001; Huang et al., 2003). It has been found that there exists another important well-known constrained relationship, referred to as root moment relation, between the roots and the root moments of polynomial rst formulated by Sir Isaac Newton; the resulting relationship is known as the Newton identity (Stathaki & Constantinides, 1995; Stathaki, 1998; Stathaki & Fotinopoulos, 2001). The concept for the root moment of polynomial is dened as follows: Denition 1. For an n-order polynomial described by equation 1.1, assume that the corresponding n roots are, respectively, w1 ; w2 ; : : : ; and wn . Then the m.m 2 Z/ order root moment of the polynomial is dened as m m Sm D wm 1 C w2 C ¢ ¢ ¢ C wn D

n X

wm i :

(2.1)

iD1

Obviously, Sm ’s are possibly complex numbers that depend on wi ’s. Furm thermore, there hold S0 D n and dS D mwm¡1 . i dwi According to this denition of root moment, a recursive relationship between the m-order root moment and the coefcients of polynomial can be obtained as follows. First, for the case of m ¸ 0, we have (Stathaki, 1998): 8 S1 C a 1 D 0 > > > > > <S2 C a1 S1 C 2a2 D 0 :: (2.2) : > > > > C C ¢ ¢ ¢ C D 0; · S a S ma .m n/ 1 m¡1 m > : m Sm C a1 Sm¡1 C ¢ ¢ ¢ C an Sm¡n D 0; .m > n/: Second, for the case of m < 0, we have: 8 > >an S¡1 C an¡1 D 0 > > > > > > a S C an¡1 SmC1 C ¢ ¢ ¢ C jmjanCm D 0; > : n m an Sm C an¡1 SmC1 C ¢ ¢ ¢ C SmCn D 0;

(2.3) .m ¸ ¡n; a0 D 1/ .m < ¡n/

A Neural Root Finder of Polynomials Based on Root Moments

1727

In fact, it can be proven that the two following formulas hold: Sm C a1 Sm¡1 C ¢ ¢ ¢ C an Sm¡n D 0;

.m > n/

(2a.2)

and an Sm C an¡1 SmC1 C ¢ ¢ ¢ C SmCn D 0;

.m < ¡n/:

(2a.3)

Consequently, for jmj > n (outside the polynomial coefcients window), we can obtain a unied recursive relation as follows: Sm C a1 Sm¡1 C ¢ ¢ ¢ C an Sm¡n D 0;

.jmj > n/

(2.4)

The recursive relationships in equations 2.2 and 2.3 are named as Newton identities. From these Newton identities, we can obtain a theorem as follows: Theorem 1. Suppose that an n-order polynomial described by equation 1.1 is known. Then a set of parameters (root moment) fSm ; m D 1; 2; : : : ; n/ can be uniquely determined recursively by equation 2.2. Conversely, given n root moments fSm ; m D 1; 2; : : : ; n/, an n-order polynomial described by equation 1.1 can be uniquely determined recursively by equation 2.2. For the case of m < 0, however, a theorem similar to the above conclusion can be stated as follows: Theorem 2. Suppose that an n-order polynomial described by equation 1.1 is known. Then a set of parameters (root moment) fSm ; m D ¡1; ¡2; : : : ; ¡n/ can be uniquely determined recursively by equation 2.3. Conversely, given n root moments fSm ; m D ¡1; ¡2; : : : ; ¡n/, an n-order polynomial described by equation 1.1 can be uniquely determined recursively by equation 2.3. In fact, we can also derive the root moment from the Cauchy residue theorem, which is dened as 1 Sm D 2¼j

I X n 0 iD1

1 zm dz; .z ¡ ri /

(2.5)

where 0 is a closed contour, dened as z D ½.µ /e jµ , and contains the roots of the required factor of f .z/. From equation 1.1, we have f 0 .z/ D

n X f .z/ : ¡ ri z iD1

(2.6)

1728

D.-S. Huang, H. Ip. and Z. Chi

Hence, equation 2.5 can be rewritten as 1 Sm D 2¼j

I 0

f 0 .z/ m z dz: f .z/

(2.7)

From the above denition of the root moment, we have the following corollary: Corollary 1. The root moments of the product f .z/ D f1 .z/ f2 .z/ and the ratio f .z/ D f1 .z/=f2 .z/ can be respectively derived as: f .z/

Sm

f .z/ Sm

f .z/

D Sm1 D

f .z/ Sm1

f .z/

(2.8)

f .z/ Sm2

(2.9)

C Sm2 ¡

Moreover, we can derive another corollary about the gradual behaviors of the root moments as the order m becomes sufciently large or small as follows: Corollary 2. Assume that rmax and rmin are respectively the maximum and minimum modulus roots of f .z/. Then we can derive lim Sm D rm max

m!1

lim Sm D rm min

m!¡1

(2.10) (2.11)

That is, when m is sufciently large, the root rmax with the maximum modulus will dominate the sum Sm , while the root rmin with minimum modulus will dominate the sum Sm when m is sufciently small. In addition, it is very easy to derive another corollary about the reciprocal of the roots: Corollary 3. If a new polynomial is dened as g.z/ D zn f .z¡1 /, then it has reciprocal roots with the original polynomial f .z/. These corollaries and conclusions form the basis for using the root moments of polynomials to nd the roots of polynomials. In the following, we derive the corresponding complex CLA based on the root moments’ constrained relation (i.e., conditions) implicit in polynomials to train the NRFs for nding the arbitrary roots of arbitrary polynomials. 2.2 Complex Constrained Learning Algorithm for Finding the Arbitrary Roots of Arbitrary Polynomial. From Huang (2000), Huang et al. (2000), and Huang et al. (2003), it can be found that the root nding method based on the CLA is apparently faster than the simple BP learning rule based on careful selection of initial synaptic weights. Therefore, in the following, we deduce this CLA and extend it to a more general complex version.

A Neural Root Finder of Polynomials Based on Root Moments

1729

Suppose that there are P training patterns selected from the complex region jxj < 1. An error cost function (ECF) is dened at the output of the FNN root nder, E.w/ D

P P 1 X 1 X jep .w/j2 D jop ¡ yp j2 ; 2P pD1 2P pD1

(2.12)

where w is the set of all connection weights in the network model; op D f .xp / or Qnln f .xp / denotes Pn the target (outer-supervised) signal to be rooted; yp D iD1 .xp ¡ wi / or iD1 ln jx ¡ wi j represents the actual output of the network; and p D 1; 2; : : : ; P is an index labeling the training patterns. The above variables and functions are all possibly complex numbers, so in the following, the derivations according to the rule of complex variables must be observed. For an arbitrary complex variable, p z D xC iy, where x and y are the real and imaginary parts of z, and i D ¡1. For the convenience of deducing the following learning algorithm, we give a denition of the derivative of a real-valued function with respect to complex variable: Denition 2. Assume that a real-valued function U.w/ is the function of complex variable w with the real and imaginary parts w1 and w2 . Then the derivative D @U.w/ C i @U.w/ of U.w/ with respect to w is dened as @U.w/ @w @w1 @w2 . Consequently, the BPA based on the gradient descent can easily be deduced from the E.w/. By taking the partial derivative of E.w/ with respect to w, we can obtain Ji D

P n Y 1X @ E.w/ D ep .w/ ¢ .xp ¡ wj / @wi P pD1 j6Di

P xp ¡ wi 1X ep .w/ ¢ : jxp ¡ wi j2 P pD1

or

(2.13)

As a result, the BPA based on the gradient descent is described as follows: dw i D ¡´Ji

(2.14)

where dw i D wi .k/ ¡ wi .k ¡ 1/ denotes the difference between wi .k/ and wi .k ¡ 1/, the current and the past weights. According to the above formulated constrained conditions based on the root moments dened in equations 2.2 to 2.4, we can uniformly write them as 8 D 0;

(2.15)

1730

D.-S. Huang, H. Ip. and Z. Chi

where 8 D [81 ; 82 ; : : : ; 8m ]T .m · n/ (T denotes the transpose of a vector or matrix) is a vector composed of the constraint conditions of any one of equations 2.2 to 2.4. By considering the ECF, which possibly contains many long, narrow troughs (Hush, Horne, & Salas, 1992), a constraint for updated connection weights is imposed in order to avoid missing the global minimum. Consequently, the sum of the squared absolute value of the individual weight changes takes a predetermined positive value .±P/2 (±P > 0 is a constant): n X iD1

jdwi j2 D .±P/2 :

(2.16)

This means that at each epoch, the search for an optimum new point in the weight space is restricted to a small hypersphere with radius ±P centered at the point dened by the current weight vector. If ±P is small enough, the change to E.w/ and to 8 induced by changes in the weights can be approximated by the rst differentials, dE.w/ and d8, respectively. In order to derive the corresponding CLA based on the constraint conditions of equations 2.2 or 2.3 and 2.16, assume that d8 is equal to a predetermined vector quantity ±Q, designed to bring 8 closer to its target (zero). The objective of the learning process is to ensure that the maximum possible change in jdE.w/j is achieved at each epoch. Usually the maximization of jdE.w/j can be carried out analytically by introducing suitable Lagrange multipliers. Thus, a vector V D [v1 ; v2 ; : : : ; vm ]T of Lagrange multipliers is needed to take into account the constraints in equation 2.15. Another Lagrange multiplier ¹ is introduced for equation 2.16. By introducing the function ", d" can be expanded as follows:

d" D

n X iD1

H

Ji dwi C .±Q ¡

n X

" dw i FH i /V

iD1 .j/

2

C ¹ .±P/ ¡

n X iD1

# 2

jdw i j

; (2.17)

@8

.2/ .m/ T j where Fi D [F.1/ i ; Fi ; : : : ; Fi ] , Fi D @wi .i D 1; 2; : : : ; n; j D 1; 2; : : : ; m/, · denotes the number of the constraint conditions in equation 2.15. m n Further, to maximize jd"j (in fact, minimize d") at each epoch, we demand that

d2 " D

n X iD1

2 .Ji ¡ FH i V ¡ 2¹dwi /d wi D 0

d3 " D ¡2¹

n X .d2 wi /2 < 0: iD1

(2.18)

(2.19)

A Neural Root Finder of Polynomials Based on Root Moments

1731

As a result, the coefcients of d2 wi in equation 2.18 should vanish, that is, Ji FH V ¡ i 2¹ 2¹

dw i D

(2.20)

where the values of Lagrange multipliers ¹ and V can beP readily evaluated from equations 2.16 and 2.20 and the condition .±QH ¡ niD1 dwi FH i /V D 0 is embodied in equation 2.15 with the following results: " 1 ¹D¡ 2

¡1 IJJ ¡ IH JF IFF I JF

#1=2

¡1 .±P/2 ¡ ±QH IFF ±Q

¡1 ¡1 V D ¡2¹IFF ±Q C IFF IJF ;

where IJJ D dened by .j/

IJF D

Pn

2 iD1 jJi j

n X iD1

(2.21) (2.22)

is a scalar and I JF is a vector whose components are

.j/

Ji Fi ; j D 1; 2; : : : ; m:

(2.23)

P jk .j/ Specically, IFF is a matrix, whose elements are dened by IFF D niD1 Fi F.k/ i .j; k D 1; 2; : : : ; m/. Obviously, there are .m C 1/ parameters ±P; ±Qj .j D 1; 2; : : : ; m/, which need to be set before the learning process begins. Parameter ±P is often adaptively selected as (Huang, 2001) µp

±P.t/ D ±P0 .1 ¡ e¡ t /; ±P.0/ D ±P 0 ;

(2.24)

where ±P0 is the initial value for ±P, which is usually chosen as a larger value, t > 0 is the time index for training, and µp is the scale coefcient of time index t, which is usually set as µp > 1. However, the vector parameters ±Qj .j D 1; 2; : : : ; m/ are generally selected as proportional to 8j , that is, ±Qj D ¡k8j (j D 1; 2; : : : ; m and k > 0), which ensures that the constraints 8 move toward zero at an exponential rate as training progresses. ¡1 From equation 2.19, we note that k should satisfy k · ±P.8H IFF 8/¡1=2 . q In practice, the simplest choice for k is k D ´±P= 8H I¡1 FF 8, where 0 < ´ < 1 is another free parameter of the algorithm but ±P. 3 Computational Complexity Estimates of the RMM-NRF and the RCM-NRF In this section, we compare the computational complexities of our proposed RM-based roots-nding method with the original RC-based roots-nding

1732

D.-S. Huang, H. Ip. and Z. Chi

method, as well as traditional root nders such as Muller and Laguerre (William et al., 1992; Anthony & Philip, 1978), and show theoretically and experimentally that the speed of this RMM-RNF is signicantly faster than the RCM-NRF and those traditional root nders. It can be seen from the CLA that at each iterative epoch, we have to compute the values of the constraint conditions of 8 and their derivatives, @8=@w, which are sharply dominant in all CLA computations (multiplication or division operations). In order to estimate the computational complexity of the multiplication or division operations for the RMM, we can, from the constraint conditions of equation 2.23, achieve @8=@w D [F.1/ ; F.2/ ; : : : ; F.n/ ]T as follows: 8 .1/ F D [1; 1; : : : ; 1]T > > > .2/ T > > > > >F.n/ D [n¸n¡1 C .n ¡ 1/a1 ¸n¡2 C ¢ ¢ ¢ C an¡1 ; n¸n¡1 C ¢¢¢ > 1 1 2 : n¡1 C an¡1 ; : : : ; n¸n C .n ¡ 1/a1 ¸n¡2 C ¢ ¢ ¢ C an¡1 ]T : n

(3.1)

Consequently, we can estimate the number of multiplication operations at each epoch for computing the constraint conditions of equation 2.2 for 8 and their derivatives @ 8=@w, stated in the following remark (Huang et al., 2000; Huang, 2004): 1 Remark 1. At each epoch for nding the complex roots of a given arbitrary polynomial f .x/ of order n based on the constrained learning neural networks using the m constraint conditions ordered from the rst one of equation 2.2, the estimate of the number of multiplication operations needed for computing 8 and @8=@w, CERMM .n; m/, is CERMM.n; m/ D

2 .m ¡ 1/.nm2 C 10nm C 3m ¡ 12n C 6/: 3

(3.2)

Obviously, CERMM.n/ is of the order O.nm3 / multiplication operations. Specically, when m D n, equation 3.2 becomes CES .n; n/ D 23 .n ¡ 1/.n3 C 10n2 ¡ 9n C 6/. In addition, for the RCM, we can obtain @ 8=@w D [F.1/ ; F.2/ ; : : : ; F.m/ ]T from the constraint condition of the relation between the roots and coef-

1 Here, the most general case for complex computations, which have four times the real computations, is considered.

A Neural Root Finder of Polynomials Based on Root Moments

1733

cients of the polynomial (see, Huang, 2000) as follows: 8 .1/ F D [1; 1; : : : ; 1]T > > " #T > > n n n > X X X > .2/ > > F D wi ; wi ; : : : ; wi > > < i6D1

i6D2

i6Dn

i6D1

i6D2

i6Dn

(3.3)

:: > > : > > " #T > > m m m Y Y Y > > .m/ > wi ; wi ; : : : ; wi : > :F D

As a result, the estimate of the number of multiplication operations at each epoch for a given polynomial f .x/ of order n is stated as follows (Huang et al., 2000; Huang, in press): Remark 2. At each epoch for nding the complex roots of a given arbitrary polynomial f .x/ of order n based on the constrained learning neural networks using the m constraint conditions ordered from the rst one of equation 3.3, the estimate of the number of multiplication operations needed for computing 8 and @8=@w, CERCM.n; m/, is: 0 1 m m m X X X j j j 2 CERCM .n; m/ D 4 @ j Cn ¡ jCn ¡ Cn C n C 1A : jD0

jD0

(3.4)

jD0

2 Obviously, CERCM.n/ is of the order O.Cm n m / multiplication operations, which is considerably higher than that of the RMM. Specically, when m D n, equation 3.4 becomes CERCM .n; n/ D 4[.n2 ¡n¡4/2n¡2 CnC1]. Obviously, CERCM.n; n/ is of the order O.2n / multiplication operations. CERCM .n;n/ If we dene a ratio, rc , of CERCM .n; n/ and CERMM.n; n/; rc D CE , RMM .n;n/ then it can be readily deduced that

lim rc D lim

n!1

n!1

CERCM .n; n/ D 1: CERMM.n; n/

(3.5)

Table 1 gives the comparisons between the computational complexities of the RCM-NRF and the RMM-NRF versus the polynomial order n. From the results, it is easily found that the RMM-NRF has less computational cost as n increases. However, when n is chosen as a smaller value (e.g., n · 5), the conclusion is the opposite. Therefore, it is for higher-order polynomials that the RMM-NRF exhibits its strong potential in computational complexity. In addition, the computational complexities for those traditional root nders in mathematica roots function such as Muller and Laguerre (Mourrain, Pan, & Ruatta, 2003; Milovanovic & Petkovic, 1986) are generally of the order O.3n /,

1734

D.-S. Huang, H. Ip. and Z. Chi

Table 1: Comparisons of Computational Complexities of the Original RCM and the RMM. n

1

2

3

4

5

6

7

8

9

10

11

12

CERCM .n/ 0 4 32 148 536 1692 4896 13,348 34,856 88,108 217,136 524,340 CERMM .n/ 0 24 128 388 896 1760 3104 5068 7808 11,496 16,320 22,484

while the computational complexity for the fastest method, like JenkinsTraub, is only of the order over O.n4 /. It is substantially higher than our proposed RMM, especially our derived recursive RMM (see Huang, 2004). To further verify the correctness of these theoretical analyses, we take a seven-order polynomial— f .x/ D x7 C .1:2 ¡ 0:5i/x6 C .¡6:5 C 1:4i/x4 C 2:93ix2 C .1:7 C 2:4i/x C 2:4 ¡ 0:3i, for example—to compare the performance of the RCM-NRF and the RMM-NRF. Here, we consider only the case of the log 6 model. Assume that the controlling parameters with the CLA for two methods are both selected as ±P0 D 2:0, µp D 5:0, and ´ D 0:6, and let the termination error be er D 1:0 £ 10¡8 . The training sample pairs .x; f .x// are obtained from the complex domain of jxj < 1, where the total training sample number P is xed as 100. In the experiment, assume that we adopt Pentium III with a CPU clock of 795 Mhz and RAM of 256 Mb, and use Visual Fortran 90 to encode. After the NRFs respectively trained by these two methods converge, it was found that the RCM-NRF and the RMM-NRF take, respectively, 9196 and 2783 iterations, where the CPU times taken by the former and the latter are 277 seconds and 41 seconds. That is, the training speed of the RMM is almost seven times that of the RCM, which completely supports our theoretical analyses. Figure 2 depicts the two learning error curves (LEC) of the two methods, and Figure 3 shows the LEC for the BPA-based NRF in the case of the learning coefcient ´ D 0:1. From Figures 2 and 3, it can be seen that the two CLAs are signicantly faster than the BPA, and the RMM-based CLA is essentially faster than the RCM-based CLA. In addition, we use the same parameters as above but with the termination accuracy er D 1:0 £ 10¡25 to compare the performance of the RCM-NRF and the RMM-NRF, as well as the two nonneural methods of Muller and Laguerre. Here, the important point is that Muller’s and Laguerre’s methods are also encoded by Visual Fortran 90 to perform the roots-nding program according to formulas of 9.5.2 and 9.5.3 and 9.5.4 to 9.5.11 in William et al. (1992), respectively. Moreover, to statistically evaluate different root nders, the experiments were repeated 10 times with different initial weight values from the uniform distribution in [¡1; 1] (for the two nonneural methods of Muller and Laguerre, the initial roots are determined by polynomial theory; Huang, 2000, 2004). Consequently, the average iterating numbers, the average CPU times, and the average estimated variances for the four methods

A Neural Root Finder of Polynomials Based on Root Moments

1735

Figure 2: Learning error curves of the original RCM-NRF and the RMM-NRF for a seven-order polynomial.

Figure 3: Learning error curve for the BPA–based NRF in the case of the learning coefcient ´ D 0:1 for a seven-order polynomial.

1736

D.-S. Huang, H. Ip. and Z. Chi

Table 2: Statistical Performance Comparisons of the Original RCM-NRF and the RMM-NRF for a Seven-Order Polynomial f .x/.

Indices

Average Iterating Numbers

Average CPU Times (seconds)

Average Estimated Variances

RMM RCM Muller Laguerre

23,242 276,425 519,374 432,463

972.4 1207.2 2542.2 2312.4

3.26367E¡026 2.32545E¡026 1.32443E¡026 4.23151E¡026

Figure 4: Histogram comparisons among the iterating numbers for four rootnding methods.

are respectively shown in Table 2, and the corresponding histograms for the average iterating numbers and the average CPU times are respectively illustrated in Figures 4 and 5. Table 2 shows that the RMM-NRF is signicantly faster than the RCMNRF, and that the two NRFs are also faster than the two nonneural methods. Moreover, the accuracies and the CPU times consumed for the two neural approaches are higher and shorter than the ones for the two nonneural methods. This is because Muller’s and Laguerre’s methods iteratively ob-

A Neural Root Finder of Polynomials Based on Root Moments

1737

Figure 5: Histogram comparisons among the CPU times consumed for four root-nding methods.

tain one root at a time so that the accuracy for the reduced polynomial is affected, which will result in lower root accuracies and longer iterating times (including the times of carefully selecting initial root values).

Comments. Why do we revisit this topic of the root-nding polynomial since there have been many numerical methods? The other motivation except for the parallel processing fact for neural methods includes that the neural methods can simultaneously nd all roots, while the traditional numerical methods can sequentially nd the roots only by the deation method so that each new root is known with only nite accuracy, and errors will creep into the determination of the coefcients of the successively deated polynomial (William et al., 1992). Hence, the accuracy for the nonneural numerical method is fundamentally limited and cannot surpass the one for the neural root-nding method. In addition, all nonneural methods need to nd good candidates with initial root values; otherwise, the designed root nder will not converge (William et al., 1992), which will increase computational complexity and consume additional processing time. On the other hand, our proposed neural approaches are instructed by the a priori information from the polynomials imposed into the CLA so that they almost do not need to compute any initial root values except for randomly selecting them from the uniform distribution in [¡1; 1].

1738

D.-S. Huang, H. Ip. and Z. Chi

Considering these points, we shall focus on the RMM-NRF in presenting and discussing experimental results. 4 Experimental Results and Discussions In order to verify the effectiveness and efciency of our proposed approaches, we conducted several experiments. We make two assumptions in the following experiments. First, for each polynomial involved f .x/, the input training sample pairs .x; f .x// are obtained from the complex domain of jxj < 1, and the total number, P, of input training sample pairs is xed at 100. Second, the initial weight (root) values of the FNNRF are randomly selected from the uniform distribution in [¡1; 1]. 4.1 Polynomial with Known Roots. There is a six-order test polynomial f1 .x/ with known complex roots generated from ri D 0:9£ exp.j ¼3i /.i D 1; 2; : : : ; 6/, the roots distributions of which are shown in Figure 6. We use an FNNRF with a 2-6-1 structure to nd the roots of this polynomial. Assume that three controlling parameters with the CLA are chosen as ±P 0 D 10:0, µp D 10:0, and ´ D 0:3, which are kept unchanged in this example, and that three termination errors of er D 0:1, er D 0:01, and er D 0:001 are respectively considered. To evaluate the statistical performance of the NRFs, for each case, we conducted 30 repeating experiments by choosing different initial connection weights from the uniform distribution in [¡1; 1] to

1.2 1 0.8 0.6 0.4 0.2 0 -0.2 -0.4 -0.6 -0.8 -1 -1.2 -1.2 -1 -0.8 -0.6 -0.4 -0.2

0

0.2 0.4 0.6 0.8

Figure 6: Roots distributions of f1 .x/ in the complex plane.

1

1.2

A Neural Root Finder of Polynomials Based on Root Moments

1739

observe the experimental results for the 6 ¡ 5 and log 6 models. After the corresponding FNNRFs were trained by the CLA to converge to the given accuracies, the estimated convergent roots corresponding to the two models are respectively depicted in Figures 7 through 9, where the true roots are represented by the bigger black dots. Figures 7 to 9 show that the estimated root accuracies get higher and higher, and the scatters for the estimated roots become smaller and smaller as the termination error increases. Figure 7 shows that for the termination error case of er D 0:1, the accuracy for the 6 ¡ 5 model is obviously higher than the one for the log 6 model. The reason is that the logarithmic operator used in the log 6 model primarily plays a role of transformation, compressing a larger numerical value into a smaller one. Thus, the termination error er D 0:1 in the log 6 model is to a certain extent equivalent to er D e0:1 in the 6 ¡ 5 model. Therefore, it is the lower training accuracy that has resulted in the poor estimated root accuracy. The same conclusions hold for the other two termination error cases; however, the differences between the two models as the training accuracy increases (i.e, the termination error decreases) are not great (see Figures 8 and 9). At the same time, this phenomenon also indirectly shows that the FNNRF based on root moments results in a very fast convergent speed. In addition, assume that the termination error is xed at er D 1:0 £ 10¡9 . We repeat the 30 experiments to get the average estimated roots (including the average estimated variances), the average iterating numbers, the average CPU times, and the average relative estimate errors (dr ), as shown in Table 3.2 From Table 3, it can be seen that from the statistical sense, the 6 ¡ 5 model is of higher estimated accuracy than the log 6 one, but the latter has a shorter training time than the former: the average CPU time for the 6 ¡ 5 model is 52 seconds longer than that for the log 6 one. The reason is also due to the compressing transformation of the log 6 model. Figures 10 and 11 show, respectively, two sets of logarithmic learning root curves of the two models for one of 30 experiments, where the logarithmic operations of the iterating numbers are done in order to show the magnied starting parts of the curves. For each plot, there are 12 curves representing the real and imaginary parts of the six complex roots for some NRF (in fact, we can get the six complex roots only by means of the NRF). From Figures 10 and 11, it can be seen that the 6 ¡ 5 model yields drastic uctuations around their true (estimated) root values, which may result in many long, 2

The average relative estimate error is dened as

dr D

K n .k/ 1 1 XX xi ¡ xO i ¢ xi K n kD1

i

where K is the repeated experimental number, xi is the ith true (exact) root value of the given polynomial, and xO .k/ i is the kth estimated value of the ith root (i.e., the weight, wi ).

1740

D.-S. Huang, H. Ip. and Z. Chi 1.2 1 0.8 0.6 0.4 0.2 0 -0.2 -0.4 -0.6 -0.8 -1 -1.2 -1.2 -1 -0.8 -0.6 -0.4 -0.2

0

(a) er

0.2 0.4 0.6 0.8

1

1.2

1

1.2

0.1

1.2 1 0.8 0.6 0.4 0.2 0 -0.2 -0.4 -0.6 -0.8 -1 -1.2 -1.2 -1 -0.8 -0.6 -0.4 -0.2

0

(b) er

0.2 0.4 0.6 0.8

0.1

Figure 7: Estimated roots distributions of f1 .x/ in the complex plane for the 6 ¡ 5 and log 6 models in the case of er D 0:1. (a) 6 ¡ 5 model. (b) log 6 model.

A Neural Root Finder of Polynomials Based on Root Moments

1741

1.2 1 0.8 0.6 0.4 0.2 0 -0.2 -0.4 -0.6 -0.8 -1 -1.2 -1.2 -1 -0.8 -0.6 -0.4 -0.2

0

(a) er

0.2 0.4 0.6 0.8

1

1.2

1

1.2

0.01

1.2 1 0.8 0.6 0.4 0.2 0 -0.2 -0.4 -0.6 -0.8 -1 -1.2 -1.2 -1 -0.8 -0.6 -0.4 -0.2

0

(b) er

0.2 0.4 0.6 0.8

0.01

Figure 8: Estimated roots distributions in the complex plane of f1 .x/ for the 6 ¡ 5 and log 6 models in the case of er D 0:01. (a) 6 ¡ 5 model. (b) log 6 model.

1742

D.-S. Huang, H. Ip. and Z. Chi 1.2 1 0.8 0.6 0.4 0.2 0 -0.2 -0.4 -0.6 -0.8 -1 -1.2 -1.2 -1 -0.8 -0.6 -0.4 -0.2

0

(a) er

0.2 0.4 0.6 0.8

1

1.2

1

1.2

0.001

1.2 1 0.8 0.6 0.4 0.2 0 -0.2 -0.4 -0.6 -0.8 -1 -1.2 -1.2 -1 -0.8 -0.6 -0.4 -0.2

(b) er

0

0.2 0.4 0.6 0.8

0.001

Figure 9: Estimated roots distributions in the complex plane of f1 .x/ for the 6 ¡ 5 and log 6 models in the case of er D 0:001. (a) 6 ¡ 5 model. (b) log 6 model.

A Neural Root Finder of Polynomials Based on Root Moments

1743

Table 3: Statistical Performance Comparison of the 6 ¡5– and the Log 6–Based FNNRF Models for f1 .x/. Average Estimated Roots (±P0 D 10:0; µp D 10:0; ´ D 0:3) Indices Average estimated roots

w1 w2 w3 w4 w5 w6

Average iterating number Average CPU times (seconds) Average relative estimated error (dr ) Average estimated variance

log 6 Model

6 ¡ 5 Model

(0.8999990, 1.3729134E¡06) (¡0.8999991, ¡1.1557200E¡07) (0.4499939, 0.7794214) (¡0.4500013, ¡0.7794217) (¡0.4499971, 0.7794194) (0.4499970, ¡0.7794210)

(0.8999987, ¡4.1471349E¡06) (¡0.9000001, ¡8.5237025E¡07) (0.4499981, 0.7794229) (¡0.4499988, ¡0.7794209) (¡0.4499974, 0.7794240) (0.4499984, ¡0.7794222)

137,541

226,690

118.24

170.33

1.133025533350818E¡005

8.708582759808792E¡006

2.706395396765315E¡005

2.147191933435669E¡005

Figure 10: The 12 learning weights (roots) curves of f1 .x/ for the 6 ¡ 5 model versus the logarithmic iterating numbers.

1744

D.-S. Huang, H. Ip. and Z. Chi

Figure 11: The 12 learning weights (roots) curves of f1 .x/ for the log 6 model versus the logarithmic iterating numbers.

narrow troughs that are at in one direction and steep in other directions. This phenomenon slowed the corresponding training speed signicantly. In contrast, the log 6 model is capable of compressing the input dynamic range so that it can produce a smoother cost function landscape, avoiding deep valleys and thus facilitating learning. Therefore, for practical problems, the log 6 model is preferred. 4.2 Arbitrary Polynomial with Unknown Roots. To verify the efciency and effectiveness of this NRF based on the root moments of a polynomial, a nine-order arbitrary polynomial f2 .x/ D x9 C 2:1ix7 C 1:1x6 C .1:3 ¡ i/x3 C .1:4 C 0:6i/x C 5:3i is given to test the performance of the NRF. For this problem, an FNNRF with the structure of 2-9-1 is constructed to nd the roots of the polynomial. Assume that three controlling parameters with the CLA are respectively chosen as ±P0 D 8:0, µp D 5:0, and ´ D 0:4, which are also kept unchanged. And for three given sets of termination errors of er D 0:1, er D 0:01, and er D 0:001, we respectively conduct 30 repeating experiments by choosing different initial connection weights from the uniform distribution in [¡1; 1] to evaluate the statistical performance of the two NRFs. After the NRF converges, the estimated convergent roots corresponding to the two models are respectively depicted in complex planes, as shown in Figures 12 to 14, where only one root is inside the unit circle of the complex plane. These three gures show that similar phenomena can be observed for the experimental cases.

A Neural Root Finder of Polynomials Based on Root Moments

1745

1.5

1

0.5

0

-0.5

-1

-1.5 -1.5

-1

-0.5

0

(a) er

0.5

1

1.5

0.5

1

1.5

0.1

1.5

1

0.5

0

-0.5

-1

-1.5 -1.5

-1

-0.5

0

(b) er

0 .1

Figure 12: Estimated roots distributions in the complex plane of f2 .x/ for the 6 ¡ 5 and log 6 models in the case of er ¡ 0:1. (a) 6 ¡ 5 model. (b) log 6 model.

1746

D.-S. Huang, H. Ip. and Z. Chi 1.5

1

0.5

0

-0.5

-1

-1.5 -1.5

-1

-0.5

0

(a) er

0.5

1

1.5

0.5

1

1.5

0.01

1.5

1

0.5

0

-0.5

-1

-1.5 -1.5

-1

-0.5

0

(b) er

0.01

Figure 13: Estimated roots distributions in the complex plane of f2 .x/ for the 6 ¡ 5 and log 6 models in the case of er D 0:01. (a) 6 ¡ 5 model. (b) log 6 model.

A Neural Root Finder of Polynomials Based on Root Moments

1747

1.5

1

0.5

0

-0.5

-1

-1.5 -1.5

-1

-0.5

0

(a) er

0.5

1

1.5

0.5

1

1.5

0.001

1.5

1

0.5

0

-0.5

-1

-1.5 -1.5

-1

-0.5

0

(b) er

0.001

Figure 14: Estimated roots distributions in the complex plane of f2 .x/ for the 6 ¡ 5 and log 6 models in the case of er D 0:001. (a) 6 ¡ 5 model. (b) log 6 model.

1748

D.-S. Huang, H. Ip. and Z. Chi

Here, the point that we must stress is that it has been found in experiments that for some initial connection weights, the 6 ¡ 5 model always oscillates around some local minimum. That is, there are possibly some local minima in the error surface of the 6 ¡ 5 model. Similarly, let er D 1:0 £ 10¡9 . We repeat the 30 experiments to get the average estimated roots (including the average estimated variances), the average reconstructed polynomial coefcients, the average iterating numbers, the average CPU times, and the average estimate accuracies3 (Cp ), as shown in Table 4. From Table 4, it can again be observed that from the statistical sense, the log 6 model has a faster training speed than the 6 ¡ 5 model, but the latter is of higher estimated accuracy than the former. In addition, Figures 15 and 16 illustrate respectively two sets of logarithmic learning weights (roots) curves of the two models for only one of 30 repeated experiments with identical initial connection weight values, where there are, for each subplot, two curves representing the real (the solid line) and imaginary (the dashed line) parts of one complex root. From Figures 15 and 16, it can be seen that the 6 ¡ 5 model is of slower convergent speed than the log 6 model, and the corresponding uctuation surrounding their true (estimated) root values is slightly conspicuous with respect to the latter. In addition, for some initial connection weights, we again found that there are oscillations surrounding certain local minima occurring in the process of searching the solutions. Therefore, it is again suggested that in applying this NRF to solving those practical problems, the log 6 model should be preferred to the direct 6 ¡ 5 model. 4.3 Effects of the Parameters with the CLA on the Performance of the NRFs. In order for the CLAs to be more conveniently applied to solving root-nding problems in practice, we investigate the effects of the three controlling parameters with the CLA on the performance of the NRFs. Assume that an arbitrary ve-order test polynomial with unknown roots f3 .x/ D x5 C 2:2x4 C .¡3:1 ¡ 0:5i/x3 C .2:4 ¡ 1:7i/x2 C 4:3ix C 2:35 ¡ 1:23i is given; we build an FNNRF with the structure of 2-5-1 to discuss the effects of the three controlling parameters used in the CLA, f±P0 ; µp ; ´g, on the performance of this NRF. In the following experiments, we always suppose that the termination error is xed at er D 1 £ 10¡8 for all cases, three sets of different values for each parameter of f±P0 ; µp ; ´g with the CLA are 3

The average estimate accuracy is dened as

Cp D

K n 1 1 XX ¢ jai ¡ aO .k/ i j; K n kD1 iD1

where K is the repeated experimental number, fai g is the coefcient of the polynomial, and aO .k/ i is the ith reconstructed polynomial coefcient value of the kth experiment.

A Neural Root Finder of Polynomials Based on Root Moments

1749

Table 4: Statistical Performance Comparison of the 6 ¡ 5– and Log 6–Based FNNRF Models for f2 .x/. Average Estimated Roots (±P0 D 8:0; µp D 5:0; ´ D 0:4) Indices Average estimated roots

w1 w2 w3 w4 w5 w6 w7 w8 w9

Average reconstructed polynomial coefcients

aN 1 aN 2 aN 3 aN 4 aN 5 aN 6 aN 7 aN 8 aN 9

Average iterating number Average CPU times (seconds) Average estimated accuracy .Cp / Average estimated variance

log 6 Model

6 ¡ 5 Model

(¡1.138758, ¡0.1249457) (0.9814540, ¡0.3930945) (1.029028, 0.5315055) (¡0.6140527, ¡0.7163082) (0.2785297, 1.157246) (¡0.5205935, 1.071421) (¡1.275355, 0.8272572) (1.105366, ¡1.250805) (0.1543829, ¡1.102276)

(¡1.138757, ¡0.1249474) (0.9814527, ¡0.3930941) (1.029029, 0.5315061) (¡0.6140542, ¡0.7163094) (0.2785297, 1.157247) (¡0.5205949, 1.071422) (¡1.275356, 0.8272576) (1.105367, ¡1.250804) (0.1543843, ¡1.102277)

234,239

267,529

135.16

278.54

5.855524581083182E¡005

8.679319172366640E¡006

2.164147025607199E¡005

3.248978567496603E¡006

(¡1.3808409E¡6, 1.0450681E¡6) (6.6061808E¡8, 1.0728835E¡7) (¡3.4644756E¡06, 2.100000) (¡2.8174892E¡07, 2.100000) (1.100003, ¡4.3437631E¡06) (1.100000, ¡1.0875538E¡06) (¡9.9966510E¡6, ¡5.5188639E¡6) (¡4.694555E¡9, 1.4712518E¡6) (2.8562122E¡6, 5.2243431E¡6) (1.5409139E¡6, 1.7070481E¡6) (1.299983, ¡1.000008) (1.300000, ¡0.9999995) (¡1.1211086E¡5, 2.4979285E¡5) (2.8605293E¡6, 1.9240313E¡6) (1.399990, 0.6000041) (1.399996, 0.6000024) (2.6458512E¡5, 5.299979) (6.8408231E¡6, 5.300003)

in turn chosen while keeping other two parameters unchanged, and for each case, we conduct 30 repeating experiments by choosing different initial connection weights from the uniform distribution in [¡1; 1] to evaluate the statistical performance of the two NRFs. 4.3.1 Case 1. The parameters µp D 5:0 and ´ D 0:5 remain unchanged, while ±P0 is respectively chosen as 15.0, 27.0, and 39.0. For this case, we design the corresponding CLA with these chosen parameters to train the two FNNRFs with 30 different initial weight values until the termination error is reached. Table 5 lists the average estimated roots (including the average estimated variances), the average reconstructed polynomial coefcients, the average iterating numbers, and the average estimate accuracies (Cp ). From Table 5, it can be seen that the iterating number (or the training time) becomes bigger and bigger (longer and longer) as the parameter ±P0 increases.

6¡5 model

log 6 model w1 w2 w3 w4 w5 aN 1 aN 2 aN 3 aN 4 aN 5

w1 w2 w3 w4 w5 aN 1 aN 2 aN 3 aN 4 aN 5

Average iterating number Average CPU times (seconds) Average estimated accuracy .Cp / Average estimated variance

Average reconstructed polynomial coefcients

Average estimated roots

Average iterating number Average CPU times (seconds) Average estimated accuracy .Cp / Average estimated variance

Average reconstructed polynomial coefcients

Average estimated roots

Indices

(2.199999, 5.1061312E¡7) (¡3.100002, ¡0.4999993) (2.400008, ¡1.700007) (2.6580785E¡5, 4.300016) (2.349960, ¡1.229988) 72,072 24.05 3.894946050153614E¡5 1.953243868591304E¡5

47,069 13.8 4.174685119040511E¡5 2.109834543554335E¡5

1.348195988200199E¡4

1.215649799242255E¡4

(2.200001, ¡3.5464765E-7) (¡3.099993, ¡0.4999999) (2.400001, ¡1.700003) (5.4543507E¡6, 4.300003) (2.350020, ¡1.230003)

2.434911676528762E¡4

2.408975763542598E¡4

(7.2838880E¡2, 0.4975683) (0.7840892, 0.9237731) (¡0.6682807, ¡0.9142268) (0.9756432, ¡0.5939271) (¡3.364290, 8.6812273E¡2)

18.21

12.76

(7.2835609E¡2, 0.4975790) (0.7840912, 0.9237689) (¡0.6682819, ¡0.9142306) (0.9756438, ¡0.5939294) (¡3.364290, 8.6812407E¡2)

50,755

28,398

2.039519060620054E¡5

3.940120231216149E¡5

25.88

104,382

(2.200002, ¡1.025199E¡6) (¡3.099999, ¡0.5000008) (2.399997, ¡1.700003) (2.0472218E¡5, 4.300006) (2.350006, ¡1.229997)

(7.283525E¡2, 0.4975761) (0.7840918, 0.9237698) (¡0.6682822, ¡0.9142287) (0.9756443, ¡0.5939285) (¡3.364291, 8.6812563E¡2)

1.317036301803572E¡4

2.560353167176954E¡4

22.34

66,752

(2.200003, 3.0497714E¡7) (¡3.100013, ¡0.5000228) (2.399964, ¡1.700022) (1.3247265E¡4, 4.300088) (2.349868, ¡1.230084)

(2.200001, ¡9.7428756E¡7) (¡3.100002, ¡0.5000318) (2.400039, ¡1.700037) (5.6860819E¡5, 4.300084) (2.350065, ¡1.230010)

(2.199985, 3.0177337E¡6) (¡3.099978, ¡0.5000005) (2.400114, ¡1.699990) (¡7.6231045E¡6, 4.300166) (2.349878, ¡1.230064)

±P0 D 39:0 .´ D 0:5; µp D 5:0/ (7.2854005E¡2, 0.497559) (0.7840888, 0.9237748) (¡0.6682967, ¡0.9142174) (0.9756408, ¡0.5939273) (¡3.36429, 8.6810604E¡2)

±P0 D 27:0 .´ D 0:5; µp D 5:0/ (7.2831228E¡2, 0.497575) (0.7841018, 0.9237837) (¡0.6682878, ¡0.9142339) (0.9756468, ¡0.5939333) (¡3.364293, 8.6809404E¡2)

(7.286707E¡2, 0.4975542) (0.7840687, 0.9237923) (¡0.6682831, ¡0.9142303) (0.9756445, ¡0.5939322) (¡3.364282, 8.6812876E¡2)

±P0 D 15:0 .´ D 0:5; µp D 5:0/

Table 5: Performance Comparisons Based on the Test Polynomial f3 .x/ Between the 6 ¡ 5 and Log 6 Models Under the Conditions of µp D 5:0 and ´ D 0:5 and ±P0 D 15:0; 27:0; 39:0.

1750 D.-S. Huang, H. Ip. and Z. Chi

A Neural Root Finder of Polynomials Based on Root Moments

1751

Figure 15: Logarithmic learning weights (roots) curves of the 6 ¡ 5 model for f2 .x/ versus the logarithmic iterating numbers: (a) w1 . (b) w2 . (c) w3 . (d) w4 . (e) w5 . (f) w6 . (g) w7 . (h) w8 . (i) w9 .

Moreover, it can be observed that from the statistical sense, the log 6 model has a faster training speed than the 6 ¡5 model, but the latter is of higher estimated accuracy than the former due to a longer training time. In addition, Figure 17 shows the logarithmic learning error curves for the two models in the case of three different ±P0 ’s for only one trial. Figure 17 shows that the convergent speed becomes slower and slower (i.e., the iterating number becomes bigger and bigger) as ±P0 increases. This phenomenon might be explained from the particulars of the CLA. In fact, from equation 2.20,

1752

D.-S. Huang, H. Ip. and Z. Chi

Figure 16: Logarithmic learning weights (roots) curves of the log 6 model for f2 .x/ versus the logarithmic iterating numbers: (a) w1 . (b) w2 . (c) w3 . (d) w4 . (e) w5 . (f) w6 . (g) w7 . (h) w8 . (i) w9 .

q ¡1 ¡1 ±Qj D ¡´8j ±P= 8H IFF 8 and » D IJJ ¡ IH JF IFF I JF ¸ 0, we can follow: ¹D¡

1 » ¢p : 2±P 1 ¡ ´2

(4.1)

¡1 It can be derived that when ¹ ! 0 equation 2.21 approaches V ¼ IFF IJF ; then equation 2.19 can be rewritten as

dw i ¼

¡1 Ji ¡ F H i IFF I JF ; 2¹

(4.2)

A Neural Root Finder of Polynomials Based on Root Moments

1753

Figure 17: Logarithmic learning error curves for the (a) 6 ¡ 5 and (b) log 6 models under the conditions of µp D 5:0 and ´ D 0:5 and ±P0 D 15:0; 27:0; 39:0 for only one trial.

1754

D.-S. Huang, H. Ip. and Z. Chi

which is more similar to the gradient descent formula of the BPA apart from ¡1 the second term ¡FH i IFF I JF =2¹ related to the a priori information of the root¡1 nding problem. When ¹ ! 1 equation 2.21 becomes V ¼ ¡2¹IFF ±Q; equation 2.19 can then be rewritten as ¡1 dw i ¼ FH i IFF ±Q;

(4.3)

which is more dependent on the a priori information of the problem at hand. Therefore, when ±P0 dynamically increases, from equation 4.1, ¹ becomes smaller and smaller so that the iterating for the connection weights switches to the gradient descent searching described in equation 4.2. Owing to ±P.t/ being adaptively chosen by equation 2.24, there is a longer training time for the bigger ±P0 before the iterating for the connection weights switches to the a priori information searching described in equation 4.3. Consequently, the convergent speed will certainly become slower for the bigger ±P0 , which shows that our experimental results are completely consistent with our theoretic analysis. In addition, from Figure 17, it can be seen that the uctuation for the 6 ¡ 5 model is much more drastic than the log 6 model, which is also consistent with the previous analyses. 4.3.2 Case 2. The parameters ±P0 D 12:0 and µp D 10:0 are kept unchanged, while ´ is respectively chosen as 0.4, 0.6, and 0.8. We design the corresponding CLA with these chosen parameters to train the two NRFs with 30 different initial weight values until the termination error is reached. Table 6 shows the average estimated roots (including the average estimated variances), the average reconstructed polynomial coefcients, the average iterating numbers, and the average estimate accuracies (Cp ). From Table 6, it can be observed that the iterating number increases as the parameter ´ increases. Moreover, it can be seen that from the statistical sense, the 6 ¡ 5 model has a slower training speed than the log 6 model, but the former is of higher estimated accuracy than the latter. Figure 18 shows the logarithmic learning error curves for the two models in the case of three different ´’s for only one trial. From this gure, it can be seen that the convergent speed becomes slower and slower as the parameter ´ increases. This phenomenon can be explained as follows: Since ±P.t/ is adaptively chosen by equation 2.24, the learning process always starts from ¹ ! 0 (the gradient-descent-based phase) to ¹ ! 1 (the a priori information-based searching phase). If ´.0 < ´ < 1/ is chosen as a bigger value, from equation 4.1, ¹ also becomes bigger. As a result, the role of ´ is dominant in the gradient-descent-based phase so that the bigger ´ will result in a slower searching process in the error surface. On the other hand, when the learning processing switches to the a priori information-based searching phase, the role of ±P.t/ becomes dominant. Obviously, during this phase, the

6¡5 model

log 6 model w1 w2 w3 w4 w5 aN 1 aN 2 aN 3 aN 4 aN 5

w1 w2 w3 w4 w5 aN 1 aN 2 aN 3 aN 4 aN 5

Average iterating number Average CPU times (seconds) Average estimated accuracy .Cp / Average estimated variance

Average reconstructed polynomial coefcients

Average estimated roots

Average iterating number Average CPU times (seconds) Average estimated accuracy .Cp / Average estimated variance

Average reconstructed polynomial coefcients

Average estimated roots

Indices

2.679815124170259E¡4 1.385519662459264E¡4

2.554318324064775E¡4 1.410293537507154E¡4

(2.200001, 6.1516954E¡7) (¡3.099996, ¡0.5000052) (2.400009, ¡1.700007) (1.1644424E¡5, 4.300018) (2.349983, ¡1.229999) 72,083 21.37 4.134208802647477E¡5 2.230266509128565E¡005

(2.200000, 5.3594511E¡7) (¡3.100001, ¡0.4999996) (2.400007, ¡1.699998) (1.4605153E¡5, 4.300022) (2.349977, ¡1.230004) 47,582 15.82 4.232391613533809E¡5 2.176448291144487E¡005

(7.2838947E¡2, 0.4975726) (0.7840895, 0.9237724) (¡0.6682820, ¡0.9142290) (0.9756428, ¡0.5939277) (¡3.364291, 8.6811036E¡2)

12.57

9.45

(7.2840475E¡2, 0.4975722) (0.7840880, 0.9237713) (¡0.6682820, ¡0.9142277) (0.9756443, ¡0.5939280) (¡3.364291, 8.6811766E¡2)

50,025

30,594

2.261178583488332E¡005

4.376189929594654E¡5

27.55

101,549

(2.199999, ¡1.1945770E-7) (¡3.100004, ¡0.4999988) (2.400000, ¡1.700007) (1.1297192E¡5, 4.300014) (2.349992, ¡1.229999)

(7.2837636E¡2, 0.4975743) (0.7840908, 0.9237695) (¡0.6682816, ¡0.9142290) (0.9756443, ¡0.5939274) (¡3.364291, 8.6812787E¡2)

1.268341660759528E¡4

2.489754259489135E¡4

15.73

59,575

(2.200000, ¡6.7576769E¡7) (¡3.100017, ¡0.5000157) (2.400007, ¡1.700022) (1.0601872E¡4, 4.300101) (2.349869, ¡1.230183)

(2.199995, ¡2.2557870E¡6) (¡3.100020, ¡0.5000215) (2.400031, ¡1.699972) (4.7380134E¡5, 4.299971) (2.349939, ¡1.230093)

(2.200006,7.8842040E¡6) (¡3.099989, ¡0.5000029) (2.400028, ¡1.700106) (1.5615452E¡4, 4.300119) (2.349921, ¡1.229973)

´ D 0:8 .±P0 D 12:0; µp D 10:0/ (7.2868429E¡2, 0.4975677) (0.7840847, 0.9237705) (¡0.6682978, ¡0.9142180) (0.9756364, ¡0.5939327) (¡3.364292, 8.6813159E¡2)

´ D 0:6 .±P0 D 12:0; µp D 10:0/ (7.2851375E¡2, 0.4975778) (0.7840911, 0.9237669) (¡0.6682835, ¡0.9142171) (0.9756373, ¡0.5939324) (¡3.364290, 8.6807005E¡2)

(7.2833687E¡02, 0.4975479) (0.7840996, 0.9237925) (¡0.6682897, ¡0.9142303) (0.9756414, ¡0.5939324) (¡3.364291, 8.6814374E¡2)

´ D 0:4 .±P0 D 12:0; µp D 10:0/

Table 6: Performance Comparisons Based on the Test Polynomial f3 .x/ Between the 6 ¡ 5 and Log 6 Models under the Conditions of ±P0 D 12:0 and µp D 10:0 and ´ D 0:4; 0:6; 0:8.

A Neural Root Finder of Polynomials Based on Root Moments 1755

1756

D.-S. Huang, H. Ip. and Z. Chi

Figure 18: Logarithmic learning error curves for the (a) 6 ¡ 5 and (b) log 6 models under the conditions of ±P0 D 12:0 and µp D 10:0 and ´ D 0:4; 0:6; 0:8 for only one trial.

A Neural Root Finder of Polynomials Based on Root Moments

1757

distinct ´ will not affect the convergent speed. Therefore, from the analyses, it can be deduced that the convergent speed will drop when the parameter ´ goes up. This shows that our experimental results are completely consistent with our theoretic analyses. From Figure 18, it can be observed that the uctuation for the 6 ¡ 5 model is much more drastic than the log 6 model. 4.3.3 Case 3. The parameters ±P0 D 8:0 and ´ D 0:3 are kept unchanged while µp is respectively chosen as 10.0, 20.0, and 30.0. Similar to the other two cases, we design the corresponding CLA with these chosen parameters to train the two NRFs with 30 different initial weight values until the termination error is satised. Table 7 shows the average estimated roots (including the average estimated variances), the average reconstructed polynomial coefcients, the average iterating numbers, and the average estimated accuracies (Cp ). From Table 7, it can also be observed that the iterating number increases as the parameter µp increases. Furthermore, it can be seen that from the statistical sense, the log 6 model has a faster training speed than the 6 ¡ 5 model, but the latter is of higher estimated accuracy than the former. Finally, Figure 19 illustrates the logarithmic learning error curves for the two models in the case of three different µp ’s for only one trial. From this gure, it can be seen that the convergent speed drops as µp increases. The experimental phenomenon here can be completely explained by means of the analyses of case 1 since ±P.t/ is the functional of µp and a bigger µp corresponds with a bigger ±P.t/. 5 Conclusions This article proposed a novel feedforward neural network root nder (FNNRF), which can be generalized to two kinds of structures, referred to as the 6 ¡5 and log 6 models, to nd the arbitrary roots (including complex ones) of arbitrary polynomials. For this FNNRF, a constructive learning algorithm, referred to as the constrained learning algorithm (CLA), was constructed and derived by imposing the a priori information, the root moments of polynomials, into the output error cost function of the FNNRF. The experimental results show that this CLA based on the root moments of polynomials has an extremely fast training speed and can compute the solutions (roots) of polynomial functions efciently. We showed in theory and in experiment that the computational complexity for the RMM-NRF is signicantly lower than that for the RCM-NRF, and the computational complexities for the two neural root nders are generally lower than the ones for the nonneural root nders in mathematica roots function. We also compared the performance of the two NRFs of the RMM and the RCM with the two nonneural methods of Muller and Laguerre. The experimental results showed that both RMM-NRF and RCM-NRF have a faster convergent speed and higher accuracy with respect to the traditional Muller and Laguerre methods. Moreover, the neural methods do not need

6¡5 model

log 6 model w1 w2 w3 w4 w5 aN 1 aN 2 aN 3 aN 4 aN 5

w1 w2 w3 w4 w5 aN 1 aN 2 aN 3 aN 4 aN 5

Average iterating number Average CPU times (seconds) Average estimated accuracy .Cp / Average estimated variance

Average reconstructed polynomial coefcients

Average estimated roots

Average iterating number Average CPU times (seconds) Average estimated accuracy .Cp / Average estimated variance

Average reconstructed polynomial coefcients

Average estimated roots

Indices

(7.283733E¡2, 0.4975732) (0.7840915, 0.9237716) (¡0.6682816, ¡0.9142293) (0.9756441, ¡0.5939273) (¡3.364290, 8.6811915E¡2) (2.199999, ¡9.040037E¡8) (¡3.100003, ¡0.5000041) (2.400007, ¡1.700010) (1.2623451E¡5, 4.300018) (2.349993, ¡1.229995) 47,079 27.27 4.036975258464581E¡5 2.058069194485565E¡5

(7.2838798E¡2, 0.4975756) (0.7840902, 0.9237689) (¡0.6682814, ¡0.9142286) (0.9756427, ¡0.5939276) (¡3.364290, 8.6811893E¡2) (2.200000, ¡3.1491118E¡7) (¡3.100001, ¡0.5000021) (2.400004, ¡1.700005) (4.5583829E¡6, 4.300004) (2.349991, ¡1.230005) 25,166 23.62 4.045050402790018E¡5 2.180208364499509E¡5

1.449031751690100E¡4

1.296374854486893E¡4

26.75

21.0 2.567913167198371E¡4

15.15 2.671974244067288E¡4

37,330

26,194

18,612

2.389964036500691E¡5

4.360327382141316E¡5

31.4

71,646

(2.199999, ¡3.380081E¡7) (¡3.100003, ¡0.4999986) (2.400006, ¡1.699997) (1.8306442E¡5, 4.300008) (2.349992, ¡1.229998)

(7.283732E¡2, 0.4975742) (0.7840905, 0.9237704) (¡0.6682812, ¡0.9142278) (0.9756446, ¡0.5939287) (¡3.364291, 8.681227E¡2)

1.418370494943892E¡4

2.484951266545354E¡4

(2.199992, ¡2.8039011E¡7) (¡3.100018, ¡0.5000123) (2.399993, ¡1.700036) (1.1222088E¡4, 4.300019) (2.349895, ¡1.230052)

(2.199995, 2.7832884E¡6) (¡3.100013, ¡0.5000213) (2.400023, ¡1.700058) (1.3987898E¡4, 4.300080) (2.349871, ¡1.229923)

(2.199996, ¡4.9563741E¡6) (¡3.099985, ¡0.5000350) (2.400034, ¡1.700035) (9.6878302E¡6, 4.300109) (2.350020, ¡1.230014)

(7.2845824E¡2, 0.497563) (0.7840970, 0.9237728) (¡0.6682878, ¡0.9142196) (0.9756383, ¡0.5939289) (¡3.364286, 8.681283E¡2)

µp D 30:0 .±P0 D 8:0; ´ D 0:3/

(7.283365E¡2, 0.4975407) (0.7840981, 0.9237937) (¡0.6682835, ¡0.9142225) (0.9756444, ¡0.5939250) (¡3.364288, 8.681038E¡2)

µp D 20:0 .±P0 D 8:0; ´ D 0:3/

(7.2840773E¡2, 0.4975706) (0.7840921, 0.9237852) (¡0.6682875, ¡0.9142358) (0.9756450, ¡0.5939275) (¡3.364287, 8.6812451E¡2)

µp D 10:0 .±P0 D 8:0; ´ D 0:3/

Table 7: Performance Comparisons Based on the Test Polynomial f3 .x/ Between the 6 ¡ 5 and Log 6 Models under the Conditions of ±P0 D 8:0 and ´ D 0:3 and µp D 10:0; 20:0; 30:0.

1758 D.-S. Huang, H. Ip. and Z. Chi

A Neural Root Finder of Polynomials Based on Root Moments

1759

Figure 19: Logarithmic learning error curves for the (a) 6 ¡ 5 and (b) log 6 models under the conditions of ±P0 D 8:0 and ´ D 0:3 and µp D 10:0; 20:0; 30:0 for only one trial.

1760

D.-S. Huang, H. Ip. and Z. Chi

to carefully select the initial root values as the nonneural methods do except for randomly selecting them from the uniform distribution in [¡1; 1]. The important point to stress here is that the neural root-nding methods have an advantage over traditional nonneural methods if the neural computer with many interconnecting processing nodes like the brain or a computer with inherently parallel algorithm structure is developed in the near future. In addition, we took two polynomials as examples to discuss and compare the performance of the two models, 6 ¡ 5 and log 6. The experimental results illustrated that under the identical termination accuracy, the 6 ¡ 5 model has a higher estimated accuracy but slower training speed and exhibits more drastic uctuations in the process of searching for global minima on the error surface with respect to the log 6 model. Specically, it can sometimes be observed in experiments that local minima exist on the error surface of the 6 ¡ 5 model so that the CLA is oscillating around a local minimum or even diverges. In contrast, the log 6 model can avoid these drawbacks due to its logarithmic operator, which plays the role of compressing the dynamical range of the inputs and gives a smoother error surface on the hidden nodes. Besides, there are no local minima with the log 6 model apart from the parameters of the CLA having been not suitably chosen since all the a priori information has been utilized. Therefore, in real applications, the log 6 model should be preferred to the direct 6 ¡5 model. Finally, we discussed the effects of the three controlling parameters f±P0 ; µp ; ´g with the CLA on the performance of the two NRFs. The experimental results showed that the training speeds could be increased by increasing these three controlling parameters while maintaining the same estimated accuracies (including the estimated variances). We showed that this performance behavior is consistent with our theoretical analysis. Specically, the drastic uctuation phenomena with the direct 6 ¡ 5 model were again observed in these experiments. Future work will explore how to use the NRFs to nd the maximum or minimum modulus root of arbitrary polynomials and apply them to addressing more practical problems in signal processing. Acknowledgments This work was supported by the National Natural Science Foundation of China and a grant of the 100-Talents Program of the Chinese Academy of Sciences of China. References Aliphas, A., Narayan, S. S., & Peterson, A. M. (1983). Finding the zeros of linear phase FIR frequency sampling lters. IEEE Trans. Acoust., Speech Signal Processing, 31, 729–734. Anthony, R., & Philip, R. (1978). A rst course in numerical analysis. New York: McGraw-Hill.

A Neural Root Finder of Polynomials Based on Root Moments

1761

Hormis, R., Antonion, G., & Mentzelopoulou, S. (1995). Separation of twodimensional polynomials via a 6 ¡5 neural net. In Proceedings of International Conference on Modelling and Simulation (pp. 304–306). Pittsburg, PA: International Society of Modeling and Simulation. Hoteit, L. (2000). FFT-based fast polynomial rooting. In Proceedings of ICASSP ’2000, 6 (pp. 3315–3318). Istanbul: IEEE. Huang, D. S. (2001). Finding roots of polynomials based on root moments. In 8th Int. Conf. on Neural Information Processing (ICONIP), 3, 1565–1571. Shanghai, China: Publishing House of Electronic Industry. Huang, D. S. (2001). Revisit to constrained learning algorithm. In The 8th International Conference on Neural Information Processing (ICONIP) (Vol. I, pp. 459–464). Shanghai, China: Publishing House of Electronic Industry. Huang, D. S. (2004). A constructive approach for nding arbitrary roots of polynomials by neural networks. IEEE Trans on Neural Networks, Vol. 15, No. 2, pp. 477–491. Huang, D. S., & Chi, Z. (2001).Neural networks with problem decomposition for nding real roots of polynomials. In 2001 Int. Joint Conf. on Neural Networks (IJCNN2001). Washington, DC: IEEE. Huang, D. S., Chi, Z., & Siu, W. C. (2000). Finding real roots of polynomials based on constrained learning neural networks (Tech. Rep.). Hong Kong: Hong Kong Polytechnical University. Huang, D. S., Ip, H., Chi, Z., & Wong, H. S. (2003), Dilation method for nding close roots of polynomials based on constrained learning neural networks. Physics Letters A, 309, 443–451. Hush, D. R., Horne, B., & Salas, J. M. (1992). Error surfaces for multilayer perceptrons. IEEE Trans. Syst., Man Cybern, 22, 1152–1161. Karras, D. A., & Perantonis, S. J. (1995). An efcient constrained training algorithm for feedforward networks. IEEE Trans. Neural Networks, 6, 1420– 1434. Lang, M., & Frenzel, B.-C. (1994). Polynomial root nding. IEEE Signal Processing Letters, 1, 141–143. Milovanovic, G. V., & Petkovic, M. S. (1986). On computational efciency of the iterative methods for the simultaneous approximation of polynomial zeros. ACM Transactions on Mathematical Software, 12, 295–306. Mourrain, B., Pan, V. Y., & Ruatta, O. (2003). Accelerated solution of multivariate polynomial systems of equations. SIAM J. Comput., 32, 435–454. Perantonis, S. J., Ampazis, N., Varoufakis, S., & Antoniou, G. (1998). Constrained learning in neural networks: Application to stable factorization of 2-D polynomials. Neural Processing Letters, 7, 5–14. Schmidt, C. E., & Rabiner, L. R. (1977).A study of techniques for nding the zeros of linear phase FIR digital lters. IEEE Trans. Acoust., Speech Signal Processing, 25, 96–98. Stathaki, T. (1998). Root moments: A digital signal-processing perspective. IEEE Proc. Vis. Image Signal Processing, 145, 293–302. Stathaki, T., & Constantinides, A. G. (1995). Root moments: An alternative interpretation of cepstra for signal feature extraction and modeling. In Proceedings of the 29th Asilomar Conference on Signals, Systems and Computers (Vol. 2, pp. 1477–1481).

1762

D.-S. Huang, H. Ip. and Z. Chi

Stathaki, T., & Fotinopoulos, I. (2001). Equiripple minimum phase FIR lter design from linear phase systems using root moments. IEEE Trans on Cirsuits and Systems-II: Analog and Digital Processing, 48, 580–587. Steiglitz, K., & Dickinson, B. (1982). Phase unwrapping by factorization. IEEE Trans. Acoust., Speech Signal Processing, 30, 984–991. Thomas, B., Arani, F., & Honary, B. (1997). Algorithm for speech model root location. Electronics Letters, 33, 355–356. William, H. P., Saul, A. T., William, T. V., & Brian, P. F. (1992). Numerical recipes in Fortran (2nd ed.). Cambridge: Cambridge University Press.

Received July 30, 2002; accepted February 2, 2004.

NOTE

Communicated by Kechen Zhang

A Note on the Applied Use of MDL Approximations Daniel J. Navarro [email protected] Department of Psychology, Ohio State University, Columbus, OH 43210, U.S.A.

An applied problem is discussed in which two nested psychological models of retention are compared using minimum description length (MDL). The standard Fisher information approximation to the normalized maximum likelihood is calculated for these two models, with the result that the full model is assigned a smaller complexity, even for moderately large samples. A geometric interpretation for this behavior is considered, along with its practical implications. 1 Introduction The minimum description length (MDL) principle (Rissanen, 1978; see also Grunwald, ¨ 1998) has attracted interest in applied fields because it allows comparisons between nonnested and misspecified models without requiring restrictive assumptions. Of particular interest is the normalized maximum likelihood (NML) criterion, NML =

p(X|θ ∗ (X)) , p(Y|θ ∗ (Y)) dY

where θ ∗ (X) denotes the maximum likelihood estimate (MLE) for the data X. While the NML has many desirable properties (Rissanen, 2001), a practical difficulty is that the normalization term is difficult to evaluate. Consequently, an approach based on Fisher information (FIA) is often used in its place (e.g., Rissanen, 1996). This criterion is given by N k FIA = − ln p(X|θ ∗ (X)) + ln + ln |I(θ)| dθ + o(1), 2 2π where k denotes the number of free parameters, N denotes the sample size, and I(θ) denotes the expected Fisher information matrix of sample size one. The second and third terms are often referred to as a complexity penalty. Under regularity conditions (Rissanen, 1996), it can be shown that the FIA asymptotically converges to − ln(NML).1 1 The regularity conditions, most important the asymptotic normality of the MLE, are satisfied for models that constitute compact (i.e., closed and bounded) subsets of an exponential family, such as those discussed here.

c 2004 Massachusetts Institute of Technology Neural Computation 16, 1763–1768 (2004)

1764

D. Navarro

The practical advantage to the FIA lies in its calculation, since it only requires integration over the parameter space and the Fisher information matrix is sometimes easier to find than the maximum likelihood. However, while the optimality of the NML criterion holds for all N, the FIA is guaranteed only asymptotically, which can be problematic in some applications. This note presents one such case. Since applied researchers are often forced by necessity to use asymptotic measures such as the FIA, it is useful to take note of circumstances under which they are inappropriate and give consideration to the reasons. 2 The Applied Problem The applied problem originates in the study of human forgetting (“retention”; see Navarro, Pitt, & Myung, in press). In a typical retention experiment, participants are presented with a list of words and asked to recall them later. Measurements at different times after stimulus presentation produce a “retention curve”: the probability of accurate recall is initially high but tends toward zero over time. A retention model takes the form p(C|t) = f (t, θ ), where p(C|t) denotes the probability of correct recall at time t, and f (t, θ ) is the hypothesized form of the retention curve, parameterized by θ. Obviously, f (t, θ) must lie on [0, 1] for all t. Additionally, t is normalized to lie between 0 and 1. The classic retention model is the exponential (EX) model, f (t, θ ) = a exp(−bt), where θ = (a, b) such that a ∈ [0, 1] denotes the initial retention probability, and b gives the decay rate. While definitive bounds on b are not easy to specify, experience with a large database suggests that [0, 100] is reasonable (see Rubin & Wenzel, 1996, or Navarro et al., in press). A second retention model is provided by Wickelgren’s (1972) strength-resistance (SR) theory, which hypothesizes that f (t, θ ) = a exp(−btw ) where θ = (a, b, w). Importantly, w is constrained to lie on [0, 1], since w < 0 results in an increasing retention function, and w > 1 results in a faster-than-exponential decay, neither of which is consistent with the theory. Note that the EX model is a special case of the SR model, and by inspection, the NML denominator term is always smaller for the EX model than for the SR model, regardless of sample size. 3 Fisher Information Matrices In any retention experiment, continuous measurements are impractical, so retention is measured at some number m of fixed time intervals t1 , . . . , tm . Thus, the observed data may be treated as N observations from an m-variate binomial distribution, where the ith Bernoulli probability is described by f (ti , θ). A standard result (e.g., Schervish, 1995; Su, Myung, & Pitt, in press) allows the uvth element of the Fisher information matrix of sample size one

A Note on the Applied Use of MDL Approximations

1765

for these models to be written Iuv (θ) =

m

1 ∂ f (ti , θ ) ∂ f (ti , θ ) . f (ti , θ)(1 − f (ti , θ )) ∂θu ∂θv

i=1

The partial derivatives of f (ti , θ) are simple. For the EX model, ∂ f (ti , θ )/∂a = exp(−bti ) and ∂ f (ti , θ)/∂b = −ati exp(−bti ). For the SR model, the partial w w derivatives are ∂ f (ti , θ)/∂a = exp(−btw i ), ∂ f (ti , θ )/∂b = −ati exp(−bti ), and w w ∂ f (ti , θ)/∂w = −abti ln(ti ) exp(−bti ). Substitution into the Fisher information formula yields (1/a) m i=1 y(i) I(a, b) = − m i=1 ti y(i)

− m i=1 ti y(i) 2 a m i=1 ti y(i)

for the EX model, where y(i) = 1 (exp(bti ) − a) . For the SR model,  I(a, b, w) = 

m z(i) m i=1 − i=1 tw z(i) m w i

(1/a)

−b

t i=1 i

(ln ti )z(i) ab

m w t z(i) mi=1 i a i=1 t2w z(i) m 2w i −

t i=1 i

m w  t (ln ti )z(i) mi=1 2wi ab i=1 ti (ln ti )z(i)  m 2w 2 2 −b

(ln ti )z(i) ab

t i=1 i

(ln ti ) z(i)

where z(i) = 1 (exp(btw i ) − a) . 4 Small Sample Behavior The experimental design was assumed to consist of eight evenly spaced ti values, though the results are consistent across a range of designs. Without a closed form for the integral in the FIA, numerical estimates were obtained using Monte Carlo methods (e.g., Robert & Casella, 1999). Given the low dimensionality of the integrals and the unnecessarily extensive sampling (108 samples were taken), even simple Monte Carlo methods provide adequate √ results. The integral term |I(θ)|dθ for the EX model is approximately 8.08, whereas for the SR model, the value is approximately 0.44. Substituting this into the FIA expression indicates that the SR model has a higher estimated complexity only when N ≥ 2096, as shown in Figure 1. This presents a substantial difficulty for the applied problem, since not one of the 77 data sets considered by Navarro et al. (in press) had a sample size this large. Therefore, every data set that could possibly be observed would be better fit by SR, yet the nested EX model would be penalized more severely for excess complexity. It is worth considering the source of this problem. Following Myung, Balasubramanian, and Pitt (2000) and Balasubramanian (1997), the complexity terms in the FIA can be viewed as approximations to (the logarithm of) the ratio of two Riemannian volumes. The first is the volume occu-

1766

D. Navarro 5000 EX model SR model

Fisher information approximation

4500 4000 3500 3000 2500 2000 1500 1000 500 0 0

500

1000

1500 2000 Sample size, N

2500

3000

Figure 1: Complexity assigned by the Fisher information approximation (FIA) for the EX and SR models (before taking logarithms).

pied by the model in the space of probability √ distributions (assuming the Fisher information metric), given by V f = |I(θ)|dθ . The second volume is that of a little ellipsoid around θ ∗ , intended to quantify the size of a “rek/2 √ gion of appreciable |J(θ ∗ )|/|I(θ ∗ )|, where 2fit,” given by Vc = (2π/N) ∗ J(θ ) = −1/N ∂ ln p(C|t)/∂θ ∂θ is the observed information. In u v θ =θ ∗ t general, the observed information J(θ ∗ ) will differ from the expected information I(θ ∗ ), but for current purposes, it suffices to note that there are data sets for which they are very similar, yielding J(θ ∗ ) ≈ I(θ ∗ ). As these are binomial models, an example of such a data set would be one in which Ci ≈ Nf (ti , θ ∗ ) for all i since the data are almost identical to their expected values at the MLE parameters. Note that these are the data sets that are well fit by the model and are thus precisely those of most interest to applied research. In such cases, Vc ≈ (2π/N)k/2 . When the observed information matrix closely approximates the expected information matrix, it is reasonable to view thecomplexity penalty for the FIA as approximately equivalent to ln V f /Vc . At N = 1, we observe that for the EX model, Vc = 2π < 8.08 ≈ V f . The little ellipsoid is smaller than the model, as it should be. However, by expanding the EX model into a third dimension, one obtains the SR model, for which Vc = (2π)3/2 > 0.44 ≈ V f . The volume of the three-dimensional ellipsoid is now larger than the volume of the entire model. Taken together, these observations suggest that the extension of the model along the new dimension induced by the addition of the w parameter is so tiny that the “small” ellipsoid now protrudes extensively, like a marble embedded in a sheet of paper.

A Note on the Applied Use of MDL Approximations

1767

Most of the “region of good fit” no longer lies within the region occupied by the model. In short, until N gets very large, θ ∗ does not seem to lie sufficiently within the model manifold to support the gaussian approximations that underlie the FIA. It does not √ appear that the problem can be entirely solved by incorporating the |J(θ ∗ )|/|I(θ ∗ )| term. After all, by judicious choice of t and C, data sets could be chosen for which the observed information is precisely equal to the expected information even at small samples. For instance, if t = (.25, .5, .75, 1) and N = 16, then the data set C = (8, 4, 2, 1) yields MLE parameters a∗ = 1 and b∗ = ln 16 for both models and w∗ = 1 for SR. In both cases, f (ti , θ ∗ ) = Ci /N for all i: The data take on their expected values at the MLE. Accordingly, Vc = (2π/N)k/2 for these data. Even so, V f ≈ 0.07 for the SR model, while V f ≈ 3.39 for the EX model, indicating that the little ellipsoid is not well located for the SR model. 5 Discussion Asymptotic approximations are useful in practice only insofar as they can be relied on to give the right answers. Clearly, if Vc < V f is not satisfied, the standard FIA expression is not necessarily reliable. In psychology, for instance, it is rare to find studies with N > 2000, yet Vc can remain smaller than V f even at this sample size. This “well-locatedness” requirement is mentioned by Balasubramanian (1997) as a condition of his asymptotic expansions, but is not generally thought of as a regularity condition because it holds asymptotically (i.e., Vc tends to zero as N tends to infinity). In finite samples, some care is needed. While observing Vc < V f does not guarantee that θ ∗ is well located, observing that Vc > V f does imply that it is not. It may be possible to use this observation to formulate approximations that perform better in small samples, a question left to more theoretically inclined researchers. In any case, it is clear that the source of the current difficulty does not lie with the MDL principle itself: the NML criterion does not suffer from this problem at any N. It is only that the o(1) term in the asymptotic criterion can be large enough to make the approximation impractical for smaller samples. While the point is somewhat obvious, it does imply the need to take care in the use of the criterion. To that end, it is worth ensuring that Vc < V f before using the approximation. Acknowledgements This research was supported by NIH grant R01-MH57472 and by a grant from the Office of Research at OSU. I thank Peter Grunwald, ¨ Michael Lee, In Jae Myung, and Mark Pitt for helpful suggestions and two anonymous reviewers for comments that substantially improved the article.

1768

D. Navarro

References Balasubramanian, V. (1997). Statistical inference, Occam’s razor and statistical mechanics on the space of probability distributions. Neural Computation, 9, 349–368. Grunwald, ¨ P. (1998). The minimum description length principle and reasoning under uncertainty. Unpublished doctoral dissertation, CWI, the Netherlands. Myung, I. J., Balasubramanian, V., & Pitt, M. A. (2000). Counting probability distributions: Differential geometry and model selection. Proceedings of the National Academy of Sciences USA, 97, 11170–11175. Navarro, D. J., Pitt, M. A., & Myung, I. J. (in press). Assessing the distinguishability of models and the informativeness of data. Cognitive Psychology. Rissanen, J. (1978). Modeling by the shortest data description. Automatica, 14, 465–471. Rissanen, J. (1996). Fisher information and stochastic complexity. IEEE Transactions on Information Theory, 42, 40–47. Rissanen, J. (2001). Strong optimality of the normalized ML models as universal codes and information in data. IEEE Transactions on Information Theory, 47, 1712-1717. Robert, C. P., & Casella, G. (1999). Monte Carlo statistical methods. New York: Springer-Verlag. Rubin, D. C., & Wenzel, A. (1996). One hundred years of forgetting: A quantitative description of retention. Psychological Review, 103, 734–760. Schervish, M. J. (1995). Theory of statistics. New York: Springer-Verlag. Su, Y., Myung, I. J., & Pitt, M. A. (in press). Minimum description length and cognitive modeling. In P. Grunwald, ¨ I. J. Myung, & M. A. Pitt (Eds.), Advances in minimum description length: Theory and applications. Cambridge, MA: MIT Press. Wickelgren, W. A. (1972). Trace resistance and decay of long-term memory. Journal of Mathematical Psychology, 9, 418–455. Received September 15, 2003; accepted February 4, 2004.

NOTE

Communicated by Chris Burgess

Optimal Reduced-Set Vectors for Support Vector Machines with a Quadratic Kernel Thorsten Thies [email protected]

Frank Weber [email protected] Cognitec Systems GmbH, D–01139 Dresden, Germany

To reduce computational cost, the discriminant function of a support vector machine (SVM) should be represented using as few vectors as possible. This problem has been tackled in different ways. In this article, we develop an explicit solution in the case of a general quadratic kernel k(x, x ) = (C + D x x )2 . For a given number of vectors, this solution provides the best possible approximation and can even recover the discriminant function if the number of used vectors is large enough. The key idea is to express the inhomogeneous kernel as a homogeneous kernel on a space having one dimension more than the original one and to follow the approach of Burges (1996). 1 Introduction Support vector machines (as described in Boser, Guyon, & Vapnik, 1992; Vapnik, 1998; Scholkopf ¨ & Smola, 2002) have been widely used for pattern recognition and regression estimation for several years. Aside from their numerous advantages, they suffer from being quite slow in the test phase compared to other learning techniques. To compensate for this drawback, there have been different approaches (e.g., Burges, 1996; Osuna & Girosi, 1998; Tipping, 2001). Specifically, we are concerned with reducing the expense of computing a discriminant function of the form f (x) = b +

m

αi k(xi , x),

(x ∈ X),

(1.1)

i=1

where k is a positive semidefinite kernel on some input space X = Rn . The support vectors xi ∈ X and the αi ∈ R can be computed by the support vector algorithm (in fact, some quadratic program). As the complexity of the computation 1.1 scales linearly with the number m of support vectors (which may be quite large), we are interested in finding a set of ν vectors (with ν << m) to be used instead of the support vecc 2004 Massachusetts Institute of Technology Neural Computation 16, 1769–1777 (2004)

1770

T. Thies and F. Weber

tors such that the original discriminant function f is approximated or even recovered. In case of a homogeneous quadratic kernel, k: X × X → R,

k(x, x ) = (x x )2

on

X = Rn ,

Burges (1996) provides an explicit way to find such reduced-set vectors. The original f is recovered using ν = n vectors, and for ν < n vectors, a best approximation to the original f is provided (in a sense specified in the following). Besides its optimality properties, the explicit method has the advantage of being very fast compared to the numerical method proposed in Burges (1996) to treat general kernels. For general kernels, an explicit method of evaluating such reduced-set vectors is quite difficult to find and hard to compute, because nonlinearity enters to a large extent. The only case where this could be done effectively so far was the case of the homogeneous quadratic kernel. Burges reduced the problem to a linear one—to estimating the largest eigenvalues of a n × n matrix and their corresponding eigenvectors. In this article, we provide an explicit solution in the case of an inhomogeneous quadratic kernel, k(x, x ) = (C + D x x )2 ,

x, x ∈ X = Rn ,

with C, D > 0. The reason for this restriction is that if C and D have different signs, the kernel is not positive semidefinite in general.1 2 Notation For any positive semidefinite kernel k on a sample space X, there is a feature space H: a Hilbert space with dot product ·, · and a feature map : X → H such that k(x, x ) = (x), (x ),

(x, x ∈ X).

We denote the norm in H as well as in X by || · ||. The concrete form of and H for the occurring kernels does not matter here; we introduce this concept simply to explain what we mean by the best possible approximation of a discriminant function. Suppose we solved the SVM’s quadratic program and received the shift value b ∈ R, the support vectors x1 , . . . , xm ∈ X, and the corresponding α1 , . . . , αm ∈ R\{0}. Note that the αi can be negative here. Their sign corresponds to the class labels yi ∈ {−1, 1} used in Burges (1996).

As a simple example, consider D = 1, C = −1 and n = 2. Then √1 11 and −1 0 lead 2 to a Gram matrix, which is not positive semidefinite, so k is not positive semidefinite by definition. 1

Optimal Reduced-Set Vectors for SVM

1771

Thus, we derived the normal vector w ∈ H to the decision hyperplane in H corresponding to the decision surface in X, that is, f (x) = b + (x), w, where w = m i=1 αi (xi ). We are now looking for vectors z1 , . . . , zν ∈ X and coefficients γ1 , . . . , γν ∈ R such that the distance m ν αi (xi ) − γp (zp ) (2.1) ρ= i=1 p=1 between the normal vector w and its approximation is minimized. ρ 2 can be written as m ν m ν 2 αi (xi ) − γp (zp ), αj (xj ) − γq (zq ) . ρ = i=1

p=1

j=1

q=1

Using the bilinearity of ·, · we receive an expression of ρ 2 without using : ρ2 =

m

αi αj k(xi , xj ) − 2

m ν

αi γp k(xi , zp ) +

i=1 p=1

i,j=1

ν

γp γq k(zp , zq ). (2.2)

p,q=1

To separate the different kinds of kernels used in this article, we use the following notation: for C, D > 0 let knC and KnC,D be the general homogeneous (resp. inhomogeneous) quadratic kernels on Rn given by knC (x, x ) = C2 (x x )2

and

KnC,D (x, x ) = (C + D x x )2 .

3 Review of the Homogeneous Case Let us first review the case of a general homogenous quadratic kernel knC . We give a detailed proof of the following theorem, which is essentially contained in section 3.1 of Burges (1996). n Theorem 1. Let C > 0, x1 , . . . , x m ∈ R and α1 , . . . , αm ∈ R\{0}. Let λ1 , . . . , m λn ∈ R be the eigenvalues of S = i=1 αi xi x i , ordered by their absolute values: |λ1 | ≥ · · · ≥ |λn |. Then for given ν ∈ {1, . . . , n}, the squared distance

ρ2 =

m i,j=1

αi αj knC (xi , xj ) − 2

m ν i=1 p=1

αi γp knC (xi , zp ) +

ν

γp γq knC (zp , zq )

p,q=1

is minimized among all γ1 , . . . , γν ∈ R and z1 , . . . , zν ∈ Rn , if we choose z1 , . . . , zν λ to be orthogonal eigenvectors of S corresponding to λ1 , . . . , λν and set γp = z p2 p n for p = 1, . . . , ν. In this case, ρ 2 = p=ν+1 λp2 . Thus, for ν = n we get ρ = 0, and the discriminant function is recovered.

1772

T. Thies and F. Weber

Proof. We use induction on ν. Step 1. We prove the claim in the case ν = 1. The only candidates (γ , z) ∈ R × Rn for a minimum of ρ = ρ(γ , z) are 2 the critical points of ρ. These are characterized by the conditions ∂ρ ∂z = 0 (I) and ∂ρ ∂γ = 0 (II). Observe that we have 2

ρ2 =

m

αi αj knC (xi , xj ) − 2

= C2 

αi γ knC (xi , z) + γ 2 knC (z, z)

i=1

i,j=1



m

m

2 αi αj (x i xj ) − 2

m

 2 2 2 . αi γ (x i z) + γ (z z)

i=1

i,j=1

So condition I is equivalent to m

2 αi (x i z)xi = γ z z.

(3.1)

i=1

Using the matrix S, equation 3.1 reads as Sz = γ z2 z.

(3.2)

Condition II is equivalent to 2

m

2 2 αi (x i z) = 2γ (z z)

⇐⇒

γ z4 = z Sz.

(3.3)

i=1

Thus, the only candidates for a minimum of ρ are (γ , z), where z is an λ eigenvector of S with corresponding eigenvalue λ and γ = z 2. We now show that the minimum of ρ (for ν = 1) occurs if λ = λ1 , the eigenvalue of the largest absolute size. To this end, consider ρ 2 (as previously computed). The first term of ρ 2 equals C2

m

2 2 αi αj (x i xj ) = C

i,j=1

m

αi αj x i xj xi xj

i,j=1

= C2

m

αi αj

i,j=1

=C

2

n

xik xjk xil xjl

k,l=1

n

m

k,l=1

i=1

= C2 tr(S2 ).



 m αi xik xil  αj xjk xjl  j=1

Optimal Reduced-Set Vectors for SVM

1773

Therefore, we get 2

ρ =C

2

2

tr(S ) − 2γ

m

αi z

xi x i z

2

2

+ γ (z z)

i=1

= C2 (tr(S2 ) − 2γ z Sz + γ 2 z4 ) = C2 (tr(S2 ) − 2γ 2 z2 z z + γ 2 z4 ) = C2 (tr(S2 ) − γ 2 z4 ) = C2 (tr(S2 ) − λ2 ), which is minimized if we choose z to be the eigenvector z1 of S corresponding to the eigenvalue λ = λ1 and γ1 = zλ12 . This proves the claim for ν = 1. 1 Step 2. We prove the claim for ν > 1 supposing it is true for ν − 1. We assume z1 , . . . , zν−1 and γ1 , . . . , γν−1 to be chosen as in the theorem. Let us define xm+p := zp ,

αm+p := −γp ,

(p = 1, . . . , ν − 1).

Then the error ρ 2 , which we want to minimize with respect to (γ , z) ∈ R×Rn , reads as 2 m ν−1 2 ρ = αi (xi ) − γi (zp ) − γ (z) i=1 p=1   m+ν−1 m+ν−1 2 2 2 2 = C2  αi αj (x αi γ (x . i xj ) − 2 i z) + γ (z z) i=1

i,j=1

2 α i xi x From the first step, with S = m+ν−1 i instead of S, we get that ρ i=1 is minimized if we set z to be an eigenvector of S corresponding to the eigenvalue λ of S of largest absolute size. Because S = S −

ν−1 p=1

λp

zp zp

zp zp

has the same eigenvectors as S, but with z1 , . . . , zν−1 ∈ ker S , we conclude that this is the case if λ = λν , z = zν and γ = γν = zλν2 . ν Because tr S2 = ni=1 λ2i , we get ρ 2 = 0 for ν = n.

1774

T. Thies and F. Weber

4 The Inhomogeneous Case Now assume that we are given the inhomogeneous kernel KnC,D (x, x ) = (C + D x x )2

with x, x ∈ Rn and C, D > 0.

Then we can write KnC,D (x, x ) = (C + D x x )2 

2 D D = C2  1 + x x  C C  =C

=

2

D Cx

1

C kn+1

D Cx

D Cx

1

,

1

2 

D Cx

1

.

If we identify x ∈ Rn with x˜ ∈ Rn+1 using the embedding n

n+1

η: R → R

,

x →

D Cx

1

˜ =: x,

we can express the inhomogenous kernel KnC,D as a homogeneous kernel C kn+1 on Rn+1 : C ˜ x˜ ). (x, KnC,D (x, x ) = kn+1

(4.1)

We will prove the following: Theorem 2. Let C, D > 0, x1 , . . . , xm ∈ Rn and α1 , . . . , αm ∈ R\{0}. Let ˜ ˜ i x˜ λ1 , . . . , λn+1 ∈ R be the eigenvalues of S = m i , ordered by their absolute i=1 αi x values, |λ1 | ≥ · · · ≥ |λn+1 |, and assume that there is a corresponding orthogonal eigenbasis v1 , . . . , vn+1 of S˜ whose last components are all equal to one. Then for given ν ∈ {1, . . . , n + 1} the squared distance (see equations 2.1 and 2.2) ρ2 =

m i,j=1

αi αj KnC,D (xi , xj ) − 2

m ν i=1 p=1

αi γp KnC,D (xi , zp ) +

ν p,q=1

γp γq KnC,D (zp , zq )

Optimal Reduced-Set Vectors for SVM

1775

is minimized among all γ1 , . . . , γν ∈ R and z1 , . . . , zν ∈ Rn , if we choose the rth

component of zp to be zp,r =

γp :=

λp , vp 2

C D vp,r ,

(r = 1, . . . , n), and set

(p = 1, . . . , ν).

Remarks. • As a consequence, we get the best possible approximation of the discriminant function f using ν vectors: f (x) = b +

m

αi KnC,D (xi , x) ≈ b +

i=1

ν

γp KnC,D (zp , x),

(x ∈ X),

p=1

where b ∈ R is the shift value. We recover f exactly for ν = n + 1. • The supposition on S˜ is necessary to find an orthogonal basis of eigenvectors of S˜ in the subset η(Rn ) ⊂ Rn+1 . By rescaling, it is necessary only to have an orthogonal eigenbasis with nonvanishing last components. If we want to find ν < n + 1 reduced-set vectors, it may still be weakened: suppose that vi,n+1 = 0 for i = 1, . . . , ν only. • From the definition of γp , zp for p = 1, . . . , ν, it is obvious that only those γp , zp are relevant for the solution where the eigenvalue λp is different from 0. So if the kernel of S˜ has dimension d, we get an exact solution (ρ = 0) even for ν = n + 1 − d. • It is possible to determine an optimal solution even if the supposition on S˜ is not valid: let vp ∈ Rn+1 be an eigenvector of S˜ of norm 1 with corresponding eigenvalue λp = 0 such that the last component of vp is 0. We replace the last component by > 0, name the resulting vector vp , and take zp :=

 C1  D

 vp,1 ..  .  vp,n

to get η(zp ) = z˜p = 1 vp . Correspondingly, we have γp :=

λp D 2 C zp

+1

=

2 λp . 1 + 2

1776

T. Thies and F. Weber

Then the corresponding term in the decision function f can be written 2 2 λp C + D x zp 2 1+   2 vp,1 √ λp  .   = C + CD x  ..  , 1 + 2 vp,n

γp KnC,D (x, zp ) =

which leads to minimum ρ for → 0. Proof. From theorem 1 with x˜ 1 , . . . , x˜ m ∈ Rn+1 instead of x1 , . . . , xm , we get that for given ν ∈ {1, . . . , n + 1}, m

C αi αj kn+1 (x˜ i , x˜j ) − 2

ν m

C αi γp kn+1 (x˜ i , vp ) +

i=1 p=1

i,j=1

ν

C γp γq kn+1 (vp , vq ) (4.2)

p,q=1

n+1

and is minimized, if we choose v1 , . . . , vν ∈ Rn+1 to be orthogonal eigenvectors of S˜ corresponding to the eigenvalues λ1 , . . . , λν , λ and set γp = v p2 for p = 1, . . . , ν. p ˜ we know that we can assume the last compoFrom the hypotheses on S, n+1 to be equal to one. nents of all v1 , . . . , vν ∈ R equals

2 p=ν+1 λp

Now define z1 , . . . , zν ∈ Rn by setting their components zp,r = for r = 1, . . . , n. Then we have vp = z˜p for p = 1, . . . , ν. Therefore, from equation 4.2, we get n+1 p=ν+1

λp2 =

m

C αi αj kn+1 (x˜ i , x˜j ) − 2

ν

C αi γp kn+1 (x˜ i , z˜p )

i=1 p=1

i,j=1

+

m ν

C D vp,r ,

C γp γq kn+1 (˜zp , z˜ q ).

p,q=1

This is the minimum of ρ 2 among all γp ∈ R and z˜p ∈ η(Rn ) ⊂ Rn+1 for p = 1, . . . , ν. Using equation 4.1, we see that this is equal to m i,j=1

αi αj KnC,D (xi , xj ) − 2

m ν i=1 p=1

αi γp KnC,D (xi , zp ) +

ν

γp γq KnC,D (zp , zq ).

p,q=1

This is exactly the squared distance ρ 2 we want to minimize.

Optimal Reduced-Set Vectors for SVM

1777

References Boser, B., Guyon, I., & Vapnik, V. (1992). A training algorithm for optimal margin classifiers. In D. Haussler (Ed.), Proceedings of the Fifth Annual Workshop on Computational Learning Theory (pp. 144–152). Pittsburgh, PA: ACM Press. Burges, C. (1996). Simplified support vector decision rules. In L. Sartta (Ed.), 13th International Conference on Machine Learning (pp. 71–77). San Mateo, CA: Morgan Kaufmann. Osuna, E., & Girosi, F. (1998). Reducing the run-time complexity of support vector machines. In International Conference on Pattern Recognition. Brisbane, Australia: IEEE. Scholkopf, ¨ B., & Smola, A. (2002). Learning with kernels. Cambridge, MA: MIT Press. Tipping, M. (2001). Sparse Bayesian learning and the relevance vector machine. Journal of Machine Learning Research, 1, 211–244. Vapnik, V. (1998). Statistical learning theory. New York: Wiley. Received September 5, 2003; accepted February 2, 2004.

LETTER

Communicated by A. Yuille

Stochastic Reasoning, Free Energy, and Information Geometry Shiro Ikeda [email protected] Institute of Statistical Mathematics, Tokyo 106-8569, Japan, and Gatsby Computational Neuroscience Unit, University College London, London WC1N 3AR, U.K.

Toshiyuki Tanaka [email protected] Department of Electronics and Information Engineering, Tokyo Metropolitan University, Tokyo 192-0397, Japan

Shun-ichi Amari [email protected] RIKEN Brain Science Institute, Saitama 351-0198, Japan

Belief propagation (BP) is a universal method of stochastic reasoning. It gives exact inference for stochastic models with tree interactions and works surprisingly well even if the models have loopy interactions. Its performance has been analyzed separately in many fields, such as AI, statistical physics, information theory, and information geometry. This article gives a unified framework for understanding BP and related methods and summarizes the results obtained in many fields. In particular, BP and its variants, including tree reparameterization and concave-convex procedure, are reformulated with information-geometrical terms, and their relations to the free energy function are elucidated from an informationgeometrical viewpoint. We then propose a family of new algorithms. The stabilities of the algorithms are analyzed, and methods to accelerate them are investigated. 1 Introduction Stochastic reasoning is a technique used in wide areas of AI, statistical physics, information theory, and others to estimate the values of random variables based on partial observation of them (Pearl, 1988). Here, a large number of mutually interacting random variables are represented in the form of joint probability. However, the interactions often have specific structures such that some variables are independent of others when a set of variables is fixed. In other words, they are conditionally independent, and their interactions take place only through these conditioning variables. When such a structure is represented by a graph, it is called a graphical model c 2004 Massachusetts Institute of Technology Neural Computation 16, 1779–1810 (2004)

1780

S. Ikeda, T. Tanaka, and S. Amari

(Lauritzen & Spiegelhalter, 1988; Jordan, 1999). The problem is to infer the values of unobserved variables based on observed ones by reducing the conditional joint probability distribution to the marginal probability distributions. When the random variables are binary, their marginal probabilities are determined by the conditional expectation, and the problem is to calculate them. However, when the number of binary random variables is large, the calculation is computationally intractable from the definition. Apart from sampling methods, one way to overcome this problem is to use belief propagation (BP) proposed in AI (Pearl, 1988). It is known that BP gives exact inference when the underlying causal graphical structure does not include any loop, but it is also applied to loopy graphical models (loopy BP) and gives amazingly good approximate inference. The idea of loopy BP is successfully applied to the decoding algorithms of turbo codes and low-density parity-check (LDPC) codes as well as spinglass models and Boltzmann machines. It should be also noted that some variants have been proposed to improve the convergence property of loopy BP. Tree reparameterization (TRP) (Wainwright, Jaakkola, & Willsky, 2002) is one of them, and convex concave computational procedure (CCCP) (Yuille, 2002; Yuille & Rangarajan, 2003) is another algorithm reported to have better convergence property. The reason that loopy BP works so well is not fully understood, and there are a number of theoretical approaches that attempt to analyze its performance. The statistical physical framework uses the Bethe free energy (Yedidia, Freeman, & Weiss, 2001a) or something similar (Kabashima & Saad, 1999, 2001), and a geometrical theory was initiated by Richardson (2000) to understand the turbo decoding. Information geometry (Amari & Nagaoka, 2000), which has been successfully used in the study of the mean field approximation (Tanaka, 2000, 2001; Amari, Ikeda, & Shimokawa, 2001), gives a framework to elucidate the mathematical structure of BP (Ikeda, Tanaka, & Amari, 2002, in press). A similar framework is also given to describe TRP (Wainwright et al., 2002). The problem is interdisciplinary when various concepts and frameworks originate from AI, statistics, statistical physics, information theory, and information geometry. In this letter, we focus on undirected graphs, which is a general representation of graphical models, and give a unified framework to understand BP, CCCP, their variants, and the role of the free energy, based on information geometry. To this end, we propose a new function of the free energy type to which the Bethe free energy (Yedidia et al., 2001a) and that of Kabashima and Saad (2001) are closely related. By constraining the search space in proper ways, we obtain a family of algorithms including BP, CCCP, and a variant of CCCP without double loops. We also give their stability analysis. The error analysis was given in Ikeda et al. (in press). This letter is organized as follows. In section 2, the problem is stated compactly, followed preliminaries of information geometry. Section 3 in-

Stochastic Reasoning, Free Energy, and Information Geometry

1781

troduces an information-geometrical view of BP, the characteristics of its equilibrium, and related algorithms, TRP and CCCP. We discuss the free energy related to BP in section 4, and new algorithms are proposed with stability analysis in section 5. Section 6 gives some extensions of BP from an information-geometrical viewpoint, and finally section 7 concludes the letter. 2 Problem and Geometrical Framework 2.1 Basic Problem and Strategy. Let x = (x1 , . . . , xn )T be hidden and y = (y1 , . . . , ym )T be observed random variables. We start with the case where each xi is binary that is, xi ∈ {−1, +1} for simplicity. An extension to a wider class of distributions will be given in section 6.1. The conditional distribution of x given y is written as q(x|y), and our task is to give a good inference of x from the observations. We hereafter simply write q(x) for q(x|y) and omit y. One natural inference of x is the maximum a posteriori (MAP), that is, xˆ map = argmax q(x). x This minimizes the error probability that xˆ map does not coincide with the true one. However, this calculation is not tractable when n is large because the number of candidates of x increases exponentially with respect to n. The maximization of the posterior marginals (MPM) is another inference that minimizes the number of component errors. If each marginal distribution q(xi ), i = 1, . . . , n, is known, the MPM inference decides xˆ i = +1 when q(xi = +1) ≥ q(xi = −1) and xˆ i = −1 otherwise. Let ηi be the expectation of xi with respect to q(x), that is, ηi = Eq [xi ] = xi q(xi ). xi

The MPM inference gives xˆ i = sgn ηi , which is directly calculated if we know the marginal distributions q(xi ), or the expectation:

η = Eq [x]. This article focuses on the method to obtain a good approximation to η , which is equivalent to the inference of ni=1 q(xi ). For any q(x), ln q(x) can be expanded as a polynomial of x up to degree n, because every xi is binary. However, in many problems, mutual interactions of random variables exist only in specific manners. We represent ln q(x) in the form ln q(x) = h · x +

L r=1

cr (x) − ψq ,

1782

S. Ikeda, T. Tanaka, and S. Amari

wij

x3

x2 x5

x1

x6 x7

x4 x8

Figure 1: Boltzmann machine.

where h · x = i hi xi is the linear term, cr (x), r = 1, . . . , L, is a simple polynomial representing the rth clique among related variables, and ψq is a logarithm of the normalizing factor or the partition function, which is called the (Helmholtz) free energy, ψq = ln exp h · x + cr (x) . (2.1) r x In the case of Boltzmann machines (see Figure 1) and conventional spinglass models, cr (x) is a quadratic function of xi , that is, cr (x) = wij xi xj , where r is the index of the edge that corresponds to the mutual coupling between xi and xj . It is more common to define the true distribution q(x) of an undirected graph as a product of clique functions as q(x) =

n 1 φi (xi ) φr (xr ), Zq i=1 r∈C

where C is the set of cliques. In our notation, φi (xi ) and φr (xr ) are denoted as follows: 1 φi (xi = +1) hi = ln , cr (x) = ln φr (xr ), ψq = ln Zq . 2 φi (xi = −1)

Stochastic Reasoning, Free Energy, and Information Geometry

1783

When there are only pairwise interactions, φr (xr ) has the form φr (xi , xj ). 2.2 Important Family of Distributions. Let us consider the set of probability distributions p(x; θ , v) = exp[θ · x + v · c(x) − ψ(θ , v)],

(2.2)

parameterized by θ and v, where v = (v1 , . . . , vL )T , c(x) = (c1 (x), . . . , cL (x))T , and v · c(x) = Lr=1 vr cr (x). We name the family of the probability distributions S, which is an exponential family, S = {p(x; θ , v) θ ∈ n , v ∈ L },

(2.3)

where its canonical coordinate system is (θ , v). The joint distribution q(x) is included in S, which is easily proved by setting θ = h and v = 1L = (1, . . . , 1)T , q(x) = p(x; h, 1L ). We define M0 as a submanifold of S specified by v = 0, M0 = {p0 (x; θ ) = exp[h · x + θ · x − ψ0 (θ )] | θ ∈ n }. Every distribution of M0 is an independent distribution, which includes no mutual interaction between xi and xj , (i = j), and the canonical coordinate system of M0 is θ . The product of marginal distributions of q(x), that is, n n i=1 q(xi ), is included in M0 . The ultimate goal is to derive i=1 q(xi ) or the corresponding coordinate θ of M0 . 2.3 Preliminaries of Information Geometry. In this section we give preliminaries of information geometry (Amari & Nagaoka, 2000; Amari, 2001). First, we define e–flat and m–flat submanifolds of S: e–flat submanifold: Submanifold M⊂S is said to be e–flat when, for all t ∈ [0, 1], q(x), p(x) ∈ M, the following r(x; t) belongs to M: ln r(x; t) = (1 − t) ln q(x) + t ln p(x) + c(t), where c(t) is the normalization factor. Obviously, {r(x; t) | t ∈ [0, 1]} is an exponential family connecting two distributions, p(x) and q(x). When an e–flat submanifold is a one–dimensional curve, it is called an e–geodesic. In terms of the e–affine coordinates θ , a submanifold M is e–flat when it is linear in θ . m–flat submanifold: Submanifold M⊂S is said to be m–flat when, for all t ∈ [0, 1], q(x), p(x) ∈ M, the following mixture r(x; t) belongs to M: r(x; t) = (1 − t)q(x) + tp(x).

1784

S. Ikeda, T. Tanaka, and S. Amari

When an m–flat submanifold is a one–dimensional curve, it is called an m– geodesic. Hence, the above mixture family is the m–geodesic connecting them. From the definition, any exponential family is an e–flat manifold. Therefore, S and M0 are e–flat. Next, we define the m–projection (Amari & Nagaoka, 2000). Definition 1. Let M be an e–flat submanifold in S, and let q(x)∈S. The point in M that minimizes the Kullback-Leibler (KL) divergence from q(x) to M, denoted by M ◦q(x) = argmin D[q(x); p(x)], p(x)∈M

(2.4)

is called the m–projection of q(x) to M. Here, D[·; ·] is the KL divergence defined as D[q(x); p(x)] =

x

q(x) ln

q(x) . p(x)

The KL divergence satisfies D[q(x); p(x)]≥0, and D[q(x); p(x)] = 0 when and only when q(x) = p(x) holds for every x. Although symmetry D[q; p] = D[p; q] does not hold in general, it is regarded as an asymmetric squared distance. Finally, the m–projection theorem follows. Theorem 1. Let M be an e–flat submanifold in S, and let q(x)∈S. The m– projection of q(x) to M is unique and given by a point in M such that the m– geodesic connecting q(x) and M ◦q is orthogonal to M at this point in the sense of the Riemannian metric due to the Fisher information matrix. Proof. A detailed proof is found in Amari and Nagaoka (2000), and the following is a sketch of it. First, we define the inner product and prove the orthogonality. A rigorous definition concerning the tangent space of manifold is found in Amari and Nagaoka (2000). Let us consider a curve p(x; α) ∈ S, which is parameterized by a realvalued parameter α. Its tangent vector is represented by a random vector ∂α ln p(x; α), where ∂α = ∂/∂α. For two curves p1 (x; α) and p2 (x; β) that intersect at α = β = 0, p(x) = p1 (x; 0) = p2 (x; 0), we define the inner product of the two tangent vectors by Ep(x) [∂α ln p1 (x; α)∂β ln p2 (x; β)]α=β=0 . Note that this definition is consistent with the Riemannian metric defined by the Fisher information matrix.

Stochastic Reasoning, Free Energy, and Information Geometry

1785

Let p∗ (x) be an m–projection of q(x) to M, and the m–geodesic connecting q(x) and p∗ (x) be rm (x; α), which is defined as rm (x; α) = αq(x) + (1 − α)p∗ (x),

α ∈ [0, 1].

The derivative of ln rm (x; α) along the m–geodesic at p∗ (x) is ∂α ln rm (x; α)|α=0 =

q(x) − p∗ (x) q(x) − p∗ (x) = . rm (x; α) α=0 p∗ (x)

Let an e–geodesic included in M be re (x; β), which is defined as ln re (x; β) = β ln p (x) + (1 − β) ln p∗ (x) + c(β), p (x) ∈ M,

β ∈ [0, 1].

The derivative of ln re (x; β) along the e–geodesic at p∗ (x) is ∂β ln re (x; β)|β=0 = ln p (x) − ln p∗ (x) + c (0). The inner product becomes Ep∗ (x) [∂α ln p∗ (x)∂β ln p∗ (x)] =

[q(x) − p∗ (x)][ln p (x) − ln p∗ (x)].

(2.5)

x

The fact that p∗ (x) is an m–projection from q(x) to M gives ∂β D[q; re (β)]|β=0 = 0, that is, q(x)[ln p∗ (x) − ln p (x)] − c (0) = 0. (2.6) ∂β D[q(x); re (x; β)]β=0 = x Moreover, since D[p∗ ; re (β)] is minimized to 0 at β = 0, we have ∂β D[p∗ (x); re (x; β)]β=0 = p∗ (x)[ln p∗ (x) − ln p (x)] − c (0) = 0. x

(2.7)

Equation 2.5 is proved to be zero by combining equations 2.6 and 2.7. Furthermore, it immediately proves the Pythagorean theorem: D[q(x); p (x)] = D[q(x); p∗ (x)] + D[p∗ (x); p (x)]. This holds for every p (x) ∈ M. Suppose the m–projection is not unique, and let another point be p∗∗ (x) ∈ M, which satisfies D[q; p∗∗ ] = D[q; p∗ ]. Then the following equation holds D[q(x); p∗∗ (x)] = D[q(x); p∗ (x)] + D[p∗ (x); p∗∗ (x)] = D[q(x); p∗ (x)].

1786

S. Ikeda, T. Tanaka, and S. Amari

This is true only if p∗ (x) = p∗∗ (x), which proves the uniqueness of the m– projection. 2.4 MPM Inference. We show that the MPM inference is immediately given if the m–projection from q(x) to M0 is given. From the definition in equation 2.4, the m–projection of q(x) to M0 is characterized by θ ∗ , which satisfies p0 (x; θ ∗ ) = M0 ◦ q(x). Hereafter, we denote the m–projection to M0 in terms of the parameter θ as

θ ∗ = πM0 ◦ q(x) = argmin D[q(x); p0 (x; θ )]. θ

By taking the derivative of D[q(x); p0 (x; θ )] with respect to θ , we have xq(x) − ∂θ ψ0 (θ ∗ ) = 0, (2.8) x where ∂θ shows the derivative with respect to θ . From the definition of exponential family, ∂θ ψ0 (θ ) = ∂θ ln exp(h · x + θ · x) = xp0 (x; θ ). (2.9) x x We define the new parameter η 0 (θ ) in M0 as η 0 (θ ) = xp0 (x; θ ) = ∂θ ψ0 (θ ). x

(2.10)

This is called the expectation parameter (Amari & Nagaoka, 2000). From equations 2.8, 2.9, and 2.10, the m–projection is equivalent to marginalizing q(x). Since translation between θ and η 0 is straightforward for M0 , once the m–projection or, equivalently, the product of marginals of q(x) is obtained, the MPM inference is given immediately. 3 BP and Variants: Information-Geometrical View 3.1 BP 3.1.1 Information-Geometrical View of BP. In this section, we give the information-geometrical view of BP. The well-known definition of BP is found elsewhere (Pearl, 1988; Lauritzen & Spiegelhalter, 1988; Weiss, 2000); details are not given here. We note that our derivation is based on BP for undirected graphs. For loopy graphs, it is well known that BP does not necessarily converge, and even if it does, the result is not equal to the true marginals.

Stochastic Reasoning, Free Energy, and Information Geometry

A

B

1787

C

Figure 2: (A) Belief graph. (B) Graph with a single edge. (C) Graph with all edges.

Figure 2 shows three important graphs for BP. The belief graph in Figure 2A corresponds to p0 (x; θ ), and that in Figure 2C corresponds to the true distribution q(x). Figure 2B shows an important distribution that includes only a single edge. This distribution is defined as pr (x; ζ r ), where pr (x; ζ r ) = exp[h · x + cr (x) + ζ r · x − ψr (ζ r )],

r = 1, . . . , L.

This can be generalized without any change to the case when cr (x) is a polynomial. The set of the distributions pr (x; ζ r ) parameterized by ζ r is an e–flat manifold defined as Mr = {pr (x; ζ r ) | ζ r ∈ n },

r = 1, . . . , L.

Its canonical coordinate system is ζ r . We also define the expectation parameter η r (ζ r ) of Mr as follows: η r (ζ r ) = ∂ζr ψr (ζ r ) = xpr (x; ζ r ), r = 1, . . . , L. (3.1) x In Mr , only the rth edge is taken into account, but all the other edges are replaced by a linear term ζ r · x, and p0 (x; θ ) ∈ M0 is used to integrate all the information from pr (x; ζ r ), r = 1, . . . , L, giving θ , which is the parameter of p0 (x; θ ), to infer i q(xi ). In the iterative process of BP ζ r of pr (x; ζ r ), r = 1, . . . , L are modified by using the information of θ , which is renewed by integrating local information {ζ r }. Information geometry has elucidated its geometrical meaning for special graphs for error-correcting codes (Ikeda et al., in press; see also Richardson, 2000), and we give the framework for general graphs in the following. BP is stated as follows: Let pr (x; ζ tr ) be the approximation to q(x) at time t, which each Mr , r = 1, . . . , L specifies. Information-Geometrical View of BP 1. Set t = 0, ξ tr = 0, ζ tr = 0, r = 1, . . . , L.

1788

S. Ikeda, T. Tanaka, and S. Amari

2. Increment t by one and set ξ t+1 r , r = 1, . . . , L as follows:

ξ t+1 = πM0 ◦pr (x; ζ tr ) − ζ tr . r

(3.2)

as follows: 3. Update θ t+1 and ζ t+1 r ζ t+1 = ξ t+1 ξ t+1 = θ t+1 = r r r , r =r

r

1 t+1 ζ . L−1 r r

4. Repeat steps 2 and 3 until convergence. The algorithm is summarized as follows: Calculate iteratively: θ t+1 = [πM0 ◦ pr (x; ζ tr ) − ζ tr ],

ζ t+1 r

r t+1

=θ

− [πM0 ◦ pr (x; ζ tr ) − ζ tr ],

r = 1, . . . , L.

We have introduced two sets of parameters {ξ r } and {ζ r }. Let the converged point of BP be {ξ ∗r }, {ζ ∗r }, and θ ∗ , where θ ∗ = r ξ ∗r = r ζ ∗r /(L − 1) and θ ∗ = ξ ∗r + ζ ∗r . With these relations, the probability distribution of q(x), its final approximations p0 (x; θ ∗ ) ∈ M0 , and pr (x; ζ ∗r ) ∈ Mr are described as q(x) = exp[h · x + c1 (x) + · · · + cr (x) + · · · + cL (x) − ψq ] p0 (x; θ ∗ ) = exp[h · x + ξ ∗1 · x + · · · + ξ ∗r · x + · · · + ξ ∗L · x − ψ0 (θ ∗ )] pr (x; ζ ∗r ) = exp[h · x + ξ ∗1 · x + · · · + cr (x) + · · · + ξ ∗L · x − ψr (ζ ∗r )].

The idea of BP is to approximate cr (x) by ξ ∗r · x in Mr , taking the information from Mr (r = r) into account. The independent distribution p0 (x; θ ) integrates all the information. 3.1.2 Common BP Formulation and the Information-Geometrical Formulation. BP is generally described as a set of message updating rules. Here we describe the correspondence between common and information-geometrical formulation. In the graphs with pairwise interactions, messages and beliefs are updated as mt+1 ij (xj ) = bi (xi ) =

1 φi (xi )φij (xi , xj ) mtki (xi ) Z xi k∈N (i)\j 1 mt+1 φi (xi ) ki (xi ), Z k∈N (i)

where Z is the normalization factor and N (i) is the set of vertices connected to vertex i. The vector ξ r corresponds to mij (xj ). More precisely, when r is an edge connecting i and j, ξr,j =

1 mij (xj = +1) ln , 2 mij (xj = −1)

ξr,i =

1 mji (xi = +1) ln , 2 mji (xi = −1)

ξr,k = 0 for k = i, j,

Stochastic Reasoning, Free Energy, and Information Geometry

1789

where ξr,i denotes the ith component of ξ r . Note that the ith component of ξ r is not generally 0 if the rth edge includes vertex i and that equation 3.2 updates mij (xj ) and mji (xi ) simultaneously. Now, it is not difficult to understand the following correspondences: θi =

ξr ,i =

r

ζr,i = θi − ξr,i =

mki (xi = +1) 1 ln 2 k∈N (i) mki (xi = −1)

mki (xi = +1) 1 ln , 2 k∈N (i)\j mki (xi = −1)

where θi and ζr,i are the ith component of θ and ζ r , respectively, and r corresponds to the edge connecting i and j. Note that ζr,k = θk holds for k = i, j. 3.1.3 Equilibrium of BP. The following theorem, proved in Ikeda et al. (in press), characterizes the equilibrium of BP. The equilibrium (θ ∗ , {ζ ∗r }) satisfies

Theorem 2.

1. m–condition: θ ∗ = πM0 ◦ pr (x; ζ ∗r ). 2. e–condition: θ ∗ =

L 1 ζ∗. L − 1 r=1 r

It is easy to check that the m–condition is satisfied at the equilibrium of the BP algorithm from equation 3.2 and θ ∗ = ζ ∗r + ξ ∗r . In order to check the e–condition, we note that ξ ∗r corresponds to a message. If the same set of messages is used to calculate the belief of each vertex, the e–condition is automatically satisfied. Therefore, at each iteration of BP, the e–condition is satisfied. In some algorithms, multiple sets of messages are defined, and a different set is used to calculate each belief. In such cases, the e–condition plays an important role. In order to have an information-geometrical view, we define two submanifolds M∗ and E∗ of S (see equation 2.3) as follows:

∗ ∗ xp(x) = xp0 (x; θ ) = η 0 (θ ) M = p(x) p(x) ∈ S, x x L L ∗ t0 ∗ tr ∗ pr (x; ζ r ) tr = 1, tr ∈ E = p(x) = Cp0 (x; θ ) ∗

r=1

(3.3)

r=0

C : normalization factor. Note that M∗ and E∗ are an m–flat and an e–flat submanifold, respectively.

1790

S. Ikeda, T. Tanaka, and S. Amari

The geometrical implications of these conditions are as follows: m–condition: The m–flat submanifold M∗ , which includes pr (x; ζ ∗r ), r = 1, . . . , L, and p0 (x; θ ∗ ), is orthogonal to Mr , r = 1, . . . , L and M0 , that is, they are the m–projections to each other. e–condition: The e–flat submanifold E∗ includes p0 (x; θ ∗0 ), pr (x; ζ ∗r ), r = 1, . . . , L, and q(x). The equivalence between the e–condition of theorem 2 and the geometrical one stated above is proved straightforwardly by setting t0 = −(L − 1) and t1 = · · · = tL = 1 in equation 3.3. From the m–condition, x xp0 (x; θ ∗ ) = x xpr (x; ζ ∗r ) holds, and from the definitions in equations 2.10 and 3.1, we have,

η 0 (θ ∗ ) = η r (ζ ∗r ),

r = 1, . . . , L.

(3.4)

It is not difficult to show that equation 3.4 is the necessary and sufficient condition for the m–condition, and it implies not only that the m–projection of pr (x; ζ ∗r ) to M0 is p0 (x; θ ∗ ) but also that the m–projection of p0 (x; θ ∗ ) to Mr is pr (x; ζ ∗r ), that is,

ζ ∗r = πMr ◦ p0 (x; θ ∗ ),

r = 1, . . . , L.

where πMr denotes the m–projection to Mr . When BP converges, the e– condition and the m–condition are satisfied, but it does not necessarily imply q(x) ∈ M∗ , in other words, p0 (x; θ ∗ ) = ni=1 q(xi ), because there is a discrepancy between M∗ and E∗ . This is shown schematically in Figure 3. It is well known that in graphs with tree structures, BP gives the true marginals, that is, q(x) ∈ M∗ holds. In this case, we have the following relation: L q(x) =

∗ r=1 pr (x; ζ r ) . p0 (x; θ ∗ )L−1

(3.5)

This relationship gives the following proposition. Proposition 1. When q(x) is represented with a tree graph, q(x), p0 (x; θ ∗ ), and pr (x; ζ ∗r ), r = 1, . . . , L are included in M∗ and E∗ simultaneously. This proposition shows that when a graph is a tree, q(x) and p0 (x; θ ∗ ) are included in M∗ , and the fixed point of BP is the correct solution. In the case of a loopy graph, q(x) ∈ / M∗ , and the correct solution is not generally a fixed point of BP. However, we still hope that BP gives a good approximation to the correct marginals. The difference between the correct marginals and the BP solution is regarded as the discrepancy between E∗ and M∗ , and if we can

Stochastic Reasoning, Free Energy, and Information Geometry

1791

Figure 3: Structure of equilibrium.

qualitatively evaluate it, the error of the BP solution is estimated. We have given a preliminary analysis in Ikeda et al. (2002, in press), which showed that the principal term of the error is directly related to the e–curvature (see Amari & Nagaoka, 2000, for the definition) of M∗ , which mainly reflects the influence of the possible shortest loops in the graph. 3.2 TRP. Some variants of BP have been proposed, and information geometry gives a general framework to understand them. We begin with TRP (Wainwright et al., 2002). TRP selects the set of trees {Ti }, where each tree Ti consists of a set of edges, and renews related parameters in the process of inference. Let the set of edges be L and Ti ⊂ L, i = 1, . . . , K be its subsets, where each graph with the edges Ti does not have any loop. The choice of the sets {Ti } is arbitrary, but every edge must be included at least in one of the trees. In order to give the information-geometrical view, we use the parameters ζ r , θ r , r = 1, . . . , L, and θ . The information-geometrical view of TRP is given as follows: Information-Geometrical View of TRP 1. Set t = 0, ζ tr = θ tr = 0, r = 1, . . . , L, and θ t = 0.

1792

S. Ikeda, T. Tanaka, and S. Amari

2. For a tree Ti , construct a tree distribution ptTi (x) as follows ptTi (x) = Cp0 (x; θ t )

pr (x; ζ t ) r r∈Ti

p0 (x; θ tr )

= C exp h · x +

cr (x) +

r∈Ti

(ζ tr

−

θ tr )

+θ

t

· x . (3.6)

r∈Ti

By applying BP, calculate the marginal distribution of ptTi (x), and let θ t+1 = πM0 ◦ptTi (x). Then update θ t+1 and ζ t+1 as follows: r r For r ∈ Ti ,

θ t+1 = θ t+1 , r

ζ t+1 = πMr ◦ptTi (x). r

For r ∈ / Ti ,

θ t+1 = θ tr , r

ζ t+1 = ζ tr . r

3. Repeat step 2 for trees Tj ∈ {Ti }. 4. Repeat steps 2 and 3 until θ t+1 = θ t+1 holds for every r, and {ζ t+1 r r } converges. Let us show that the e– and the m–conditions are satisfied at the equilibrium of TRP. Since ptTi (x) is a tree graph, BP in step 2 gives the exact inference of marginal distributions. Moreover, from equations 3.5 and 3.6, we have t t+1 r∈Ti pr (x; ζ r ) r∈Ti pr (x; ζ r ) t t pTi (x) = Cp0 (x; θ ) = . t p0 (x; θ t+1 )|Ti |−1 r∈Ti p0 (x; θ r ) where |Ti | is the cardinality of Ti . By comparing the second and third terms and using θ t+1 = θ t+1 , r ∈ Ti , r t+1 (ζ tr − θ tr ) + θ t = ζ t+1 − (|Ti | − 1)θ t+1 = (ζ t+1 − θ t+1 . r r r )+θ r∈Ti

r∈Ti

r∈Ti

t t Since r∈T / i (ζ r − θ r ) does not change through step 2, we have the following relation, which shows that the e–condition holds for the convergent point of TRP: ζ ∗r − (L − 1)θ ∗ = (ζ ∗r − θ ∗r ) + θ ∗ = (ζ tr − θ tr ) + θ t = 0. r

r

r

When TRP converges, the operation of step 2 shows that each tree distribution has the same marginal distribution, which shows p∗Ti (x) ∈ M∗ , where p∗Ti (x) is the tree distribution constructed with the converged parameters. Since ζ ∗r = πMr ◦p∗Ti (x), r ∈ Ti holds, pr (x; ζ ∗r ) ∈ M∗ also holds for r = 1, . . . , L, which shows the m–condition is satisfied at the convergent point.

Stochastic Reasoning, Free Energy, and Information Geometry

1793

3.3 CCCP. CCCP is an iterative procedure to obtain the minimum of a function, which is represented by the difference of two convex functions (Yuille & Rangarajan, 2003). The idea of CCCP was applied to solve the inference problem of loopy graphs, where the Bethe free energy, which we will discuss in section 4, is the energy function (Yuille, 2002) (therefore, it is CCCP-Bethe, but in the following, we refer it as CCCP). The detail of the derivation will be given in the appendix, and CCCP is defined as follows in an information-geometrical framework. Information-Geometrical View of CCCP Inner loop: Given θ t , calculate {ζ t+1 r } by solving t πM0 ◦ pr (x; ζ t+1 r ) = Lθ −

ζ t+1 r ,

r = 1, . . . , L.

(3.7)

r

Outer loop: Given a set of {ζ t+1 r } as the result of the inner loop, calculate

θ t+1 = Lθ t −

ζ t+1 r .

(3.8)

r

From equations 3.7 and 3.8, one obtains

θ t+1 = πM0 ◦ pr (x; ζ t+1 r ),

r = 1, . . . , L,

which means that CCCP enforces the m–condition at each iteration. On the other hand, the e–condition is satisfied only at the convergent point, which can be easily verified by θ t+1 = θ t = θ ∗ in equation 3.8 to yield the e– letting ∗ ∗ condition (L − 1)θ = r ζ r . One can therefore see that the inner and outer loops of CCCP solve the m–condition and the e–condition, respectively. 4 Free Energy Function 4.1 Bethe Free Energy. We have described the information-geometrical view of BP and related algorithms. It has the characteristics of the equilibrium points, but it is not enough to describe the approximation accuracy and the dynamics of the algorithm. An energy function helps us to clarify them, and there are some functions proposed for this purpose. The most popular one is the Bethe free energy. The Bethe free energy itself is well known in the literature of statistical mechanics, being used in formulating the so-called Bethe approximation (Itzykson & Drouffe, 1989). As far as we know, Kabashima and Saad (2001) were the first to point out that BP is derived by considering a variational extremization of a free energy. It was Yedidia et al. (2001a) who introduced to the machine-learning community the formulation of BP based on the Bethe

1794

S. Ikeda, T. Tanaka, and S. Amari

free energy. Following Yedidia et al. (2001a) and using their terminology, the definition of the free energy is given as follows:

Fβ =

br (xr ) ln

br (xr ) exp[hi xi + hj xj + cr (x)]

xr − (li − 1) bi (xi ) ln r

i

xi

bi (xi ) . exp(hi xi )

Here, xr denotes the pair of vertices included in the edge r, bi (xi ) and br (xr ) are a belief and a pairwise belief respectively, and li is the number of neighbors of vertex i. From its definition, xi bi (xi ) = 1, and xr br (xr ) = 1 is satisfied. In an information-geometrical formulation, br (xr ) = pr (xr ; ζ r ). And by setting pr (xk ; ζ r ) = p0 (xk ; θ ),

k∈ / rth edge,

the Bethe free energy becomes

Fβ =

[ζ r · η r (ζ r ) − ψr (ζ r )] − (L − 1)[θ · η 0 (θ ) − ψ0 (θ )].

(4.1)

r

In Yedidia et al. (2001a, 2001b) the following reducibility conditions (also called the marginalization conditions) are further imposed: bi (xi ) =

bij (xi , xj ),

bj (xj ) =

xj

bij (xi , xj ).

(4.2)

xi

These conditions are equivalent to the m–condition in equation 3.1, that is, η r (ζ r ) = η 0 (θ ), r = 1, . . . , L, so that every ζ r is no longer an independent variable but is dependent on θ . With these constraints, the Bethe free energy is simplified as follows:

Fβm (θ ) = (L − 1)ψ0 (θ ) − +

r

ψr (ζ r (θ ))

ζ r (θ ) − (L − 1)θ · η 0 (θ ).

(4.3)

r

At each step of the BP algorithm, equation 4.2 is not satisfied, but the e– condition is satisfied. Therefore, assuming equation 4.2 for original BP immediately gives the equilibrium, and no free parameter is left. Without any free parameter, it is not possible to take the derivative, which does not allow us to give any further analysis in terms of the Bethe free energy. Thus,

Stochastic Reasoning, Free Energy, and Information Geometry

1795

it is important to specify in any analysis based on the free energy what the independent variables are, in order to provide a proper argument. Finally, we mention the relation between the Bethe free energy and the conventional (Helmholtz) free energy ψq , the logarithm of the partition function of q(x) defined in equation 2.1. When the e–condition is satisfied, Fβm (θ ) becomes

Fβm (θ ) = (L − 1)ψ0 (θ ) −

ψr (ζ r (θ ))

r

= − ψ0 (θ ) + [ψr (ζ r (θ )) − ψ0 (θ )] . r

This formula shows that the Bethe free energy can be regarded as an approximation to the conventional free energy by a linear combination of ψ0 and {ψr }. Moreover, if the graph is a tree, the result of proposition 1 shows that the Bethe free energy is equivalent to −ψq . 4.2 A New View on Free Energy. Instead of assuming equation 4.2, let us start from the free energy defined in equation 4.3 without any constraint on the parameters; that is, all of θ , ζ 1 , . . . , ζ L are the free parameters:

F (θ , ζ 1 , . . . , ζ L ) = (L − 1)ψ0 (θ ) − +

ψr (ζ r )

r

ζ r − (L − 1)θ · η 0 (θ ).

(4.4)

r

The above function is rewritten in terms of the KL divergence as F (θ , ζ 1 , . . . , ζ L ) = D[p0 (x; θ ); q(x)] − D[p0 (x; θ ); pr (x; ζ r )] + C, r

where C is a constant. The following theorem is easily derived. Theorem 3. ζ r ). Proof.

The equilibrium (θ ∗ , ζ ∗r ) of BP is a critical point of F (θ , ζ 1 , . . . ,

By calculating

∂F = 0, ∂ζr we easily have

η r (ζ r ) = η 0 (θ ),

1796

S. Ikeda, T. Tanaka, and S. Amari

which is the m–condition. By calculating ∂F = 0, ∂θ

(4.5)

we are led to the e–condition (L − 1)θ =

r ζr.

The theorem shows that equation 4.4 works as the free energy function without any constraint. 4.3 Relation to Other Free Energies. The function F (θ , ζ 1 , . . . , ζ L ) works as a free energy function, but it is also important to compare it with other “free energies.” First, we compare it with the one proposed by Kabashima and Saad (2001). It is a function of (ζ 1 , . . . , ζ L ) and (ξ 1 , . . . , ξ L ), given by

FKS (ζ 1 , . . . , ζ L ; ξ 1 , . . . , ξ L ) = F (θ , ζ 1 , . . . , ζ L ) + D[p0 (x; θ ); p0 (x; ζ r + ξ r )], r

where θ = r ξ r . It is clear from the definition that the choice of ξ r that makes FKS minimum is ξ r = θ − ζ r , for all r, and FKS becomes equivalent to F . Next, we consider the dual form of the free energy Fβ in equation 4.1. The dual form is defined by introducing the Lagrange multipliers (Yedidia et al., 2001a) and redefining the free energy as a function ofthem. The multipliers are defined on the reducibility conditions, bi (xi ) = xj bij (xi , xj ) and bj (xj ) = xi bij (xi , xj ). They are equivalent to η r (ζ r ) = η 0 (θ ), which is the m–condition in information-geometrical formulation. Let λr ∈ n , r = 1, . . . , L be the Lagrange multipliers, and the free energy becomes G (θ , {ζ r }, {λr }) = Fβ (θ , {ζ r }) − λr · [η r (ζ r ) − η 0 (θ )], λr ∈ n . r

The original extremal problem is equivalent to the extremal problem of G with respect to θ , {ζ r }, and {λr }. The dual form Gβ is derived by redefining G as a function of {λr }, where the extremal problems of θ and {ζ r } are solved. By solving ∂θ G = 0, we have

θ ({λr }) =

1 λr , L−1 r

while ∂ζr G = 0 gives

ζ r (λr ) = λr .

Stochastic Reasoning, Free Energy, and Information Geometry

Finally, the dual form Gβ becomes

Gβ ({λr }) = (L − 1)ψ0 (θ ({λr })) −

ψr (ζ r (λr )).

1797

(4.6)

r

Although F in equation 4.4 becomes equivalent to Gβ by assuming the e– condition, F is free from the e– and the m–conditions and is different from Gβ . From the definition of the Lagrange multipliers, Gβ is introduced to analyze the extremal problem of Fβ under the m–condition, where the e– condition is not satisfied. The m–constraint free energy Fβm in equation 4.3 shows that F is equivalent to Fβ under the m–condition. Finally, we summarize as follows: Under the m–condition, F is equivalent to Fβ , and under the e–condition, F is equivalent to the dual form Gβ . 4.4 Property of Fixed Points. Let us study the stability of the fixed point of Fβ or, equivalently, F under the m–condition. Since the m–condition is satisfied, every ζ r is a dependent variable of θ , and we consider the derivative with respect to θ . From the m–condition, we have

η r (ζ r ) = η 0 (θ ),

∂ζr = Ir−1 (ζ r )I0 (θ ), ∂θ

r = 1, . . . , L.

(4.7)

Here, I0 (θ ) and Ir (ζ r ) are the Fisher information matrices of p0 (x; θ ) and pr (x; ζ r ), respectively, which are defined as I0 (ζ r ) = ∂θ η 0 (θ ) = ∂θ2 ψ0 (θ ),

Ir (ζ r ) = ∂ζr η r (ζ r ) = ∂ζ2r ψr (ζ r ),

r = 1, . . . , L. Equation 4.7 is proved as follows:

η r (ζ r ) + Ir (ζ r )δ ζ r η r (ζ r + δ ζ r ) = η 0 (θ + δ θ ) η 0 (θ ) + I0 (θ )δ θ δ ζ r = Ir (ζ r )−1 I0 (θ )δ θ . (4.8) The condition of the equilibrium is equation 4.5 which yields the e–condition, and the second derivative gives the property around the stationary point, that is, ∂ 2F ∂ θ2

= I0 (θ ) + I0 (θ )

[Ir (ζ r )−1 − I0 (θ )−1 ]I0 (θ ) + ,

(4.9)

r

where is the term related to the derivative of the Fisher information matrix, which vanishes when the e–condition is satisfied. If equation 4.9 is positive definite at the stationary point, the Bethe free energy is at least locally minimized at the equilibrium. But it is not always positive definite. Therefore, the conventional gradient descent method of Fβ or F may fail.

1798

S. Ikeda, T. Tanaka, and S. Amari

5 Algorithms and Their Convergences 5.1 e–constraint Algorithm. Since the equilibrium of BP is characterized with the e– and the m–conditions, there are two possible algorithms for finding the equilibrium. One is to constrain the parameters always to satisfy the e–condition and search for the parameters that satisfy the m–condition (e–constraint algorithm). The other is to constrain the parameters to satisfy the m–condition and search for the parameters that satisfy the e–condition (m–constraint algorithm). In this section, we discuss e–constraint algorithms. BP is an e–constraint algorithm since the e–condition is satisfied at each step, but its convergence is not necessarily guaranteed. We give an alternative of the e–constraint algorithm that has a better convergence property. Let us begin with proposing a new cost function as Fe ({ζ r }) = ||η 0 (θ )−η r (ζ r )||2 , (5.1) r

under the e–constraint θ = r ζ r /(L − 1). If the cost function is minimized to 0, the m–condition is satisfied, and it is an equilibrium. A naive method to minimize Fe is the gradient descent algorithm. The gradient is 2 ∂ Fe = −2Ir (ζ r )[η 0 (θ )−η r (ζ r )] + [η 0 (θ )−η r (ζ r )]. (5.2) I0 (θ ) ∂ζr L−1 r If the derivative is available, ζ r and θ are updated as ∂ Fe 1 t+1 ζ t+1 = ζ tr − δ t , θ t+1 = ζ , r L r r ∂ζr where δ is a small positive learning rate. It is not difficult to calculate η 0 (θ ), η r (ζ r ), and I0 (θ ), and the rest of the problem is to calculate the first term of equation 5.2. Fortunately, we have the relation,

η r (ζ r + αh) − η r (ζ r ) . α→0 α If (η 0 (θ )−η r (ζ r )) is substituted for h, this becomes the first term of equation 5.2. Now we propose a new algorithm. Ir (ζ r )h = lim

A New e–Constraint Algorithm 1. Set t = 0, θ t = 0, ζ tr = 0, r = 1, . . . , L. 2. Calculate η 0 (θ t ), I0 (θ t ), and η r (ζ tr ), r = 1, . . . , L. 3. Let hr =η 0 (θ t )−η r (ζ tr ) and calculate η r (ζ tr +αhr ) for r = 1, . . . , L, where α>0 is small. Then calculate gr =

η r (ζ tr + αh) − η r (ζ tr ) . α

Stochastic Reasoning, Free Energy, and Information Geometry

1799

4. For t = 1, 2, . . . , update ζ t+1 as follows: r

ζ t+1 r θ t+1

2 t = − δ −2gr + hr , I0 (θ ) L−1 r 1 t+1 = ζ . L−1 r r

5. If Fe ({ζ r }) = and go to 2.

ζ tr

r ||η 0 (θ )−η r (ζ r )||

2

> ( is a threshold) holds, t+1→t

This algorithm is an e–constraint algorithm and does not include double loops, which is similar to the BP algorithm, but we have introduced a new parameter α, which can affect the convergence. We have checked, with small-sized numerical simulations, that if α is sufficiently small, this problem can be avoided, but further theoretical analysis is needed. Another problem is that this algorithm converges to any fixed point of BP even if it is not a stable fixed point of BP. For example, when ζ r and θ are extremely large, eventually every component of η r and η 0 becomes close to 1, which is a trivial useless fixed point of this algorithm. In order to avoid this, it is natural to use the Riemannian metric for the norm instead of the square norm defined in equation 5.1. The local metric modifies the cost function to FeR ({ζ r }) = [η 0 (θ ) − η r (ζ r )]T I0 (θ 0 )−1 [η 0 (θ ) − η r (ζ r )], r

where θ 0 is the convergent point. Since I0 (θ 0 )−1 diverges at the trivial fixed points mentioned above, we expect FeR ({ζ r }) to be a better cost function. The gradient can be calculated similarly by fixing θ 0 , which is unknown. Hence, we replace it by θ t . The calculation of gr should also be modified to η (ζ t + αI0 (θ t )−1 r hr ) g˜ r = r r α from the point of view of the natural gradient method (Amari, 1998). We finally have 1 t+1 t t −1 ζ r = ζ r − 2δI0 (θ ) hr . −˜gr + L−1 r Since I0 (θ ) is a diagonal matrix, computation is simple. 5.2 m–constraint Algorithm. The other possibility is to constrain the parameters always to satisfy the m–condition and modify the parameters to satisfy the e–condition. Since the m–condition is satisfied, {ζ r } are dependent on θ .

1800

S. Ikeda, T. Tanaka, and S. Amari

A naive idea is to repeat the following two steps: Naive m–Constraint Algorithm 1. For r = 1, . . . , L,

ζ tr = πMr ◦ p0 (x; θ t ).

(5.3)

2. Update the parameters as θ t+1 = Lθ t − ζ tr . r

Starting from θ t , the algorithm finds {ζ t+1 r } that satisfies the m–condition by equation 5.3, and θ t+1 is adjusted to satisfy the e–condition. This is a simple recursive algorithm without double loops. We call it the naive m–constraint algorithm. One may use an advanced iteration method that uses new ζ t+1 instead of ζ tr . In this case, the algorithm is r ζ t+1 = πMr ◦ p0 (x; θ t+1 ), where θ t+1 = Lθ t − ζ t+1 r r . r t

In this algorithm, starting from θ , one should solve a nonlinear equation t+1 in θ t+1 , because {ζ t+1 . This algorithm therefore uses r } are functions of θ double loops—the inner loop and the outer loop. This is the idea of CCCP, and it is also an m–constraint algorithm. 5.2.1 Stability of the Algorithms. Although the naive m–constraint algorithm and CCCP share the same equilibrium θ ∗ and {ζ ∗r }, their local stabilities at the equilibrium are different. It is reported that CCCP has superior properties in this respect. The local stability of BP was analyzed by Richardson (2000) and also by Ikeda et al. (in press) in geometrical terms. The stability condition of BP is given by the conditions of the eigenvalues of a matrix defined by the Fisher information matrices. In this article, we give the local stability of the other algorithms. If we eliminate the intermediate variables {ζ r } in the inner loop, the naive m–constraint algorithm is θ t+1 = Lθ t − πMr ◦ p0 (x; θ t ), (5.4) r

and CCCP is represented as θ t+1 = Lθ t − πMr ◦ p0 (x; θ t+1 ).

(5.5)

r

In order to derive the variational equation at the equilibrium, we note that for the m–projection,

ζ r = πMr ◦ p0 (x; θ ),

Stochastic Reasoning, Free Energy, and Information Geometry

1801

a small perturbation δ θ in θ is updated as δ ζ r = Ir (ζ r )−1 I0 (θ )δ θ (see equation 4.8). The variational equations are hence for equation 5.4, t+1 −1 δθ = LE − Ir (ζ r ) I0 (θ ) δ θ t , r

where E is the identity matrix, and for equation 5.5, −1 t+1 −1 =L E+ Ir (ζ r ) I0 (θ ) δθt , δθ r

respectively. Let K be a matrix defined by K=

1

I0 (θ )Ir (ζ r )−1 I0 (θ ) L r

t and δ θ˜ be a new variable defined as

t δ θ˜ = I0 (θ )δ θ t .

The variational equations for equations 5.4 and 5.5 are then δ θ˜

t+1

t = L(E − LK)δ θ˜ ,

δ θ˜

t+1

= L(E + LK)−1 δ θ˜ ,

t

respectively. The equilibrium is stable when the absolute values of the eigenvalues of the respective coefficient matrices are smaller than 1. Let λ1 , . . . , λn be the eigenvalues of K. They are all real and positive, since K is a symmetric positive-definite matrix. We note that λi are close to 1, when Ir (ζ r ) ≈ I0 (θ ) or Mr is close to M0 . The following theorem shows that CCCP has a good convergent property. Theorem 4. The equilibrium of the naive m–constraint algorithm in equation 5.4 is stable when 1+

1 1 > λi > 1 − , L L

i = 1, . . . , n.

The equilibrium of CCCP is stable when the eigenvalues of K satisfy λi > 1 −

1 , L

i = 1, . . . , n.

(5.6)

1802

S. Ikeda, T. Tanaka, and S. Amari

Under the m–constraint, the Hessian of F (θ ) at an equilibrium point is equal to (cf. equation 4.9)

I0 (θ )[LK − (L − 1)E] I0 (θ ), so that the stability condition (see equation 5.6) for CCCP is equivalent to the condition that the equilibrium is a local minimum of F under the m–constraint, which is equivalent to the m–constraint Bethe free energy Fβm (θ ). The theorem therefore states that CCCP is locally stable around an equilibrium if and only if the equilibrium is a local minimum of Fβm (θ ), whereas the naive m–constraint algorithm is not necessarily stable even if the equilibrium is a local minimum. A similar result is obtained in Heskes (2003). It should be noted that the above local stability result for CCCP does not follow from the global convergence result given by Yuille (2002). Yuille has shown that CCCP decreases the cost function and converges to an extremal point of Fβm (θ ), which means the fixed point is not necessarily a local minimum but can be a saddle point. Our local linear analysis shows that a stable fixed point of CCCP is a local minimum of Fβm (θ ). 5.2.2 Natural Gradient and Discretization. Let us consider a gradient rule for updating θ to find a minimum of F under the m–condition ∂ F (θ ) θ˙ = − . ∂θ When we have a metric to measure the distance in the space of θ , it is natural to use the metric for gradient (natural gradient; see Amari, 1998). For statistical models, the Riemannian metric given by the Fisher information matrix is a natural choice, since it is derived from KL divergence. The natural gradient version of the update rule is ∂F θ˙ = −I0−1 (θ ) = (L − 1)θ − ∂θ

πMr ◦ p0 (x; θ ).

(5.7)

r

For the implementation, it is necessary to discretize the continuous-time update rule. The “fully explicit” scheme of discretization (Euler’s method) reads t+1 t t t θ = θ + t (L − 1)θ − πMr ◦ p0 (x; θ ) . (5.8) r

When t = 1, this is equivalent to the naive m–constraint algorithm (see equation 5.4). However, we do not necessarily have to let t = 1. Instead, we may use arbitrary positive value for t. We will show how the convergence rate will be affected by the change of t later.

Stochastic Reasoning, Free Energy, and Information Geometry

1803

The “fully implicit” scheme yields

θ

t+1

t

= θ + t (L − 1)θ

t+1

−

πMr ◦ p0 (x; θ

t+1

) ,

(5.9)

r

which, after rearrangement of terms, becomes πMr ◦ p0 (x; θ t+1 ). [1 − t(L − 1)] θ t+1 = θ t − t r

When t = 1/L, this equation is equivalent to CCCP in equation 5.5. Again, we do not have to be bound to the choice t = 1/L. We will also show the relation between t and the convergence rate later. We have just shown that the naive m–constraint algorithm and CCCP can be viewed as first-order methods of discretization applied to the continuoustime natural gradient system shown in equation 5.7. The local stability result for CCCP proved in theorem 4 can also be understood as an example of the well-known absolute stability property of the fully implicit scheme applied to linear systems. It should also be noted that other more sophisticated methods for solving ordinary differential equations, such as Runge-Kutta methods (possibly with adaptive step-size control) and the Bulirsch-Stoer method Press, Teukolsky, Vetterling, and Flannery (1992), are applicable for formulating m–constraint algorithms with better properties, for example, better stability. In this article, however, we do not discuss possible extension along this line any further. 5.2.3 Acceleration of m–Constraint Algorithms. equations 5.8 and 5.9 in this section. The variational equation for equation 5.8 is δ θ˜

t+1

We give the analysis of

t = {E − [LK − (L − 1)E]t}δ θ˜ .

Let λ1 ≤ λ2 ≤ · · · ≤ λn

(5.10)

be the eigenvalues of K. Then the convergence rate is improved by choosing an adequate t. The convergence rate is governed by the largest absolute values of the eigenvalues of E − [LK − (L − 1)E]t, which are given by µi = 1 − [Lλi − (L − 1)]t. From equation 5.10, we have µ1 ≥ µ2 ≥ · · · ≥ µn . The stability condition is |µi | < 1 for all i. At a locally stable equilibrium point, µ1 < 1 always holds, so that the algorithm is stable if µn > −1 holds. The convergence to a locally

1804

S. Ikeda, T. Tanaka, and S. Amari

stable equilibrium point is most accelerated when µ1 + µn = 0, which holds by taking topt =

2 L(λ1 + λn − 1) + 2

.

The variational equation for equation 5.9 is δ θ˜

t+1

t = {E + [LK − (L − 1)E]t}−1 δ θ˜ ,

and the convergence rate is governed by the largest of the absolute values of the eigenvalues of {E + [LK − (L − 1)E]t}−1 , which should be smaller than 1 for convergence. The eigenvalues are µi =

1 . 1 + [Lλi − (L − 1)]t

We again have µ1 ≥ µ2 ≥ · · · ≥ µn . At a locally stable equilibrium point, 0 < µn and µ1 < 1 always hold, so that the algorithm is always stable. In principle, the smaller µ1 becomes, the faster the algorithm converges, so that taking t → +∞ yields the fastest convergence. However, the algorithm in this limit reduces to the direct evaluation of the e–condition under the m–constraint with one update step of the parameters. This is the fastest if it is possible, but this is usually infeasible for loopy graphs. 6 Extension 6.1 Extend the Framework to Wider Class of Distributions. In this section, two important extensions of BP are given in the information-geometrical framework. First, we extend the model to the case where the marginal distribution of each vertex is an exponential family. A similar extension is given in Wainwright, Jaakola, and Willsky (2003). Let ti be the sufficient statistics of the marginal distribution of xi , that is, q(xi ). The marginal distribution is in the family of distributions defined as follows: p(xi ; θ i ) = exp[θ i · ti − ϕi (θ i )]. This includes many important distributions. For example, multinomial distribution and gaussian distribution are included in this family. Let us define t = (t1T , . . . , tnT )T and θ = (θ T1 , . . . , θ Tn )T , and let the true distribution be q(x) = exp[h · t + c(x) − ψq ]. We can now redefine equation 2.2 as follows, p(x; θ , v) = exp[θ · t + v · c(x) − ψ(θ , v)],

Stochastic Reasoning, Free Energy, and Information Geometry

1805

and S in equation 2.3 as S = {p(x; θ , v) | θ ∈ , v ∈ V }. When the problem is to infer the marginal distribution q(xi ) of q(x), we can redefine the BP algorithm in this new S by redefining M0 and Mr . This extension based on the new definition is simple, and we do not give further details in this article. 6.2 Generalized Belief Propagation. In this section, we show the information-geometrical framework for the general belief propagation (GBP; Yedidia et al., 2001b), which is an important extension of BP. A naive explanation of GBP is that the cliques are reformulated by subsets of L, which is the set of all the edges. This brings us a new implementation of the algorithm and a different inference. In the information-geometrical formulation, we define cs (x) as a new clique function, which summarizes the interactions of the edges in Ls , s = 1, . . . , L , that is, cs (x) =

cr (x),

r∈Ls

where Ls ⊆ L. Those Ls may have overlaps, and Ls must be chosen to satisfy ∪s Ls = L. GBP is a general framework, which includes a lot of possible cases. We categorize them into three important classes and give an informationgeometrical framework for them: Case 1. In the simplest case, each Ls does not have any loop. This is equivalent to TRP. As we saw in section 3.2, the algorithm is explained in the information-geometrical framework. Case 2. In this case, each Ls can have loops, but there is no overlap, that is, Ls ∩ Ls = ∅ for s = s . The extension to this case is also simple. We can apply information geometry by redefining Mr as Ms , where its definition is given as Ms = {ps (x; ζ s ) = exp[h · x + cs (x) + ζ s · x − ψs (ζ s )] | ζ s ∈ n }. Since some loops are treated in a different way, the result might be different from BP. Case 3. Finally, we describe the case where each Ls can have loops and overlaps with the other sets. In this case, we have to extend the framework. Suppose Ls and Ls have an overlap, and both have loops. We explain the case with an example in Figure 4.

1806

S. Ikeda, T. Tanaka, and S. Amari

x1

x3 4

1

3

x4

x2 5

2 Figure 4: Case 3.

Let us first define the following distributions: 5 ci (x) − ψq q(x) = exp h · x + i=1

p0 (x; θ ) = exp[h · x + θ · x − ψ0 (θ )] 3 p1 (x; ζ 1 ) = exp h · x + ci (x) + ζ 1 · x − ψ1 (ζ 1 ) p2 (x; ζ 2 ) = exp h · x +

i=1 5

(6.1) (6.2)

ci (x) + ζ 2 · x − ψ2 (ζ 2 ) .

(6.3)

i=3

Even if ζ 1 , ζ 2 , and θ satisfy the e–condition as θ = ζ 1 + ζ 2 , this does not imply that C

p1 (x; ζ 1 )p2 (x; ζ 2 ) p0 (x; θ )

is equivalent to q(x), since c3 (x) is counted twice. Therefore, we introduce another model p3 (x; ζ 3 ), which has the following form: p3 (x; ζ 3 ) = exp[h · x + c3 (x) + ζ 3 · x − ψ3 (ζ 3 )].

(6.4)

Now, C

p1 (x; ζ 1 )p2 (x; ζ 2 ) p3 (x; ζ 3 )

becomes equal to q(x), where ζ 3 = ζ 1 + ζ 2 is the e–condition. Next, we look at the m–condition. The original form of the m–condition is xp0 (x; θ ) = xps (x; ζ s ), x x

Stochastic Reasoning, Free Energy, and Information Geometry

1807

but in this case, this form is not enough. We need a further condition, that is, ps (x2 , x3 ; ζ s ) = ps (x; ζ s ) x1 ,x4

should be the same for s = {1, 2, 3}. The models in equations 6.1, 6.2, 6.3, and 6.4 are not sufficient, since we do not have enough parameters to specify a joint distribution of (x2 , x3 ), and the model must be extended. In the binary case, we can extend the models by adding one variable as follows: p1 (x; ζ 1 , v1 ) = exp h · x + p2 (x; ζ 2 , v2 ) = exp h · x +

3

ci (x) + ζ 1 · x + v1 x2 x3 − ψ1 (ζ 1 , v1 )

i=1 5

ci (x) + ζ 2 · x + v2 x2 x3 − ψ2 (ζ 2 , v2 )

i=3

p3 (x; ζ 3 , v3 ) = exp[h · x + c3 (x) + ζ 3 · x + v3 x2 x3 − ψ3 (ζ 3 , v3 )], and the m–condition becomes xp0 (x; θ ) = xps (x; ζ s , vs ), s = 1, 2, 3, x x x2 x3 p1 (x; ζ 1 , v1 ) = x2 x3 p2 (x; ζ 2 , v2 ) = x2 x3 p3 (x; ζ 3 , v3 ). x x x We revisit the e–condition, which is now extended as

ζ3 = ζ1 + ζ2,

v3 = v1 + v2 .

This is a simple example, but we can describe any GBP problem in the information-geometrical framework in a similar way. 7 Conclusion Stochastic reasoning is an important technique widely used for graphical models with many interesting applications. BP is a useful method to solve it, and in order to analyze its behavior and give a theoretical foundation, a variety of approaches have been proposed from AI, statistical physics, information theory, and information geometry. We have shown a unified framework for understanding various interdisciplinary concepts and algorithms from the point of view of information geometry. Since information geometry captures the essential structure of the manifold of probability distributions, we are successful in clarifying the intrinsic geometrical structures and their difference of various algorithms proposed so far. The BP solution is characterized with the e– and the m–conditions. We have shown that BP and TRP explore the solution in the subspace where the

1808

S. Ikeda, T. Tanaka, and S. Amari

e–condition is satisfied, while CCCP does so in the subspace where the m– condition is satisfied. This analysis makes it possible to obtain new, efficient variants of these algorithms. We have proposed new e– and m–constraint algorithms. The possible acceleration methods for the m–constraint algorithm and CCCP are shown with local stability and convergence rate analysis. We have clarified the relation among the free-energy-like functions and have proposed a new one. Finally, we have shown possible extensions of BP from an information-geometrical viewpoint. This work is a first step toward an information-geometrical understanding of BP. By using this framework, we expect further understanding and a new improvement of the methods will emerge. Appendix: Information-Geometrical View of CCCP In this section, we derive the information-geometrical view of CCCP. The following two theorems play important roles in CCCP. Theorem 5 (Yuille & Rangarajan, 2003, sect. 2). Let E(x) be an energy function with bounded Hessian ∂E2 (x)/∂x∂x. Then we can always decompose it into the sum of a convex function and a concave function. Theorem 6 (Yuille & Rangarajan, 2003, sect. 2). Consider an energy function E(x) (bounded below) of form E(x) = Evex (x) + Ecave (x) where Evex (x), Ecave (x) are convex and concave functions of x, respectively. Then the discrete iterative CCCP algorithm xt → xt+1 given by ∇Evex (xt+1 ) = −∇Ecave (xt ) is guaranteed to monotonically decrease the energy E(x) as a function of time and hence to converge to a minimum or saddle point of E(x) (or even a local maximum if it starts at one). The idea of CCCP was applied to solve the inference problem of loopy graphs, where the Bethe free energy Fβ in equation 4.1 is the energy function (Yuille, 2002). The concave and convex functions are defined as follows: Fβ (θ , {ζ r }) = [ζ r · η r (ζ r ) − ψr (ζ r )] − (L − 1)[θ · η 0 (θ ) − ψ0 (θ )] r

= Fvex (θ , {ζ r }) + Fcave (θ , {ζ r }) Fvex (θ , {ζ r }) = [ζ r · η r (ζ r ) − ψr (ζ r )] + [θ · η 0 (θ ) − ψ0 (θ )], r

Fcave (θ ) = −L[θ · η 0 (θ ) − ψ0 (θ )]. Let the m–condition be satisfied, and Fvex is a function of θ . Next, since η 0 and θ have a one-to-one relation, let η 0 be the coordinate system. The

Stochastic Reasoning, Free Energy, and Information Geometry

1809

gradient of Fvex and Fcave is given as ∇η0 Fvex (η 0 ) = θ + ζ r , −∇η0 Fcave (η 0 ) = Lθ . r

Finally, the CCCP algorithm is written as ) = −∇η0 Fcave (η t0 ) ∇η0 Fvex (η t+1 0 θ t+1 + ζ t+1 = Lθ t . r

(A.1)

r

Since the m–condition is not satisfied in general, the inner loop solves the condition, while the outer loop updates the parameters as equation A.1. Acknowledgments We thank the anonymous reviewers for valuable feedback. This work was supported by the Grant-in-Aid for Scientific Research 14084208 and 14084209, MEXT, Japan and 14654017, JSPS, Japan. References Amari, S. (1998). Natural gradient works efficiently in learning. Neural Computation, 10, 251–276. Amari, S. (2001). Information geometry on hierarchy of probability distributions. IEEE Trans. Information Theory, 47, 1701–1711. Amari, S., Ikeda, S., & Shimokawa, H. (2001). Information geometry and mean field approximation: The α-projection approach. In M. Opper & D. Saad (Eds.), Advanced mean field methods—Theory and practice (pp. 241–257). Cambridge, MA: MIT Press. Amari, S., & Nagaoka, H. (2000). Methods of information geometry. Providence, RI: AMS, and New York: Oxford University Press. Heskes, T. (2003). Stable fixed points of loopy belief propagation are minima of the Bethe free energy. In S. Becker, S. Thrun, & K. Obermayer (Eds.), Advances in neural information processing systems, 15 (pp. 359–366). Cambridge, MA: MIT Press. Ikeda, S., Tanaka, T., & Amari, S. (2002). Information geometrical framework for analyzing belief propagation decoder. In T. G. Dietterich, S. Becker, & Z. Ghahramani (Eds.), Advances in neural information processing systems, 14 (pp. 407–414). Cambridge, MA: MIT Press. Ikeda, S., Tanaka, T., & Amari, S. (in press). Information geometry of turbo codes and low-density parity-check codes. IEEE Trans. Information Theory. Itzykson, C. & Drouffe, J.-M. (1989). Statistical field theory. Cambridge: Cambridge University Press. Jordan, M. I. (1999). Learning in graphical models. Cambridge, MA: MIT Press. Kabashima, Y., & Saad, D. (1999). Statistical mechanics of error-correcting codes. Europhysics Letters, 45, 97–103.

1810

S. Ikeda, T. Tanaka, and S. Amari

Kabashima, Y., & Saad, D. (2001). The TAP approach to intensive and extensive connectivity systems. In M. Opper & D. Saad (Eds.), Advanced mean field methods—Theory and practice (pp. 65–84). Cambridge, MA: MIT Press. Lauritzen, S. L., & Spiegelhalter, D. J. (1988). Local computations with probabilities on graphical structures and their application to expert systems. Journal of the Royal Statistical Society B, 50, 157–224. Pearl, J. (1988). Probabilistic reasoning in intelligent systems: Networks of plausible inference. San Mateo, CA: Morgan Kaufmann. Press, W. H., Teukolsky, S. A., Vetterling, W. T., & Flannery, B. P. (1992). Numerical recipes in C: The art of scientific computing. Cambridge: Cambridge University Press. Richardson, T. J. (2000). The geometry of turbo-decoding dynamics. IEEE Trans. Information Theory, 46, 9–23. Tanaka, T. (2000). Information geometry of mean-field approximation. Neural Computation, 12, 1951–1968. Tanaka, T. (2001). Information geometry of mean-field approximation. In M. Opper & D. Saad (Eds.), Advanced mean field methods—Theory and practice (pp. 259–273). Cambridge, MA: MIT Press. Wainwright, M., Jaakkola, T., & Willsky, A. (2002). Tree-based reparameterization for approximate inference on loopy graphs. In T. G. Dietterich, S. Becker, & Z. Ghahramani (Eds.), Advances in neural information processing systems, 14 (pp. 1001–1008). Cambridge, MA: MIT Press. Wainwright, M., Jaakkola, T., & Willsky, A. (2003). Tree-reweighted belief propagation algorithms and approximate ML estimate by pseudo-moment matching. In C. M. Bishop & B. J. Frey (Eds.), Proceedings of Ninth International Workshop on Artificial Intelligence and Statistics. Available on-line at: http://www.research.microsoft.com/conferences/aistats2003/. Weiss, Y. (2000). Correctness of local probability propagation in graphical models with loops. Neural Computation, 12, 1–41. Yedidia, J. S., Freeman, W. T., & Weiss, Y. (2001a). Bethe free energy, Kikuchi approximations, and belief propagation algorithms (Tech. Rep. No. TR2001–16). Cambridge, MA: Mitsubishi Electric Research Laboratories. Yedidia, J. S., Freeman, W. T., & Weiss, Y. (2001b). Generalized belief propagation. In T. K. Leen, T. G. Dietterich, & V. Tresp (Eds.), Advances in neural information processing systems, 13 (pp. 689–695). Cambridge, MA: MIT Press. Yuille, A. L. (2002). CCCP algorithms to minimize the Bethe and Kikuchi free energies: Convergent alternatives to belief propagation. Neural Computation, 14, 1691–1722. Yuille, A. L., & Rangarajan, A. (2003). The concave-convex procedure. Neural Computation, 15, 915–936. Received August 12, 2003; accepted March 3, 2004.

LETTER

Communicated by Sebastian Seung

Blind Separation of Positive Sources by Globally Convergent Gradient Search Erkki Oja [email protected] Neural Networks Research Centre, Helsinki University of Technology, 02015 HUT, Finland

Mark Plumbley [email protected] Department of Electrical Engineering, Queen Mary, University of London, London E1 4NS, U.K.

The instantaneous noise-free linear mixing model in independent component analysis is largely a solved problem under the usual assumption of independent nongaussian sources and full column rank mixing matrix. However, with some prior information on the sources, like positivity, new analysis and perhaps simplified solution methods may yet become possible. In this letter, we consider the task of independent component analysis when the independent sources are known to be nonnegative and well grounded, which means that they have a nonzero pdf in the region of zero. It can be shown that in this case, the solution method is basically very simple: an orthogonal rotation of the whitened observation vector into nonnegative outputs will give a positive permutation of the original sources. We propose a cost function whose minimum coincides with nonnegativity and derive the gradient algorithm under the whitening constraint, under which the separating matrix is orthogonal. We further prove that in the Stiefel manifold of orthogonal matrices, the cost function is a Lyapunov function for the matrix gradient flow, implying global convergence. Thus, this algorithm is guaranteed to find the nonnegative well-grounded independent sources. The analysis is complemented by a numerical simulation, which illustrates the algorithm.

1 Introduction The problem of independent component analysis (ICA) has been studied by many authors in recent years (for a review, see Hyv¨arinen, Karhunen, & Oja, 2001; Amari & Cichocki, 2002). In the simplest form of ICA, we assume that we have a sequence of observations {x(k)}, which are samples of a random c 2004 Massachusetts Institute of Technology Neural Computation 16, 1811–1825 (2004)

1812

E. Oja and M. Plumbley

observation vector x generated according to x = As,

(1.1)

where s = (s1 , . . . , sn )T is a vector of real independent random variables (the sources), all but perhaps one of them nongaussian, and A is a nonsingular n × n real mixing matrix. The task in ICA is to identify A given just the observation sequence, using the assumption of independence of the si s, and hence to construct an unmixing matrix B = RA−1 giving y = Bx = BAs = Rs, where R is a matrix that permutes and scales the sources. Typically, we assume that the sources have unit variance, with any scaling factor being absorbed into the mixing matrix A, so y will be a permutation of the s with just a sign ambiguity. Common cost functions for ICA are based on maximizing nongaussianities of the elements of y, and they may involve approximations by higherorder cumulants such as kurtosis. The observations x are usually assumed to be zero mean, or transformed to be so, and are commonly prewhitened by some matrix z = Vx so that E{zzT } = I before an optimization algorithm is applied to find the separating matrix. This basic ICA model can be considered to be solved, with a multitude of practical algorithms and software. However, if one makes some further assumptions that restrict or extend the model, then there is still ground for new analysis and solution methods. One such assumption is positivity or nonnegativity of the sources and perhaps the mixing coefficients. Nonnegativity is a natural condition for many real-world applications, for example, in the analysis of images (Parra, Spence, Sajda, Ziehe, & Muller, ¨ 2000; Lee, Lee, Choi, & Lee, 2001), text (Tsuge, Shishibori, Kuroiwa, & Kita, 2001) or air quality (Henry, 2002). The constraint of nonnegative sources, perhaps with an additional constraint of nonnegativity on the mixing matrix A, is often known as positive matrix factorization (Paatero & Tapper, 1994) or nonnegative matrix factorization (Lee & Seung, 1999). A nonnegativity constraint has been suggested for a number of other neural network models too (Xu, 1993; Fyfe, 1994; Harpur, 1997; Charles & Fyfe, 1998). We refer to the combination of nonnegativity and independence assumptions on the sources as nonnegative independent component analysis. Recently, one of us considered the nonnegativity assumption on the sources (Plumbley, 2002, 2003) and introduced an alternative way of approaching the ICA problem, as follows. We call a source si nonnegative if Pr(si < 0) = 0, and such a source will be called well grounded if Pr(si < δ) > 0 for any δ > 0, that is, that si has nonzero pdf all the way down to zero. The following key result was proven (Plumbley, 2002): Theorem 1. Suppose that s is a vector of nonnegative well-grounded independent unit-variance sources si , i = 1, . . . , n, and y = Us where U is a square orthonormal rotation, that is, UT U = I. Then U is a permutation matrix, that is,

Blind Separation of Positive Sources

1813

the elements yj of y are a permutation of the sources si , if and only if all yj are nonnegative. Actually, the requirement for independence of the sources in theorem 1 is a technicality that simplifies the proof but could be relaxed to second-order independence or uncorrelatedness. The proof would be lengthy, however, and we wished to rely on the existing theorem 1 in this letter. The result of theorem 1 can be used for a simple solution of the nonnegative ICA problem. Note that y = Us can also be written as y = Wz, with z the prewhitened observation vector and W an unknown orthogonal (rotation) matrix. It therefore suffices to find an orthogonal matrix W for which y = Wz is nonnegative. This brings additional benefit over other ICA methods that we know of that, if successful, we always have a positive permutation of the sources, since both the s and y are nonnegative. The sign ambiguity present in usual ICA vanishes here. Plumbley (2002) further suggested that a suitable cost function for finding the rotation could be constructed as follows. Suppose we have an output + + truncated at zero, y+ = (y+ 1 , . . . , yn ) with yi = max(0, yi ), and we construct T T a reestimate of z = W y given by zˆ = W y+ . Then a suitable cost function would be given by J(W) = E{ z − zˆ 2 } = E{ z − WT y+ 2 },

(1.2)

because obviously its value will be zero if W is such that all the yi are positive, or y = y+ . We considered the minimization of this cost function by various numerical algorithms in Plumbley (2003) and Plumbley and Oja (2004). In Plumbley (2003), explicit axis rotations as well as geodesic search over the Stiefel manifold of orthogonal matrices were used. In Plumbley and Oja (2004), the cost function 1.2 was taken as a special case of nonlinear PCA, for which an algorithm was earlier suggested by Oja (1997). However, a rigorous convergence proof for the nonlinear PCA method could not be constructed except in some special cases. The general convergence seems to be a very challenging problem. The new key result shown in this article is that the cost function 1.2 has very desirable properties. In the Stiefel manifold of rotation matrices, the function has no local minima, and it is a Lyapunov function for its gradient matrix flow. A gradient algorithm, suggested in the following, is therefore monotonically converging and is guaranteed to find the absolute minimum of the cost function. The minimum is zero, giving positive components yi , which by theorem 1 must be a positive permutation of the original unknown sources sj . Some preliminary results along these lines were given in Oja and Plumbley (2003). In the next section, we present the whitening for non-zero-mean observations and further illustrate by a simple example why a rotation into positive outputs yi will give the sources. In section 3, we consider the cost function

1814

E. Oja and M. Plumbley

1.2 in more detail. Section 4 gives a gradient algorithm, whose monotonical global convergence is proven. Section 5 relates this orthogonalized algorithm to the nonorthogonal “nonlinear PCA” learning rule previously introduced by one of the authors by illustrating both algorithms in a pictorial example. Finally, section 6 gives some conclusions. 2 Prewhitening and Axis Rotations In order to reduce the ICA problem to one of finding the correct orthogonal rotation, the first stage in our ICA process is to whiten the observed data x. This gives z = Vx,

(2.1)

where the n × n real whitening matrix V is chosen so that z = E{(z − z¯ )(z − z¯ )T } = In , with z¯ = E{z}. If E is the orthogonal matrix of eigenvectors of the data covariance matrix x = E{(x − x¯ )(x − x¯ )T } and D = diag(d1 , . . . , dn ) is the diagonal matrix of corresponding eigenvalues, so that x = EDET and −1/2 = ED−1/2 ET ET E = EET = In , then a suitable whitening matrix is V = x where −1/2

D−1/2 = diag(d1

−1/2

, . . . , dn

).

x is normally estimated from the sample covariance (Hyv¨arinen et al., 2001). Note that for nonnegative ICA, we do not remove the mean of the data, since this would lose information about the nonnegativity of the sources (Plumbley, 2002). Suppose that our sources sj have unit variance, such that s = In , and let U = VA be the s-to-z transform. Then In = z = Us UT = UUT so U is an orthonormal matrix. It is therefore sufficient to search for a further orthonormal matrix W such that y = Wz = WUs is a permutation of the original sources s. Figure 1 illustrates the process of whitening for nonnegative data in two dimensions. Whitening has succeeded in making the axes of the original sources orthogonal to each other (see Figure 1b), but there is a remaining orthonormal rotation ambiguity. A typical ICA algorithm might search for a rotation that makes the resulting outputs as nongaussian as possible, for example, by finding an extremum of kurtosis, since any sum of independent random variables will make the result “more gaussian” (Hyv¨arinen et al., 2001). However, Figure 1 immediately suggests another approach: we should search for a rotation where all the data fit into the positive quadrant. As long as the distribution of the original sources is “tight” down to the axes, then it is intuitively clear that this will be a unique solution, apart from a permutation and scaling of the axes. This explains why theorem 1 works. Note also that after this rotation, the two sources s1 and s2 are indeed independent, even

Blind Separation of Positive Sources 5

4

4

3

3

2

2

z2

5

2

x

1815

1

1

0

0

−1

−1

−2

−2 −2

0

2

4

−2

x

0

2

4

z

1

1

(a)

(b)

Figure 1: Original data (a) are whitened (b) to remove second-order correlations.

if they are not zero mean, because for their densities, it holds that the joint density (in this case, uniform in a square) is the product of the marginal (uniform) densities. 3 The Cost Function for Nonnegative BSS In the following, we show that the minimum of the cost function 1.2 in the set of orthogonal (rotation) matrices will give the sources. For brevity of notation, let us denote the truncation nonlinearity by g+ (yi ) = max(0, yi ), which is zero for negative yi and yi otherwise. We are now ready to state and prove theorem 2: Theorem 2. Assume the n-element random vector z is a whitened linear mixture of nonnegative well-grounded independent unit variance sources s1 , . . . , sn , and y = Wz with W constrained to be a square orthogonal matrix. If W is obtained as the minimum of the cost function 1.2, rewritten as J(W) = E z − WT g+ (Wz) 2 , then the elements of y will be a permutation of the original sources si . Proof.

Because W is square orthogonal, we get:

J(W) = E{ z − WT g+ (Wz) 2 }

(3.1)

1816

E. Oja and M. Plumbley

= E{ Wz − WWT g+ (Wz) 2 } +

2

= E{ y − g (y) } n = E{[yi − g+ (yi )]2 } = =

i=1 n i=1 n

(3.2) (3.3) (3.4)

E{min(0, yi )2 }

(3.5)

E{y2i |yi < 0}P(yi < 0).

(3.6)

i=1

This is always nonnegative and becomes zero if and only if each yi is nonnegative with probability one. Thus, if W is obtained as the minimum of the cost function, then the elements of y will be nonnegative. On the other hand, because y = Wz with W orthogonal, it also holds that y = Us with U orthogonal. Theorem 1 now implies that because the elements of y are nonnegative, they must be a permutation of the elements of s. 4 A Converging Gradient Algorithm Theorem 2 leads us naturally to consider the use of a gradient algorithm for minimizing equation 3.1 under the orthogonality constraint. In order to derive the gradient, let us first write equation 3.1 in the simple form, equation 3.5: J(w1 , . . . , wn ) =

n

E{min(0, yi )2 }, yi = wTi z,

(4.1)

i=1

where the vectors wTi are the rows of matrix W and thus satisfy wTi wj = δij . The gradient with respect to one of vectors wi is straightforward: ∂J min(0, yi )2 ∂yi = 2E{min(0, yi )z}. =E ∂wi ∂yi ∂wi

(4.2)

(4.3)

A possible way to do the constrained minimization by gradient descent is to divide each descent step into two parts: a step in the direction of the unconstrained gradient, followed by a consequent projection of the new point onto the constraint set 4.2. For vectors wi , this gives the update rule ∂J ∂wi = wi − 2γ E{min(0, yi )z},

w˜ i = wi − γ

(4.4) (4.5)

Blind Separation of Positive Sources

1817

which is the unconstrained gradient descent step with step size γ , and ˜ ˜W ˜ T )−1/2 W, W = (W

(4.6)

which is the projection onto the constraint set of orthonormal wTi vectors, ˜ is the one with because equation 4.6 implies WWT = I. Obviously, matrix W T vectors w˜ i as its rows. This is just one of the possibilities for performing orthonormalization; another alternative would be the Gram-Schmidt algorithm. However, we prefer a symmetrical orthonormalization as there is no reason to order the wi vectors in any way. In matrix form, equation 4.5 reads ˜ = W − 2γ E{fzT }, W

(4.7)

where f = f(y) is the column vector with elements min(0, yi ). This is the gradient descent step. Now, to derive the projection in equation 4.6, we have ˜W ˜ T = WWT + 4γ 2 E{fzT }E{zfT } − 2γ E{WzfT + fzT WT } = I − 2γ E{WzfT + W T ˜W ˜ T )−1/2 = I + γ E{WzfT + fz WT } + O(γ 2 ). Thus, assuming γ small, (W T T 2 T −1/2 ˜W ˜ ) ˜ = W − 2γ E{fzT } + γ E{WzfT W + fz W } + O(γ ), and finally (W W T 2 T T fz } + O(γ ) = W − γ E{fz } + γ E{yf W} + O(γ 2 ). Omitting O(γ 2 ), the ˜ at the ˜W ˜ T )−1/2 W change from W at the previous step to the new W = (W T T next step is therefore W = −γ E{fz − yf W}, which can be further written in the form W = −γ E{fyT − yfT }W.

(4.8)

Substituting min(0, yi ) for the elements of f(y) gives the gradient descent algorithm for the constrained problem. The form of the update rule 4.8 is the same as that derived by Cardoso and Laheld (1996) (although with different notation) for minimizing a general contrast under the orthogonality constraint. However, they are projecting the unconstrained gradient onto the space of skew-symmetric matrices. It seems that from our derivation, higher-order terms with respect to the step size γ and thus a more accurate projection could be more easily obtained by continuing the series expansion. The skew-symmetric form of the matrix fyT − yfT in equation 4.8 ensures that W tends to stay orthogonal from step to step, although to fully guarantee orthogonality in a discrete-time gradient algorithm, an explicit orthonormalization of the rows of W should be done from time to time. Instead of analyzing this learning rule directly, let us look at the averaged differential equation corresponding to the discrete-time algorithm 4.8. It becomes dW = −MW, dt

(4.9)

1818

E. Oja and M. Plumbley

where we have denoted the continuous-time deterministic solution also by W, and the elements µij of matrix M = E{fyT − yfT } are µij = E{min(0, yi )yj − yi min(0, yj )}.

(4.10)

Note that M is a nonlinear function of the solution W, because y = Wz. Yet we can formally write the solution of equation 4.9 as t W(t) = exp − M(s)ds W(0).

(4.11)

0

The solution W(t) is always an orthogonal matrix if W(0) is orthogonal. This can be shown as follows: W(t)W(t)T = t t exp − M(s)ds W(0)W(0)T exp − M(s)T ds 0 t

= exp[−

0

(M(s) + M(s)T )ds].

0

But matrix M is skew-symmetric; hence, M(s) + M(s)T = 0 for all s and W(t)W(t)T = exp[0] = I. We can now analyze the stationary points of equation 4.9 and their stability in the class of orthogonal matrices. The stationary points (for which dW dt = 0) are easily solved. They must be the roots of the equation MW = 0, which is equivalent to M = 0 because of the orthogonality of W. We see that if all yi are positive or all of them are negative, then M = 0. Namely, if yi and yj are both positive, then min(0, yi ) and min(0, yj ) in equation 4.10 are both zero. If they are both negative, then min(0, yi ) = yi and min(0, yj ) = yj , and the two terms in equation 4.10 cancel out. Thus, in these two cases, W is a stationary point. The case when all yi are positive corresponds to the minimum value (zero) of the cost function J(W). By theorem 1, y is then a permutation of s, which is the correct solution we are looking for. We would hope that this stationary point would be the only stable one, because then the ordinary differential equation (ODE) will converge to it. The case when all the yi are negative corresponds to the maximum value of J(W), equal to ni=1 E{y2i } = n. As it is stationary too, we have to consider the case when it is taken as the initial value in the ODE. In all other cases, at least some of the yi have opposite signs. Then M is not zero and W is not stationary, as seen from equation 4.11. We could look at the local stability of the two stationary points. However, we can do even better and perform a global analysis. It turns out that equation 3.1 is in fact a Lyapunov function for the matrix flow 4.9; it is strictly decreasing always when W changes according to the ODE 4.9, except at the stationary points. Let us prove this in the following.

Blind Separation of Positive Sources

Theorem 3. If W follows the ODE 4.9, then all yi are nonnegative or all are nonpositive.

1819 dJ(W) dt

< 0, except at the point when

Proof. Consider the ith term in the sum J(W), given in equation 3.5. Denoting it by ei , we have ei = E{min(0, yi )2 } whose derivative with respect to yi is 2E{min(0, yi )}. If wTi is the ith row of matrix W, then yi = wTi z. Thus, T dwi dei dei dyi = = 2E min(0, yi ) z . dt dyi dt dt

(4.12)

From the ODE 4.9, we get n dwTi µik wTk , =− dt k=1

with µik given in equation 4.10. Substituting this in equation 4.12 gives n dei µik E{min(0, yi )yk } = −2 dt k=1

= −2

n

E2 {min(0, yi )yk }

k=1 n

+2

E{min(0, yk )yi }E{min(0, yi )yk }.

k=1

If we denote αik = E{min(0, yi )yk }, we have

n n n n n dJ(W) dei 2 αik + αik αki . = =2 − dt dt i=1 i=1 k=1 i=1 k=1 By the Cauchy-Schwartz inequality, this is strictly negative unless αik = αki for all i, k, and thus J(W) is decreasing. We still have to look at the condition that αik = αki for all i, k and show that this implies nonnegativity or nonpositivity for all the yi . Now, because y = Us with U orthogonal, each yi is a projection of the positive source vector s on one of n orthonormal rows uTi of U. If the vectors ui are aligned with the original coordinate axes, then the projections of s on them are nonnegative. For any rotation that is not aligned with the coordinate axes, one of the vectors ui (or −ui ) must be in the positive octant due to the orthonormality of the vectors. Without loss of generality, assume that this vector is u1 ; then it holds that P(y1 = uT1 s ≥ 0) = 1 (or 0). But if P(y1 ≥ 0) = 1, then min(0, y1 ) = 0 and α1k = E{min(0, y1 )yk } = 0 for all k. If symmetry holds for the αij , then also αk1 = E{min(0, yk )y1 } = E{y1 yk |yk ≤

1820

E. Oja and M. Plumbley

0}P(yk ≤ 0) = 0. But y1 is nonnegative, so P(yk ≤ 0) must be zero too for all k. The same argument carries over to the case when P(y1 ≥ 0) = 0, which implies that if one yi is nonnegative, then all yk must be nonnegative in the case of symmetrical αij . The behavior of the learning rule 4.8 is now well understood. The function J(W) in equation 1.2 is a Lyapunov function for the averaged differential equation 4.9, for all orthogonal matrices W except for the point with all outputs yi nonpositive. Therefore, recalling that if W(0) is an orthogonal matrix, then it is constrained to remain so, W(t) converges to the minimum of J(W) from almost everywhere in the Stiefel manifold of orthogonal matrices. For a discussion of optimization and learning on the Stiefel manifold, see Edelman, Arias, and Smith (1998) and Fiori (2001). This minimum corresponds to all nonnegative yi . By theorem 1, these must be a permutation of the original sources sj which therefore have been found. The result was proven only for the continuous-time averaged version of the learning rule; the exact connection between this and the discrete-time online algorithm has been clarified in the theory of stochastic approximation (see Oja, 1983). In practice, even if the starting point happened to be the “bad” stationary point in which all yi are nonpositive, then numerical errors will deviate the solution from this point and the cost function J(W) starts to decrease. 5 Experiments We illustrate the operation of the algorithm using a blind image separation problem. We use the same images and perfomance measures as in Plumbley and Oja (2004), in which the nonlinear PCA algorithm (Oja, 1997) was used instead to solve the nonnegative ICA problem. As performance measures, we measure a mean squared error eMSE , an orthonormalization error eOrth , and a permutation error ePerm , defined as follows: 1 zk − WT g+ (yk ) 2 np k=1 p

eMSE =

1 I − (WVA)T WVA 2F n2 1 = 2 I − abs(WVA)T abs(WVA) 2F , n

(5.1)

eOrth =

(5.2)

ePerm

(5.3)

where abs(M) returns the absolute value of each element of M, so that ePerm = 0 only for a positive permutation matrix. The parameters have been scaled (by 1/(np) or 1/n2 ) to allow more direct comparison between the result of simulations using different values for n and p.

Blind Separation of Positive Sources

1821

Four image patches of size 252 × 252 were selected from a set of images of natural scenes and downsampled by a factor of 4 in both directions to yield 63 × 63 pixel images. Each of the n = 4 images was treated as one source, with its pixel values representing the p = 63 × 63 = 3969 samples. The source image values were shifted to have a minimum of zero to ensure they were well grounded, and the images were scaled to ensure they were all unit variance. After scaling, the source covariance matrix was found to be   1.000 0.074 −0.003 0.050  0.074 1.000 −0.071 0.160  , (5.4) ssT − s¯ s¯ T =   −0.003 −0.071 1.000 0.130  0.050 0.160 0.130 1.000 giving an acceptably small covariance between the images. A mixing matrix A was generated randomly and used to construct x = As. For the algorithm, the demixing matrix W was initialized to the identity matrix, ensuring initial orthogonality of W. Instead of algorithm 4.8, the theoretical expectation was replaced by a batch update method: denoting by X the observation matrix having all the data vectors x(1), . . . , x(T) as its columns and defining the matrix of outputs as Y = WX, we update W with the incremental change, 1 W = −µ {f(Y)YT − Yf(Y)T }W. p

(5.5)

A constant update factor of µ = 0.03 was used, and W was renormalized to an orthogonal matrix after each step using equation 4.6. Figure 2 shows the performance of learning over 2 × 104 steps, with Figure 3 showing the original, mixed, and separated images and their histograms. After 2 × 104 iteration steps, in which each step is one update following presentation of the batch of 3969 samples (1610s/27min of CPU time on an 850 MHz Pentium III), the source-to-output matrix WVA was found to be   1.002 0.015 −0.040 0.026  −0.099 0.067 −0.111 1.007  , (5.6) WVA =   −0.016 1.009 −0.089 0.018  0.004 −0.056 1.021 −0.058 with eMSE = 6.07 × 10−5 , eOrth = 8.88 × 10−3 , and ePerm = 1.07 × 10−2 . The mean squared error and orthogonalization error are slightly better than for the nonnegative PCA algorithm, which were 9.30 × 10−5 and 9.02×10−3 , respectively, for the same number of iterations. The final permutation error for the same number of iterations is also smaller than the value obtained with the nonnegative PCA algorithm, which was 1.68 × 10−2 .

1822

E. Oja and M. Plumbley

2

10

0

MSE

10

−2

10

−4

10

−6

10

0

10

1

10

2

3

10

10

4

10

5

10

Iteration 0

Performance

10

−1

10

−2

10

−3

10

0

10

1

10

2

3

10

10

4

10

5

10

Iteration

Figure 2: Simulation results on image data. Performance in the lower graph is measured as distance from permutation ePerm (upper curve) and distance from orthogonality eOrth (lower curve). See the text for definitions.

The lower bound on eOrth and ePerm is determined by the accuracy of the prewhitening stage: recall that prewhitening is estimated from the statistics of the input data, without having any access to the original mixing matrix A. Calculating the equivalent error in VA from orthonormality, eWhite =

1 I − (VA)T (VA) 2F , n2

(5.7)

we find eWhite = 8.88 × 10−3 = eOrth to within the machine accuracy (i.e., |eOrth − eWhite | < 10−15 ) as we might expect, since W is orthonormalized at each iteraction, so WWT = I. Therefore, the ePerm is 20.3% above its lower bound for this algorithm, compared to 80.3% for the nonnegative PCA algorithm. 6 Discussion We have considered the problem of nonnegative ICA, that is, independent component analysis where the sources are known to be nonnegative. Else-

Blind Separation of Positive Sources

1823

Figure 3: Images and histograms for the image separation task, showing (a) the original source images (b) and their histograms, (c, d) the mixed images and their histograms, and (e, f) the separated images and their histograms.

1824

E. Oja and M. Plumbley

where, one of us introduced algorithms to solve this based on the use of orthogonal rotations, related to Stiefel manifold approaches (Plumbley, 2003). In this article, we considered a gradient-based algorithm operating on prewhitened data, related to the “nonlinear PCA” algorithms investigated by one of the authors (Oja, 1997, 1999). We refer to these algorithms, which use a truncation nonlinearity, as nonnegative PCA algorithms. By theoretical analysis of algorithm 4.8, we showed the key result of the article: asymptotically, as the learning rate is very small, the algorithm is guaranteed to find a permutation of the well-grounded nonnegative sources. Such a global convergence result is rather unique in ICA gradient methods. The convergence was experimentally verified for a small learning rate using a set of positive images as sources, with a random mixing matrix. Acknowledgments Patrik Hoyer kindly supplied the images used in section 5. Part of this work was undertaken while M.P. was visiting the Neural Networks Research Centre at the Helsinki University of Technology, supported by a Leverhulme Trust Study Abroad Fellowship. This work is also supported by grant GR/R54620 from the UK Engineering and Physical Sciences Research Council as well as by the project New Information Processing Principles, 44886, of the Academy of Finland. References Amari, S.-I., & Cichocki, A. (2002). Adaptive blind signal and image processing. New York: Wiley. Cardoso, J.-F., & Laheld, B. H. (1996). Equivariant adaptive source separation. IEEE Transactions on Signal Processing, 44(12), 3017–3030. Charles, D., & Fyfe, C. (1998). Modelling multiple-cause structure using rectification constraints. Network: Computation in Neural Systems, 9, 167–182. Edelman, A., Arias, T. A., & Smith, S. T. (1998). The geometry of algorithms with orthogonality constraints. SIAM J. Matrix Anal. Appl., 20(2), 303–353. Fiori, S. (2001). A theory for learning by weight flow on Stiefel-Grassman manifold. Neural Computation, 13, 1625–1647. Fyfe, C. (1994). Positive weights in interneurons. In G. Orchard (Ed.), Neural computing: Research and applications II. Proceedings of the Third Irish Neural Networks Conference, Belfast, Northern Ireland, 1–2 Sept 1993 (pp. 47–58). Belfast, NI: Irish Neural Networks Association. Harpur, G. F. (1997). Low entropy coding with unsupervised neural networks. Unpublished doctoral dissertation, Cambridge University. Henry, R. C. (2002). Multivariate receptor models—current practice and future trends. Chemometrics and Intelligent Laboratory Systems, 60(1–2), 43–48. Hyv¨arinen, A., Karhunen, J., & Oja, E. (2001). Independent component analysis. New York: Wiley.

Blind Separation of Positive Sources

1825

Lee, D. D., & Seung, H. S. (1999). Learning the parts of objects by nonnegative matrix factorization. Nature, 401, 788–791. Lee, J. S., Lee, D. D., Choi, S., & Lee, D. S. (2001). Application of nonnegative matrix factorization to dynamic positron emission tomography. In T.-W. Lee, T.-P. Jung, S. Makeig, & T. J. Sejnowski (Eds.), Proceedings of the International Conference on Independent Component Analysis and Signal Separation (ICA2001), San Diego, California (pp. 629–632). San Diego, CA: Institute of Neural Computation, University of California, San Diego. Oja, E. (1983). Subspace methods of pattern recognition. Baldock, U.K.: Research Studies Press, and New York: Wiley. Oja, E. (1997). The nonlinear PCA learning rule in independent component analysis. Neurocomputing, 17(1), 25–46. Oja, E. (1999). Nonlinear PCA criterion and maximum likelihood in independent component analysis. In Proc. Int. Workshop on Independent Component Analysis and Signal Separation (ICA’99) (pp. 143–148). Aussois, France: ICA 1999 Organizing Committee, Institut National Polytechnique de Grenoble, France. Oja, E., & Plumbley, M. D. (2003). Blind separation of positive sources using nonnegative ICA. In Proc. Int. Workshop on Independent Component Analysis and Signal Separation (ICA’03). Kyoto, Japan: ICA 2003 Organizing Committee, NTT Communication Science Laboratories. Paatero, P., & Tapper, U. (1994). Positive matrix factorization: A nonnegative factor model with optimal utilization of error estimates of data values. Environmetrics, 5, 111–126. Parra, L., Spence, C., Sajda, P., Ziehe, A., & Muller, ¨ K.-R. (2000). Unmixing hyperspectral data. In S. A. Solla, T. K. Leen, & K.-R. Muller ¨ (Eds.), Advances in neural information processing systems, 12 (pp. 942–948). Cambridge, MA: MIT Press. Plumbley, M. D. (2002). Conditions for nonnegative independent component analysis. IEEE Signal Processing Letters, 9(6), 177–180. Plumbley, M. D. (2003). Algorithms for non-negative independent component analysis. IEEE Transactions on Neural Networks, 14(3), 534–543. Plumbley, M. D., & Oja, E. (2004). A “non-negative PCA” algorithm for independent component analysis. IEEE Transactions on Neural Networks, 15(1), 66–76. Tsuge, S., Shishibori, M., Kuroiwa, S., & Kita, K. (2001). Dimensionality reduction using non-negative matrix factorization for information retrieval. In IEEE International Conference on Systems, Man, and Cybernetics (Vol. 2, pp. 960–965). Piscataway, NJ: IEEE. Xu, L. (1993). Least mean square error reconstruction principle for selforganizing neural-nets. Neural Networks, 6(5), 627–648. Received July 9, 2003; accepted February 5, 2004.

LETTER

Communicated by Aapo Hyvarinen

A New Concept for Separability Problems in Blind Source Separation Fabian J. Theis [email protected] Institute of Biophysics, University of Regensburg, 93040 Regensburg, Germany

The goal of blind source separation (BSS) lies in recovering the original independent sources of a mixed random vector without knowing the mixing structure. A key ingredient for performing BSS successfully is to know the indeterminacies of the problem—that is, to know how the separating model relates to the original mixing model (separability). For linear BSS, Comon (1994) showed using the Darmois-Skitovitch theorem that the linear mixing matrix can be found except for permutation and scaling. In this work, a much simpler, direct proof for linear separability is given. The idea is based on the fact that a random vector is independent if and only if the Hessian of its logarithmic density (resp. characteristic function) is diagonal everywhere. This property is then exploited to propose a new algorithm for performing BSS. Furthermore, first ideas of how to generalize separability results based on Hessian diagonalization to more complicated nonlinear models are studied in the setting of postnonlinear BSS. 1 Introduction In independent component analysis (ICA), one tries to find statistically independent data within a given random vector. An application of ICA lies in blind source separation (BSS), where it is furthermore assumed that the given vector has been mixed using a fixed set of independent sources. The advantage of applying ICA algorithms to BSS problems in contrast to correlation-based algorithms is that ICA tries to make the output signals as independent as possible by also including higher-order statistics. Since the introduction of independent component analysis by H´erault and Jutten (1986), various algorithms have been proposed to solve the BSS problem (Comon, 1994; Bell & Sejnowski, 1995; Hyv¨arinen & Oja, 1997; Theis, Jung, Puntonet, & Lang, 2002). Good textbook-level introductions to ICA are given in Hyv¨arinen, Karhunen, and Oja (2001) and Cichocki and Amari (2002). Separability of linear BSS states that under weak conditions to the sources, the mixing matrix is determined uniquely by the mixtures except for permutation and scaling, as showed by Comon (1994) using the DarmoisSkitovitch theorem. We propose a direct proof based on the concept of c 2004 Massachusetts Institute of Technology Neural Computation 16, 1827–1850 (2004)

1828

F. Theis

separated functions, that is, functions that can be split into a product of one-dimensional functions (see definition 1). If the function is positive, this is equivalent to the fact that its logarithm has a diagonal Hessian everywhere (see lemma 1 and theorem 1). A similar lemma has been shown by Lin (1998) for what he calls block diagonal Hessians. However, he omits discussion of the separatedness of densities with zeros, which plays a minor role for the separation algorithm he is interested in but is important for deriving separability. Using separatedness of the density, respectively, the characteristic function (Fourier transformation), of the random vector, we can then show separability directly (in two slightly different settings, for which we provide a common framework). Based on this result, we propose an algorithm for linear BSS by diagonalizing the logarithmic density of the Hessian. We recently found that this algorithm has already been proposed (Lin, 1998), but without considering the necessary assumptions for successful algorithm application. Here we give precise conditions for when to apply this algorithm (see theorem 3) and show that points satisfying these conditions can indeed be found if the sources contain at most one gaussian component (see lemma 5). Lin uses a discrete approximation of the derivative operator to approximate the Hessian. We suggest using kernel-based density estimation, which can be directly differentiated. A similar algorithm based on Hessian diagonalization has been proposed by Yeredor (2000) using the characteristic function of a random vector. However, the characteristic function is complex valued, and additional care has to be taken when applying a complex logarithm. Basically, this is well defined locally only at nonzeros. In algorithmic terms, the characteristic function can be easily approximated by samples (which is equivalent to our kernel-based density approximation using gaussians before Fourier transformation). Yeredor suggests joint diagonalization of the Hessian of the logarithmic characteristic function (which is problematic because of the nonuniqueness of the complex logarithm) evaluated at several points in order to avoid the locality of the algorithm. Instead of joint diagonalization, we use a combined energy function based on the previously defined separator, which also takes into account global information but does not have the drawback of being singular at zeros of the density, respectively, characteristic function. Thus, the algorithmic part of this article can be seen as a general framework for the algorithms proposed by Lin (1998) and Yeredor (2000). Section 2 introduces separated functions, giving local characterizations of the densities of independent random vectors. Section 3 then introduces the linear BSS model and states the well-known separability result. After giving an easy and short proof in two dimensions with positive densities, we provide a characterization of gaussians in terms of a differential equation and provide the general proof. The BSS algorithm based on finding separated densities is proposed and studied in section 4. We finish with a generalization of the separability for the postnonlinear mixture case in section 5.

A New Concept for Separability Problems in BSS

1829

2 Separated and Linearly Separated Functions Definition 1. A function f : Rn → C is said to be separated, respectively, linearly separated, if there exist one-dimensional functions g1 , . . . , gn : R → C such that f (x) = g1 (x1 ) · · · gn (xn ) respectively f (x) = g1 (x1 ) + · · · + gn (xn ) for all x ∈ Rn . Note that the functions gi are uniquely determined by f up to a scalar factor, respectively, an additive constant. If f is linearly separated, then exp f is separated. Obviously the density function of an independent random vector is separated. For brevity, we often use the tensor product and write f ≡ g1 ⊗ · · · ⊗ gn for separated f , where for any functions h, k defined on a set U, h ≡ k if h(x) = k(x) for all x ∈ U. Separatedness can also be defined on any open parallelepiped (a1 , b1 ) × · · · × (an , bn ) ⊂ Rn in the obvious way. We say that f is locally separated at x ∈ Rn if there exists an open parallelepiped U such that x ∈ U and f |U is separated. If f is separated, then f is obviously everywhere locally separated. The converse, however, does not necessarily hold, as shown in Figure 1.

0.2

0.1

4 3

0 4

2 3

1 2

0 1

0

−1 −1

−2

−2 −3 −3

Figure 1: Density of a random vector S with a locally but not globally separated density. Here, pS := cχ[−2,2]×[−2,0]∪[0,2]×[1,3] where χU denotes the function that is 1 on U and 0 everywhere else. Obviously, pS is not separated globally, but is separated if restricted to squares of length < 1. Plotted is a smoothed version of pS .

1830

F. Theis

The function f is said to be positive if f is real and f (x) > 0 for all x ∈ Rn , and nonnegative if f is real and f (x) ≥ 0 for all x ∈ Rn . A positive function f is separated if and only if ln f is linearly separated. Let C m (U, V) be the ring of all m-times continuously differentiable functions from U ⊂ Rn to V ⊂ C, U open. For a C m -function f , we write ∂i1 · · · ∂im f := ∂ m f/∂xi1 · · · ∂xim for the m-fold partial derivatives. If f ∈ n C 2 (Rn , C), denote with the symmetric (n × n)-matrix H f (x) := ∂i ∂j f (x) i,j=1 the Hessian of f at x ∈ Rn . Linearly separated functions can be classified using their Hessian (if it exists): Lemma 1. A function f ∈ C 2 (Rn , C) is linearly separated if and only if H f (x) is diagonal for all x ∈ Rn . A similar lemma for block diagonal Hessians has been shown by Lin (1998). Proof. If f is linearly separated, its Hessian is obviously diagonal everywhere by definition. Assume the converse. We prove that f is separated by induction over the dimension n. For n = 1, the claim is trivial. Now assume that we have shown the lemma for n − 1. By induction assumption, f (x1 , . . . , xn−1 , 0) is linearly separated, so f (x1 , . . . , xn−1 , 0) = g1 (x1 ) + · · · + gn−1 (xn−1 ) for all xi ∈ R and some functions gi on R. Note that gi ∈ C 2 (R, C). Define a function h : R → C by h(y) := ∂n f (x1 , . . . , xn−1 , y), y ∈ R, for fixed x1 , . . . , xn−1 ∈ R. Note that h is independent of the choice of the xi , because ∂n ∂i f ≡ ∂i ∂n f is zero everywhere, so xi → ∂n f (x1 , . . . , xn−1 , y) is constant for fixed xj , y ∈ R, j = i. By definition, h ∈ C 1 (R, C), so h is y integrable on compact intervals. Define k : R → C by k(y) := 0 h. Then f (x1 , . . . , xn ) = g1 (x1 ) + · · · + gn−1 (xn−1 ) + k(xn ) + c, where c ∈ C is a constant, because both functions have the same derivative and Rn is connected. If we set gn := k + c, the claim follows. This lemma also holds for functions defined on any open parallelepiped (a1 , b1 ) × · · · × (an , bn ) ⊂ Rn . Hence, an arbitrary real-valued C 2 -function f is locally separated at x with f (x) = 0 if and only if the Hessian of ln | f | is locally diagonal. For a positive function f , the Hessian of its logarithm is diagonal everywhere if it is separated, and it is easy to see that for positive f , the converse

A New Concept for Separability Problems in BSS

1831

also holds globally (see theorem 1(ii)). In this case, we have for i = j, 0 ≡ ∂i ∂j ln f ≡

f ∂i ∂j f − (∂i f )(∂j f ) , f2

so f is separated if and only if f ∂i ∂j f ≡ (∂i f )(∂j f ) for i = j or even i < j. This motivates the following definition: Definition 2.

For i = j, the operator

Rij : C 2 (Rn , C) → C 0 (Rn , C) f → Rij [ f ] := f ∂i ∂j f − (∂i f )(∂j f ) is called the ij-separator. Theorem 1.

Let f ∈ C 2 (Rn , C).

i. If f is separated, then Rij [ f ] ≡ 0 for i = j or, equivalently, f ∂i ∂j f ≡ (∂i f )(∂j f )

(2.1)

holds for i = j. ii. If f is positive and Rij [ f ] ≡ 0 holds for all i = j, then f is separated. If f is assumed to be only nonnegative, then f is locally separated but not necessarily globally separated (if the support of f has more than one component). See Figure 1 for an example of a nonseparated density with R12 [ f ] ≡ 0. Proof of Theorem 1.i. If f is separated, then f (x) = g1 (x1 ) · · · gn (xn ) or short f ≡ g1 ⊗ · · · ⊗ gn , so ∂i f ≡ g1 ⊗ · · · ⊗ gi−1 ⊗ gi ⊗ gi+1 ⊗ · · · ⊗ gn and ∂i ∂j f ≡ g1 ⊗ · · · ⊗ gi−1 ⊗ gi ⊗ gi+1 ⊗ · · · ⊗ gj−1 ⊗ gj ⊗ gj+1 ⊗ · · · ⊗ gn for i < j. Hence equation 2.1 holds. ii. Now assume the converse and let f be positive. Then according to the remarks after lemma 1, Hln f (x) is everywhere diagonal, so lemma 1 shows that ln f is linearly separated; hence, f is separated.

1832

F. Theis

Some trivial properties of the separator Rij are listed in the next lemma: Lemma 2.

Let f, g ∈ C 2 (Rn , C), i = j and α ∈ C. Then

Rij [αf ] = α 2 Rij [ f ] and Rij [ f + g] = Rij [ f ] + Rij [g] + f ∂i ∂j g + g∂i ∂j f − (∂i f )(∂j g) − (∂i g)(∂j f ). 3 Separability of Linear BSS Consider the noiseless linear instantaneous BSS model with as many sources as sensors: X = AS,

(3.1)

with an independent n-dimensional random vector S and A ∈ Gl(n). Here, Gl(n) denotes the general linear group of Rn , that is, the group of all invertible (n × n)-matrices. The task of linear BSS is to find A and S given only X. An obvious indeterminacy of this problem is that A can be found only up to scaling and permutation because for scaling L and permutation matrix P, X = ALPP−1 L−1 S, and P−1 L−1 S is also independent. Here, an invertible matrix L ∈ Gl(n) is said to be a scaling matrix if it is diagonal. We say two matrices B, C are equivalent, B ∼ C, if C can be written as C = BPL with a scaling matrix L ∈ Gl(n) and an invertible matrix with unit vectors in each row (permutation matrix) P ∈ Gl(n). Note that PL = L P for some scaling matrix L ∈ Gl(n), so the order of the permutation and the scaling matrix does not play a role for equivalence. Furthermore, if B ∈ Gl(n) with B ∼ I, then also B−1 ∼ I, and, more generally if BC ∼ A, then C ∼ B−1 A. According to the above, solutions of linear BSS are equivalent. We will show that under mild assumptions to S, there are no further indeterminacies of linear BSS. S is said to have a gaussian component if one of the Si is a one-dimensional gaussian, that is, pSi (x) = d exp(−ax2 + bx + c) with a, b, c, d ∈ R, a > 0, and S has a deterministic component if one Si is deterministic, that is, constant. Theorem 2 (Separability of linear BSS). Let A ∈ Gl(n) and S be an independent random vector. Assume one of the following: i. S has at most one gaussian or deterministic component, and the covariance of S exists.

A New Concept for Separability Problems in BSS

1833

ii. S has no gaussian component, and its density pS exists and is twice continuously differentiable. Then if AS is again independent, A is equivalent to the identity. So A is the product of a scaling and a permutation matrix. The important part of this theorem is assumption i, which has been used to show separability by Comon (1994) and extended by Eriksson and Koivunen (2003) based on the Darmois-Skitovitch theorem (Darmois, 1953; Skitovitch, 1953). Using this theorem, the second part can be easily shown without C 2 -densities. Theorem 2 indeed proves separability of the linear BSS model, because if X = AS and W is a demixing matrix such that WX is independent, then WA ∼ I, so W−1 ∼ A as desired. We will give a much easier proof without having to use the DarmoisSkitovitch theorem in the following sections. 3.1 Two-Dimensional Positive Density Case. For illustrative purposes we will first prove separability for a two-dimensional random vector S with positive density pS ∈ C 2 (R2 , R). Let A ∈ Gl(2). It is enough to show that if S and AS are independent, then either A ∼ I or S is gaussian. S is assumed to be independent, so its density factorizes: pS (s) = g1 (s1 )g2 (s2 ), for s ∈ R2 . First, note that the density of AS is given by pAS (x) = | det A|−1 pS (A−1 x) = cg1 (b11 x1 + b12 x2 )g2 (b21 x1 + b22 x2 ) for x ∈ R2 , c = 0 fixed. Here, B = (bij ) = A−1 . AS is also assumed to be independent, so pAS (x) is separated. pS was assumed to be positive; then so is pAS . Hence, ln pAS (x) is linearly separated, so ∂1 ∂2 ln pAS (x) = b11 b12 h1 (b11 x1 + b12 x2 ) + b21 b22 h2 (b21 x1 + b22 x2 ) = 0 for all x ∈ R2 , where hi := ln gi ∈ C 2 (R2 , R). By setting y := Bx, we therefore have b11 b12 h1 (y1 ) + b21 b22 h2 (y2 ) = 0

(3.2)

for all y ∈ R2 , because B is invertible. Now, if A (and therefore also B) is equivalent to the identity, then equation 3.2 holds. If not, then A, and hence also B, have at least three nonzero entries. By equation 3.2 the fourth entry has to be nonzero, because the

1834

F. Theis

hi are not zero (otherwise gi (yi ) = exp(ayi + b), which is not integrable). Furthermore, b11 b12 h1 (y1 ) = −b21 b22 h2 (y2 ) for all y ∈ R2 , so the hi are constant, say, hi ≡ ci , and ci = 0, as noted above. Therefore, the hi are polynomials of degree 2, and the gi = exp hi are gaussians (ci < 0 because of the integrability of the gi ). 3.2 Characterization of Gaussians. In this section, we show that among all densities, respectively, characteristic functions, the gaussians satisfy a special differential equation. Lemma 3.

Let f ∈ C 2 (R, C) and a ∈ C with

af 2 − f f + f 2 ≡ 0. Then either f ≡ 0 or f (x) = exp

(3.3) a

2x

2

+ bx + c , x ∈ R, with constants b, c ∈ C.

Proof. Assume f ≡ 0. Let x0 ∈ R with f (x0 ) = 0. Then there exists a nonempty interval U := (r, s) containing x0 such that a complex logarithm log is defined on f (U). Set g := log f |U. Substituting exp g for f in equation 3.3 yields a exp(2g) − exp(g)(g + g2 ) exp(g) + g2 exp(2g) ≡ 0, and therefore g ≡ a. Hence, g is a polynomial of degree ≤ 2 with leading coefficient 2a . Furthermore, lim f (x) = 0

x→r+

lim f (x) = 0,

x→s−

so f has no zeros at all because of continuity. The argument above with U = R shows the claim. If, furthermore, f is real nonnegative and integrable with integral 1 (e.g., if f is the density of a random variable), then f has to be the exponential of a real-valued polynomial of degree precisely 2; otherwise, it would not be integrable. So we have the following corollary:

A New Concept for Separability Problems in BSS

1835

Corollary 1. Let X be a random variable with twice continuously differentiable density pX satisfying equation 3.3. Then X is gaussian. If we do not want to assume that the random variable has a density, we can use its characteristic function (Bauer, 1996) instead to show an equivalent result: Corollary 2. Let X be a random variable with twice continuously differentiable characteristic function X(x) := EX (exp ixX) satisfying equation 3.3. Then X is gaussian or deterministic. = 1, lemma 3 shows that X(x) = exp( a x2 + bx). MoreUsing X(0) 2 ≤1 over, from X(−x) = X(x), we get a ∈ R and b = ib with real b . And |X| shows that a ≤ 0. So if a = 0, then X is deterministic (at b ), and if a = 0, then X has a gaussian distribution with mean b and variance −a−1 .

Proof.

3.3 Proof of Theorem 2. We will now prove linear separability; for this, we will use separatedness to show that some source components have to be gaussian (using the results from above) if the mixing matrix is not trivial. The main argument is given in the following lemma: Lemma 4. Let gi ∈ C 2 (R, C) and B ∈ Gl(n) such that f (x) := g1 ⊗· · ·⊗gn (Bx) is separated. Then for all indices l and i = j with bli blj = 0, gl satisfies the differential equation 3.3 with some constant a. Proof.

f is separated, so by theorem 1i.

Rij [ f ] ≡ f ∂i ∂j f − (∂i f )(∂j f ) ≡ 0

(3.4)

holds for i < j. The ingredients of this equation can be calculated for i < j as follows: ∂i f (x) =

bki g1 ⊗ · · · ⊗ gk ⊗ · · · ⊗ gn (Bx)

k

(∂i f )(∂j f )(x) =

bki blj (g1 ⊗ · · · ⊗ gk ⊗ · · · ⊗ gn )

k,l

× (g1 ⊗ · · · ⊗ gl ⊗ · · · ⊗ gn )(Bx) ∂i ∂j f (x) = bki (bkj g1 ⊗ · · · ⊗ gk ⊗ · · · ⊗ gn k

+

l=k

blj g1 ⊗ · · · ⊗ gk ⊗ · · · ⊗ gl ⊗ · · · ⊗ gn )(Bx).

1836

F. Theis

Putting this in equation 3.4 yields 0 = ( f ∂i ∂j f − (∂i f )(∂j f ))(x) = bki bkj ((g1 ⊗ · · · ⊗ gn )(g1 ⊗ · · · ⊗ gk ⊗ · · · ⊗ gn ) k

− (g1 ⊗ · · · ⊗ gk ⊗ · · · ⊗ gn )2 )(Bx) 2 2 = bki bkj g21 ⊗ · · · ⊗ g2k−1 ⊗ gk gk − g2 k ⊗ gk+1 ⊗ · · · ⊗ gn (Bx) k

for x ∈ Rn . B is invertible, so the whole function is zero:

2 2 bki bkj g21 ⊗ · · · ⊗ g2k−1 ⊗ gk gk − g2 k ⊗ gk+1 ⊗ · · · ⊗ gn ≡ 0.

(3.5)

k

Choose x ∈ Rn with gk (xk ) = 0 for k = 1, . . . , n. Evaluating equation 3.5 at (x1 , . . . , xl−1 , y, xl+1 , . . . , xn ) for variable y ∈ R and dividing the resulting one-dimensional equation by the constant g21 (x1 ) · · · g2l−1 (xl−1 )g2l+1 (xl+1 ) · · · g2n (xn ) shows bli blj

gl gl

−

g2 l

 (y) = − 

bki bkj

k=l

gk gk − g2 k g2k

 (xk ) g2l (y)

(3.6)

for y ∈ R. So for indices l and i = j with bli blj = 0, it follows from equation 3.6 that there exists a ∈ C such that gk satisfies the differential equation ag2l − gl gl + g2 l ≡ 0, that is, equation 3.3. Proof of Theorem 2. i. S is assumed to have at most one gaussian or deterministic component and existing covariance. Set X := AS. We first show using whitening that A can be assumed to be orthogonal. For this, we can assume S and X to have no deterministic component at all (because arbitrary choice of the matrix coefficients of the deterministic components does not change the covariance). Hence, by assumption, Cov(X) is diagonal and positive definite, so let D1 be diagonal invertible with Cov(X) = D21 . Similarly, let D2 be diagonal invertible with Cov(S) = D22 . −1 Set Y := D−1 1 X and T := D2 S, that is, normalize X and S to covariance I. Then −1 −1 Y = D−1 1 X = D1 AS = D1 AD2 T

A New Concept for Separability Problems in BSS

1837

−1 and T, D−1 1 AD2 and Y satisfy the assumption, and D1 AD2 is orthogonal because

I = Cov(Y) = E(YY ) −1 = E(D−1 1 AD2 TT D2 A D1 ) −1 = (D−1 1 AD2 )(D1 AD2 ) .

So without loss of generality, let A be orthogonal. Now let S(s) := ES (exp is S) be the characteristic function of S. By assumption, the covariance (and hence the mean) of S exists, so S ∈ C 2 (Rn , C) (Bauer, 1996). Furthermore, since S is assumed to be independent, its chari . The characteristic function is separated: S ≡ g1 ⊗ · · · ⊗ gn , where gi ≡ S acteristic function of AS can easily be calculated as AS(x) = ES (exp ix AS) = S(A x) = g1 ⊗ · · · ⊗ gn (A x) for x ∈ Rn . Let B := (bij ) = A . Since AS is also assumed to be independent, f (x) := AS(x) = g1 ⊗ · · · ⊗ gn (Bx) is separated. Now assume that A ∼ I. Using orthogonality of B = A , there exist indices k = l and i = j with bki bkj = 0 and bli blj = 0. Then according to lemma 4, gk and gl satisfy the differential equation 3.3. Together with corollary 2, this shows that both Sk and Sl are gaussian, which is a contradiction to the assumption. ii. Let S be an n-dimensional independent random vector with density pS ∈ C 2 (Rn , R) and no gaussian component, and let A ∈ Gl(n). S is assumed to be independent, so its density factorizes pS ≡ g1 ⊗ · · · ⊗ gn . The density of AS is given by pAS (x) = | det A|−1 pS (A−1 x) = | det A|−1 g1 ⊗ · · · ⊗ gn (Ax) for x ∈ Rn . Let B := (bij ) = A−1 . AS is also assumed to be independent, so f (x) := | det A|pAS (x) = g1 ⊗ · · · ⊗ gn (Bx) is separated. Assume A ∼ I. Then also B = A−1 ∼ I, so there exist indices l and i = j with bli blj = 0. Hence, it follows from lemma 4 that gl satisfies the differential equation 3.3. But gl is a density, so according to corollary 1 the lth component of S is gaussian, which is a contradiction.

1838

F. Theis

4 BSS by Hessian Diagonalization In this section, we use the theory already set out to propose an algorithm for linear BSS, which can be easily extended to nonlinear settings as well. For this, we restrict ourselves to using C 2 -densities. A similar idea has already been proposed in Lin (1998), but without dealing with possibly degenerated eigenspaces in the Hessian. Equivalently, we could also use characteristic functions instead of densities, which leads to a related algorithm (Yeredor, 2000). If we assume that Cov(S) exists, we can use whitening as seen in the proof of theorem 2i (in this context, also called principal component analysis) to reduce the general BSS model, equation 3.2, to X = AS

(4.1)

with an independent n-dimensional random vector S with existing covariance I and an orthogonal matrix A. Then Cov(X) = I. We assume that S admits a C 2 −density pS . The density of X is then given by pX (x) = pS (A x) for x ∈ Rn , because of the orthogonality of A. Hence, pS ≡ pX ◦ A. Note that the Hessian of the composition of a function f ∈ C 2 (Rn , R) with an n × n-matrix A can be calculated using the Hessian of f as follows: H f ◦A (x) = AH f (Ax)A . Let s ∈ Rn with pS (s) > 0. Then locally at s, we have Hln pS (s) = Hln pX ◦A (s) = AHln pX (As)A .

(4.2)

pS is assumed to be separated, so Hln pS (s) is diagonal, as seen in section 2. Lemma 5. Let X := AS with an orthogonal matrix A and S, an independent random vector with C 2 -density, and at most one gaussian component. Then there exists an open set U ⊂ Rn such that for all x ∈ U, pX (x) = 0 and Hln pX (x) has n different eigenvalues. Proof. Assume not. Then there exists no x ∈ Rn at all with pX (x) = 0 and Hln pX (x) having n different eigenvalues because otherwise, due to continuity, these conditions would also hold in an open neighborhood of x.

A New Concept for Separability Problems in BSS

1839

Using equation 4.2 the logarithmic Hessian of pS has at every s ∈ Rn with pS (s) > 0 at least two of the same eigenvalues, say, λ(s) ∈ R. Hence, since S is independent, Hln pS (s) is diagonal, so locally, ln pSi (s) = ln pSj (s) = λ(s) for two indices i = j. Here, we have used continuity of s → Hln pS (s) showing that the two eigenvalues locally lie in the same two dimensions i and j. This proves that λ(s) is locally constant in directions i and j. So locally at points s with pS (s) > 0, Si and Sj are of the type exp P, with P being a polynomial of degree ≤ 2. The same argument as in the proof of lemma 3 then shows that pSi and pSj have no zeros at all. Using the connectedness of R proves that Si and Sj are globally of the type exp P, hence gaussian (because of R pSk = 1), which is a contradiction. Hence, we can assume that we have found x(0) ∈ Rn with Hln pX (x(0) ) having n different eigenvalues (which is equivalent to saying that every eigenvalue is of multiplicity one), because due to lemma 5, this is an open condition, which can be found algorithmically. In fact, most densities in practice turn out to have logarithmic Hessians with n different eigenvalues almost everywhere. In theory however, U in lemma 5 cannot be assumed to be, for example, dense or Rn \ U to have measure zero, because if we choose pS1 to be a normalized gaussian and pS2 to be a normalized gaussian with a very localized small perturbation at zero only, then U cannot be larger than (−ε, ε) × R. By diagonalization of Hln pX (x(0) ) using eigenvalue decomposition (principal axis transformation), we can find the (orthogonal) mixing matrix A. Note that the eigenvalue decomposition is unique except for permutation and sign scaling because every eigenspace (in which A is only unique up to orthogonal transformation) has dimension one. Arbitrary scaling indeterminacy does not occur because we have forced S and X to have unit variances. Using uniqueness of eigenvalue decomposition and theorem 2, we have shown the following theorem: Theorem 3 (BSS by Hessian calculation). Let X = AS with an independent random vector S and an orthogonal matrix A. Let x ∈ Rn such that locally at x, X admits a C 2 -density pX with pX (x) = 0. Assume that Hln pX (x) has n different eigenvalues (see lemma 5). If EHln pX (x)E = D is an eigenvalue decomposition of the Hessian of the logarithm of pX at x, that is, E orthogonal, D diagonal, then E ∼ A, so E X is independent.

1840

F. Theis

Furthermore, it follows from this theorem that linear BSS is a local problem as proven already in Theis, Puntonet, and Lang (2003) using the restriction of a random vector. 4.1 Example for Hessian Diagonalization BSS. In order to illustrate the algorithm of local Hessian diagonalization, we give a two-dimensional example. Let S be a random vector with densities 1 χ[−1,1] (s1 ) 2

1 1 pS2 (s2 ) = √ exp − s22 2 2π

pS1 (s1 ) =

where χ[−1,1] is one on [−1, 1] and zero everywhere else. The orthogonal mixing matrix A is chosen to be

1 1 A= √ 2 −1

1 . 1

The mixture density pX of X := AS then is (det A = 1), 1 pX (x) = √ χ[−1,1] 2 2π

1 1 √ (x1 − x2 ) exp − (x1 + x2 )2 , 4 2

for x ∈ R2 . pX is positive and C 2 in a neighborhood around 0. Then 1 ∂1 ln pX (x) = ∂2 ln pX (x) = − (x1 + x2 ) 2 ∂12 ln pX (x) = ∂22 ln pX (x) = ∂1 ∂2 ln pX (x) = −

1 2

for x with |x| < 12 , and the Hessian of the logarithmic densities is

1 1 Hln pX (x) = − 2 1

1 1

independent on x in a neighborhood around 0. Diagonalization of Hln pX (0) yields

−1 0

0 , 0

and this equals AHln pX (0)A , as stated in theorem 3.

A New Concept for Separability Problems in BSS

1841

4.2 Global Hessian Diagonalization Using Kernel-Based Density Approximation. In practice, it is usually not possible to approximate the density locally with sufficiently high accuracy, so a better approximation using the typically global information of X has to be found. We suggest using kernel-based density estimation to get an energy function with minima at the BSS solutions together with a global Hessian diagonalization in the following. The idea is to construct a measure for separatedness of the densities (hence independence) based on theorem 1. A possible measure could be the norm of the summed-up separators i<j Rij [ f ]. In order for this to be calculable, we choose only a set of points p(i) where we evaluate the difference and minimize k i<j Rij [ f ](p(k) )2 at those points. Although in the linear noiseless case, calculation of the Hessian at only one point would be enough, using an energy function of this type ensures using global information of the densities while averaging over possible local errors. First, we need to approximate the density function. For this, let X ∈ Rn be an n-dimensional random vector with ν independent and identically distributed samples x(1) , . . . , x(ν) ∈ Rn . Let ϕ : Rn → R

1 1 2 x → n √ exp − 2 x 2σ σ (2π)n

be the n-dimensional-centered independent gaussian with fixed variance σ 2 > 0. For ease of notation, denote κ := 2σ1 2 . Define the approximated density pˆX of X by pˆX (x) :=

ν 1 ϕ(x − x(i) ). ν i=1

(4.3)

If l → ∞, pˆX converges to pX in the space of all integrable functions if σ is chosen appropriately. This can be shown using the central limit theorem. Figure 2 depicts the approximation of a Laplacian using equation 4.3. The partial derivatives of ϕ can be calculated as ∂i ϕ(x) = −2κxi ϕ(x) ∂i ∂j ϕ(x) = 4κ 2 xi xj ϕ(x)

(4.4)

for i = j. ϕ is separated, so R[ϕ] ≡ 0. Note that pˆX ∈ C ∞ (Rn , R) is positive. So according to theorem 1 pˆX is separated if and only if Rij [pˆX ] ≡ 0 for i < j. And since pˆX is an approximation of pX , separatedness of pˆX also induces approximate independence of X.

1842

F. Theis

0.4

0.2

0.2

0.1

0 2

1

0

−1

−2 −2

−1

0

1

2

0 2

1

0

−1

−2 −2

−1

0

1

2

Figure 2: Independent Laplacian density pS (s) = 12 exp(−|x1 | − |x2 |): theoretic (left) and approximated (right) densities. For the approximation, 1000 samples and gaussian kernel approximation (see equation 4.3) with standard deviation 0.37 were used.

Rij [pˆX ] can be calculated using lemma 2—here Rij [ϕ(x − x(k) )] ≡ 0—and equation 4.4: ν 1 (k) Rij [pˆX ](x) = 2 Rij ϕ(x − x ) ν k=1 =

1 ϕ x − x(k) ∂i ∂j ϕ x − x(l) 2 ν k=l − (∂i ϕ) x − x(k) (∂j ϕ) x − x(l)

=

4κ 2 (l) ϕ x − x(k) ϕ x − x(l) x(k) xj − xj(l) i − xi 2 ν k=l

=

4κ 2 (l) xj(k) − xj(l) . ϕ x − x(k) ϕ x − x(l) x(k) i − xi 2 ν k
This function is zero for i < j if and only if pˆX is separated. For linear BSS, it would be enough to check this at one point in general position (see theorem 3), but for robustness, we want to require Rij [pˆX ] to be zero (or as close to zero as possible) at all sample points. So the desired independence estimator can be calculated as

E(x1 , . . . , x(n) ) := E :=

ν (Rij [pˆX ](x(m) ))2 ; m=1 i<j

A New Concept for Separability Problems in BSS

1843

hence, 2

−4

E = (σ ν)

m

i<j

2 (k)

ϕ(x − x )ϕ(x − x

(l)

)(x(k) i

−

(k) x(l) i )(xj

−

xj(l) )

.

k
Minimizing the function ε : Gl(n) → R W → E(Wx1 , . . . , Wx(n) ) then yields the desired demixing matrix with W−1 ∼ A according to theorem 2. ε can be minimized using the usual techniques—for example, global search, gradient descent, or fixed-point search. Figure 3 shows the energy function of an example mixture of two Laplacians. E is minimal at the points where WX is independent.

8

x10 −3

7 6 5 4 3 2 1 0

0.5

1

1.5

2

2.5

3

3.5

Figure 3: Energy function W → E(WX) of a mixture X of two Laplacians using as mixing matrix A a rotation by 45 degrees. One hundred samples were used, and E is plotted in steps of 0.1. The minima of E clearly lie at 14 π and 34 π, as desired.

1844

F. Theis

Note that E represents a new approximate measure of independence. Therefore, the linear BSS algorithm can now be readily generalized to nonlinear situations by finding an appropriate parameterization of the possibly nonlinear separating model. The proposed algorithm basically performs a global diagonalization of the logarithmic Hessian after prewhitening. Interestingly, this is similar to traditional BSS algorithms based on joint diagonalization such as JADE (Cardoso & Souloumiac, 1993) using cumulant matrices or AMUSE (Tong, Liu, Soan, & Huang, 1991) and SOBI (Belouchrani, Meraim, Cardoso, & Moulines, 1997) employing time decorrelation. Instead of using a global energy function as proposed above, we could therefore also jointly diagonalize a given set of Hessians (respectively, separator matrices, as above; see also Yeredor, 2000). Another relation to previously proposed ICA algorithms lies in the kernel approximation technique. Gaussian or generalized gaussian kernels have already been used in the field of independent component analysis to model the source densities (Lee & Lewicki, 2000; Habl, Bauer, Putonet, Rodriguez-Alvarez, & Lang, 2001), thus giving an estimate of the score function used in Bell-Sejnowski-type semiparametric algorithms (Bell, & Sejnowski, 1995) or enabling direct separation using a maximum likelihood parameter estimation. Our algorithm also uses density approximation, but employs this for the mixture density, which can be problematic in higher dimensions. A different approach not involving density approximation is a direct sample-based Hessian estimation similar to Lin (1998). 5 Separability of Postnonlinear BSS In this section, we show how to use the idea of Hessian diagonalization in order to give separability proofs in nonlinear situations, more precisely, in the setting of postnonlinear BSS. After stating the postnonlinear BSS model and the general (to the knowledge of the author, not yet proven) separability theorem, we will prove postnonlinear separability in the case of random vectors with distributions that are somewhere locally constant and nonzero (e.g., uniform distributions). A possible proof of postnonlinear separability has been suggested by Taleb and Jutten (1999); however, the proof applies only to densities with at least one zero and furthermore contains an error rendering the proof applicable only to restricted situations. Definition 3. A function f : Rn → Rn is called diagonal if each component fi (x) of f(x) depends on only the variable xi . In this case, we often omit the other variables and write f(x1 , . . . , xn ) = ( f1 (x1 ), . . . , fn (xn )); so f ≡ f1 × · · · × fn where × denotes the Cartesian product.

A New Concept for Separability Problems in BSS

1845

Consider now the postnonlinear BSS model, X = f(AS),

(5.1)

where again S is an independent random vector, A ∈ Gl(n), and f is a diagonal nonlinearity. We assume the components of f to be injective analytical functions with invertible Jacobian at every point (locally diffeomorphic). Definition 4. An invertible matrix A ∈ Gl(n) is said to be mixing if A has at least two nonzero entries in each row. Note that if A is mixing, then A , A−1 , and ALP for scaling matrix L and permutation matrix P are also mixing. Postnonlinear BSS is a generalization of linear BSS, so the indeterminacies of postnonlinear ICA contain at least the indeterminacies of linear BSS: A can be reconstructed only up to scaling and permutation. In the linear case, affine linear transformation is ignored. Here, of course, additional indeterminacies come into play because of translation: fi can be recovered only up to a constant. Also, if L ∈ Gl(n) is a scaling matrix, then f(AS) = (f ◦ L)((L−1 A)S), so f and A can interchange scaling factors in each component. Another obvious indeterminacy could occur if A is not general enough. If, for example, A = I, then f(S) is already independent, because independence is invariant under diagonal nonlinear transformation; so f cannot be found in this case. If we assume, however, that A is mixing, then we will show that except for scaling interchange between f and A, no more indeterminacies than in the affine linear case exist. Theorem 4 (separability of postnonlinear BSS). Let A, W ∈ Gl(n) be mixing, h : Rn → Rn be a diagonal bijective function with analytical locally diffeomorphic components, and S be an independent random vector with at most one gaussian component and existing covariance. If W(h(AS)) is independent, then there exists a scaling matrix L ∈ Gl(n) and p ∈ Rn with LA ∼ W−1 and h ≡ L+p. If analyticity of the components of h is not assumed, then h ≡ L + p can only hold on {As|pS (s) = 0}. If f ◦ A is the mixing model, W ◦ g is the separating model. Putting the two together, we get the above mixing-separating model. Since A has to be assumed to be mixing, we can assume W to be mixing as well because the inverse of a matrix that is mixing is again mixing. Furthermore, the mixingseparating model is assumed to be bijective—hence, A and W invertible and h bijective—because otherwise trivial solutions as, for example, h ≡ c for a constant c ∈ R, would also be solutions.

1846

F. Theis

We will show the theorem in the case of S and X with components having somewhere locally constant nonzero C 2 -densities. An alternative geometric idea of how to prove theorem 4 for bounded sources in two dimensions is mentioned in Babaie-Zadeh, Jutten, and Nayebi (2002) and extended in Theis and Gruber (forthcoming). Note that in our case, as well as in the above restrictive cases, the assumption that S has at most one gaussian component holds trivially. Proof of Theorem 4 (with Locally-Constant Nonzero C 2 -Densities). Let h = h1 × · · · × hn with bijective C ∞ -functions hi : R → R. We have to show only that the hi are constant. Then h is affine linear, say, h ≡ L + p, with diagonal matrix L ∈ Gl(n) and a vector p ∈ Rn . Hence, WLAS + Wp, and then WLAS is independent, so using linear separability, theorem 2i, WLA ∼ I, therefore, LA ∼ W−1 . Let X := W(h(AS)). The density of this transformed random vector is easily calculated from S: pX (Wh(As)) = | det W|−1 |h1 ((As)1 )|−1 · · · |hn ((As)n )|−1 | det A|−1 pS (s) for s ∈ Rn . h has by assumption an invertible Jacobian at every point, so the hi are either positive or negative; without loss of generality, hi > 0. Furthermore, pX is independent, so we can write pX ≡ g1 ⊗ · · · ⊗ gn . For fixed s0 ∈ Rn with pS (s0 ) > 0, there exists an open neighborhood U ⊂ Rn of s0 with pS |U > 0 and pS |U ∈ C 2 (U, R). If we define f (s) := ln | det W|−1 | det A|−1 pS (s) for s ∈ U, then f (s) = ln(h1 ((As)1 ) · · · hn ((As)n )g1 ((Wh(As))1 ) · · · gn ((Wh(As))n )) =

n

ln hk ((As)k ) + ζk ((Wh(As))k )

k=1

for x ∈ Rn where ζk := ln gk locally at s0k . pS is separated, so ∂i ∂j f ≡ 0

(5.2)

for i < j. Denote A =: (aij ) and W =: (wij ). The first derivative and then the nondiagonal entries in the Hessian of f can be calculated as follows (i < j): n hk aki ((As)k ) + ζk ((Wh(As))k ) wkl ali hl ((As)l ) ∂i f (s) = hk k=1 l=1 n

A New Concept for Separability Problems in BSS

∂i ∂j f (s) =

n

aki akj

2 hk h k − hk

h2 k

k=1

+

ζk ((Wh(As))k )

((As)k )

n

wkl ali hl ((As)l )

l=1

+ ζk ((Wh(As))k )

1847

n

n

wkl alj hl ((As)l )

l=1

wkl ali alj hl ((As)l ) .

l=1

Substituting y := As and using equation 5.2, we finally get the following differential equation for the hk : 0=

n

aki akj

2 hk h k − hk

h2 k

k=1

+

ζk ((Wh(y))k )

(yk )

n l=1

+ ζk ((Wh(y))k )

n

n

wkl ali hl (yl )

wkl alj hl (yl )

l=1

wkl ali alj hl (yl )

(5.3)

l=1

for y ∈ V := A(U). We will restrict ourselves to the simple case mentioned above in order to solve this equation. We assume that the hk are analytic and that there exists x0 ∈ Rn where the demixed densities gk are locally constant and nonzero. Consider the above calculation around s0 = A−1 (h−1 (W−1 x0 )). Choose the open set V such that the gk are locally constant nonzero on h(W(V)). Then so are the ζk = ln gk , and therefore 0=

n

aki akj

2 hk h k − hk

k=1

h2 k

(yk )

for y ∈ V. Hence, there exist open intervals Ik ⊂ R and constants bk ∈ R with 2 ≡ dk h2 aki akj hk h k − hk k on Ik (here, dk =

l=k ali alj

hl h −h2 l l (yl ) h2 l

for some (and then any) y ∈ V).

By assumption, W is mixing. Hence, for fixed k, there exist i = j with aki akj = 0. If we set ck := akibak kj , then 2 ck h2 k − hk hk + hk ≡ 0

(5.4)

1848

F. Theis

on Ik . hk was chosen to be analytic, and equation 5.4 holds on the open set Ik , so it holds on all R. Applying lemma 3 then shows that either hk ≡ 0 or hk (x) = ± exp

c

x + dk x + ek , x ∈ R

k 2

2

(5.5)

with constants dk , ek ∈ R. By assumption, hk is bijective, so hk ≡ 0. Applying the same arguments as above to the inverse system S = A−1 (h−1 (W−1 X)) and using the fact that also pS is somewhere locally constant nonzero shows that equation 5.5 also holds for (h−1 k ) with other constants. But if both the −1 derivatives of hk and hk are of this exponential type, then ck = dk = 0, and therefore hk is affine linear for all k = 1, . . . , n, which completes the proof of postnonlinear separability in this special case. Note that in the above proof, local positiveness of the densities was assumed in order to use the equivalence of local separability with the diagonality of the logarithm of the Hessian. Hence, these results can be generalized using theorem 1 in a similar fashion as we did in the linear case with theorem 2. Hence, we have proven postnonlinear separability also for uniformly distributed sources. 6 Conclusion We have shown how to derive the separability of linear BSS using diagonalization of the Hessian of the logarithmic density, respectively, characteristic function. This induces separated, that is, independent, sources. The idea of Hessian diagonalization is put into a new algorithm for performing linear independent component analysis, which is shown to be a local problem. In practice, however, due to the fact that the densities cannot be approximated locally very well, we also propose a diagonalization algorithm that takes the global structure into account. In order to show the use of this framework of separated functions, we finish with a proof of postnonlinear separability in a special case. In future work, more general separability results of postnonlinear BSS could be constructed by finding more general solutions of the differential equation 5.3. Algorithmic improvements could be made by using other density approximation methods like mixture of gaussian models or by approximating the Hessian itself using the cumulative density and discrete approximations of the differential. Finally, the diagonalization algorithm can easily be extended to nonlinear situations by finding appropriate model parameterizations; instead of minimizing the mutual information, we minimize the absolute value of the off-diagonal terms of the logarithmic Hessian.

A New Concept for Separability Problems in BSS

1849

The algorithm has been specified using only an energy function; gradient and fixed-point algorithms can be derived in the usual manner. Separability in nonlinear situations has turned out to be a hard problem— illposed in the most general case (Hyv¨arinen & Pajunen, 1999)—and not many nontrivial results exist for restricted models (Hyv¨arinen & Pajunen, 1999; Babaie-Zadeh et al., 2002), all only two-dimensional. We believe that this is due to the fact that the rather nontrivial proof of the DarmoisSkitovitch theorem is not at all easily generalized to more general settings (Kagan, 1986). By introducing separated functions, we are able to give a much easier proof for linear separability and also provide new results in nonlinear settings. We hope that these ideas will be used to show separability in other situations as well. Acknowledgments I thank the anonymous reviewers for their valuable suggestions, which improved the original manuscript. I also thank Peter Gruber, Wolfgang Hackenbroch, and Michaela Theis for suggestions and remarks on various aspects of the separability proof. The work described here was supported by the DFG in the grant “Nonlinearity and Nonequilibrium in Condensed Matter” and the BMBF in the ModKog project. References Babaie-Zadeh, M., Jutten, C., & Nayebi, K. (2002). A geometric approach for separating post non-linear mixtures. In Proc. of EUSIPCO ’02 (Volume 2, pp. 11–14). Toulouse, France. Bauer, H. (1996). Probability theory. Berlin: Walter de Gruyter. Bell, A., & Sejnowski, T. (1995). An information-maximization approach to blind separation and blind deconvolution. Neural Computation, 7:1129–1159. Belouchrani, A., Meraim, K. A., Cardoso, J.-F., & Moulines, E. (1997). A blind source separation technique based on second order statistics. IEEE Transactions on Signal Processing, 45(2), 434–444. Cardoso, J.-F., & Souloumiac, A. (1993). Blind beamforming for non gaussian signals. IEEE Proceedings–F, 140(6), 362–370. Cichocki, A., & Amari, S. (2002). Adaptive blind signal and image processing. New York: Wiley. Comon, P. (1994). Independent component analysis—a new concept? Signal Processing, 36:287–314. Darmois, G. (1953). Analyse g´en´erale des liaisons stochastiques. Rev. Inst. Internationale Statist., 21, 2–8. Eriksson, J., & Koivunen, V. (2003). Identifiability and separability of linear ica models revisited. In Proc. of ICA 2003, (pp. 23–27). Nara, Japan. Habl, M., Bauer, C., Puntonet, C., Rodriguez-Alvarez, M., & Lang, E. (2001). Analyzing biomedical signals with probabilistic ICA and kernel-based source

1850

F. Theis

density estimation. In M. Sebaaly, (Ed.) Information science innovations (Proc.ISI’2001) (pp. 219–225). Alberta, Canada: ICSC Academic Press. H´erault, J., & Jutten, C. (1986). Space or time adaptive signal processing by neural network models. In J. Denker (Ed.), Neural networks for computing: Proceedings of the AIP Conference (pp. 206–211). New York: American Institute of Physics. Hyv¨arinen, A., Karhunen, J., & Oja, E. (2001). Independent component analysis. New York: Wiley. Hyv¨arinen, A., & Oja, E. (1997). A fast fixed-point algorithm for independent component analysis. Neural Computation, 9:1483–1492. Hyv¨arinen, A., & Pajunen, P. (1999). Nonlinear independent component analysis: Existence and uniqueness results. Neural Networks, 12(3), 429–439. Kagan, A. (1986). New classes of dependent random variables and a generalization of the Darmois-Skitovitch theorem to several forms. Theory Probab. Appl., 33(2), 286–295. Lee, T., & Lewicki, M. (2000). The generalized gaussian mixture model using ICA. In Proc. of ICA 2000 (pp. 239–244). Helsinki, Finland. Lin, J. (1998). Factorizing multivariate function classes. In M. Kearns, M. Jordan, & S. Solla (Eds.), Advances in neural information processing systems, 10, (pp. 563– 569). Cambridge, MA: MIT Press. Skitovitch, V. (1953). On a property of the normal distribution. DAN SSSR, 89, 217–219. Taleb, A., & Jutten, C. (1999). Source separation in post non linear mixtures. IEEE Trans. on Signal Processing, 47, 2807–2820. Theis, F., & Gruber, P. (forthcoming). Separability of analytic postnonlinear blind source separation with bounded sources. In Proc. of ESANN 2004. Evere, Belgium: d-side. Theis, F., Jung, A., Puntonet, C., & Lang, E. (2002). Linear geometric ICA: Fundamentals and algorithms. Neural Computation, 15, 1–21. Theis, F., Puntonet, C., & Lang, E. (2003). Nonlinear geometric ICA. In Proc. of ICA 2003 (pp. 275–280). Nara, Japan. Tong, L., Liu, R.-W., Soon, V., & Huang, Y.-F. (1991). Indeterminacy and identifiability of blind identification. IEEE Transactions on Circuits and Systems, 38, 499–509. Yeredor, A. (2000). Blind source separation via the second characteristic function. Signal Processing, 80(5), 897–902.

Received June 27, 2003; accepted March 8, 2004.

LETTER

Communicated by Jacques Droulez

Modeling Mental Navigation in Scenes with Multiple Objects Patrick Byrne [email protected]

Suzanna Becker [email protected] Department of Psychology, McMaster University, Hamilton, Ontario, Canada

Various lines of evidence indicate that animals process spatial information regarding object locations differently from spatial information regarding environmental boundaries or landmarks. Following Wang and Spelke’s (2002) observation that spatial updating of egocentric representations appears to lie at the heart of many navigational tasks in many species, including humans, we postulate a neural circuit that can support this computation in parietal cortex, assuming that egocentric representations of multiple objects can be maintained in prefrontal cortex in spatial working memory (not simulated here). Our method is a generalization of an earlier model by Droulez and Berthoz (1991), with extensions to support observer rotation. We can thereby simulate perspective transformation of working memory representations of object coordinates based on an egomotion signal presumed to be generated via mental navigation. This biologically plausible transformation would allow a subject to recall the locations of previously viewed objects from novel viewpoints reached via imagined, discontinuous, or disoriented displacement. Finally, we discuss how this model can account for a wide range of experimental findings regarding memory for object locations, and we present several predictions made by the model. 1 Introduction Spatial reasoning is of paramount importance in nearly all aspects of human behavior, from planning and navigating a complex route through some environment in order to reach a relevant goal, to simply grasping a nearby object. The set of all entities that comprise our spatial surroundings may be divided into subsets in arbitrarily many ways. For the present purposes, we will partition these entities into environmental boundaries/landmarks and objects. There is substantial empirical evidence as to how the brain represents and processes the former (for a review, see Burgess, Becker, King, & O’Keefe, 2001). For example, O’Keefe & Dostrovsky (1971) found neurons in the hippocampus of the rat that respond to the rat’s location in space. O’Keefe and Nadel (1978) argue that this collection of “place cells” forms a c 2004 Massachusetts Institute of Technology Neural Computation 16, 1851–1872 (2004)

1852

P. Byrne and S. Becker

cognitive map and is the rat’s internal allocentric map of the environment. Evidence of view-invariant hippocampal place cells has also been found in nonhuman primates (Ono, Nakamura, Nishijo, & Eifuku, 1993) and in human hippocampus (Ekstrom et al., 2003). For the case of short-term object location representation, the focus of this article, empirical evidence strongly indicates involvement of the posterior parietal cortex. For example, Sabes, Breznen, and Andersen (2002) perform single unit recordings that demonstrate that area LIP of monkey cortex encodes saccade targets in retinotopic coordinates. More generally, Colby and Goldberg (1999) review evidence showing that object locations are represented in a variety of reference frames in parietal cortex, while Andersen, Shenoy, Snyder, Bradley, and Crowell (1999) review evidence suggesting that area 7a incorporates vestibular information to maintain localization of visual stimuli in a world-centered reference frame. Finally, Goodale and Milner (1992) review data suggesting that the dorsal stream from striate to posterior parietal cortex is responsible for the relatively short timescale sensorimotor transformations used while performing visually guided actions directed at objects in the environment. In this article, we first review empirical evidence from neuropsychological and electrophysiological studies as to the nature and locus of object representations in the brain. We then present a biologically plausible computational model that allows for the transformation of object coordinates maintained in spatial working memory (WM). The function of this transformation is to update WM representations of object coordinates in egocentric space while the subject mentally navigates through the environment. This transformation is driven by the products of mental navigation (presumably some mentally generated equivalent to vestibular, proprioceptive, and motor efference information). Next, we perform simulations demonstrating the functioning of the model and how error is introduced into object coordinate representations by the model. Finally, we discuss how the model can account for various experimental findings. The representation of object location information in the brain appears, at least under certain circumstances, to be quite different from the representation of environmental boundary information. A series of experiments performed by Wang & Spelke (2000) provides insight into how humans differentially process object location and environmental boundary information. For each experiment, the authors placed a few objects around a small room. Subjects were allowed to explore the room and studied objects’ locations for as long as they pleased. They were then brought to the center of the room, blindfolded, rotated a small amount, and asked to point to the various objects as they were called out in a random order by the experimenters. Following this, subjects were required to sit in a swivel chair fixed at the center of the room and disorient themselves via a 1 minute self-induced rotation. They were again asked to point to the objects in a predetermined random order. In addition to objects, subjects were also required to point to

Modeling Mental Navigation in Scenes with Multiple Objects

1853

room corners in some experiments. The main findings can be summarized as follows: After the initial rotation (without disorientation), subjects could point to objects and room corners with relatively high accuracy, implying that they had encoded correctly the information that they were asked to. After disorientation, subjects could no longer accurately point to the location of objects or room corners. However, analyses of the consistency of the pointing errors indicated that the relative configuration of the room corners could be accurately recalled (for nonrectangular as well as rectangular rooms), whereas the relative object configuration could not be. This supports a configural representation of geometric feature locations and nonconfigural or independent representations of object locations. Furthermore, when subjects were reoriented with a bright light (still blindfolded), they could accurately recall the absolute and relative locations of room corners but neither the relative nor absolute locations of objects. When the bright light was left on throughout the disorientation procedure, subjects could recall the relative and absolute locations of both objects and room corners. The results of Wang and Spelke (2000) experiments indicate that subjects could rapidly form an accurate internal representation of environmental boundaries and object locations. However, it appears that during mental transformations of spatial information in trials where the subject is not continuously oriented, information about a given object location is updated independent of information regarding other objects and of the environment. In this way, each object location is subject to independent transformation error. This is in contrast to environmental geometric cues such as room corners, which appear to be updated as a coherent whole. Additional evidence that object location information is handled differently from environmental boundary information comes from an experiment by Shelton and McNamara (2001). Subjects were brought into a room with various objects placed at different locations inside and allowed to observe the objects from various predetermined viewpoints. After leaving the room, they were asked to imagine standing at a given object facing a second object and to point to where a third object would be relative to themselves. Subjects performed best when their imagined viewpoint was aligned with the original viewpoint from which they observed the object configuration. This suggests the possibility that subjects are storing the object configuration in head-centered egocentric coordinates and that they must transform it (introducing error) when they imagine observing it from other viewpoints. We say head centered here because subjects were asked to imagine facing an object, which implies pointing their head toward it; they were not asked to constrain their gaze direction, so neither retinal nor body-centered coordinates are implied. Head-centered coordinates are also implied by the unit recordings of Funahasi, Bruce, and Goldman-Rakic (1989), as we mention in the next section. Another possible interpretation of Shelton and McNamara’s (2001) results is that subjects form a view-based snapshot of the entire presentation scene, which can be used for later matching. Wang and Spelke

1854

P. Byrne and S. Becker

(2002) review evidence that humans do seem to make use of such snapshots, at least under certain circumstances. However, we do not investigate this issue any further here. In order to build a computational model of mental navigation that explains empirical results such as those of Wang and Spelke (2000) and Shelton and McNamara (2001), we first require a more complete understanding of how the brain represents objects. We next review empirical evidence as to the nature and locus of object representations in the brain. 2 Objects and Spatial Working Memory A number of experiments suggest a transition from short-term spatial WM to long-term memory representations of objects after several minutes of study time (Smith & Milner, 1989; Crane, Milner, & Leonard, 1995; Bohbot et al., 1998). The evidence provided by these studies suggests that medial temporal lobe structures are essential to location memory over periods of several minutes (≈ 4 minutes) or greater but are less relevant over shorter timescales. To investigate the nature of short-term object location memory, we consider the evidence from functional imaging and unit recording studies. In an fMRI experiment performed by Galati et al. (2000), subjects were required to report the location of a vertical bar flashed before them for 150 ms relative to their midsagittal plane. During this task, several frontal and parietal regions were more active on the location task, relative to a control color decision involving the same stimuli. Furthermore, Sala, R¨am¨a, and Courtney (2003) presented subjects with a sequence of flashes and asked them to recall the location or identity of what was shown (picture of a house or a face) three flashes back. During location recall, fMRI scans revealed activation in the superior portion of the intraparietal sulcus (IPS) and in the superior frontal sulcus, as well as other areas. During identity recall, activation was found in the inferior and medial frontal gyrus, as well as other areas. These observations indicate that areas of frontal and parietal cortices are of key importance in generating and maintaining internal representations of spatial locations, at least for short periods of time. The role of frontal cortical areas in this process is reviewed by Levy and Goldman-Rakic (2000), who argue that the principal sulcus (area 46) plays a crucial role in spatial WM and that Walker’s areas 12 and 45 (the inferior convexivity) play a crucial role in object identity WM. In particular, a unit recording study by Funahasi et al. (1989) showed that neurons in the principal sulcus of the monkey appear to code for egocentric spatial locations in head-centered coordinates. Furthermore, a human study by Oliveri et al. (2001), requiring subjects to remember the position of a flash two steps back in a sequence, found that only when transcranial magnetic stimulation (TMS) was applied to the dorsolateral prefrontal cortex (DLPFC) was accuracy affected, although TMS applied to several different brain regions affected reaction times. These re-

Modeling Mental Navigation in Scenes with Multiple Objects

1855

sults suggest that egocentric representations of spatial locations are maintained in DLPFC. Unfortunately, the role of parietal cortical areas in spatial WM is somewhat less clear. It is known, however, that neurons in areas VIP and LIP of the IPS show receptive fields in head-centered coordinates (see Burgess, Jeffery, and O’Keefe, 1999, for an overview). Also, Chaffe and Goldman-Rakic (1998) showed that when monkeys are required to hold a target location in memory for a short delay before making a saccade to it, neurons in area 8a (near the principle sulcus of prefrontal cortex) and in area 7ip (near the IPS) become active and show temporally varying activation levels over the delay period. Neurons in both regions show spatial selectivity similar to that found in the principal sulcus of monkeys by Funahasi et al. (1989). From the above evidence, it seems plausible that object locations are initially represented in parietal cortex, and if sufficient attention is allocated to them, then their locations and identities are maintained in WM in DLPFC. In particular, their locations are represented egocentrically (as Shelton and McNamara’s, 2001, work might indicate) in one area of DLPFC, while maintenance of their identities involves a different area of DLPFC consistent with Goldman-Rakic’s hypothesis that the ventral/dorsal (what/where) distinction persists into the DLPFC. In order to mentally manipulate the positions of objects stored in WM, it seems reasonable to assume that a circuit involving areas of the parietal cortex (especially the IPS) would be involved, given this area’s involvement in spatial WM and its known ability to represent locations in multiple reference frames. In particular, we hypothesize that object coordinates are maintained in WM egocentrically in DLPFC and manipulated and represented more transiently in the vicinity of the IPS. 3 Model In this work, we develop a computational account of memory for object locations in the face of certain types of viewer motion, specifically those in which the motion is imagined or those in which the subject does not remain continuously oriented throughout. At a minimum, the model should provide an explanation of Shelton and McNamara’s (2001) finding that subjects more accurately recall object positions when asked to do so from a viewpoint in which they previously observed the object configuration, and of Wang and Spelke’s (2000) finding regarding the pattern of errors made by disoriented subjects recalling object configurations. One possible way in which subjects might perform either the Shelton and McNamara or the Wang and Spelke task is by mental navigation. In contrast to a simple mental rotation of the object array, mental navigation involves imagined egomotion. This entails making mental viewpoint transformations while simultaneously updating WM representations of egocentric object coordinates. More specifically, subjects could make a WM snapshot of object locations from a given viewpoint, and when asked to recall object locations from a new viewpoint, they could

1856

P. Byrne and S. Becker

mentally navigate from the initial viewpoint to the new viewpoint and use the same motion signal driving mental navigation to simultaneously drive a transformation of egocentric object coordinates. In the case of the Wang and Spelke tasks, the new viewpoint would have to be estimated due to the disorientation procedure. Also, assuming a serial updating procedure in which the transformation just described must be repeated for each individual object allows for a possible explanation of the finding that object configurations could not be accurately recalled after disorientation and that recall errors were not systematic, indicating lack of a configurational representation. This will be discussed in more detail in section 5. Our hypothesis requires neural circuits that can represent the environment allocentrically, to allow for mental navigation through the environment. Additional circuits are required to maintain egocentric object coordinates and transform these coordinates, one object at a time, based on the egomotion signal that drives mental navigation. To this end, models of real navigation based on internal allocentric cognitive maps have been developed for rats and humans (see Voicu, 2003, and Burgess, Donnett, & O’Keefe, 1997, for examples). Given the evidence that both real and imagined spatial tasks invoke nearly the same cortical circuitry (see, e.g., Stippich, Ochmann, & Sartor, 2002; Ino et al., 2002; Mellet et al., 2000; Kreiman, Koch, & Fried, 2000), these models of real navigation will be assumed to be applicable to mental navigation. We now require a neural circuit that performs the object location transformation. To begin, we must decide how egocentric object coordinates are to be represented. Thelen, Schoner, ¨ Scheier, and Smith (2001) and Compte, Brunel, Goldman-Rakic, and Wang (2000) have created models of spatial WM that hold location as a bump of activity in a topographically organized neural circuit, in which each neuron represents a location or direction in egocentric space. Becker and Burgess (2001) model the parietal egocentric map in this way, and similarly the organism’s location in allocentric space is represented by a gaussian bump of activity over an array of hippocampal place cells. Such bump attractor networks have been used by others in models of hippocampus as well (Zhang, 1996; Samsonovich & McNaughton, 1997). Here also it will be assumed that the model contains a main layer of neurons, each of which will be preassigned a unique location on a Cartesian grid covering head-centered egocentric space. We will represent object location in egocentric space as a gaussian bump of activity in the main neuron layer. By definition, the person’s coordinates in this map are the origin, and their orientation will be taken as facing along the positive y-axis. The existence of these main-layer neurons in parietal cortex is supported by the electrophysiological recordings of Chaffe and Goldman-Rakic (1998). In addition to a representation of object locations, our model will be adapted from a model by Droulez and Berthoz (1991), which could easily be implemented to handle translational movements of the observer but requires a nontrivial extension to handle rotational motion. To derive the

Modeling Mental Navigation in Scenes with Multiple Objects

1857

P y’

r’P x’ θ

y

rP

O’ rT

x O

Figure 1: Translation and rotation of reference frame.

model, first consider an observer standing at the origin, O, of some reference frame and facing along the positive y-axis (see Figure 1). The position of an object at a point, P, will be denoted by the vector, rP . If the observer moves by an amount rT and rotates by an amount θ , then we will call the new egocentric reference frame the primed frame. The position of the object with regard to this new frame is given by rp = rp − rT .

(3.1)

Reexpressed in terms of Cartesian x, y components, this is xp i + yp j = xp − xT i + yp − yT j,

(3.2)

where i and j are basis vectors oriented along the x- and y-axes, respectively. Making an appropriate change of basis on the right-hand side yields xp = xp − xT cos θ + yp − yT sin θ yp = yp − yT cos θ − xp − xT sin θ.

(3.3)

Next, we assume that the observer’s viewpoint shift occurs over a time interval t << 1, so that they are standing at O facing in the j direction at time t and at O facing in the j direction at time t + t. Assuming that

1858

P. Byrne and S. Becker

egocentric velocities vary smoothly enough so that for any time, t ∈ [t, t + t], we can write vx (t ) = v0x + fvx (t ), vy (t ) = v0y + fvy (t ), ω(t ) = ω0 + fω (t ),

(3.4)

where vx (t ), vy (t ), and ω(t ) are the translational and rotational velocities of the observer as measured in the O frame, and the f functions vary by O (t) over the interval and satisfy f (t = t) = 0; then we have xp = xp − v0x − ω0 yp t + O t2 yp = yp − v0y + ω0 xp t + O t2 .

(3.5)

Of course, the primed position variables are simply the egocentric object coordinates at time t + t, the unprimed position variables are simply the egocentric object coordinates at time t, and the velocity variables are the egocentric velocities at time t. Therefore, we can rewrite equations 3.5 as x(t + t) = x(t) − vx (t) − ω(t)y(t) t + O t2 y(t + t) = y(t) − vy (t) + ω(t)x(t) t + O t2 ,

(3.6)

where x(t) and y(t)

refer to the egocentric object coordinates at time t and vx (t), vy (t), ω(t) are the egocentrically measured velocities at time t. We are interested in representing the egocentric location of an object by a bump of activity across a population of neurons moving around within the observer’s egocentric reference frame over time. This activity can be expressed as a function A(x, y, t) over egocentric coordinates and time. Given the value of A(x, y, t), the value of A(x, y, t + t) can easily be found by applying the inverse of equations 3.6 to the arguments of A, as follows: A(x, y, t + t) = A(x + [vx (t) − ω(t)y]t + O(t2 ), y + [vy (t) + ω(t)x]t + O(t2 ), t).

(3.7)

The right-hand-side of this equation may be expanded using the second mean value theorem. Such an expansion will be valid as long as the second derivatives of A and the coefficients of t remain small. Notice, however, that these coefficients actually depend on location, so for distant objects,

Modeling Mental Navigation in Scenes with Multiple Objects

1859

more error will be introduced into the approximation. We are thus left with A(x, y, t + t) = ∂A (x, y, t) ∂x ∂A + [vy (t) + ω(t)x]t (x, y, t) + O(t2 ). ∂y

A(x, y, t) + [vx (t) − ω(t)y]t

(3.8)

Recall that space in this model is to be represented as a Cartesian grid of points, with a single main layer neuron allocated to each point. In an attempt to follow the notation of Droulez and Berthoz (1991), we note that each of these positions and, hence, the corresponding main layer neurons can be labeled with a positive integer, starting at 1 for the position (neuron) with the smallest x and y value and increasing with position in the direction of increasing x until one row is complete. The numbering continues on the row with the next lowest y value until the last position is reached (see Figure 2). By doing this, equation 3.8 can be discretized as follows, A(xi , yi , t + t) ≈ aij + bij vx (t) + cij vy (t) + dij ω(t) A(xi + dxij , yi + dyij , t), (3.9) j

where xi and yi are the x and y coordinates of the ith location and dxij and dyij are the distances from the ith location to the jth location along the x and y directions. The values of the a, b, c, and d coefficients are determined by the approximation used to calculate the gradients of A. For example, to approximate equation 3.8 using a centered difference rule for the partial derivatives, the coefficients should be selected as aij = δij , t t (δi+1,j − δi−1,j ), cij = δi+Nx ,j − δi−Nx ,j 2dxij 2dyij

yi t xi dij = δi+Nx ,j − δi−Nx ,j − δi+1,j − δi−1,j 2 dyij dxij bij =

(3.10)

where δij is the Kronecker delta function and Nx is the number of neurons spanning the x-direction. Notice that equation 3.9 can be rewritten as

1860

P. Byrne and S. Becker

7

y x

8

4

9

5

6

Main Layer 1

2

3

Translational Velocity Layer

Vx

Angular Velocity Layer

ω

ω Figure 2: Schematic of proposed neural network. The main layer neurons are interconnected with weights “a”. They are also each connected to the neurons with the same x and y coordinates in the other layers via unit weights. Neurons in the translational velocity layer multiply current egocentric velocity with the corresponding main neuron activation from the current time step and feedback with weights “b”. Vy neurons have been left out of this diagram for clarity. Finally, neurons in the angular velocity layer multiply main layer activations by current head rotation speed and feedback with weights “c”.

AM (xi , yi , t + t) =

aij AM (xi + dxij , yi + dyij , t)

j

+ bij AVx (xi + dxij , yi + dyij , t) + cij AVy (xi + dxij , yi + dyij , t) + dij Aω (xi + dxij , yi + dyij , t) AVx (xi , yi , t) = vx (t)AM (xi , yi , t) AVy (xi , yi , t) = vy (t)AM (xi , yi , t) Aω (xi , yi , t) = ω(t)AM (xi , yi , t),

(3.11)

Modeling Mental Navigation in Scenes with Multiple Objects

1861

where we have relabeled A → AM . These equations can now be viewed as a neural network with four layers, and the parameters a, b, c, and d can be viewed as weights (see Figure 2 for a schematic), each of which couples a spatial location at the input processing layer with another spatial location in the transformation layer. Notice, however, that these weights are modulated by a scalar velocity value (i.e., the network contains neurons that perform multiplicative operations on their inputs). Neurons with this ability have been postulated in models of visual cortex (Mel, 1993, 1994). It is also thought that neurons that show such gain modulation are important in performing coordinate transformations in parietal cortex (Salinas & Abbott, 1996; Andersen, Essick, & Siegel, 1985). Additionally, we are assuming that the signal used to drive the transformation is an egocentric velocity signal. Although the velocity signal used to drive mental navigation in our model is not vestibular in nature but rather mentally generated, it seems reasonable to make this assumption because vestibular information is known to project to Brodman’s area 7, which borders portions of the IPS (Kawano, Sasaki, & Yamashita, 1980, 1984). If neural circuitry is adapted to using such information in this area, then it seems reasonable that a mentally generated equivalent could also readily be made use of. Note that although we represent velocity as three scalar values, along the x, y, and θ directions, the model could easily be extended to accommodate coarse coding of the velocity signal. Finally, we are assuming a “Cartesian” distribution of neurons over egocentric space. It seems unlikely that cortex is organized in such a way; however, this organization is convenient for the present purposes, and we do not expect the behavior of our model to be qualitatively changed by the specific choice of neural distribution so long as each region of egocentric space is represented by a sufficiently high neural density. Quantitatively, for a different neural-spatial map, we would expect more transformation error to be introduced in areas where space is more sparsely represented. Satisfactory performance of our model can be attained using the weights defined by equation 3.10. However, further improvements may be made by applying a gradient descent algorithm to refine the connection strengths. To be specific, at each iteration of the algorithm, a gaussian bump of activity is injected into the main layer at a random location, random velocities are applied to the network, the activity is updated for one time step, and the resulting main layer activation is compared to an exact calculation of what it should be. This comparison is used to calculate the value of an error function given by E

aij , bij , cij , dij

=

2 AM xi , yi − AM xi , yi ,

(3.12)

i

where AM is the target activation determined as described below. The gradient of this error function with regard to weights can be calculated analyt-

1862

P. Byrne and S. Becker

ically, and the weight vector can thus be updated to reduce the value of the error. To determine the target function, AM , we note that it too must be a gaussian with the same width as the input gaussian, but that it must be centered at a slightly shifted location. This location can be determined by dividing equations 3.6 by t and taking the limit as t → 0. The result is a set of two exact differential equations for the egocentric coordinates of the peak of the gaussian bump in time. The constant velocity solutions of these equations are vy vx x(t) = x(0) + cos(ωt) + y(0) − sin(ωt) − ω ω vy vx y(t) = y(0) − cos(ωt) − x(0) + sin(ωt) + ω ω

vy , ω vx . ω

(3.13)

As one final point regarding the training method, it should be noted that the gradient descent method used here can be made equivalent to a Hebbianlike mechanism by the addition of an extra layer of “input” neurons that compare main layer activity to an external target signal (see Droulez and Berthoz, 1991). 4 Simulations In the simulations reported here, a network with 961 main layer neurons arranged in a 31 × 31 lattice in which each neuron represents a spatial area of one unit2 was trained using the method described in the previous section. Each of the three transformation layers also contained 31 × 31 neurons. The width of each gaussian activation bump injected into the main layer for training was chosen randomly from the interval [2, 4] units (widths below 2 units result in large error due to the discrete nature of the lattice; large widths are not useful in representing localized objects), with the peak height always set to unity. Angular velocities were chosen randomly from the interval [−1, 1] radians per second, while translational velocities were chosen from the interval [−3, 3] units per second. These ranges were chosen to be reasonable for a human subject in the case of 1 unit = 1 meter, but the exact values, which have little effect on training results, were chosen arbitrarily. A time step of 10 ms was used for training and for all simulations. With initial weights set to the values defined in equations 3.10, the simple gradient descent algorithm used here was found to converge after approximately 500, 000 random infinitesimal transformations when the learning rate was set to 5 × 10−4 . Although error, as defined by equation 3.12, was reduced by a factor of about 4 during this procedure, all weights remained within 4% of their initial values after training was complete.

Modeling Mental Navigation in Scenes with Multiple Objects

1863

Object 1

2

3

5 units

5 units

Figure 3: Trajectory of the person along the x-axis while maintaining the egocentric coordinates of a stationary object. Each of the four levels shows the position of the person relative to the object at a different stage of the navigation. The dashed line is the path the person travels.

As a demonstration of the functioning of the model, a simulation was carried out in which the egocentric location of a stationary object was maintained while the subject performed a sequence of imagined motions in space (see Figure 3). To begin, an object was placed 5 units away from the subject, directly to his or her right. The subject turned 90 degrees to the left, moved forward by 5 units, and then turned 90 degrees to the right. The predicted trajectory of the stationary object through egocentric space, as computed by the model, can be seen in Figure 4, where main layer activations, sampled every 20 time steps, have been superimposed. Initial and final activations are shown in Figures 5 and 6. From Figure 6, we see that the bump of activity representing the object’s egocentric coordinates is somewhat attenuated and deformed by the transformation procedure. In fact, error is introduced in the simulation by two distinct mechanisms. First, the truncated leading-order terms in equation 3.6 contain factors of v2x , v2y , and ω2 , so simulation error should increase quadratically in velocity for large velocities (as long as velocity × t < 1, after which point error will increase more rapidly with increasing velocity). Second, at low velocities, truncating the O t2 terms will have little effect in terms of introducing error compared with the effect of approximating

1864

P. Byrne and S. Becker

Figure 4: Main layer activation representing the egocentric location of a stationary object, from the perspective of the moving observer depicted in Figure 3. Main layer neurons are sampled every 20 time steps and superimposed. The numbered arrows correspond to the numbered observer motions in Figure 3.

Figure 5: Initial main layer activation before navigation begins.

Modeling Mental Navigation in Scenes with Multiple Objects

1865

Figure 6: Final main layer activation when navigation is complete.

the first derivatives of activation using information from a discrete lattice of neurons. In this case, error should increase linearly with the number of time steps required for a given transformation or, equivalently, as the inverse of velocity. A number of simulations were performed in order to investigate these effects. The results of two of these simulations will now be discussed. For the first, an object was placed 5 units to the egocentric right of an observer who then turned 90 degrees to the left. For the second, an object was placed 5 units to the egocentric left of an observer who then moved to the egocentric right, a distance that is equal to the arc length traveled by the object in the first simulation. In both cases, the object was represented by a gaussian activation bump with a width of 2 units. Squared error is calculated for the final activation using equation 3.12 and plotted against velocity in Figures 7 and 8. In both cases, the optimum velocity is around 2 units per second. The optimum velocity actually remains in the [1.5, 3.5] units per second interval for a wide variety of paths. Finally, the solid curves in Figures 7 and 8, which are of the form f (x) =

a1 + a2 (x − a3 )2 + a4 , x

(4.1)

were fit to the data in order to demonstrate that transformation error depends on velocity in the way described above. So far we have discussed the functioning of our model in isolation. In the brain, we expect that object location is maintained in DLPFC, as we have discussed in previous sections. One possible way in which this information

1866

P. Byrne and S. Becker 0.9

Squared Error

0.8

0.7

0.6

0.5

0.4 0

0.5

1

1.5

2

Angular Velocity (rad/s) Figure 7: Transformation error produced by the network for an observer rotation of 90 degrees. Triangles are simulation data, and the solid curve is the fit to equation 4.1.

is maintained is as a collection of activity bumps over continuous attractor neural networks residing in the DLPFC, although we do not speculate on the details here. If the recurrent main layer weights (“a” weights) were reduced in strength, then an activity bump in our model would require external driving to be maintained. This driving could come directly from the DLPFC activity. Via feedback connections, the movement of the activity bump in our circuit could cause the DLPFC bump representation of object location to move along with it. This scenario does not necessarily imply that the representation of the original egocentric object coordinates is lost. Pickering, Gathercole, Hall, and Lloyd (2001) perform experiments and review evidence that they claim implies that visuo-spatial working memory is dissociated into static and dynamic components. For our case, the updating of object locations might occur in something analogous to the dynamic component, while a static snapshot might be maintained elsewhere. Simulations of such DLPFC-parietal interactions are currently being implemented in our lab.

Modeling Mental Navigation in Scenes with Multiple Objects

1867

0.5 0.45

Squared Error

0.4 0.35 0.3 0.25 0.2 0.15 0.1 0

2

4

6

8

10

Velocity (unit/s) Figure 8: Transformation error produced by the network for an observer translation of ≈ 7.8 units to their egocentric right. Triangles are simulation data, and the solid curve is the fit to equation 4.1 .

5 Discussion In this article, we have postulated a biologically plausible computation, implemented as a neural circuit, that explains how the contents of spatial WM can be updated to keep track of the egocentric coordinates of remembered locations during mental navigation over timescales shorter than 4 minutes. Error can be introduced into the representation of object location coordinates during transformation by a number of different mechanisms. We have demonstrated that the discrete nature of the representation of space and the exclusion of higher-order velocity information in a rate-coded neural circuit of this kind leads to a nonzero error regardless of the velocity of the subject through space. This error appears to affect peak height predominantly, but the effect on final peak location is relatively small in the simulations performed here. There is, however, an optimum velocity range for performing transformations. Other potential sources of error are internal and external noise, which we have not studied here. A final source of error in the kind of object location transformation postulated here is the potential error introduced by the mental navigation pro-

1868

P. Byrne and S. Becker

cedure itself. This process must depend on an internal allocentric map of the subject’s environment. Evidence indicates that this map contains inaccuracies and imprecisions (Hirtle & Jonides, 1985), perhaps more so in novel environments (Hartley, Burgess, Lever, Carucci, & O’Keefe, 2000). Presumably, tasks requiring precise mental navigation would also suffer from these errors. Shelton and McNamara’s (2001) finding that subjects more accurately remember object configurations when tested from viewpoints in which they have actually observed the object configuration is readily explained by our model. Subjects would simply store an egocentric view of the studied presentation in WM and then mentally navigate to the test viewpoint while updating object locations using the proposed parietal circuit. Note that this circuit can manipulate and transiently represent information from spatial WM; however, we do not claim that the original representation of object location, maintained in DLPFC, is necessarily altered by this procedure. Therefore, the original WM snapshot information is still available for future transformation and retrieval. This implies that subjects are still more accurate when tested at a presentation viewpoint because no transformation is required and no error is introduced. Our model does not, however, address the question of why information from a particular presentation viewpoint is maintained in preference to others. Our model also provides an explanation for the Wang and Spelke (2000) data. If we assume that subjects have an accurate WM snapshot of object locations before disorientation, then after disorientation, they could mentally navigate from the original snapshot viewpoint to their new estimated viewpoint while simultaneously using the egomotion signal that drives mental navigation to transform egocentric object location coordinates. From the discussion, we believe that the mental navigation and parietal update procedure would introduce error into an egocentric object location transformation. Since we have assumed serial update, that is, the mental navigation procedure must be performed separately for each object, the independent error inflicted on each object coordinate transformation could generate an overall object configuration error. This would also explain the object configuration error obtained from the experiment in which subjects were reoriented after disorientation by a bright light visible through their translucent blindfold. In this case, a mental navigation still must be performed from the viewpoint of the subjects while facing the light directly to the new viewpoint. Wang and Spelke’s (2000) results indicate that blindfolded subjects who are continuously oriented are able to maintain object configurations accurately even after rotation. Performance in these conditions, which include slow rotation or rotation in the presence of an orienting light, suggests minimal transformation of internal object location representations, or transformation that induces a systematic error only. One possible explanation for how subjects could recall object locations under continuously oriented con-

Modeling Mental Navigation in Scenes with Multiple Objects

1869

ditions is that they could generate a displacement-rotation vector via path integration. This vector could simply be added to the DLPFC-maintained object location representations at each recall. In this way, all object locations would be subject to the same systematic error. Another possibility is that the subject who remains oriented makes use of some allocentric system in which object locations are static. In this case, the subject must simply orient himself or herself with regard to the configuration. Given the availability of path integration in these conditions, subjects would be able to accurately localize themselves within the environment, and this procedure should result only in a small error, which would be the same for all objects. In addition to being consistent with a range of experimental findings regarding memory for object locations, our model makes two easily testable predictions. First, the serial updating we require for multiple objects suggests that a subject’s reaction time to determine if an object configuration, seen from a novel viewpoint, is the same as one seen from a different viewpoint should monotonically increase with number of objects. Second, our hypothesis regarding how object configuration error is introduced into the Wang and Spelke (2000) task suggests that as subjects become more familiar with a given environment, and hence develop a more precise cognitive map, their performance on these types of tasks should improve. These predictions are currently being investigated empirically. Should our first prediction turn out to be false, it may be that our model is too simple. A more complicated model, in which object locations are updated in parallel and egocentric space is represented nonuniformly with respect to the main layer neuron density, might also be able to account for the object configuration error observed by Wang and Spelke. Finally, another interesting question to investigate would be the nature of how objects become represented after being seen as stable for long periods of time, that is, after several minutes. For example, the work of Jeffery (1998) suggests that objects that remain stable over time begin to affect place cell firing in rats. Hence, objects might become part of the animal’s cognitive map after some time. One way to investigate this possibility would be to perform an experiment similar to that of Wang and Spelke (2000) in which the presentation phase was continued over a long period of time before testing began. It should be noted, however, that Cressant, Muller, and Poucet (1997) found that objects near the center of an environment failed to affect place cell firing in the same way. Therefore, such results would likely be found only for certain object placements.

Acknowledgments This research was supported by a grant from the Natural Sciences and Engineering Research Council of Canada to S.B. We thank Neil Burgess for helpful discussions.

1870

P. Byrne and S. Becker

References Andersen, R., Essick, G., & Siegel, R. (1985). The encoding of spatial location by posterior parietal neurons. Science, 230, 456–458. Andersen, R., Shenoy, K., Snyder, L., Bradley, D., & Crowell, J. (1999). The contributions of vestibular signals to the representations of space in the posterior parietal cortex. Ann. N.Y. Acad. Sci., 871, 282–292. Becker, S., & Burgess, N. (2001). A model of spatial recall, mental imagery and neglect. In T. Leen, T. Ditterich, & V. Tresp (Eds.), Advances in neural information processing systems., 13 (pp. 96–102). Cambridge, MA: MIT Press. Bohbot, V., Kalina, M., Stepankova, K., Spackova, N., Petrides, M., & Nadel, L. (1998). Spatial memory deficits in patients with lesions to the right hippocampus and to the right parahippocampal cortex. Neuropsychologica, 36(11), 1217– 1238. Burgess, N., Becker, S., King, J., & O’Keefe, J. (2001). Memory for events and their spatial context: Models and experiments. Philos. Trans. R. Soc. Lond. B Biol. Sci., 356(1413), 1493–1503. Burgess, N., Donnett, J., & O’Keefe, J. (1997). The representation of space and the hippocampus in rats, robots and humans. Z. Naturforsch [C], 53(7–8), 504–509. Burgess, N., Jeffery, K., & O’Keefe, J. (Eds.). (1999). The hippocampal and parietal foundations of spatial cognition. Oxford: Oxford University Press. Chaffe, M., & Goldman-Rakic, P. (1998). Matching patterns of activity in primate prefrontal area 8a and parietal area 7ip neurons during a spatial working memory task. Journal of Neurophysiology, 79(6), 2919–2940. Colby, C., & Goldberg, M. (1999). Space and attention in parietal cortex. Annual Rev. Neurosci., 22, 319–349. Compte, A., Brunel, N., Goldman-Rakic, P., & Wang, X.-J. (2000). Synaptic mechanisms and network dynamics underlying spatial working memory in a cortical network model. Cerebral Cortex, 10, 910–923. Crane, J., Milner, B., & Leonard, G. (1995). Spatial-array learning by patients with focal temporal-lobe excisions. Society for Neuroscience Abstracts, 21, 1446. Cressant, A., Muller, R., & Poucet, B. (1997). Failure of centrally placed objects to control the firing fields of hippocampal place cells. J. Neurosci., 17(7), 2531–2542. Droulez, J., & Berthoz, A. (1991). A neural network model of sensoritopic maps with predictive short-term memory properties. Proc. Natl. Acad. Sci. USA, 88, 9635–9657. Ekstrom, A., Kahana, M., Caplan, J., Fields, T., Isham, E., Newman, E., & Fried, I. (2003). Cellular networks underlying human spatial navigation. Nature, 425, 184–187. Funahashi, S., Bruce, C. & Goldman-Rakic, P. (1989). Mnemonic coding of visual space in the monkey’s dorsolateral prefrontal cortex. Journal of Neurophysiology, 61(2), 331–348. Galati, G., Lobel, E., Vallar, G., Berthoz, A., Pizzamiglio, L., & LeBihan, D. (2000). The neural basis of egocentric and allocentric coding of space in humans: A functional magnetic resonance study. Exp. Brain Res., 133, 156–164.

Modeling Mental Navigation in Scenes with Multiple Objects

1871

Goodale, M., & Milner, A. (1992). Separate visual pathways for perception and action. Trends Neurosci., 15(1), 20–25. Hartley, T., Burgess, N., Lever, C., Cacucci, F., & O’Keefe, J. (2000). Modelling place fields in terms of the cortical inputs to the hippocampus. Hippocampus, 10(4), 369–379. Hirtle, S., & Jonides, J. (1985). Evidence of hierarchies in cognitive maps. Memory and Cognition, 13(3), 208–217. Ino, T., Inoue, Y., Kage, M., Hirose, S., Kimura, T., & Fukuyama, H. (2002). Mental navigation in humans is processed in the anterior bank of the parieto-occipital sulcus. Neuroscience Letters, 322, 182–186. Jeffery, K. (1998). Learning of landmark stability and instability by hippocampal place cells. Neuropharmacology, 37, 677–687. Kawano, K., Sasaki, M., & Yamashita, M. (1980). Vestibular input to visual tracking neurons in the posterior parietal association cortex of the monkey. Neurosci. Lett., 17, 55–60. Kawano, K., Sasaki, M., & Yamashita, M. (1984). Response properties of neurons in posterior parietal cortex of monkey during visual-vestibular stimulation. J. Neurophysiol., 51, 340–351. Kreiman, G., Koch, C., & Fried, I. (2000). Imagery neurons in the human brain. Nature, 408, 357–361. Levy, R., & Goldman-Rakic, P. (2000). Segregation of working memory functions within the dorsolateral prefrontal cortex. Exp. Brain Res., 133, 23–32. Mel, B. (1993). Synaptic integration in an excitable dendritic tree. J. Neurophysiol., 70(3), 1086–1101. Mel, B. (1994). Information processing in dendritic trees. Neural Computation, 6, 1031–1085. Mellet, E., Bricogne, S., Tzourio-Mazoyer, N., Gha¨em, O., Petit, L., Zago, L., Etard, O., Berthoz, A., Mazoyer, B., & Denis, M. (2000). Neural correlates of topographic mental exploration: The impact of route versus survey perspective learning. NeuroImage, 12, 588–600. O’Keefe, J., & Dostrovsky, J. (1971). The hippocampus as a spatial map: Preliminary evidence from unit activity in the freely-moving rat. Brain Res., 34(1), 171–175. O’Keefe, J., & Nadel, L. (1978). The hippocampus as a cognitive map. Oxford: Oxford University Press. Oliveri, M., Turriziani, P., Carlesimo, G., Koch, G., Tomaiuolo, F., Panella, M., & Caltagirone, C. (2001). Parieto-frontal interactions in visual-object and visualspatial working memory: Evidence from transcranial magnetic stimulation. Cerebral Cortex, 11(7), 606–618. Ono, T., Nakamura, K., Nishijo, H., & Eifuku, S. (1993). Monkey hippocampal neurons related to spatial and nonspatial functions. Journal of Neurophysiology, 70(4), 1516–1529. Pickering, S., Gathercole, S., Hall, M., & Lloyd, S. (2001). Development of memory for pattern and path: Further evidence for the fractionation of visuospatial memory. Quarterly Journal of Experimental Psychology, 54(2), 397–420. Sabes, P., Breznen, B., & Andersen, R. (2002). Parietal representation of objectbased saccades. J. Neurophysiol., 88(4), 1815–1829.

1872

P. Byrne and S. Becker

Sala, J., R¨am¨a, P., & Courtney, S. (2003). Functional topography of a distributed neural system for spatial and nonspatial information maintenance in working memory. Neuropsychologica, 41, 341–356. Salinas, E., & Abbott, L. (1996). A model of multiplicative neural responses in parietal cortex. Proc. Natl. Acad. Sci. USA, 93, 11956–11961. Samsonovich, A., & McNaughton, B. (1997). Path integration and cognitive mapping in a continuous attractor neural network model. J. Neurosci., 17(15), 5900–5920. Shelton, A., & McNamara, T. (2001). Systems of spatial reference in human memory. Cognitive Psychology, 43(4), 274–310. Smith, M., & Milner, B. (1989). Right hippocampal impairment in the recall of spatial location: Encoding deficit or rapid forgetting? Neuropsychologica, 27(1), 71–81. Stippich, C., Ochmann, H., & Sartor, K. (2002). Somatotopic mapping of the human primary sensorimotor cortex during motor imagery and motor execution by functional magnetic resonance imaging. Neuroscience Letters, 331, 50–54. Thelen, E., Schoner, ¨ G., Scheier, C., & Smith, L. (2001). The dynamics of embodiment: A dynamic field theory of infant perseverative reaching errors. Behavioral and Brain Sciences, 24, 1–86. Voicu, H. (2003). Hierarchical cognitive maps. Neural Networks, 16, 560–576. Wang, R., & Spelke, E. (2000). Updating egocentric representations in human navigation. Cognition, 77, 215–250. Wang, R., & Spelke, E. (2002). Human spatial representation: Insights from animals. Trends Cog. Neurosci., 6(9), 376–382. Zhang, K. (1996). Representation of spatial orientation by the intrinsic dynamics of the head-direction cell ensemble: A theory. J. Neurosci., 16(6), 2112–2126. Received September 16, 2003; accepted February 6, 2004.

LETTER

Communicated by Reza Shadmehr

Failure of Motor Learning for Large Initial Errors Terence D. Sanger [email protected] Department of Neurology and Neurological Sciences, Stanford University, Stanford, CA 94305-5235, U.S.A.

For certain complex motor tasks, humans may experience the frustration of a lack of improvement despite repeated practice. We investigate a computational basis for failure of motor learning when there is no prior information about the system to be controlled and when it is not practical to perform a thorough random exploration of the set of possible commands. In this case, if the desired movement has never yet been performed, then it may not be possible to learn the correct motor commands since there will be no appropriate training examples. We derive the mathematical basis for this phenomenon when the controller can be modeled as a linear combination of nonlinear basis functions trained using a gradient descent learning rule on the observed commands and their results. We show that there are two failure modes for which continued training examples will never lead to improvement in performance. We suggest that this may provide a model for the lack of improvement in human skills that can occur despite repeated practice of a complex task. 1 Introduction Although humans are extraordinarily adept at learning new and complex skills, it is not unusual for a person learning a new skill to find that he or she is making no improvement despite repeated practice. For example, in the absence of effective coaching, a beginner learning to play golf may attempt many hundreds of swings without significantly improving performance. The same experience is common for many sports, musical skills, and other skilled tasks. A more pathological form of this phenomenon may occur for children and adults with movement disorders, for whom a lifetime of practice at movement does not lead to improvement in the quality of their motor skills. Since there are many examples of adaptive algorithms that are able to improve skill through repeated practice, these phenomena lead to the question of whether there is a particular limitation that can explain the lack of success with repetitive practice for complex skills. Successful control of an unknown environment requires the ability to determine the motor commands that lead to particular desired results. If a motor plan is specified in terms of a sequence of desired states, then at c 2004 Massachusetts Institute of Technology Neural Computation 16, 1873–1886 (2004)

1874

T. Sanger

some point, it becomes necessary to perform the inverse dynamics transform that will estimate the appropriate control signals needed to achieve the desired states. Learning an inverse dynamics transform leads to an internal model of the behavior of the environment (Scheidt, Reinkensmeyer, Conditt, Rymer, & Mussa-Ivaldi, 2000; Dingwell, Mah, & Mussa-Ivaldi, 2002; Flanders, Hondzinski, Soechting, & Jackson, 2003). There are many examples of successful algorithms for learning inverse dynamics models from observations of the behavior of a plant (Kawato, Furukawa, & Suzuki, 1987; Atkeson, 1989; Kawato & Gomi, 1992; Jordan & Rumelhart, 1992). Experiments in humans and animals (Gandolfo, Li, Benda, Schioppa, & Bizzi, 2000) have shown strong support for the formation of internal models (Dingwell et al., 2002) and the ability to adapt, with practice, to unknown or unusual dynamic environments (Krakauer, Pine, Ghilardi, & Ghez, 2000; Thoroughman & Shadmehr, 1999; Wang, Dordevic, & Shadmehr, 2001). Further experiments have indicated that generalization to previously unseen conditions is possible (Goodbody & Wolpert, 1998; Flanders et al., 2003; CriscimagnaHemminger, Donchin, Gazzaniga, & Shadmehr, 2003), and that there are limits on generalization that are consistent with models based on local basis-function approximations (Shadmehr & Moussavi, 2000; Thorughman & Shadmehr, 2000; Donchin & Shadmehr, 2003). However, essentially all computational algorithms and biological models for solving the inverse dynamics problem assume either that an approximate inverse model is available prior to learning or that a sufficient number of examples of motor commands and their resulting effects are available for training. If an approximate inverse model is available, then it may be possible to extrapolate from observed examples to determine the necessary changes in the motor command for desired performance. Even if an approximate model is not available, some algorithms can choose motor commands randomly in order to explore the space of reachable states. This has been proposed as an explanation of the apparently random movements of infants. But there are at least three problems with random exploration. First, for any real robotic or natural control system, random exploration can lead to injury. Second, the dimensionality of the control or state space may be high enough that complete exploration is not feasible in a reasonable amount of time. Third, there are clearly some states that are more useful than others, and it would be advantageous to explore only near regions of desired behavior. For example, there is no point in learning to play golf by randomly swinging the club about one’s head. Here, we investigate limitations on motor learning when no approximation to the plant is available and when random exploration is dangerous or otherwise not feasible. We model learning by choosing a particular desired effect s∗ that the system will transform into an estimated motor command m. This motor command will lead to an actual effect s that will in general be different from s∗ . The system then uses the trio (s∗ , m, s) to improve its future performance (Sanger, 1994). The goal of the system is to map s∗ into

Failure of Motor Learning for Large Initial Errors

1875

the “correct” command m∗ that achieves s = s∗ . If an approximation to the plant or its inverse were available, then the error s − s∗ could be converted to an approximate motor error m − m∗ so that modification of the motor command will improve performance. Since in this article we assume that no approximation to the plant is available, we instead must use the observed command-result pair (m, s) to learn the value of the inverse at s. We show that if the initial performance is such that s is sufficiently far from the desired s∗ , then under certain circumstances, no amount of practice will ever be sufficient to learn the correct value of m∗ . 2 Method Let S be the set of all observable states of the environment, and let M be the set of all possible motor commands that can be applied. Let the environment to be controlled be represented by a memoryless plant P that maps motor commands m ∈ M onto (sensory) effects or states s ∈ S so that P : m → s. The animal or robot transforms desired effects s∗ into motor commands m through an approximate inverse dynamics mapping F for which F : s∗ → m. Thus, we can write the complete system as s∗ →F m →P s.

(2.1)

The goal of learning is to modify the approximate inverse dynamics F to produce the “correct” command m∗ for which s∗ →F m∗ →P s∗ ,

(2.2)

so that the exact desired effect s∗ occurs. If this is achieved, then PF is the identity on the desired set of states S∗ ⊂ S, and F will be a partial right inverse of P. In order to achieve this, the mapping F can be modified based on observed values (s∗ , m, s). In particular, since s = P(m), modification of F such that m = F(s) will lead to a partial inverse for which s = PF(s). Under certain conditions, it can be shown that iteration of this procedure will lead to convergence on the desired states so that s∗ = PF(s∗ ) for all s∗ ∈ S∗ (Sanger, 1994). However, this is not guaranteed under all conditions. In particular, learning can occur only for those states s that are actually observed. This set is given by {PF(s∗ ) : s∗ ∈ S∗ }, which we can write as the set Sˆ = PF(S∗ ). Successful learning may satisfy s = PF(s) for all s ∈ PF(S∗ ). But this is not the desired behavior. The desired behavior is that s∗ = PF(s∗ ) for all s∗ ∈ S∗ . Thus, the system may successfully learn the inverse dynamics but on an inappropriate or undesired set of states. Figure 1 illustrates the ˆ S∗ , and M. relationship between the various subsets S, If this form of learning converges to a steady state, then for all s∗ ∈ S∗ , ˆ and for all s ∈ S, ˆ we have PF(s) = s ∈ S, ˆ so we note that we have PF(s∗ ) ∈ S,

1876

T. Sanger

Figure 1: Illustration of the relationship between variables in the model. s∗ is the desired state of the plant, and S∗ is the set of all desired states. m = F(s∗ ) is the motor command produced by the animal or robot’s internal inverse dynamics model. s = P(m) is the resulting state that occurs when the command m is applied to the unknown plant P. If the inverse dynamics model F trains successfully, then it will set m = F(s) in order to map the observed state s into the motor command m that actually generated it.

ˆ Points in S∗ that are not in Sˆ the mapping PF is a projection onto the set S. will never be mapped to the correct motor commands. The mapping F can be approximated to arbitrary accuracy as a weighted sum of a large but finite set of basis functions, so without loss of generality, we assume that F takes the form F(s∗ ) =

n

wi φi (s∗ ),

(2.3)

i=1

where φi () is a set of fixed scalar basis functions, such as radial basis functions, Fourier components, or other reasonable basis for function space. (For a similar development, refer to Thoroughman & Shadmehr, 2000; Donchin & Shadmehr, 2003.) wi is a set of vectors that specify the “weights” for combining the basis functions, and this computational structure has a straightforward neural network implementation (Sanger, 2003). We can rewrite equation 2.3 in vector notation as F(s∗ ) = Wφ(s∗ ),

(2.4)

where W is a matrix of weights, and φ() is a column vector representing the basis function outputs. Modification of F due to learning is accomplished

Failure of Motor Learning for Large Initial Errors

1877

by changing the weight matrix W. Although we have chosen a specific functional form for F, the following discussion will apply to any adaptive controller that can be approximated in this way. We can thus perform the analysis for this specific functional form of the controller F. Given a desired state s∗ , we have s = P(m) = P(Wφ(s∗ )), and learning will modify the weights W based on the observed values s and m. If we assume that F is modified using an iterative learning rule, then after a single training pair (s, m), we have F → F + F = (W + W)φ,

(2.5)

and the weights W are modified according to W = r(W, m, s, s∗ ),

(2.6)

where r() describes the learning rule, and there is a possible additional normalization step to maintain bounded weights. If we define the error vector as δs = s − s∗ and an energy function as E = δs 2 , then gradient descent on E with respect to the weights W is given by ∂P(m) T ∂E = −γ δs φ(s∗ )T , (2.7) W = −γ ∂W ∂m where γ is the learning rate. Since we usually do not know ∂P(m)/∂m (otherwise, we would have a complete model of the plant and would not need to learn it), algorithms for motor learning often approximate this based on reasonable assumptions about the form of the plant. For example, approximation using a constant linear function K ≈ ∂P(m)/∂m is used in feedback-error learning (Kawato & Gomi, 1992), while others have used backpropagation through a learned approximate forward model (Jordan & Rumelhart, 1992). However, if the plant is nonlinear, then a constant approximation may be valid only in a small region, and a learned forward model will be valid only near observed data points. If the approximation is sufficiently incorrect, then the sign of elements of (∂P(m)/∂m)T δs may be incorrect, and modification by W could actually increase the error E . In the following, we assume that the plant is nonlinear and unknown, so that no approximation to ∂P(m)/∂m is available. In this case, it is reasonable to take samples of observed plant inputs and outputs (m, s) and train on these, since no other information is available. In particular, there is no information about the desired inputs and outputs (m∗ , s∗ ) or about the value or sign of the motor error m − m∗ . In this case, the gradient-descent learning rule that converges to the solution m = F(s) is given by the least-mean-squared (LMS) algorithm applied to the observed data, r(W, m, s) = W = (m − Wφ(s))φ(s)T ,

(2.8)

1878

T. Sanger

so that the functional form of the update in this case depends only on the product of the basis function outputs φ(s) and the output error vector δ = m − Wφ(s) of the network. After a single learning iteration, the output of the inverse dynamics network is modified so that F(s∗ ) → F(s∗ ) + F(s∗ ) = (W + W)φ(s∗ ).

(2.9)

The network has converged when there are no further changes in the network in response to the desired inputs. There will be no change in the output F(s∗ ) if Wφ(s∗ ) = 0. Substituting the gradient descent learning rule, we obtain F(s∗ ) = Wφ(s∗ ) = δφ(s)T φ(s∗ ) = 0.

(2.10)

Normally, we say that when F(s∗ ) = 0, then the network has converged, and for many applications of a gradient descent algorithm, this condition guarantees successful solution of the problem. However, in the case of motor learning in our formulation, there are two other possibilities. Since φ(s)T φ(s∗ ) is a scalar, equation 2.10 will be zero if and only if either 1. δ = 0, or 2. φ(s)T φ(s∗ ) = 0. The first case may represent failure of learning, since δ = 0 does not imply that PF(s∗ ) = s∗ . In particular, δ = 0 implies only that F(s) = m or, equivalently, PF(s) = s. Thus, learning stops, but the error at s∗ is nonzero. If we substitute s = PF(s∗ ), we see that PF(PF(s∗ )) = PF(s∗ ), which means that this situation will occur whenever PF is a projection operator near s∗ . The second case represents a failure of learning due to the fact that if the basis function representations φ(s) and φ(s∗ ) are orthogonal, then training at s will not change the value of F(s∗ ) despite continuing changes in the network output for other input values. Therefore, whether the learning eventually converges (to δ = 0), the value of F(s∗ ) will not have been modified, and errors will remain. These two failure modes correspond to two different mechanisms that impede learning at s∗ . In the first case, the network has already learned the mapping at s (the error δ is zero), and therefore it is unaware of the error at s∗ and makes no further corrections. In the second case, despite continuing errors and modification of the network at s, there is no change in the output at s∗ . In both cases, continued attempts at achieving the desired s∗ will fail. 3 Simulations The first type of failure is produced whenever the combination of the plant and the network forms a projection operator. This can occur if the input

Failure of Motor Learning for Large Initial Errors

1879

m results in a plant output s = P(m) such that one or more elements φi (s) are always zero, and no learning will occur since the LMS algorithm, 2.8, applied to a zero input causes no change in the weights connected to that input. Alternatively, if the network ignores an input such that ∂φ/∂si = 0 for some element si of s, then it will never be possible to control the behavior of si . The network may converge, but it will not achieve the desired values. In order to demonstrate the first type of failure, we simulate a linear network F that attempts to invert a linear plant P with four inputs and four outputs. The plant is represented by a 4 × 4 random matrix with entries chosen from a uniform distribution between 0 and 1. The network is represented by a 4 × 4 matrix F with initially random entries, but with one of the initial eigenvalues set to zero so that F is degenerate with rank 3. Therefore, PF is a projection operator from R4 to a three-dimensional hypersurface. Training occurs according to equation 2.8, with a learning rate of 0.01, and 1000 random examples drawn from a uniform distribution on the positive unit cube in R4 . The eigenvalues are not constrained during training. The time series of mean-squared command error m − m∗ and performance error s − s∗ is calculated and smoothed with a 20-point moving average. The process is repeated for 100 different randomly chosen plants and initial weights, and the mean and standard deviations of the command and performance errors are calculated for each point in time. The results are displayed in Figure 2. Note that the command error converges to zero as expected with the LMS algorithm, but that the performance error remains nonzero. The network learns the mapping s = PF(s) almost perfectly on a three-dimensional subspace, but the error for randomly chosen examples in the full four-dimensional space remains significant. Because the network training samples s = PF(s∗ ) always lie in the subspace, the matrix F remains low rank despite the lack of any constraint forcing it to do so. Eventually, the error on the subspace becomes sufficiently small that learning effectively stops, yet significant errors in performance remain. In order to show that the second type of failure can occur even for extremely simple problems, we simulate a network that attempts to compute the identity mapping from a single-dimensional s to a single-dimensional command m. For this simulation, a nonlinear network F has 10 narrow gaussian basis functions φi with standard deviation 0.05 and centers evenly spaced between 0.0 and 1.0. Here, we choose the plant to be the identity, so P(m) = s = m, and the correct solution is F(s∗ ) = s∗ . Weights wi are initialized to random values uniformly distributed between 0.45 and 0.75. The desired response is a value of s∗ = 0.2, but during training, some local exploration is allowed so that training examples use s∗ uniformly distributed between 0.1 and 0.3. For each example s∗ , the values of m = Wφ(s∗ ) and s = P(m) = m are calculated. Gradient descent training of W on pairs (s, m) occurs for 1000 examples with a learning rate of 0.05 according to equation 2.8. Figure 3A shows the resulting network after training. The dashed line is the desired function F(s∗ ) = s∗ . The solid line is the learned value of

1880

T. Sanger

command error

0.12 0.1 0.08 0.06 0.04 0.02 0 0

100

200

300

400

500 time

600

700

800

900

1000

100

200

300

400

500 time

600

700

800

900

1000

performance error

0.6 0.5 0.4 0.3 0.2 0.1 0 0

Figure 2: Simulation of the first type of failure of motor learning when PF is a projection operator. The top graph shows the command (motor) error m − m∗ , and the bottom graph shows the performance (sensory) error s − s∗ (Mean-squared error in arbitrary units; time in units of iterations of the learning algorithm). Solid lines show the mean error averaged over 100 trials for 100 different random linear plants P, and the gray region shows +/− one standard deviation.

F(s) = i wi φi (s), and the thin lines at the bottom show a plot of each basis function multiplied by its corresponding weight wi φi (s). Note that the result approximates the correct result only in the neighborhood of s, but not near s∗ . Figure 3B shows the evolution of the weights during training. Note that only 3 of the 10 weights are modified by the training procedure. This network demonstrates the particular case for which φ(s)T φ(s∗ ) = 0. Because the basis functions are local, the representation for φ(s∗ ) excites only the first few basis functions, while the representation φ(s) excites only the last few. This situation occurred because the initial values of the weights for basis functions near s∗ were between 0.45 and 0.75, and thus far from the correct value of approximately 0.2. During learning, the only observed values of s used for training were between 0.7 and 0.9 (since s = PF(s∗ ) = wi φi (s∗ )), and so the network learned the correct solution only in this neighborhood.

Failure of Motor Learning for Large Initial Errors A

B

1

1881

1

0.9 0.9 0.8 0.8

0.7

0.6

weight

m

0.7 0.5

0.6 0.4

0.3

0.5

0.2 0.4 0.1

0

0

0.2

0.4

0.6

0.8

1

0.3

0

200

400

600

800

1000

time *

s

s

Figure 3: Simulation of the second type of failure of motor learning when φ(s∗ )φ(s) = 0. See the explanation in the text. (A) The solid line is the learned network i wi φi (s), the dashed line is the desired function, and basis functions wi φi (s) are shown as well. The network approximates the desired function only in the neighborhood of s. (B) The evolution of the weights wi during 1000 steps of training.

Although these are extremely simple examples, they are important because they demonstrate that the phenomenon of failure of motor learning can occur even for networks and plants of low complexity. Because of the larger number of variables, a more realistic plant with a high-dimensional control or state space will be even more likely to suffer this problem if the set of desired values of the states does not overlap with the set of observed values or if some modes of the system are not observed at all. 4 Discussion The analysis shows that an animal or autonomous robot learning to control an unknown environment by observing the consequences of its own actions may fail to achieve its goal despite the use of a network learning algorithm with normally guaranteed convergence. There are two different

1882

T. Sanger

ways in which learning may fail, and in both cases, failure occurs because the network is not trained on the desired states of the system. The first failure mode occurs when the network converges on an undesired set and therefore stops learning. The second failure mode occurs when training on the observed states of the system does not cause changes in performance on the desired states. In both cases, the weights cease to change despite lack of convergence to the globally optimal solution. The learning algorithm therefore behaves as if it has achieved a “local minimum,” but it should be noted that the reason for the failure of convergence may be due less to the nonlinear structure of the problem to be solved and more to the lack of availability of appropriate training examples. These two types of failure occur only if no approximate plant inverse is available and random exploration of the complete control space is not permitted. Although we have performed the analysis for the particular case of a basis function network trained using gradient descent, many other network structures and training algorithms (but not all) will be susceptible to this type of failure of learning. We note that if an approximate plant inverse is available or can be learned, then algorithms that minimize the error s − s∗ may be able to converge even for large initial errors. If the weight matrix W maintains full rank and the plant P is controllable, then the combination PF may never become a projection operator, and thus the first type of failure may not occur. If the basis functions φi are broadly tuned, then orthogonality may be unlikely or impossible, so that the second type of failure may not occur. Therefore, although we show that two types of failure are possible, this will not necessarily occur for all plants or for learning with prior knowledge about the plant. This type of failure can also be avoided if it is feasible to explore the set of possible controls thoroughly. If sufficient time is available to choose large numbers of examples of m randomly (and if doing so does not lead to injury), then an accurate plant inverse can be estimated throughout almost all of space, including the desired set S∗ . The number of examples required will depend on the nature of the plant, the dimensionality of the control space, and the nature of the network approximating the inverse. If an approximately correct motor command can be produced, then full exploration may not be necessary. In particular, local exploration or small, random changes in the neighborhood of an approximately correct motor command may allow estimation of the plant inverse in a region that includes the desired performance. In some cases, a forward model of the plant may be available, so that a complete exploration of state space can occur off-line. An approximate forward model can be learned by observation of the plant and used to estimate an inverse model (Jordan & Rumelhart, 1992). However, in the absence of prior information about the plant, a learned forward model will be accurate only in the neighborhood of observed pairs (m, s), and thus an estimated inverse will also be accurate only in such regions and will be

Failure of Motor Learning for Large Initial Errors

1883

susceptible to the second type of motor learning failure. If an approximate example near the desired (m∗ , s∗ ) has never been observed, then the forward and inverse models may not provide useful information about the region of desired performance. Our development is very similar to analysis of generalization experiments (Thoroughman & Shadmehr, 2000; Shadmehr & Moussavi, 2000; Donchin & Shadmehr, 2003). When studying generalization, a particular movement s is trained until it can be performed accurately. Then it is asked for which s∗ there has been a change in performance. The set of s∗ with changed performance indicates the extent of the representation φ(s) and can be used to investigate the form of the basis functions φi . In contrast, when studying failure of learning, the goal is the movement s∗ , and accurate performance on s is a by-product but not the purpose of training. If training at s does not generalize to s∗ , then learning may fail. Real-world examples of the first type of learning failure occur when a particular movement is not in the repertoire of known actions, or when the results of a particular movement cannot be detected by the controller (in both cases, the system behaves as a projection operator; it ignores or fails to produce certain elements of the desired performance). For example, a nonnative English speaker attempting to produce the th sound may never successfully produce this sound despite years of practice. All attempts at producing the foreign consonant are projected onto the set of consonants that are within the existing motor repertoire. Although the speaker is aware of the error and can produce an incorrect sound repeatedly on command, the sound never approximates the desired one. Another example of a projection operator is that of a nonnative student learning to speak Chinese but who is not yet able to distinguish between two spoken vowels. Since the two different vowels produce the same sensory response s, the speaker is unable to correct errors in order to learn to differentiate the two different vowels. Real-world examples of the second type of learning failure occur when all components of a particular movement are achievable but do not occur properly in context, so that the desired goal has never been observed. In this case, the system is not a projection operator, but there are no relevant training examples. For example, a novice high diver who attempts to learn a complex dive may never be successful without coaching despite having all the necessary muscular ability to achieve the required forces. Repeated practice will produce a result that is so far from the desired result that no successful examples are available for learning. This is also a situation where random exploration is not a good idea. How then is learning possible at all? In many cases, an approximation to the plant inverse may be available, particularly if the plant obeys uncomplicated Newtonian mechanics. Even when no inverse is available, if the network is able to achieve approximate performance, then s and s∗ may be sufficiently close that there is some overlap in the set of excited basis functions. If φ(s)T φ(s∗ ) = 0, then training at s will lead to changes in F(s∗ ). As

1884

T. Sanger

we have shown previously, under reasonable smoothness assumptions on F and P, convergence can then be guaranteed (Sanger, 1994). The broader the tuning of the basis functions, the further apart s and s∗ can be and still allow successful learning. Failure will be more likely when basis functions φi have low overlap for different values of s, as would occur for very narrow locally tuned functions. But this can also occur for broadly tuned functions if the patterns of activity φ(s) and φ(s∗ ) are orthogonal. For narrowly tuned basis functions, the product φ(s)φ(s∗ ) is more likely to be zero particularly when there is a large number of basis functions (e.g., if s is in a high-dimensional space). We therefore expect to see this phenomenon more when complex trajectories need to be learned. This is exactly the situation in which generalization may be most difficult. Thus, there is a relationship between the ability to generalize (from F(s) to F(s∗ )) and the ability to learn reliably through iterative practice. It is interesting to speculate on the role of coaching in this context. A coach could redirect attention to a particular sensory representation (perhaps a kinesthetic rather than a visual representation) for which the desired and actual movements do not lead to orthogonal basis function outputs. A coach could help limit attention to only a few important sensory variables, thereby reducing the dimensionality and reducing the likelihood of orthogonality of the basis functions. A coach could draw attention to errors that might have been ignored, and thereby ensure that learning does not converge prematurely. A coach could guide exploration from areas in which the inverse is known to regions in which it is not known, while maintaining reasonable margins of safety. We hope that this formulation of limitations in motor learning from experience may help to further the understanding of why certain types of skill learning can fail in advanced skill learning and for children and adults with motor disorders. Certainly, there are tasks that are so complex that almost any conceivable learning algorithm will fail to converge to the desired solution. In this letter, we have shown that under certain circumstances, learning even a simple task can fail to converge. Understanding the failure of simple tasks may help to explain why some subjects with motor impairments are unable to perform tasks that others find easy. In the future, this understanding may lead to a mathematical description of the role of coaching that could be helpful for improving motor performance for children and adults with and without motor impairments.

Acknowledgments I was supported in part by grant K23-NS41243-02 from the NINDS and in part by a faculty scholars award from Pfizer Pharmaceuticals. I also wish to thank the reviewers for their insightful comments and suggestions.

Failure of Motor Learning for Large Initial Errors

1885

References Atkeson, C. G. (1989). Learning arm kinematics and dynamics. Ann. Rev. Neurosci., 12, 157–183. Criscimagna-Hemminger, S. E., Donchin, O., Gazzaniga, M. S. Shadmehr, R. (2003). Learned dynamics of reaching movements generalize from dominant to nondominant arm. J. Neurophysiol., 89(1), 168–176. Dingwell, J. B., Mah, C. D. Mussa-Ivaldi, F. A. (2002). Manipulating objects with internal degrees of freedom: Evidence for model-based control. J. Neurophysiol., 88(1), 222–235. Donchin, O. Shadmehr, R. (2003). Linking motor learning to function approximation: Learning in an unlearnable force field. In T. G. Dietterich, S. Becker Z. Ghahramani ( Eds.), Advances in neural information processing systems, 14. Cambridge, MA: MIT Press. Flanders, M., Hondzinski, J. M., Soechting, J. F. Jackson, J. C. (2003). Using arm configuration to learn the effects of gyroscopes and other devices. J. Neurophysiol., 89(1), 450–459. Gandolfo, F., Li, C., Benda, B. J., Schioppa, C. P. Bizzi, E. (2000). Cortical correlates of learning in monkeys adapting to a new dynamical environment. Proc. Natl. Acad. Sci. U.S.A., 97(5), 2259–2263. Goodbody, S. J. Wolpert, D. M. (1998). Temporal and amplitude generalization in motor learning. J. Neurophysiol., 79(4), 1825–1838. Jordan, M. I. Rumelhart, D. E. (1992). Forward models: Supervised learning with a distal teacher. Cognitive Science, 16, 307–354. Kawato, M., Furukawa, K. Suzuki, R. (1987). A hierarchical neural-network model for control and learning of voluntary movement. Biol. Cybern., 57(3), 169–185. Kawato, M. Gomi, H. (1992). A computational model of four regions of the cerebellum based on feedback-error learning. Biol. Cybern., 68(2), 95– 103. Krakauer, J. W., Pine, Z. M., Ghilardi, M. F. Ghez, C. (2000). Learning of visuomotor transformations for vectorial planning of reaching trajectories. J. Neurosci., 20(23), 8916–8924. Sanger, T. D. (1994). Neural network learning control of robot manipulators using gradually increasing task difficulty. IEEE Trans. Robotics and Automation, 10(3), 323–333. Sanger, T. D. (2003). Neural population codes. Curr. Opin. Neurobiol., 13(2), 238–249. Scheidt, R. A., Reinkensmeyer, D. J., Conditt, M. A., Rymer, W. Z. Mussa-Ivaldi, F. A. (2000). Persistence of motor adaptation during constrained, multi-joint, arm movements. J. Neurophysiol., 84(2), 853–862. Shadmehr, R. Moussavi, Z. M. (2000). Spatial generalization from learning dynamics of reaching movements. J. Neurosci., 20(20), 7807–7815. Thoroughman, K. A. Shadmehr, R. (1999). Electromyographic correlates of learning an internal model of reaching movements. J. Neurosci., 19(19), 8573– 8588.

1886

T. Sanger

Thoroughman, K. A. Shadmehr, R. (2000). Learning of action through adaptive combination of motor primitives. Nature, 407(6805), 742–747. Wang, T., Dordevic, G. S. Shadmehr, R. (2001). Learning the dynamics of reaching movements results in the modification of arm impedance and long-latency perturbation responses. Biol. Cybern, 85(6), 437–448. Received June 9, 2003; accepted February 24, 2004.

LETTER

Communicated by Stephen Jos´e Hanson

Fair Attribution of Functional Contribution in Artificial and Biological Networks Alon Keinan [email protected]

Ben Sandbank [email protected] School of Computer Sciences, Tel-Aviv University, Tel-Aviv, Israel

Claus C. Hilgetag [email protected] School of Engineering and Science, International University Bremen, Bremen, Germany

Isaac Meilijson [email protected] School of Mathematical Sciences, Tel-Aviv University, Tel-Aviv, Israel

Eytan Ruppin [email protected] School of Computer Sciences and School of Medicine, Tel-Aviv University, Tel-Aviv, Israel

This letter presents the multi-perturbation Shapley value analysis (MSA), an axiomatic, scalable, and rigorous method for deducing causal function localization from multiple perturbations data. The MSA, based on fundamental concepts from game theory, accurately quantifies the contributions of network elements and their interactions, overcoming several shortcomings of previous function localization approaches. Its successful operation is demonstrated in both the analysis of a neurophysiological model and of reversible deactivation data. The MSA has a wide range of potential applications, including the analysis of reversible deactivation experiments, neuronal laser ablations, and transcranial magnetic stimulation “virtual lesions,” as well as in providing insight into the inner workings of computational models of neurophysiological systems. 1 Introduction How is neural information processing to be understood? One of the fundamental challenges is to identify the individual roles of the network’s elements, be they single neurons, neuronal assemblies, or cortical regions, depending on the scale on which the system is analyzed. Localization of c 2004 Massachusetts Institute of Technology Neural Computation 16, 1887–1915 (2004)

1888

A. Keinan, B. Sandbank, C. Hilgetag, I. Meilijson, and E. Ruppin

specific functions in the nervous system is conventionally done by recording neural activity during cognition and behavior, mainly using electrical recordings and functional neuroimaging techniques, and correlating the activations of the neural elements with different behavioral and functional observables. However, this correlation does not necessarily identify causality (Kosslyn, 1999). To allow the correct identification of the elements that are responsible for a given function, lesion studies have been traditionally employed in neuroscience, in which the functional performance is measured after different elements of the system are disabled. Most of the lesion investigations in neuroscience have been single lesion studies, in which only one element is disabled at a time (e.g., Squire, 1992; Farah, 1996). Such single lesions are very limited in their ability to reveal the significance of interacting elements, such as elements with functional overlap (Aharonov, Segev, Meilijson, & Ruppin, 2003). Acknowledging that single lesions are insufficient for localizing functions in neural systems, Aharonov et al. (2003) have presented the functional contribution analysis (FCA). The FCA analyzes a data set composed of numerous multiple lesions that are inflicted on a neural system, along with its performance scores in a given function. In each multiple lesion experiment composing the lesion data set, several elements are lesioned concurrently, and the system’s level of performance is measured. The FCA uses these data to yield a prediction of the system’s performance level when new, untested, multiple lesion damage is imposed on it. It further yields a quantification of the elements’ contributions to the function studied as a set of values minimizing the prediction error (a contribution of an element to a function should measure its importance to the successful performance of the function). The definition of the elements’ contributions to each function within the FCA framework is an operative one, based on minimizing the prediction error within a predefined prediction model. There is no inherent notion of correctness of the contributions found by this method. In particular, there are incidents where several different contributions’ assignments to the elements yield the same minimum prediction error. In such cases, the FCA algorithm may reach different solutions, providing accurate predictions in all cases but yielding quite different contributions. Furthermore, since the contributions are calculated in an embedded manner with the prediction component, the statistical significance of the results is unknown. In large systems with numerous insignificant elements, testing the statistical significance of the elements’ contributions is essential for locating the significant ones and focusing on them for the rest of the analysis. Devoid of this capacity, the FCA approach has serious limitations on the size of networks it can analyze. This letter presents a new framework, the multi-perturbation Shapley value analysis (MSA), addressing the same challenge of defining and calculating the contributions of network elements from a data set of multiple lesions or other types of perturbations and their corresponding performance scores. A set of multiple perturbation experiments is viewed as a coalitional

Fair Attribution of Functional Contribution

1889

game, borrowing relevant concepts from the field of game theory. Specifically, we define the desired set of contributions to be the Shapley value (Shapley, 1953), which stands for the unique fair division of the game’s worth (the network’s performance score when all elements are intact) among the different players (the network elements). While in traditional game theory the Shapley value is a theoretical tool that assumes full knowledge of the game, we have developed and studied methods to compute it approximately with high accuracy and efficiency from a relatively small set of multiple perturbation experiments. The MSA framework also quantifies the interactions between groups of elements, allowing for higher-order descriptions of the network, further developing previous work on highdimensional FCA (Segev, Aharonov, Meilijson, & Ruppin, 2003). The MSA has the ability to use a large spectrum of more powerful predictors, compared with the FCA, and to obtain a higher level of prediction accuracy. Finally, it incorporates statistical tests of significance for removing unimportant elements, allowing for the analysis of large networks. Besides biological experiments, the MSA can be used for the analysis of neurophysiological models. In recent years, such models have been developed in large numbers and in different contexts, providing valuable insight into the modeled systems. The availability of a computer simulation model for a system, however, does not imply that the inner workings of that model are known. Analyzing such models with the MSA can extend their usefulness by allowing for a deeper understanding of their operation. We demonstrate the workings of the MSA for the analysis of the building block of a model of lamprey swimming controller (Ekeberg, 1993), successfully uncovering the neuronal mechanism underlying its oscillatory activity as well as the redundancies inherent in it. Additionally, the contributions of specific synapses to different characteristics of the oscillation, such as frequency and amplitude, are inferred and analyzed. To test the applicability of our approach to the analysis of real biological data, we applied the MSA to reversible cooling deactivation experiments in cats. Our aim was to establish the contributions of parietal cortical and superior collicular sites to the brain function of spatial attention to auditory stimuli, as tested in an orienting paradigm (Lomber, Payne, & Cornwell, 2001). As we shall show, the MSA correctly identifies the elements’ contributions and interactions, verifying quantitatively previous qualitative findings. The remainder of this letter is organized as follows. Section 2 describes the concept of coalitional games and how it is used in the MSA framework for calculating the contribution of network elements, as well as the interactions between them. Section 3 demonstrates the application of the MSA to the neurophysiological model, and section 4 presents an MSA of reversible deactivation experiments. Our results and possible future applications of the MSA are discussed in section 5. The appendix illustrates the workings of the MSA on a toy problem, comparing it with single lesion analysis and the FCA.

1890

A. Keinan, B. Sandbank, C. Hilgetag, I. Meilijson, and E. Ruppin

2 The Multi-Perturbation Shapley Value Analysis 2.1 Theoretical Foundations. Given a system (network) consisting of many elements, we wish to ascribe to each element its contribution in carrying out the studied function. In single lesion analysis, an element’s contribution is determined by comparing the system’s performance in the intact state and when the element is perturbed. Such single lesions are very limited in their ability to reveal the significance of interacting elements. One obvious example is provided by two elements that exhibit a high degree of functional overlap, that is, redundancy: lesioning either element alone will not reveal its significance. Another classical example is that of the “paradoxical” lesioning effect (Sprague, 1966; Kapur, 1996). In this paradigmatic case, lesioning an element is harmful, but lesioning it when another specific element is lesioned is beneficial for performing a particular function. In such cases, and more generally in cases of compound processing, where the contribution of an element depends on the state of other elements, single lesion analysis is likely to be misleading, resulting in erroneous conclusions. The MSA presented in this letter aims at quantifying the contribution of the system’s elements, while overcoming the inherent shortcomings of the single lesion approaches. The starting point of the MSA is a data set of a series of multi-perturbation experiments studying the system’s performance in a certain function. In each such experiment, a different subset of the system’s elements is perturbed concomitantly (denoting a perturbation configuration), and the system’s performance following the perturbation in the function studied is measured. Given this data set, our main goal is to assign values that capture the elements’ contribution (importance) to the task in a fair and accurate manner. Note that this assignment is nothing but the classical functional localization goal in neuroscience, but now recast in a formal, multi-perturbation framework. The basic observation underlying the solution presented in this article to meet this goal is that the multi-perturbation setup is essentially equivalent to a coalitional game. That is, the system elements can be viewed as “players” in a game. The set of all elements left intact in a perturbation configuration can be viewed as a “coalition” of players. The performance of the system following the perturbation can then be viewed as the “worth” of that coalition of players in the game. Within such a framework, an intuitive notion of a player’s importance (or contribution) should capture the worth of coalitions containing it (i.e., the system’s performance when the corresponding element is intact), relative to the worth of coalitions that do not (i.e., relative to the system’s performance when this element, perhaps among others, is perturbed). This intuitive equivalence, presented formally below, enables us to harness the pertaining game theoretical tools to solve the problem of function localization in biological systems. Using the terminology of game theory, let a coalitional game be defined by a pair (N, v), where N = {1, . . . , n} is the set of all players and v(S), for

Fair Attribution of Functional Contribution

1891

every S ⊆ N, is a real number associating a worth with the coalition S, such that v(φ) = 0.1 In the context of multi-perturbations, N denotes the set of all elements, and for each S ⊆ N, v(S) denotes the performance measured under the perturbation configuration in which all the elements in S are intact and the rest are perturbed. A payoff profile of a coalitional game is the assignment of a payoff to each of the players. A value is a function that assigns a unique payoff profile to a coalitional game. It is efficient if the sum of the components of the payoff profile assigned is v(N). That is, an efficient value divides the overall game’s worth (the networks’ performance when all elements are intact) between the different players (the network elements). A value that captures the importance of the different players may serve as a basis for quantifying, in the context of multi-perturbations, the contributions of the system’s elements. The definite value in game theory and economics for this type of coalitional game is the Shapley value (Shapley, 1953), defined as follows. Let the marginal importance of player i to a coalition S, with i ∈ / S, be i (S) = v(S ∪ {i}) − v(S).

(2.1)

Then the Shapley value is defined by the payoff γi (N, v) =

1 i (Si (R)) n! R∈R

(2.2)

of each player i ∈ N, where R is the set of all n! orderings of N and Si (R) is the set of players preceding i in the ordering R. The Shapley value can be interpreted as follows: Suppose that all the players are arranged in some order, all orders being equally likely. Then γi (N, v) is the expected marginal importance of player i to the set of players who precede him. The Shapley value is efficient since the sum of the marginal importance of all players is v(N) in any ordering. An alternative view of the Shapley value is based on the notion of balanced contributions. For each coalition S, the subgame (S, vS ) of (N, v) is defined to be the game in which vS (T) = v(T) for any T ⊆ S. A value satisfies the balanced contributions property if for every coalitional game (N, v) and for every i, j ∈ N, i (N, v) − i (N\{j}, vN\{j} ) = j (N, v) − j (N\{i}, vN\{i} ),

(2.3)

meaning that the change in the value of player i when player j is excluded from the game is equal to the change in the value of player j when player i is 1 This type of game is most commonly referred to as a coalitional game with transferable payoff.

1892

A. Keinan, B. Sandbank, C. Hilgetag, I. Meilijson, and E. Ruppin

excluded. This property implies that objections made by any player to any other regarding the division are exactly balanced by the counterobjections. The unique efficient value that satisfies the balanced contributions property is the Shapley value (Myerson, 1977, 1980). The Shapley value also has an axiomatic foundation. Let player i be a null player in v if i (S) = 0 for every coalition S (i ∈ / S). Players i and j are interchangeable in v if i (S) = j (S) for every coalition S that contains neither i nor j. Using these basic definitions, the Shapley value is the only efficient value that satisfies the three following axioms, further pointing to its uniqueness (Shapley, 1953): Axiom 1 (symmetry). j (v).

If i and j are interchangeable in game v, then i (v) =

Intuitively, this axiom states that the value should not be affected by a mere change in the players’ “names.” Axiom 2 (null player property).

If i is a null player in game v, then i (v) = 0.

This axiom sets the baseline of the value to be zero for a player whose marginal importance is always zero. Axiom 3 (additivity). For any two games v and w on a set N of players, i (v + w) = i (v) + i (w) for all i ∈ N, where v + w is the game defined by (v + w)(S) = v(S) + w(S). This last axiom constrains the value to be consistent in the space of all games. In the 50 years since its construction, the Shapley value as a unique fair solution has been successfully used in many fields. Probably the most important application is in cost allocation, where the cost of providing a service should be shared among the different receivers of that service. This application was first suggested by Shubik (1962), and the theory was later developed by many other authors (e.g., Roth, 1979, and Billera, Heath, & Raanan, 1978). This use of the Shapley value has received recent attention in the context of sharing the cost of multicast routing (Feigenbaum, Papadimitriou, & Shenker, 2001). In epidemiology, the Shapley value has been utilized as a means to quantify the population impact of exposure factors on a disease load (Gefeller, Land, & Eide, 1998). Other fields where the Shapley value has been used include, among others, politics (starting from the strategic voting framework introduced by Shapley & Shubik, 1954), international environmental problems, and economic theory (see Shubik, 1985, for discussion and references). The MSA, given a data set of multi-perturbations, uses the Shapley value as the unique fair division of the network’s performance between the dif-

Fair Attribution of Functional Contribution

1893

ferent elements, assigning to each element its contribution, as its average importance to the function in question.2 The higher an element’s contribution according to the Shapley value, the larger is the part it causally plays in the successful performance of the function.3 This set of contributions is a unique solution, in contrast to the multiplicity of possible contributions assignments that may plague an error minimization approach like the FCA. In the context of multi-perturbations, axiom 1 of the Shapley value formulation entails that if two elements have the same importance in all perturbation configurations, their contributions will be identical. Axiom 2 ensures that an element that has no effect in any perturbation configuration will be assigned a zero contribution. Axiom 3 indicates that if two separate functions are performed by the network, such that the overall performance of the network in all multi-perturbation configurations is defined to be equal to the sum of the performances in the two functions, then the total contribution assigned to each element for both functions will be equal to the sum of its individual contributions to each of the two functions. It should be noted that other game-theoretical values can be used instead of the Shapley value within the MSA framework. Specifically, Banzhaf (1965) has suggested an analog of the Shapley value and Dubey, Neyman, and Weber (1981) have later generalized the Shapley value to a whole family of semivalues, all satisfying the three axioms but without the efficiency property, a natural requirement for describing fair divisions. Once a game is defined, its Shapley value is uniquely determined. Yet employing different perturbation methods may obviously result in different values of v and, as a consequence, different Shapley values. Aharonov et al. (2003) and Keinan, Meilijson, and Ruppin (2003) discuss different perturbation methods within the framework of neurally driven evolved autonomous agents and the effects that they may have on the elements’ contributions found. The workings of the MSA and the intuitive sense of its “fairness” are demonstrated in a simple, toy-like example in the appendix. 2.2 Methods 2.2.1 Full Information Calculation: The Original Shapley Value. In an ideal scenario, in which the full set of 2n perturbation configurations along with the performance measurement for each is given, the Shapley value may be straightforwardly calculated using equation 2.2, where the summation runs over all n! orderings of N. Equivalently, the Shapley value can be computed 2 Since v(φ) = 0 does not necessarily hold in practice, as it depends on the performance measurement definition, Shapley value efficiency transcribes to the property according to which the sum of the contributions assigned to all the elements equals v(N) − v(φ). 3 Since no limitations are enforced on the shape of v, a negative contribution is possible, indicating that the element hinders, on the average, the function’s performance.

1894

A. Keinan, B. Sandbank, C. Hilgetag, I. Meilijson, and E. Ruppin

as a summation over all 2n configurations, properly weighted by the number of possible orderings of the elements, γi (N, v) =

1 i (S) · |S|! · (n − |S| − 1)! n! S⊆N\{i}

(2.4)

Substituting according to equation 2.1 results in γi (N, v) =

1 v(S) · (|S| − 1)! · (n − |S|)! n! S⊆N,i∈S −

1 v(S) · (|S|)! · (n − |S| − 1)! n! S⊆N,i∈S /

(2.5)

where each configuration S contributes a summand to either one of the two sums, depending on whether element i is perturbed or intact in S. Thus, the Shapley value calculation consists of going through all perturbation configurations and calculating for each element the two sums in the above equation. 2.2.2 Predicted and Estimated Shapley Value. Obviously, the full set of all perturbation configurations required for the calculation of the Shapley value is often not available. In such cases, one may train a predictor, using the given subset of the configurations, to predict the performance levels of new, unseen perturbation configurations. Given such a predictor, the outcomes of all multi-perturbation experiments may be extracted, and a predicted Shapley value can be calculated on these data according to equation 2.5. The functional uncoupling (as opposed to the FCA) between the predictor component and the contributions calculation provides the MSA with the freedom to use any predictor. When the space of multi-perturbations is too large to enumerate all configurations in a tractable manner, the MSA can sample orderings and calculate an unbiased estimator of each element’s contribution as its average marginal importance (see equation 2.1) over all sampled orderings. Additionally, an estimator of the standard deviation of this Shapley value estimator can be obtained, allowing the construction of confidence intervals for the contribution of each of the elements, as well as the testing of statistical hypotheses on whether the contribution of a certain element equals a given value (e.g., zero or 1/n). The multi-perturbation experiments that should be performed in the estimated Shapley value method are dictated by the sampled orderings. But in many biological applications, a data set consisting of multi-perturbation experiments is pregiven and should be analyzed. In such cases, the MSA employs an estimated predicted Shapley computation. A performance predictor is trained using the given set of perturbation configurations and serves

Fair Attribution of Functional Contribution

1895

as an oracle supplying performance predictions for any perturbation configuration as dictated by the sampling. Based on the estimation method, the MSA provides another method for handling networks with a large number of elements. The two-phase MSA procedure is motivated by the observation that often only a small fraction of the possibly large number of elements significantly contributes to the specific function being analyzed. Based on a small sample, the method’s first phase finds those elements with significant contributions. Then the second phase focuses on finding the accurate contributions of those significant elements. This phase may use the same small sample from the first phase, while focusing on a much smaller perturbations space, thus allowing for faster training and an increased scalability. The different MSA prediction and estimation variants, intended for use in various neuroscience applications, have been tested in the theoretical modeling framework of neurally driven evolved autonomous agents, where their accuracy, efficiency, and scalability were established. A detailed account of these results and of the above estimation and two-phase methods used in obtaining them will be published elsewhere. This article presents analyses that utilize the full information Shapley value calculation (in section 3 and the appendix) and the predicted Shapley value (in section 4), demonstrating the applicability of the concepts underlying the MSA. 2.3 Two-Dimensional MSA. The Shapley value serves as a summary of the game, indicating the average marginal importance of an element over all possible elements orderings. For complex networks, where the importance of an element strongly depends on the state (perturbed or intact) of other elements, a higher-order description may be necessary in order to capture sets of elements with significant interactions. For example, when two elements exhibit a high degree of functional overlap, that is, redundancy, it is necessary to capture this interaction, aside from the average importance of each element. Such high-dimensional analysis provides further insight into the network’s functional organization. We focus on the description of two-dimensional interactions. A natural definition of the interaction between a pair of elements is as follows: Let γi, j¯ = γi (N\{j}, vN\{j} ) be the Shapley value of element i in the subgame of all elements without element j. Intuitively, this is the average marginal importance of element i when element j is perturbed. Let us now define the coalitional game (M, vM ), where M = N\{i, j} ∪ {(i, j)} ((i, j) is a new compound element) and vM (S), for S ⊆ M, is defined by vM (S) =

v(S) : (i, j) ∈ /S , v(S\{(i, j)} ∪ {i, j}) : (i, j) ∈ S

(2.6)

where v is the characteristic function of the original game with elements N. Then γi,j = γ(i,j) (M, vM ), the Shapley value of element (i, j) in this game, is

1896

A. Keinan, B. Sandbank, C. Hilgetag, I. Meilijson, and E. Ruppin

the average marginal importance of elements i and j when jointly added to a configuration. The two-dimensional interaction between element i and element j, j = i, is then defined as Ii,j = γi,j − γi, j¯ − γj,¯i ,

(2.7)

which quantifies how much the average marginal importance of the two elements together is larger (or smaller) than the sum of the average marginal importance of each of them when the other one is perturbed. Intuitively, this symmetric definition (Ii,j = Ij,i ) states how much the whole is greater than the sum of its parts (synergism), where the whole is the pair of elements. In cases where the whole is smaller than the sum of its parts, that is, when the two elements exhibit functional overlap, the interaction is negative (antagonism). This two-dimensional interaction definition coincides with the Shapley interaction index, which is a more general measure for the interaction among any group of players (Grabisch & Roubens, 1999). The MSA can classify the type of interaction between each pair even further. By definition, γi, j¯ is the average marginal importance of element i when element j is perturbed. Based on equation 2.7, γi, j¯ + Ii,j is the average marginal importance of element i when element j is intact. When both γi, j¯ and γi, j¯ + Ii,j are positive, element i’s contribution is positive, regardless of whether element j is perturbed or intact. When both are negative, element i hinders the performance, regardless of the state of element j. In cases where the two measures have inverted signs, we define the contribution of element i as j-modulated. The interaction is defined as positive modulated when γi, j¯ is negative, while γi, j¯ +Ii,j is positive, causing a “paradoxical” effect. We define the interaction as negative modulated when the former is positive while the latter is negative. The interaction of j with respect to i may be categorized in a similar way, yielding a full description of the type of interaction between the pair. Classical “paradoxical” lesioning effects, for instance, of the kind reported in the neuroscience literature (Sprague, 1966; Kapur, 1996), are defined when both elements exhibit positive modulation with respect to one another. As evident, the rigorous definition of the type of interaction presented in this section relies on an average interaction over all perturbation configurations. Thus, it does not necessarily coincide with the type of interaction found by using only single perturbations and a double perturbation of the pair, as conventionally described in the neuroscience literature. 3 MSA of a Lamprey Segment Computer simulation of neuronal models is becoming an increasingly important tool in neuroscience. Modeling enables exploring the relation between the properties of single elements and the emergent properties of the system as a whole. Neural network models are usually highly complex networks composed of many interacting neural elements. In many cases, it is

Fair Attribution of Functional Contribution

1897

possible to determine an architecture and find parameter values that provide a good fit between the behavior of the model and that of the real physiological system. However, it is usually very difficult to assess each element’s function within the model system, that is, to understand how the elements interact with each other to create the emergent properties that the system displays as a whole. In fact, there is a wide gap between the large body of work invested in constructing neural models and the paucity of efforts to analyze and understand the workings of these models in a rigorous manner. Neuronal models may be easily manipulated, allowing measurement of their behavior under different perturbations, which makes them amenable to analysis. Using the MSA, it is straightforward to compute the contribution of each element to each function the model system simulates and the interactions between different elements in each of the different functions. Insights from such an analysis may be utilized for further study of the system being modeled. In the remainder of this section, we demonstrate the application of the MSA to the analysis of a fish swimming model. Section 3.1 briefly introduces the model, and section 3.2 presents the results obtained by the MSA. 3.1 The Oscillatory Segment Model. The lamprey is one of the earliest and most primitive vertebrates. Much like an eel, it swims by propagating an undulation along its body, from head to tail. The neural network located in its spinal cord, which controls the motion, has been extensively studied (see, e.g., Brodin, Grillner, & Rovainen, 1985; Buchanan & Grillner, 1987; Grillner, Wallen, & Brodin, 1991; Grillner et al., 1995; Buchanan, 1999, 2001; Parker & Grillner, 2000). For a recent review, see Buchanan (2002). Physiological experiments on isolated spinal cords have shown that this network is a central pattern generator (CPG), as it is able to generate patterns of oscillations without oscillatory input from either the brain or sensory feedback. Because small parts of the spinal cord (up to two segments) can be isolated and still produce oscillations when subjected to an excitatory bath, the CPG is thought of as an interconnection of local (segmental) oscillators. The complete controller is formed of approximately 100 identical copies of segmental oscillators, linked together to form a chain. The CPG controlling the lamprey’s swimming movement has been modeled at several levels of abstraction. The model analyzed in this section is a connectionist model of the swimming controller developed by Ekeberg (1993). This model demonstrates that the interneuron connectivity is in itself sufficient for generating much of the lamprey’s range of motions, without resorting to complicated neuronal mechanisms. Following the real lamprey, the model is essentially a chain of identical segments, each capable of independently producing oscillations. Being the basic building block of the model, the properties of the individual segment determine to a large extent the overall behavior of the model. We therefore focus our analysis on the individual segment.

1898

A. Keinan, B. Sandbank, C. Hilgetag, I. Meilijson, and E. Ruppin

Figure 1A shows a schematic diagram of the neural network of a segmental oscillator. It is a symmetrical network, composed of four neurons on each side. The neurons are modeled as nonspiking leaky integrators, where each unit can be regarded as a representative of a large population of functionally similar (spiking) neurons, its output representing the mean firing rate of the population. There are four types of neurons. The motoneurons (MN) provide the output to the muscles; the excitatory interneurons (EIN) maintain the ipsilateral activity by exciting all of the neurons on their side of the segment; the contralateral inhibitory neurons (CCIN) suppress the contralateral activity; and the lateral inhibitory neurons (LIN) suppress the ipsilateral CCIN. When this network is initialized to an asymmetric state (with all neurons on one side excited) and a constant external bias is applied to all neurons (interpreted within the model as input from the lamprey’s brain stem), the network exhibits a stable pattern of oscillations with the left and right motoneurons out of phase, as shown in Figure 1B. When 100 such segments are linked in a long chain as in the full model, these local oscillations give rise to coordinated waves of activity that travel along the chain (corresponding to the lamprey’s spinal cord), producing the lamprey’s swimming motion. 3.2 Segmental Analysis. In order to understand how the connectivity between the neurons gives rise to the behavior of the segmental network as a whole, the analysis was conducted on the level of the synapses. Because the neural network is completely left-right symmetrical and its task is also inherently symmetrical, the basic neural element in our analysis is a symmetrical pair of synapses. That is, each perturbation configuration defines a set of “synapses” to be perturbed, where each perturbation of a “synapse” denotes the simultaneous perturbation of both the left and right corresponding synapses. As a consequence, the MSA quantifies the importance of each such pair of synapses. Throughout this section, the shorter term synapse is used to refer to a synaptic pair. 3.2.1 The Oscillation Generating Mechanism. The first experiment aims to discover which synapses are important in producing the pattern of oscillations in the motoneurons, and hence for generating the motion that propels the lamprey. From viewing the structure of the full network, it is not clear how these oscillations are generated and maintained. Hence, we utilize the MSA to uncover this mechanism. All possible multi-perturbation configurations are inflicted on the model, testing whether the motoneurons exhibit stable oscillations. The perturbation method used is clamping the activation level of a synapse to its mean.4 This allows the MSA to quantify the extent 4 The activation level of a synapse is used here to refer to the activation level of the presynaptic neuron, as mediated by the synapse to the postsynaptic neuron.

Fair Attribution of Functional Contribution

1899

Figure 1: The lamprey’s segmental oscillator. (A) The neural network controlling the segmental oscillator. Solid lines indicate excitatory synapses, and dashed lines indicate inhibitory synapses. The number next to each synapse denotes the synaptic weight. The connections leading from the motoneurons to the muscles are not included in the model and are shown for illustration purposes only. (B) The activation levels of the left (solid line) and right (dashed line) motoneurons, after the network has been initialized to an asymmetric state with all left neurons maximally active. The figure focuses on the behavior of the system after it stabilizes, starting from the 200th millisecond.

1900

A. Keinan, B. Sandbank, C. Hilgetag, I. Meilijson, and E. Ruppin

to which the oscillations mediated by a given synapse contribute to the existence of oscillations in the motoneurons without affecting the level of bias of each neuron. Viewing each model neuron as a representative of a large population of spiking neurons, this perturbation method can be shown to be equivalent to using stochastic lesioning (Aharonov et al., 2003) on the level of the individual synapses connecting different neuronal populations.5 Figure 2A presents the contributions (Shapley value) of the synapses, based on the full information available. Evidently, out of the nine synapses in the model, only five contribute to the generation of oscillations, while the others have a vanishing contribution. Figure 2B shows the oscillatory backbone of the network, including only those five synapses. Viewing just these synapses, it is possible to understand the mechanism generating the oscillations. Essentially, when the left CCIN becomes active, it inhibits the right EIN. The reduced activity of the EIN causes a reduction in the activity of the right LIN. This, in turn, causes increased activation in the right CCIN, which inhibits the left EIN, and so on. The constant bias to the network is what causes the EIN’s activation level to swing back up when it becomes too low, thus maintaining the process. The two remaining synapses, CCIN→MN and EIN→MN, mediate the oscillations to the motoneurons. The MSA also quantifies the interaction between pairs of synapses. Specifically, it reveals a functional overlap between the CCIN→MN and EIN→MN synapses, expressed by a strongly negative interaction (IEIN→MN,CCIN→MN = −0.25). The positive contribution of each of the two synapses when the other is perturbed (γEIN→MN,CCIN→MN = γCCIN→MN,EIN→MN = 0.25) and the zero contribution of each of them when the other is intact indicate that perturbing one of the synapses nullifies the oscillations in the MNs if and only if the other is perturbed. In other words, the two synapses are totally redundant with respect to each other. In contradistinction, every other pair of synapses in the backbone exhibits a positive interaction (synergism) indicating that the two synapses comprising it cooperate and rely on each other in order to perform this function. Thus, the MSA successfully uncovers the synapses generating the oscillations, including those that are backed up by other synapses, in which case it also identifies the type of interaction between them. This description of the dynamics underlying the network’s oscillatory activity could not have been obtained by examining the structure of the original network, as one might postulate that a different set of synapses is important for this task. For example, the CCIN→EIN and EIN→LIN synapses might be replaced by a single CCIN→LIN synapse to the same effect. It should also be emphasized that using only single perturbations to analyze the model would not have revealed the importance of the CCIN→MN

5 Stochastic lesioning is performed by randomizing the firing pattern of the perturbed element without changing its mean activation level.

Fair Attribution of Functional Contribution

1901

Figure 2: Results of the MSA on the task of motoneuronal oscillations. (A) The contributions of the nine synapses in the network. Synapses are presented in the form presynaptic neuron→postsynaptic neuron. Due to space considerations, only the first letter of each neuron is displayed (e.g., E stands for the EIN). (B) The oscillatory backbone of the network, composed of the five synapses with nonzero contributions. As before, solid lines indicate excitatory synapses, and dashed lines indicate inhibitory synapses.

1902

A. Keinan, B. Sandbank, C. Hilgetag, I. Meilijson, and E. Ruppin

and EIN→MN synapses, since if only one of them is perturbed, the oscillations are still preserved as they are mediated to the motoneurons by the remaining synapse. Using multi-perturbation experiments, however, overcomes the intrinsic limitations of single-perturbation experiments and fully reveals the workings of the network. 3.2.2 Oscillation Characteristics. The results of the previous section suggest that only five synapses generate the oscillations in the network, while the others may be perturbed without affecting their existence. This, however, does not mean that the other synapses play no role in determining other aspects of the system’s behavior. It is also possible that the five synapses participating in the oscillatory backbone of the network contribute to other functions as well. To examine this issue further, we perform a finer analysis, which identifies and quantifies the synaptic contributions to four characteristics of the network’s oscillatory behavior. The first three characteristics pertain to the difference between the left and the right motoneuron’s activity level during a time span of 3 seconds. This difference is in itself oscillatory in the intact network, signifying in each moment the extent to which the left muscle innervated by the segment’s motoneuron is more stimulated than the right one. The three characteristics we measure are the amplitude and frequency of this oscillation and the standard deviation of the difference in activity around its mean. The fourth characteristic examined is the degree to which the left and right motoneurons are out of phase with each other. In each multi-perturbation configuration, the network’s performance is measured in each of these four characteristics, interpreted as functions within the MSA framework, and the contribution of each of the elements to each function is quantified separately by calculating the corresponding Shapley value using full information. This type of analysis requires a more delicate type of perturbation, such that the oscillations in the network are preserved under all perturbation configurations.6 To this end, we use a perturbation method that only slightly shifts a perturbed synapse’s activation level toward its mean. Thus, oscillations mediated by the perturbed synapse are somewhat attenuated but not wholly negated. If the magnitude of the shift is small enough, the network retains its oscillatory behavior under all perturbation configurations. The selection of this perturbation method can be justified if we recall that a single neuron stands for a large population of spiking neurons, and hence the perturbation’s overall impact is equivalent to applying stochastic informational lesioning (Keinan et al., 2003) to the individual synapses linking different populations.7 6 Another reason for using more delicate perturbations is that they allow the evaluation of the performance of the system closer to its true, intact behavior, and thus gain a more precise sense of the contribution of each synapse (see Keinan et al., 2003, for more on this approach to perturbation). 7 Informational lesioning views a lesioned element as a noisy channel and enables the

Fair Attribution of Functional Contribution

A

1903

B 0.8

Contribution

Contribution

0.3

0.7 0.6 0.5 0.4 0.3 0.2

0.2 0.1 0 −0.1 −0.2

0.1

C

! ! ! ! ! ! ! ! !

! ! ! ! ! ! ! ! !

E E C E E C L C C C E L C L E M C M

E E C E E C L C C C E L C L E M C M

0

D 0.5 0.4

0.6

Contribution

0.5 0.4 0.3 0.2 0.1

0.3 0.2 0.1 0 −0.1

0

−0.2

! ! ! ! ! ! ! ! !

! ! ! ! ! ! ! ! !

E E C E E C L C C C E L C L E M C M

−0.1

E E C E E C L C C C E L C L E M C M

Contribution

0.7

Figure 3: Results of the MSA with different oscillation characteristics taken as functions (see the main text). Shown are the contributions to the oscillation’s amplitude (A), frequency (B), the standard deviation of the difference between the activity of the left and right motoneurons (C), and the degree to which the left and right motoneurons’ firing is in antiphase (D). The results are normalized such that the sum of the absolute values of the contributions in each function equals 1.

Several observations can immediately be made on the results of the analysis, shown in Figure 3. First, synapses that contribute nothing to the oscillation-generating function, such as the CCIN→LIN synapse, do make significant contributions to other tasks. The opposite is also true: synapses

application of different levels of perturbation, corresponding to different magnitudes of noise in the channel, while maintaining the element’s mean activation level.

1904

A. Keinan, B. Sandbank, C. Hilgetag, I. Meilijson, and E. Ruppin

in the oscillation-generating backbone make vanishing contributions to other functions. Second, a given synapse may have a positive contribution to one function while effectively hindering another. For example, the CCIN→CCIN synapse has a strong positive contribution to the left-right firing antiphase but a negative contribution to the frequency of the oscillations. This suggests that the network is the result of compromising a number of contradicting demands. Finally, different functions are localized to a different degree. On the one hand, the oscillation amplitude function is very localized, having a single synapse with a contribution of almost 0.9. On the other hand, the frequency function is more distributed, as four synapses have contributions greater than 0.1 in absolute value. An analysis of the contributions to each of the functions may provide further understanding of the network’s behavior. As an example, we discuss the results for the amplitude function. Evidently, the CCIN→MN synapse is the only one with a significant contribution to the amplitude of the MN’s oscillations (see Figure 3A). Interestingly, while the results of section 3.2.1 have shown that the CCIN→MN and the EIN→MN synapses are totally redundant with respect to one another in generating and maintaining the oscillations, the results of the current analysis reveal that the CCIN→MN synapse influences the activity of the MN to a far greater degree than the EIN→MN. The latter synapse is indeed sufficient for preserving the oscillations in the MN when the CCIN→MN synapse is clamped to its mean activation level, but these would be of much lower amplitude than when the CCIN→MN synapse is intact. Another conclusion that can be drawn from these results is that the amplitude of the oscillations of the CCIN is rather robust to perturbations to synapses. If it were not true and perturbing some synapse had a large impact on the amplitude of the CCIN’s oscillations, it would also have affected the amplitude of the MN, and hence that synapse would have had a large contribution to this function. Because the MSA reveals no such contributions, the conclusion follows. The contributions to the frequency function (see Figure 3B) demonstrate how the MSA infers the true importance of neural elements, which in this simple network and for this function can be at least partially deduced through an examination of the network’s structure. We focus here on the two synapses that have nonvanishing contributions to the frequency function and do not belong to the oscillatory backbone: CCIN→CCIN and CCIN→ LIN. The CCIN→CCIN’s strong negative contribution suggests that on the average, when it is perturbed, the frequency of the MN’s oscillations is increased. This is due to the fact that as the left side of the segment becomes more excited, the CCIN→CCIN synapse inhibits the activity of the right CCIN. As this CCIN becomes more inhibited, its inhibitory influence on the overall activity of the left side decreases, causing a further increase in the activity of the left side. Therefore, this synapse indeed serves to prolong the bursts of activity, reducing the frequency of the oscillations. In contrast, the CCIN→LIN synapse has a positive contribution to the MN’s frequency

Fair Attribution of Functional Contribution

1905

of oscillations, which means that, on the average, when it is perturbed, the frequency is decreased. The dynamics underlying this effect are as follows. As the left side of the segment becomes more excited, the left CCIN inhibits the right LIN, reducing the inhibition effect on the right CCIN. As the right CCIN becomes more active, it serves to terminate the left side burst. As mentioned in section 3.2.1, this synapse could in principle participate in the oscillatory backbone together with (or instead of) the pair of synapses CCIN→EIN and EIN→LIN. But while the CCIN→LIN synapse is not necessary or sufficient for generating oscillations in the network, this finer analysis reveals that it is important in modulating the frequency of those oscillations. Clearly, the MSA provides interesting insights into the identification of neural function even in this relatively simple example. In summary, this section presents and analyzes the basic building block of a model for the neural network responsible for lamprey swimming. Using the MSA, it is possible to understand how the oscillations are generated and maintained in the network, which could not be inferred from the network’s structure alone. The MSA succeeds where single perturbation analysis fails and detects the importance of synapses that are completely backed up by other synapses, in which case it also correctly identifies the type of interaction between them and quantifies it. Finally, using the MSA, it is possible to find the contribution of each synapse to various characteristics of the oscillations, revealing different levels of functional localization and providing further insight into the workings of the network. 4 MSA of Reversible Deactivation Experiments To test the applicability of our approach to the analysis of biological “wetware” network data, we applied the MSA to data from reversible cooling deactivation experiments in the cat. Specifically, we investigated the brain localization of spatial attention to auditory stimuli (based on the orienting paradigm described in Lomber, Payne, Hilgetag, & Rushmore, 2002). Spatial attention is an essential brain function in many species, including humans, that underlies several other aspects of sensory perception, cognition, and behavior. While attentional mechanisms proceed efficiently, automatically, and inconspicuously in the intact brain, perturbation of these mechanism can lead to dramatic behavioral impairment. So-called neglect patients, for instance, have great difficulties (or even fail) in reorienting their attention to spatial locations after suffering specific unilateral brain lesions, with resulting severe deficits of sensory (e.g., visual, auditory) perception and cognition (Vallar, 1998). From the perspective of systems neuroscience, attentional mechanisms are particularly interesting, because this function is known to be widely distributed in the brain. Moreover, lesions in the attentional network have resulted in “paradoxical” effects (Kapur, 1996), in which the deactivation of some elements results in a better-than-normal performance (Hilgetag, Theoret, & Pascual-Leone, 2001) or reversed behav-

1906

A. Keinan, B. Sandbank, C. Hilgetag, I. Meilijson, and E. Ruppin

ioral deficits resulting from earlier lesions (e.g., Sprague, 1966). Such effects challenge traditional approaches for lesion analysis and provide an ideal test bed for novel formal analysis approaches, such as the MSA. In the reversible deactivation experiments analyzed here, auditory stimulus detection and orienting responses in intact and reversibly lesioned cats were tested in a semicircular arena, in which small speakers were positioned bilaterally from midline (0) to 90 degrees eccentricity, at 15 degree intervals. The animals were first trained to stand at the center of the apparatus, facing and attending to the 0 degrees position speaker presenting a white noise hiss. Their subsequent task was to detect and orient toward noise coming from one of the peripheral speakers. They were rewarded with moist (high incentive) or dry (low incentive) cat food depending on whether they correctly approached the peripheral stimulus or the default central position, respectively. After the animals attained a stable near-perfect baseline performance, cryoloops were surgically implanted over parietal cortical (pMS) and collicular (SC) target structures, following an established standard procedure (Lomber, Payne, & Horel, 1999). Because these regions are found in both halves of the brain, altogether four candidate target sites (pMSL , pMSR , SCL , and SCR ) were deactivated, one or two of them in each experiment. Once baseline performance levels were reestablished, the animals were tested during reversible cooling deactivation of unilateral and bilateral cortical and collicular sites (for further details, see Lomber et al., 2001). The technique allows for selective deactivation of just the superficial cortical or collicular layers by keeping the cooling temperature to a level at which the deeper layers are still warm and active. Deactivation of the deeper layers, on the other hand, also requires deactivation of the superficial ones due to the placement of the cryoloops on top of the cortical or collicular tissue. Nineteen single and multi-lesion experiments were performed (mostly as control experiments for the deactivation of visual structures but also in their own right, e.g., Lomber et al., 2001) and another 14 lesion configurations were deduced by assuming mirror-symmetric effects resulting from lesions of the two hemispheres, yielding data for 33 single and multi-lesion experiments out of the 256 possible configurations. Figure 4A shows the predicted Shapley value (see section 2.2.2) of the different regions involved in the experiments, using projection pursuit regression (PPR) (Friedman & Stuetzle, 1981) for prediction, trained using the 33 configurations. It is evident that only regions SCR -deep and SCL -deep play a role in determining the auditory attentional performance. Both have a contribution equal to half the overall predicted performance of the system (0.2). Due to the outlined limitation of the experimental procedure, when a deep component of the SC is lesioned, the superficial one is lesioned as well (regions 5 and 7 in Figure 4). Nevertheless, the MSA successfully reveals that only the deep SC regions are the ones of significance, which concurs with previous interpretation of collicular deactivation results (Lomber et al., 2001). Interestingly, the FCA using the same experiments assigns different

Fair Attribution of Functional Contribution

A

1907

B 0.12

1

Contribution

Interaction

0.1 0.08 0.06 0.04

0.5

0 1

0.02 0

1

2

3

4

5

Region

6

7

8

2

3

Re 4 5 gio 6 7 8 n

8 6 7 4 5 3 n 1 2 egio

R

Figure 4: One- and two-dimensional MSA of reversible deactivation experiments. The eight regions correspond to the pMS and the SC of both sides, with separation between deep and superficial layers in each. (A) Predicted Shapley value of the eight regions. Regions 6 and 8 represent SCR -deep and SCL -deep, respectively. (B) The symmetric interaction Ii,j between each pair of regions (i < j).

contributions in different runs. In most runs, it uncovers the role played by the deep SC regions. Alas, it usually also assigns nonvanishing contributions to the superficial SC regions. Furthermore, in some runs, it assigns significant contributions to the pMS regions. This testifies to the disadvantages of an operative approach such as the FCA that are overcome by the MSA by offering a unique fair solution for the contributions. We performed a two-dimensional MSA to quantify the interactions between each pair of regions, finding only one significant interaction, between SCR -deep and SCL -deep, I6,8 = 0.8 (see Figure 4B). Furthermore, observing the negative contribution of each of the two regions when the other one is lesioned (γ6,8¯ = γ8,6¯ = −0.3) and the positive contribution when the other one is intact (γ6,8¯ + I6,8 = γ8,6¯ + I8,6 = 0.5), the MSA concludes that each of the two regions is positively modulated by the other, uncovering the “paradoxical” type of interaction assumed to take place in this system (Hilgetag, Kotter, & Young, 1999; Hilgetag, Lomber, & Payne, 2000; Lomber et al., 2001). This analysis testifies to the usefulness of the MSA in deducing the functionally important regions as well as their significant interactions. The prediction made by the MSA to lesioning configurations that were not performed in the experiments is that lesioning either one of the deep SC regions will result in a large deficit in (contralateral) spatial attention to auditory stimuli, while lesioning both will result in a much smaller deficit. Lesioning any subgroup of the other regions involved in these experiments will not influence the performance in this task. In particular, it is predicted that as long as the deep SC regions are intact, the animal will exhibit full

1908

A. Keinan, B. Sandbank, C. Hilgetag, I. Meilijson, and E. Ruppin

orientation performance, even when some or all of the superficial collicular layers, the superficial parietal cortical layers, and the deeper parietal cortical layers are lesioned. 5 Discussion Over the past two years, we have developed and described a new framework for quantitative causal function localization using multi-perturbation experiments (Aharonov et al., 2003; Segev et al., 2003; Keinan et al., 2003). This article presents an important new step within this project, replacing an earlier ad hoc error minimization approach by the MSA: an axiomatic framework based on a rigorous definition of all elements’ contributions via the Shapley value. The latter is a fundamental concept borrowed from game theory, which has been classically used for providing fair solutions to cost allocation problems. The MSA accurately approximates the Shapley value in a scalable manner, making it a more accurate and efficient method for function localization than its predecessor, the FCA. The prediction and estimation variants of the MSA are specifically geared toward experimental biological applications, in which only a limited number of multi-perturbation experiments can be performed. As demonstrated in this article, the MSA can provide new insights into the workings of a classical neural model: that of a segment of the lamprey, revealing a “skeletal” subnetwork that is primarily responsible for its CPG activity. We also showed that the MSA framework is capable of dealing with behavioral data from experimental deactivation studies of the cat brain, quantitatively identifying the main interaction underlying a “paradoxical” lesioning phenomenon observed previously in studies of spatial attention (Hilgetag et al., 1999, 2000; Lomber et al., 2001). The presented applications, however, involve relatively simple networks of limited size. The more advanced estimation and prediction methods will be needed in order to meet the challenge of function localization in more complex systems. As pointed out previously, and as evident from effects such as “paradoxical” lesion phenomena and functional backup, single lesion approaches do not suffice to portray the correct function localization in a network. Why, then, has the great majority of lesioning studies in neuroscience until now relied on single lesioning solely? First, it has been very difficult (and in many systems, practically impossible) to reliably create and test multiple lesions. Moreover, single lesion studies have been perceived as already being fairly successful in providing insights to the workings of neural systems. Naturally, the lack of a more rigorous multi-perturbation analysis has made the testing and validation of this perception practically impossible, and it may well have been the case that the lack of an analysis method has made such experiments seem futile. We hope that at least this last obstacle has been remedied by the introduction of the MSA.

Fair Attribution of Functional Contribution

1909

But the question remains, Have we reached the stage where we can now perform multi-perturbation experiments and corresponding analyses of biological networks? We believe that the answer is affirmative, and newly introduced experimental tools hold great promise, both in neuroscience and for the analysis of genetic and metabolic networks. In neuroscience, there is now an exciting new prospect of carrying out experimental perturbation studies in human subjects using transcranial magnetic stimulation (TMS). This technique allows inducing “virtual lesions” in normal subjects performing various cognitive and perceptual tasks (Pascual-Leone, Wasserman, Davey, & Rothwell, 2002; Rafal, 2001). The methodology can be utilized to codeactivate doublets of brain sites (Hilgetag et al., 2003) and potentially even triplets. Additionally, recent retrospective lesion studies of stroke patients have reconstructed patients’ lesions and analyzed the resulting multi-lesion data using statistical tools (e.g., Adolphs, Damasio, Tranel, Cooper, & Damasio, 2000; Bates et al., 2003). Such data may be more rigorously analyzed by the MSA, processing the multi-lesion data to capture the contributions of and the significant high-dimensional interactions between regions or voxels. Going beyond the realm of neuroscience to biology in general, the recent discovery of RNA interference (RNAi) (Hammond, Caudy, & Hannon, 2001; Couzin, 2002) has made the possibility of multiple concomitant gene knockouts a reality. Using RNAi vectors, it is now possible to temporarily block the transcription of specified genes for a certain duration and measure the performance of various cellular and metabolic indices of the cell, including the expression levels of other genes. As with TMS, RNAi is limited at this stage to just a few elements that are knocked out concomitantly, but this is just the beginning. multi-perturbation studies are a necessity, and they are hence bound to take place starting in the very near future. The MSA framework presented in this article is a harbinger of this new kind of studies, offering a novel and rigorous way of making sense out of them. Appendix: A Toy Problem In this appendix, we analyze a very simple system as a test case for the MSA for the purpose of illustrating its operation in a concrete, minimal example. After introducing the system, we present the Shapley value obtained and explain its intuitive correctness in this case. We then compare the results of the MSA with both single lesion analysis and the FCA of this system. A.1 The System. Let us define a system of elements {e1 , . . . , en }, where the lifetime of element ei is exponentially and independently distributed with parameter λi (expectation of 1/λi ). We define the performance of the system as the expected time where at least one of the elements remains functioning, that is, the expectation of the maximum of the individual lifetimes. For clarity, we focus on the case n = 3, but the results presented throughout this appendix hold for any number of elements.

1910

A. Keinan, B. Sandbank, C. Hilgetag, I. Meilijson, and E. Ruppin

We calculate the performance for each perturbation configuration inflicted on the system, where a perturbed element is simply removed from the system. Obviously, v(φ) = 0 and v({ei }) = λ1i , for i ∈ {1, 2, 3}, since there is only one element in the system in the latter and none in the former. For i = j, i, j ∈ {1, 2, 3}, v({ei , ej }) is the expectation of a random variable with a cumulative distribution function, which is the product of the two cumulative distribution functions: (1−e−λi x )·(1−e−λj x ). Calculating the expectation using the cumulative distribution function yields

v({ei , ej }) =

1 1 1 + − . λi λj λi + λj

(A.1)

Similarly, v({e1 , e2 , e3 }) 1 1 1 1 1 1 1 = + + − − − + (A.2) λ1 λ2 λ3 λ1 + λ2 λ1 + λ3 λ2 + λ3 λ1 + λ2 + λ3 is the performance of the system in its intact state. The formulas of v may also be viewed intuitively based on the inclusionexclusion formula. For instance, equation A.2 is illustrated by the following equation using Venn diagrams: =

+

+

−

−

−

+

,

where each Venn diagram corresponds to a term in equation A.2, in the same order. A region in a Venn diagram in the illustration is the expected time where all the elements corresponding to including groups are functioning. Thus, an intersection of several groups is the expectation of the minimum of the corresponding distributions. In this case of exponential distributions, the minimum is also exponentially distributed with a parameter equal to the sum of the parameters of the different distributions. Thus stems the resulting expectation of each such distribution (see equation A.2). A.2 The Shapley Value. Knowing the performance of each coalition of elements, the Shapley value is obtained using equation 2.5. The contribution of element e1 according to the Shapley value is

γ1 (N, v) =

1 1 1 1 1 1 1 − · − · + · , λ1 2 λ1 + λ2 2 λ1 + λ3 3 λ1 + λ2 + λ3

(A.3)

Fair Attribution of Functional Contribution

1911

and similarly for the other elements. Illustrating the meaning of the resulting contribution of e1 using Venn diagrams, we get =

−

−

+

,

in the same order as the terms in equation A.3. As seen from the left-handside Venn diagram, element e1 is accredited for a third of the time when it is functioning with both elements e2 and e3 (the rest is divided equally between the contributions of elements e2 and e3 ), for half of the time when it is functioning with either e2 or e3 (the other half is contributed to the other element), and for the whole time when it is functioning alone. That is, the Shapley value divides the intact performance of the system (given by equation A.2) to the different elements such that each term is divided equally to all elements composing it, denoting a fair division of the system performance to the different elements. A.3 Single Lesion Analysis and FCA. The single lesion approach consists of perturbing one element within each experiment and measuring the decrease in performance. Using the same notation of the MSA, the contribution assigned to element i using single lesion analysis is proportional to v(N) − v(N\{ei }). For the test case system, this equals (for i = 1) σ1 (N, v) =

1 1 1 1 − − + , λ1 λ1 + λ2 λ1 + λ3 λ1 + λ2 + λ3

(A.4)

and similarly for the other elements. Illustrating σ1 using Venn diagrams, we obtain =

−

−

+

,

in the same order as the terms in equation A.4. The left-hand-side Venn diagram illustrates that with single lesion analysis, each element is accredited for the expected time only when it is functioning alone, without considering its previous contribution while other elements were still functioning. Thus, the Shapley value is much more informative in capturing the true contribution of the elements in comparison with using only single perturbation experiments. Figure 5 compares the Shapley value with the single lesion and FCA contributions, for the case where n = 4 and λi = 1/i, for i = 1, . . . , 4.8 8 A concrete example must be used since no general formulas for the contributions exist in an FCA analysis.

1912

A. Keinan, B. Sandbank, C. Hilgetag, I. Meilijson, and E. Ruppin 0.7

Shapley value Single lesion

0.6

FCA contributions 0.5

0.4

0.3

0.2

0.1

0

1

2

3

4

Element Figure 5: Comparison between the FCA, single lesion analysis (dark gray bars), and the MSA (black bars) on the test case. The FCA contributions (light gray bars) are mean and standard deviations across 10 FCA runs. Both the Shapley value and the FCA are based on the full set of all 24 perturbation configurations. The Shapley value, the single lesion contributions, and the FCA contributions within each of the 10 runs are normalized such that their sum equals 1.

Evidently, the FCA contributions and the Shapley value differ significantly for three out of the four elements, even when considering the large standard deviations of the former. The FCA contributions resemble the contributions assigned by the single lesion analysis, testifying that the FCA fails, too, in capturing a fair attribution of contributions in this case.

Acknowledgments We acknowledge the valuable contributions and suggestions made by Ranit Aharonov, Shay Cohen, Alon Kaufman, and Ehud Lehrer. This research has been supported by the Adams Super Center for Brain Studies in Tel Aviv University and by the Israel Science Foundation founded by the Israel Academy of Sciences and Humanities.

Fair Attribution of Functional Contribution

1913

References Adolphs, R., Damasio, H., Tranel, D., Cooper, G., & Damasio, A. R. (2000). A role for somatosensory cortices in the visual recognition of emotion as revealed by three-dimensional lesion mapping. Journal of Neuroscience, 20(7), 2683–2690. Aharonov, R., Segev, L., Meilijson, I., & Ruppin, E. (2003). Localization of function via lesion analysis. Neural Computation, 15(4), 885–913. Banzhaf, J. F. (1965). Weighted voting doesn’t work: A mathematical analysis. Rutgers Law Review, 19, 317–343. Bates, E., Wilson, S. M., Saygin, A., Dick, F., Sereno, M. I., Knight, R. T., & Dronkers, N. F. (2003). Voxel-based lesion-symptom mapping. Nature Neuroscience, 6(5), 448–450. Billera, L. J., Heath, D., & Raanan, J. (1978). Internal telephone billing rates—a novel application of non-atomic game theory. Operations Research, 26, 956– 965. Brodin, L., Grillner, S., & Rovainen, C. M. (1985). N-methyl-d-aspartate (NMDA), kainate and quisqualate receptors and the generation of fictive locomotion in the lamprey spinal cord. Brain Research, 325(1–2), 302–306. Buchanan, J. T. (1999). Commissural interneurons in rhythm generation and intersegmental coupling in the lamprey spinal cord. Journal of Neurophysiology, 81(5), 2037–2045. Buchanan, J. T. (2001). Contributions of identifiable neurons and neuron classes to lamprey vertebrate neurobiology. Progress in Neurobiology, 63(4), 441–466. Buchanan, J. T. (2002). Spinal cord of lamprey: Generation of locomotor patterns. In M. A. Arbib (Ed.), The handbook of brain theory and neural networks. Cambridge, MA: MIT Press. Buchanan, J. T., & Grillner, S. (1987). Newly identified “glutamate interneurons” and their role in locomotion in the lamprey spinal cord. Science, 236(4799), 312–314. Couzin, J. (2002). Small RNAs make big splash. Science, 298, 2296–2297. Dubey, P., Neyman, A., & Weber, R. J. (1981). Value theory without efficiency. Mathematics of Operations Research, 6(1), 122–128. Ekeberg, O. (1993). A combined neuronal and mechanical model of fish swimming. Biological Cybernetics, 69(5–6), 363–374. Farah, M. J. (1996). Is face recognition “special”? Evidence from neuropsychology. Behavioral Brain Research, 76, 181–189. Feigenbaum, J., Papadimitriou, C. H., & Shenker, S. (2001). Sharing the cost of multicast transmissions. Journal of Computer and System Sciences, 63(1), 21–41. Friedman, J. H., & Stuetzle, W. (1981). Projection pursuit regression. J. Amer. Statist. Assoc., 76(376), 817–823. Gefeller, O., Land, M., & Eide, G. E. (1998). Averaging attributable fractions in the multifactorial situation: Assumptions and interpretation. J. Clin. Epidemiol., 51(5), 437–441. Grabisch, M., & Roubens, M. (1999). An axiomatic approach to the concept of interaction among players in cooperative games. International Journal of Game Theory, 28, 547–565.

1914

A. Keinan, B. Sandbank, C. Hilgetag, I. Meilijson, and E. Ruppin

Grillner, S., Deliagina, T., Ekeberg, O., Manira, A. E., Hill, R. H., Lansner, A., Orlovsky, G. N., & Wallen, P. (1995). Neural networks that coordinate locomotion and body orientation in lamprey. Trends in Neurosciences, 18(6), 270–279. Grillner, S., Wallen, P., & Brodin, L. (1991). Neuronal network generating locomotor behavior in lamprey: Circuitry, transmitters, membrane properties, and simulation. Annual Review Neuroscience, 14, 169–199. Hammond, S. M., Caudy, A. A., & Hannon, G. J. (2001). Post-transcriptional gene silencing by double-stranded RNA. Nature Rev. Gen., 2(2), 110-119. Hilgetag, C. C., Kotter, R., Theoret, H., Classen, J., Wolters, A., & Pascual-Leone, A. (2003). A bilateral competitive network for visual spatial attention in humans. Neurocomputing, 52–54, 793–798. Hilgetag, C. C., Kotter, R., & Young, M. P. (1999). Inter-hemispheric competition of sub-cortical structures is a crucial mechanism in paradoxical lesion effects and spatial neglect. Prog. Brain Res., 121, 121–141. Hilgetag, C. C., Lomber, S. G., & Payne, B. R. (2000). Neural mechanisms of spatial attention in the cat. Neurocomputing, 38, 1281–1287. Hilgetag, C. C., Theoret, H., & Pascual-Leone, A. (2001). Enhanced visual spatial attention ipsilateral to rTMS-induced virtual lesions of human parietal cortex. Nature Neuroscience, 4(9), 953-957. Kapur, N. (1996). Paradoxical functional facilitation in brain-behaviour research: A critical review. Brain, 119, 1775–1790. Keinan, A., Meilijson, I., & Ruppin, E. (2003). Controlled analysis of neurocontrollers with informational lesioning. Philosophical Transactions of the Royal Society of London: Series A, 361(1811), 2123–2144. Kosslyn, S. M. (1999). If neuroimaging is the answer, what is the question? Philosophical Transactions of the Royal Society of London: Series B, 354, 1283– 1294. Lomber, S. G., Payne, B. R., & Cornwell, P. (2001). Role of the superior colliculus in analyses of space: Superficial and intermediate layer contributions to visual orienting, auditory orienting, and visuospatial discriminations during unilateral and bilateral deactivations. J. Comp. Neurol., 441, 44–57. Lomber, S. G., Payne, B. R., Hilgetag, C. C., & Rushmore, R. J. (2002). Restoration of visual orienting into a cortically blind hemifield by reversible deactivation of posterior parietal cortex or the superior colliculus. Exp. Brain Res., 142, 463–474. Lomber, S. G., Payne, B. R., & Horel, J. A. (1999). The cryoloop: An adaptable reversible cooling deactivation method for behavioral or electrophysiological assessment of neural function. J. Neurosci. Methods, 86, 179–194. Myerson, R. B. (1977). Graphs and cooperation in games. Mathematics of Operations Research, 2, 225–229. Myerson, R. B. (1980). Conference structures and fair allocation rules. International Journal of Game Theory, 9, 169–182. Parker, D., & Grillner, S. (2000). Neuronal mechanisms of synaptic and network plasticity in the lamprey spinal cord. Progress in Brain Research, 125, 381–398. Pascual-Leone, A., Wasserman, E., Davey, N., & Rothwell, J. (2002). Handbook of transcranial magnetic stimulation. New York: Oxford University Press.

Fair Attribution of Functional Contribution

1915

Rafal, R. (2001). Virtual neurology. Nature Neuroscience, 4(9), 862–864. Roth, A. E. (1979). Axiomatic models of bargaining. Berlin: Springer-Verlag. Segev, L., Aharonov, R., Meilijson, I., & Ruppin, E. (2003). High-dimensional analysis of evolutionary autonomous agents. Artificial Life, 9(1), 1–20. Shapley, L. S. (1953). A value for n-person games. In H. W. Kuhn & A. W. Tucker (Eds.), Contributions to the theory of games (Vol. 2, pp. 307–317). Princeton, NJ: Princeton University Press. Shapley, L. S., & Shubik, M. (1954). A method for evaluating the distribution of power in a committee system. American Political Science Review, 48(3), 787–792. Shubik, M. (1962). Incentives, decentralized control, the assignment of joint costs and internal pricing. Management Science, 8, 325–343. Shubik, M. (1985). Game theory in the social sciences. Cambridge, MA: MIT Press. Sprague, J. M. (1966). Interaction of cortex and superior colliculus in mediation of visually guided behavior in the cat. Science, 153, 1544–1547. Squire, L. R. (1992). Memory and the hippocampus: A synthesis of findings with rats, monkeys, and humans. Psychological Review, 99, 195–231. Vallar, G. (1998). Spatial hemineglect in humans. Trends in Cognitive Sciences, 2, 87–97. Received May 30, 2003; accepted February 13, 2004.

LETTER

Communicated by Ennio Mingolla

Recurrent Network with Large Representational Capacity Drazen Domijan [email protected] Department of Psychology, Faculty of Philosophy, University of Rijeka, Trg Ivana Klobucarica 1, HR-51000 Rijeka, Croatia

A recurrent network is proposed with the ability to bind image features into a unified surface representation within a single layer and without capacity limitations or border effects. A group of cells belonging to the same object or surface is labeled with the same activity amplitude, while cells in different groups are kept segregated due to lateral inhibition. Labeling is achieved by activity spreading through local excitatory connections. In order to prevent uncontrolled spreading, a separate network computes the intensity difference between neighboring locations and signals the presence of the surface boundary, which constrains local excitation. The quality of surface representation is not compromised due to the self-excitation. The model is also applied on gray-level images. In order to remove small, noisy regions, a feedforward network is proposed that computes the size of surfaces. Size estimation is based on the difference of dendritic inhibition in lateral excitatory and inhibitory pathways, which allows the network to selectively integrate signals only from cells with the same activity amplitude. When the output of the size estimation network is combined with the recurrent network, good segmentation results are obtained. Both networks are based on biophysically realistic mechanisms such as dendritic inhibition and multiplicative integration among different dendritic branches. 1 Introduction An important area of the study of neural basis of visual perception is how the external world is represented in the brain. In particular, does the brain construct a detailed representation of the whole visual scene, or does it use sparse representation with only a few of the most salient objects? The phenomenon of change blindness has been taken as evidence for the sparse representation (Rensink, 2000, 2002). It refers to our inability to detect large changes between successive views of the same scene (Simons & Levin, 1997). However, recent psychophysical and neuroimaging studies imply that full representation of the visual scene could be formed despite the failure to detect the change (Simons, 2000). The failure could be the consec 2004 Massachusetts Institute of Technology Neural Computation 16, 1917–1942 (2004)

1918

D. Domijan

quence of interference or overwriting of the prechange representation by the postchange representation (Becker, Pashler, & Anstis, 2000; Landman, Spekreijse, & Lamme, 2003). Another possibility is that the failure results from the limited capacity of the comparison process (Hollingworth & Henderson, 2002; Simons, Chabris, Schnur, & Levin, 2002). An fMRI study of the change detection reveals activity in category-specific regions of the ventral stream (e.g., fusiform gyrus) even when participants were unable to report any change in the display (Beck, Rees, Frith & Lavie, 2001), indicating that unnoticed features of the display are processed to some degree. Another phenomenon relevant to the issue of representational capacity of visual system is inattentional blindness (Mack & Rock, 1998). It refers to our inability to perceive a clearly visible stimulus when our attention is directed elsewhere. For instance, when participants are engaged in a demanding perceptual task, they often fail to report even unusual objects appearing in the visual field. Wolfe (1999) argued that inattentional blindness does not necessarily imply sparseness of the visual representation but rather reflects limitations of working memory that disable participants to report properties of objects that are outside the focus of attention. In a similar vein, Moore and Egeth (1997) showed that perceptual grouping is not disrupted by the focus of attention, because Muller-Lyer ¨ illusion is experienced even when parts of the figures are not attended. In a recent review of findings regarding inattentional blindness, Mack (2003) concluded that inattentional blindness is a well-established phenomenon but further research should be done in order to assess what happens with unattended objects and surfaces. Other experimental paradigms are also consistent with the view that although the detailed representation exists in the brain, not all aspects of it will enter conscious experience. For instance, van Rullen and Koch (2003) tried to estimate how many objects in a visual scene could be explicitly represented in the consciousness. They found that memory for briefly presented visual objects is limited to about four items, but ignored objects show significant negative priming. This means that the ignored objects are processed up to the level of identification, but their representation is actively inhibited by attentional selection, which is manifested behaviorally as interference (i.e., negative priming). Driver, Davis, Russell, Turatto, and Freeman (2001) provide a detailed review of evidence concerning the relation between attention and visual segmentation. They suggest that the segmentation could be influenced by the allocation of attention, but this does not mean that segmentation is completely constrained by attention. On the contrary, they provide evidence that segmentation proceeds even in a condition of inattention. Their conclusion is consistent with the theory of perceptual organization advocated by Palmer and Rock (1994), who argued that the segmentation of a visual scene into candidate regions precedes limited capacity attentional selection. Further support for this conclusion is obtained with single-unit recordings from primate visual areas V1 and V2 showing that figure-ground response is not altered by the movement of attention to

Representational Capacity

1919

or away from the target surface (Marcus & van Essen, 2002). A recent visual search study also pointed to the independence of attention and segmentation by showing that figures are segregated from a background in a single preattentive step that is followed by a serial search for targets among the figures (Wolfe, Oliva, Horowitz, Butcher, & Bompas, 2002). Not all agree on this point. For instance, Rensink (2002) suggests that dynamic visual representation is formed only when attention is focused on a particular surface and that detailed representation might be just an illusion. According to his coherence theory, early visual processing does not have enough spatial and temporal coherence to sustain detailed representation. Furthermore, influence of attention and object recognition on segmentation is observed in a figure-ground assignment task where it is shown that an ambiguous border is attributed to the surface more familiar to the observer (Peterson, 1994; Peterson & Gibson, 1994a, 1994b). Recently, Vecera, Flevaris, and Filapek (2004) showed that exogenous cues also influence figure-ground assignment and even override bottom-up Gestalt cues. Based on these findings, Vecera (2000; Vecera & O’Reilly, 1998) proposed an interactive architecture where attention, object recognition, and segmentation mutually influence each other in an attempt to recover surface borders. Although interactive account explains familiarity effects in figure-ground segregation it is not clear how segmentation is performed when there are no familiarity cues or conflict between bottom-up cues. Also, there are a number of methodological and conceptual issues regarding the relationship between vision and top-down processes that complicate the interpretation of behavioral data. A detailed review of these issues and a more general discussion on the cognitive penetrability of visual perception is provided by Pylyshyn (1999) and his commentators. Although the data do not provide a definite answer, they do suggest that it is a plausible assumption that a detailed representation of the visual scene is formed in the brain. From a computational point of view, it is important to elucidate what the candidate neural mechanisms are that are able to support such a representation. According to the temporal correlation hypothesis, the timing of neural firing is used to represent different functional groups (Milner, 1974; von der Malsburg, 1981; von der Malsburg & Schneider, 1986; Gray, 1999; Singer, 1999; Singer & Gray, 1995). Cells that are activated at the same time are assumed to code the features of the same object or surface. Based on this idea, Wang and Terman (1997; Terman & Wang, 1995) introduced a locally excitatory globally inhibitory oscillatory network (LEGION) that segments input due to the synchronization between oscillatory cells that code the same object and desynchronization between cells that code different objects. Although LEGION has been successfully tested on realistic images and applied to texture and motion segmentation (Cesmeli & Wang, 2000, 2001), it has several limitations. Due to the fact that a single oscillator cannot be active or inactive for an arbitrarily long time, LEGION suffers from segmentation capacity limitations (i.e., an inability to repre-

1920

D. Domijan

sent an arbitrary number of different surfaces). It is estimated that LEGION cannot segment more than five objects (Wang & Terman, 1997). The second problem is that the time to segment the image depends on the number of objects in the scene. Furthermore, the quality of pattern separation at the region borders is also compromised due to local coupling and additional mechanisms—for example, dynamic normalization is invoked in order to improve it (Wang & Terman, 1997). Besides, biological plausibility of the temporal correlation hypothesis is still questionable (Shadlen & Movshon, 1999). Another possibility is to use the amplitude of the cell activity to represent image segments. Opara and Worg ¨ otter ¨ (1998) used the Potts model of spin to represent the results of segmentation. In the Potts model, spin can assume a number of discrete states. Spins that share the same state are assumed to code the single object. The problem with this approach is that it is not clear how discrete activity labels, such as spin states, could arise in a continuous-time network dynamics. Wersing, Steil, and Ritter (2001) proposed a competitive layer model (CLM) that separates image regions into different network layers. Capacity issue is avoided at the cost of introducing many identical network layers. In a context of figure-ground separation, Grossberg and Wyse (1991) suggested a similar solution called a feature-boundary-feature (FBF) network. Like LEGION, CLM also suffers from border effect, which is resolved using the time-consuming annealing procedure. The aim of this letter is to describe a single-layer lateral inhibitory network for pattern separation based on firing-rate code that simultaneously represents an arbitrary number of image segments without border effect. This is achieved by incorporating mechanisms of dendritic inhibition and multiplicative interactions at dendritic branches into a description of cell dynamics. 2 Model Description 2.1 Labeling Network. The model is based on standard network architecture for competition in sensory and cognitive processing with selfexcitation and lateral inhibition (Grossberg, 1988; Kaski & Kohonen, 1994). The standard model is augmented with dendritic inhibition, which modulates the strength of lateral inhibition. The basic elements of the network are excitatory neurons with self-feedback connection and two inhibitory interneurons, denoted as lateral and dendritic. Network interactions are illustrated in Figure 1A. The lateral interneuron with extended dendrites mediates lateral inhibition from the network to the excitatory neuron. Every dendrite receives excitation from one excitatory cell in the network and inhibition from the dendritic interneuron, which serves as protection from lateral inhibitory influences. The dendritic interneuron receives input from an excitatory cell and distributes inhibition to all dendrites of the lateral

Representational Capacity

1921

Figure 1: Labeling network for image segmentation. (A) Recurrent network with self-excitatory and lateral inhibitory connections. Open circles are excitatory cells, and filled circles are inhibitory interneurons. Lines with T-endings are dendrites. (B) Boundary detection system with local excitatory pathways to dendritic interneurons.

1922

D. Domijan

interneuron. In this way, the dendritic interneuron disables any inhibitory influence from less active cells. More active cells compared with the target excitatory cell will override dendritic inhibition and deliver its inhibitory signals. The mechanism of dendritic inhibition assumes that dendrites operate as independent computational devices with their own input-output relationships (Hauser & Mel, 2003; Poirazi, Brannon, & Mel, 2003; Spratling & Johnson, 2001, 2002). The output of their computation is passed to the cell, which integrates signals from different dendrites. In order to form a group of cells that represent the same object with the same activity value, excitatory signals are spread from more active cells to less active cells. Activity spreading is achieved through local links from the excitatory cell to dendritic interneurons of nearest-neighbor excitatory cells. Activation spreading is allowed if there is no boundary between cells. A separate set of nodes detects boundary presence and multiplicatively interacts with local excitatory links to prevent activity spreading across the boundary, as shown in Figure 1B. Boundary presence is signaled if there is a sufficiently large difference in input intensities between neighbor locations. The boundary detection system could be understood as a simplified variant of the boundary contour system (BCS) proposed by Grossberg and Mingolla (1985). Mathematically, the model is expressed as a set of nonlinear differential equations: dxij = Axij + (B − xij )(I(t)wij + f (xij )) − (C + xij )h(yij ), dt dyij g[ f (xpq ) − h(zij ) − Ty ] = yij + dt pq=ij dzij = −zij + f (xij ) + Sij dt

(2.1) (2.2)

(2.3)

where xij , yij , and zij denote, respectively, excitatory cell, lateral interneuron, and dendritic interneuron at spatial position {i, j} for all i = 1, . . . , M and j = 1, . . . , N. The excitatory cell is modeled with a nonlinear shunting equation that has a physiological interpretation (Grossberg, 1988). Parameter A in equation 2.1 is passive decay, which drives activity toward zero if there is no input; B(C) defines the excitatory (inhibitory) saturation point, that is, the upper (lower) boundary for an activity level that could be obtained. In order to obtain full representational capacity, excitatory bound B should be set equal to M × N. Transient input I(t) initializes network activity, and it is defined as I(t) = .1 if t < tm and I(t) = 0 for t ≥ tm where tm > 0. Synaptic weights from transient node to excitatory cells are set to wij = i + (j − 1)N

(2.4)

Representational Capacity

1923

for all i and j. Therefore, nodes in the network are activated with a gradient of input values, where every node has a distinct activity amplitude that corresponds to its position in a network. This is equivalent to random gaussian noise for oscillatory units in LEGION (Wang & Terman, 1997) or to dye injections in the FBF network (Grossberg & Wyse, 1991) or to different visual latencies for pixels with different luminance in the spin-lattice model (Opara & Worg ¨ otter, ¨ 1998). For instance, suppose that M = N = 5. Then, 

1 6  wij =  11 16 21

2 7 12 17 22

3 8 13 18 23

4 9 14 19 24

 5 10  15 . 20 25

(2.5)

This type of coding seems too artificial to be implemented in the brain because it is independent from any visual attributes. However, it could be enhanced by inclusion of direct input values (i.e., luminance) as in I(t)(wij +Jij ). In that case, the network becomes sensitive to a magnitude of the input Jij , but it is also sensitive to an abstract distinction between different surfaces. Therefore, lighter surfaces will receive labels with higher amplitude due to the Jij , but if there are two or more distinct surfaces with the same input, they will be labeled with different activity values. Unfortunately, interference may occur between two different codes, and weights (wij ) should be normalized to some small interval in order to prevent it. Another possibility is to use two-layer architecture, where the first layer is sensitive only to the luminance and its output is sent to the second layer, which uses abstract code for the objects’ identity. Linear output functions with half-rectification for excitatory and inhibitory cells are defined as f (r) = h(r) = r for r > 0 and f (r) = h(r) = 0 for r ≤ 0. Rectification is necessary in a biologically plausible model because it prevents excitatory connection from becoming inhibitory, and vice versa. Self-excitation is included in a model in order to prevent activation decay after the transient input is shut down and to improve the quality of the object representation. Function g() describes the dendritic output function; it is defined as g(r) = 1 if r > 0 and g(r) = 0 if r ≤ 0. Other functions may be used as well (e.g., sigmoid function), but the important point is that g() must have an upper boundary because too strong an inhibition would lead to a winner-take-all behavior. Threshold Ty controls the minimal difference between the activity of the excitatory cell and the activity of the dendritic interneuron that is necessary to influence cell yij . The network is fully connected, and the sum in equation 2.2 is taken over all network cells p, q except i, j. The lateral and dendritic inhibitory connections are assumed to be of unit strength, and their weights are not explicitly represented in the

1924

D. Domijan

model. The term Sij in equation 2.3 is defined as Sij =

νmnij 2k[ f (xmn ) − h(zij )],

(2.6)

mn

where m and n are four nearest-neighbor locations {(i + 1, j), (i − 1, j), (i, j + 1), (i, j − 1)}. It describes excitatory activity arriving from local surrounding xmn to the dendritic interneuron through its dendrites. This activity is also subject to dendritic inhibition from the dendritic interneuron (zij ), which prevents its excessive excitation (see Figure 1B). Dendrites of the dendritic interneuron have a half-rectified linear output function k(r) = r for r > 0 and k(r) = 0 if r ≤ 0. Furthermore, excitatory links are multiplicatively gated by the νmnij = g(Jmn − Jij + Tν )g(Jij − Jmn + Tν ).

(2.7)

This is a simplified variant of the BCS, which detects intensity difference between neighboring pixels Jmn and Jij from input image J. Terms in equation 2.5 correspond to two dendritic branches, which independently compute a signal difference between excitatory and inhibitory input, as shown in Figure 1B. They multiplicatively interact in order to achieve contrast sensitivity without sensitivity to contrast polarity (Grossberg & Mingolla, 1985). A nonlinear dendritic output function g() is defined as for equation 2.2. When intensity difference is larger than the excitatory tonic signal Tv , one of the terms in equation 2.5 drops to 0, and activity spreading is disabled. If the difference is smaller than Tv , both terms will be 1, which will allow an exchange of excitatory signals between neighboring cells. Tonic signal Tv controls the sensitivity of the boundary detection system to intensity difference. It may be considered an internal property of the dendrites, or it may arise from a common external source. Multiplicative interactions are assumed to arise at junctions of dendritic branches (Mel, 1994). The input image J does not influence excitatory cells directly; rather, it controls activity spreading through the boundary detection system (other possibilities are discussed above). The network description could be simplified if we assume that inhibitory interneurons quickly reach equilibrium value. Then we could replace the system of equations 2.1 to 2.3 with a single equation: dxij = −Axij + (B − xij )(I(t)wij + f (xij )) dt − (C + xij ) g[ f (xpq ) − h(xij ) − Sij ].

(2.8)

pq=ij

Further simplification is useful when applying the network to real images. Solving the large system of ordinary differential equations is timeconsuming, and algebraic approximation of the real-time process greatly

Representational Capacity

1925

reduces the computational costs. Spreading activation in the labeling network is approximated with the following algorithm: 1. Initialize xij = wij ;

(2.9)

2. Iterate until convergence xij = Max(xij , xmn ∗ νmnij );

(2.10)

where indices m and n are defined above, in equation 2.6. The algorithm does not reproduce exactly the amplitudes of the cells’ activity obtained with the real-time version of the model, but it labels distinct closed regions with distinct activity values. All locations within the same region are labeled with the same activity value. 2.2 Size Estimation Network. Application of the labeling network to gray-level images results in fragmented output with many small regions that do not correspond to significant objects in the visual scene. In order to remove this noise, an additional network is proposed, which estimates a size of surfaces segmented by the labeling network. It is analogous to a lateral potential in the LEGION (Wang & Terman, 1997). While the lateral potential is directly embedded in the LEGION dynamics, the size estimation network operates independent of the labeling network. The size estimation network is shown in Figure 2. It combines excitatory and inhibitory connections to distinguish signals originating from different objects. Each cell in the size estimation network receives excitatory input from a corresponding input cell and an excitatory and inhibitory signal from all other input cells. Input arrives from the labeling network. As in the labeling network, an additional dendritic inhibitory pathway is introduced, which influences direct excitatory and inhibitory pathways before they reach the cell. Mathematically, the model is described by dXij g(xmn − Zij + TE ) − Yij = −Xij + g(xij ) + dt m n dYij g(xmn − Zij + TI ) = −Yij + dt m n dZij = −Zij + xij , dt

(2.11) (2.12) (2.13)

where Xij , Yij , and Zij , denote, respectively, excitatory cell, lateral interneuron, and dendritic interneuron at spatial position {i, j} for all i = 1, . . . , M and j = 1, . . . , N in the size estimation network. The dynamics of cells is

1926

D. Domijan

Figure 2: Size estimation network. Open circles are excitatory cells, and filled circles are inhibitory interneurons. Lines with T-endings are dendrites. The bottom layer of excitatory cells represents input from the labeling network (xij ). The connection pattern is shown for a single cell in the size estimation network.

described using a linear additive model (Grossberg, 1988). Term xij is a direct input to the cell from the labeling network that is not subject to any activity modulations; xmn represents lateral input, which is modulated by dendritic inhibition (−Zij ) before it can perturb the cell. Indexes m and n represent all spatial locations except i and j; function g(r) is a dendritic output function defined above; TE (TI ) is a threshold for the lateral excitatory (inhibitory) pathway. Constraint on the threshold values is given by the following inequality, 0 = TI < TE < D,

(2.14)

where D is a positive constant representing a measure of distinctiveness of the labeled object representation defined as the smallest difference between object labels. Such a constraint allows cells to be sensitive to lateral input with the same label as from direct input since the threshold for excitatory pathway is set to the positive value. When activity amplitude is equal in an excitatory pathway and in a dendritic inhibitory pathway, a positive output is produced since the threshold in the inhibitory pathway prevents lateral inhibition to influence the target cell. All weights in the network are set to the unit value, and they are not explicitly represented. All cells in the size estimation network receive input from all cells in the labeling network. There is no intralayer connectivity, so activity spreads in a strictly feedforward manner. Therefore, activity of the

Representational Capacity

1927

size estimation network could be approximated with an algebraic form: Xij = g(xij ) +

m

−

m

g(xmn − xij + TE )

n

g(xmn − xij + TI ).

(2.15)

n

In order to remove small, noisy regions, the output of the size estimation network is thresholded and multiplicatively combined with the output of the labeling network in order to produce a final result of the segmentation (Segij ): Segij = g(Xij − TX )xij ,

(2.16)

where TX represents a threshold for the size estimation network. Therefore, all surfaces with size smaller than TX will not be represented in the final output. 3 Computer Simulations 3.1 Synthetic Images. Behavior of the real-time version of the labeling network (equations 2.4–2.8) is illustrated using small synthetic images (see Figure 3). Input is a 15 × 15 array of pixels whose intensities vary in the range 0 to 225. Network parameters are set to A = .01; B = 225; C = 0; Ty = .1; Tv = .5; tm = .1. Differential equations are solved using the Euler method with a time step of .001. In Figure 3A, four squares of equal size are presented. Initially, all cells have a different activity value due to the transient tonic stimulation. Dendritic inhibition of the lateral inhibitory signals will preserve the initial activity difference in the absence of local excitation. The node that receives the strongest activation will inhibit all other nodes, but through its interneuron, it will indirectly inhibit all inhibitory signals from other nodes, and therefore it will not receive any inhibition. Excitatory self-feedback will drive its activity close to the maximum possible level B (here it is assumed that decay parameter A is negligible). The second largest node will also prevent inhibition from all other units except the largest one. The largest node will override the dendritic inhibition and deliver one unit of inhibition to the second largest node, which will settle approximately to the activity level B − 1. In the same manner, the third largest node will settle to B − 2 because it receives two units of inhibition, one from the largest node and one from the second largest node; the fourth largest node will settle to B − 3; and so on. However, local excitatory signals override inhibition from nearby locations and allow spreading of activity through a whole region bounded by signals from the boundary detection system. All cells in the interior of the

1928

D. Domijan

Figure 3: Computer simulations that illustrate the behavior of the labeling network. The first row is input image (Jij ) with maximal intensity (=225) denoted as white and minimal intensity (=1) as black. The second row is a response from the boundary detection system (wmnij ), which is depicted as a set of lines between neighboring locations {m, n}, and {i, j}. Line length encodes the magnitude of the response. The bottom row is the output from an excitatory cell of the network (xij ), where maximal activity is denoted as white and minimal as black. (A) Four identical squares with background. (B) Four connected squares with a separate central region. (C) Overlapping squares. (D) Fragmented image with a distinct intensity value for every pixel.

single square will converge on the value of the cell, which initially has the strongest activation in the group. Since inhibition from cells representing different squares could not be overridden, different squares are painted with a different activity amplitude, which is denoted by different shades of gray. The background is painted in white because the cell at location {15, 15}, which is initialized with the strongest input, reaches maximally possible activity value (225) and spreads it to all network locations outside the closed borders belonging to squares. An important point is that due to the selfexcitation and unit inhibition from nodes belonging to different segments, there is no edge effect; that is, the activity level of cells representing a single image region is the same at all corresponding locations. If four squares are connected, they will be treated as a unitary object, as shown in Figure 3B. There is no edge between squares, and activity spreads

Representational Capacity

1929

from the square with the highest activity to all other squares. Although the interior of the figure belongs to the background, the labeling network treats it as a separate surface. Therefore, the labeling network is sensitive to the topological structure of the input image in the same manner as LEGION (Wang, 2000). Figure 3C illustrates the network ability to segment overlapping surfaces. This is a consequence of the boundary detection system, which prevents spreading between squares through multiplicative interactions with local excitatory signals. However, the model does not have the ability to perform figure-ground separation and amodal completion of partially overlapped surfaces like the FBF network, because all segmented regions are represented at the same network layer. The problem arises from the model’s inherent inability to represent depth relations. In order to achieve this, multilayer architecture should be employed with different layers corresponding to different depths. Nevertheless, an important point is that the number of network layers depends on the number of different depth planes, not on the number of objects that should be segmented as in the FBF network (Grossberg & Wyse, 1991). The FBF network would require a different network layer for every surface even if they are at the same depth plane. It would be interesting to establish whether the FBF network and the labeling network could be combined in order to achieve greater flexibility in representing three-dimensional space. Figure 3D shows that the labeling network can represent even maximally fragmented images where each pixel is treated as a separate object. Such a network capacity depends on the choice of parameter B, which is set to the value corresponding to the dimensionality of the network. However, even if B is restricted to the value that is considerably lower than the total number of pixels in the network, the capacity could be preserved if the positive value of the function g() is lowered to compensate for compression. For instance, if B is set to the value that is 10% of the dimensionality of the input image, g() should be defined as g(r) = .1 if r > 0 and g(r) = 0 if r ≤ 0. 3.2 Real Images. The full architecture, including the labeling network and the size estimation network, is applied to real images. Before the analysis of segmentation results, the behavior of the size estimation network is examined in more detail. Consider a simple example including three rectangles of different extent and a background. Suppose that each rectangle is labeled with a different activity level as an output from the labeling network:  1 0  x= 2 0 3

1 0 2 0 3

1 0 2 0 0

1 0 2 0 0

1 0 0 0 0

 1 0  0 . 0 0

(3.1)

1930

D. Domijan

Given this input, the size estimation network produces the following output: 

6 18  X= 4 18 2

6 18 4 18 2

6 18 4 18 18

6 18 4 18 18

6 18 18 18 18

 6 18  18 . 18 18

(3.2)

Parameters for the size estimation network are set to TE = .5 and TI = .0. The same parameter set is used for all results reported below. As can be seen, all cells that represent the same surface have the same level of activity, which corresponds to the size of the given rectangle. This is true even for the background, which was labeled with the zero activity level. In order to understand how such output is produced, three possible cases should be considered. First, if xij = xmn for some spatial locations {i, j} and {m, n}, dendritic inhibition will suppress the inhibitory signal traveling along the pathway from xmn to Xij . On the other hand, dendritic inhibition will not be strong enough to prevent the excitatory signal because of the higher threshold (TE ) in the excitatory pathway. The result will be an excitation of Xij . The amount of excitation is directly related to the number of cells, xmn , which have an equal activity level as xij . In other words, Xij simply counts the number of cells that belong to the same surface as xij . Second, if xij > xmn , dendritic inhibition will prevent activity spreading along excitatory and inhibitory pathways, and those xmn will not be able to influence the target cell Xij . Third, if xij < xmn , dendritic inhibition is too weak to prevent activity spreading along excitatory and inhibitory pathways. Both excitatory and inhibitory signals will be delivered to the target cell. However, they will not have any influence on the final cell activity because they have the same strength and will cancel each other. The third case will not be necessary if surfaces are labeled with amplitudes according to their sizes, that is, if larger streaks are labeled with higher activity levels. In that case, inhibitory pathways could be excluded from the model. However, it is not reasonable to expect that a labeling procedure could use size information before it is computed by the size estimation network. Therefore, inhibitory pathways are included in the model to prevent cells with higher activity labeling from destroying size computation with cells with lower activity levels. As a consequence, the model allows arbitrary labeling in input as shown in the demonstration where the largest surface is painted with activity level 1 while the smallest surface is painted with activity value 3. An important constraint of the model is that input labels should be distinctive (the difference between activity levels of different surfaces should be large enough). In the demonstration, it is assumed that it is one unit of activity. Therefore, the threshold on the excitatory pathways should be set

Representational Capacity

1931

Figure 4: A gray-level image of an outdoor scene with a resolution of 192 × 128 pixels. (A) Input image. (B) Activation of the labeling network (xij ) with Tν = 2.5. (C) Activation of the size estimation network (Xij ). (D) Final segmentation result (Segij ) with TX = 100.

to the value above 0 but below D. Otherwise the model will not correctly distinguish different surfaces. Figure 4 shows an application of the algorithmic version of the model (equations 2.9–2.10 and 2.15–2.16) on an outdoor image. The image is taken from a database of natural images used to test properties of the independent component filters based on independent component analysis. (It is available on-line at http://hlab.phys.rug.nl/imlib/index.html.) The database is formed using a Kodak DCS420 digital camera. (Details of the acquisition procedure can be found in van Hateren and van der Schaaf, 1998.) The thumbnail version of images consists of 192 × 128 pixels with an intensity range from 0 to 255. Parameters are set to Tν = 2.5 and TX = 100. Different activity amplitudes are represented as different shades of gray. As can be seen from Figure 4D, the algorithm segments several major regions in the image such as the central path, the left and the right meadow, larger and smaller bushes on the right, several bushes on the left, and the sky. In

1932

D. Domijan

total, 10 segments are produced, plus a background, which is denoted by the black area. Due to natural illumination, there is a gradient of intensity values from the bottom of the image to the center, as can be seen on the path and on the meadows. However, the algorithm treats these regions as single units because illumination variations are regular across space and the boundary system does not detect them. The building in the center of the image is not represented as a single segment because it contains irregular variations in intensity values at a small area due to the dark windows on the white coat. This illustrates the model’s inability to cope with textured surfaces because small surfaces are treated as noise and the threshold for the size estimation network, TX , eliminates them. Figure 5 shows an application of the algorithm with the same parameter setting on another image from the same database. This image shows leaves in a forest. There are many leaves that occlude each other and are illuminated

Figure 5: A gray-level image showing leaves in a forest. It is from the same database as the image in Figure 4 and also consists of 192 × 128 pixels. (A) Input image. (B) Activation of the labeling network (xij ) with Tν = 2.5. (C) Activation of the size estimation network (Xij ). (D) Final segmentation result (Segij ) with TX = 100.

Representational Capacity

1933

from the upper right corner, making it very difficult for segmentation. Combined output from the labeling network and the size estimation network is shown in Figure 5D. There are 10 segmented regions plus the background. Segments correspond to several larger leaves, while smaller leaves are filtered out. The segment in the right upper corner corresponds to the open space. Figure 6 shows the effect of change in the parameter setting for the same image as in Figure 5. The threshold Tν is set to 3.5 and TX = 100 in Figure 6D. Now the algorithm produces 18 segments. If, in addition, the threshold for the size estimation network (TX ) is lowered to 50, output is even more fragmented, with 47 segments (see Figure 6E). Increasing the Tν and decreasing the TX leads to a more fragmented output. The threshold Tν influences the size of the segments and, implicitly, their number in the final output. A higher value of the Tν implies that larger differences between neighboring pixels are necessary in order to prevent activity spreading in the labeling network. Therefore, the activity spreading will not get stuck on small, insignificant segments. However, there is always a risk of using too high a value and produce activity leakage and compromise the segmentation results. On the other hand, the TX acts simply as a size filter, which disables the representation of smaller segments. In order to facilitate comparison with the LEGION, the algorithm is tested on an MRI (magnetic resonance imaging) image of a human head (available on-line from Digital Anatomist Interactive Atlases, http://www9.biostr. washington.edu/cgi-bin/DA/imageform/). The image consists of 450 × 450 pixels (see Figure 7.). Due to the larger scale of this image, thresholds are set to larger values: Tν = 6.5 and TX = 500. The algorithm successfully segments several important regions such as the cerebral cortex, the brain stem, the corpus callosum, the fornix, the septum pellucidum, and the bone marrow. The entire image was partitioned into 20 regions. There are also several failures. For instance, the cerebellum was not represented as a segment, the white matter is not distinguished from the gray matter, and parts of the cerebral cortex, including the striate and parietal regions, are not included in the single segment. The failures are a consequence of the large variations in the image intensity at these regions, which results in a large number of small segments, as indicated by activity gradients in Figure 7B. Despite these shortcomings, the algorithm exhibits good segmentation performance comparable with the LEGION. However, an important difference between the models is that the labeling network has the inherent capacity to represent arbitrarily many visual segments, while the LEGION achieves this only with algorithmic approximation. 4 Discussion Natural images contain a large number of objects, which poses a representational problem for image segmentation algorithms based on neural networks. Models based on oscillatory correlation such as LEGION have a

1934

D. Domijan

Figure 6: The same gray-level image as in the Figure 5, but with different parameter settings. (A) Input image. (B) Activation of the labeling network (xij ) with Tν = 3.5. (C) Activation of the size estimation network (Xij ). (D) Final segmentation result (Segij ) with TX = 100. (E) Segmentation result with TX = 50.

Representational Capacity

1935

Figure 7: A gray-level image of an MRI scan of a human head. It consists of 450 × 450 pixels (available online from Digital Anatomist Interactive Atlases). (A) Input image. (B) Activation of the labeling network (xij ) with Tν = 6.5. (C) Activation of the size estimation network (Xij ). (D) Final segmentation result (Segij ) with TX = 500.

limited capacity to separate patterns (Wang & Terman, 1997). Even if large capacity is possible, this will need a very long time to complete segmentation. On the other hand, models based on amplitude code such as FBF or CLM avoid the capacity issue by projecting different segments to different network layers (Grossberg & Wyse, 2001; Wersing et al., 2001). This strategy

1936

D. Domijan

implies either that the number of objects is known in advance or that the number of network layers is very large. The model proposed here is based on an amplitude code, but it requires only one network layer (with the size estimation network as an auxiliary mechanism for removing small, noisy regions in real images). The pattern separation is achieved by labeling different objects with a different activity level. Pixels that belong to the same object are painted with the same activity value due to the local excitatory links. Self-excitation ensures a uniform representation of segments without any auxiliary mechanisms. Pixels that belong to different objects are segregated by lateral inhibitory signals, which maintain an amplitude difference between different groups. In contrast with the spin-lattice model (Opara & Worg ¨ otter, ¨ 1998), which assumes units with discrete activation states (i.e., Potts spins), continuous activation dynamics in the present model converges to a set of discrete labels. The flow of local excitatory signals is regulated by a separate set of units that compute boundaries between objects. The proposed network deals only with the issue of representation, and surface boundaries are computed at other networks, such as BCS. Although the model’s behavior resembles interaction between surface and boundary computation as proposed by Grossberg and Todorovi´c (1988), there are two important differences. In the present approach, surfaces are constructed from abstract code provided by a single time-varying cell, while the feature contour system (FCS) receives input from contrast-sensitive cells similar to lateral geniculate nucleus (LGN) cells, which are also used for boundary computation. Another difference is that interaction between the surface and boundary system occurs at dendrites, which allows for sharper surface representation in the present formulation. The labeling network shares similar architectural assumptions with previously proposed models because it depends on the local excitatory links and lateral inhibitory links (Grossberg, 1988; Terman & Wang, 1995; Wersing et al., 2001). Standard architecture is augmented with dendritic inhibition of lateral inhibitory pathways, which enable the network to represent an arbitrary number of image segments. Dendritic inhibition is closely related to the mechanism of presynaptic inhibition previously proposed by Yuille and Grzywacz (1989) in a model of winner-take-all behavior. Spratling and Johnson (2001, 2002) provide a review of arguments for dendritic inhibition as a plausible mechanism. They showed that dendritic inhibition improves network capacity for storing patterns. Here it is shown that dendritic inhibition also increases the representational capacity of the neural networks. Multiplicative interactions at dendritic branches may arise as a consequence of electric properties of real dendritic membrane (Mel, 1994). The assumption that object identity is coded with amplitude of neural activity is in accord with a recent proposal by Roelfsema, Lamme, and Spekreijse (2000; Roelfsema, Lamme, Spekreijse, & Bosch, 2002). However, the present approach does not require attention or any high-level visual process (e.g., associative memory) in order to form a segmented representation of an input scene.

Representational Capacity

1937

Several neurophysiological studies support the idea of labeling different surfaces or objects with different activity levels. Lamme (1995) studied neural activity in the primary visual cortex in response to stimuli where a figure is distinguished from a background based on a difference in a texture orientation or a motion direction. He observed a difference in the firing rate when a neuron’s receptive field was inside or outside a region belonging to the figure. Rate enhancement was uniform along the whole figure, regardless of the location of the receptive field. It could be on the border or on the interior of the figure. In a subsequent study, Zipser, Lamme, and Schiller (1996) found firing rate enhancement for the figures defined by differences in luminance, color, or disparity. Furthermore, in both studies, the rate enhancement related to the figure-ground relationship has a longer latency than the initial response to local image features. Also, texture boundaries are processed before a texture interior is enhanced, suggesting different network mechanisms for surfaces and borders (Lamme, Rodriguez-Rodriguez, & Spekreijse, 1999). Based on these studies, Lamme and Roelfsema (2000) proposed a model of sensory processing in the brain with two distinct modes. The first mode is a fast feedforward mode, where cells respond to the local image feature to which they are tuned. The second mode is a slower feedback mode, which represents a more abstract code related to figure-ground segregation and possibly other aspects of visual perception. Unfortunately, Lamme and his colleagues used displays with only one figure and a background. Extensions to multi-element displays still need to be done. In that case, the present model has made a testable predication. Distinct surfaces should be represented with different firing rates, and all cells coding the same surface should exhibit the same firing rate. It is not unreasonable to expect such an outcome given the recent finding that the cells in the prefrontal cortex use amplitude code for representing abstract property, namely, serial position of motor commands in a sequence of movements that should be performed (Averbeck, Chafee, Crowe, & Georgopoulos, 2002). However, a study from another laboratory failed to find evidence for enhanced activity in the interior of the texture in the texture segregation task (Rossi, Desimone, & Ungerleider, 2001). Enhanced activity was observed only at the texture borders. Rossi et al. argued that a discrepancy in results could be accounted for by different behavioral paradigms employed in these studies. On the other hand, Albright and Stoner (2002) suggest that textured stimuli do not provide strong cues for segmentation, and configurations with depth cues should be tested instead. For instance, two overlapping surfaces will promote stronger segmentation because they contain T-junctions. Relevant findings are summarized by Lee (2003), who also argued that enhanced activity in the interior of the surfaces reflects important information about figure-ground organization. Results of segmentation need to be interpreted in some way by higherlevel visual areas. How surface selection and recognition could be performed using temporal correlations is not entirely clear (Shadlen & Mov-

1938

D. Domijan

shon, 1999). Wang (1999) proposed an architecture for the surface selection that takes the LEGION as an input. On the other hand, the present approach offers a straightforward method for selecting a particular segment or performing a visual search. Network dynamics forces one surface to attain maximal activity amplitude due to self-excitation. Therefore, the surface is selected simply by imposing a threshold on the output of the labeling network. The threshold should assume a value just slightly below the maximal amplitude in the network. Moreover, the threshold could be lowered in order to select more than one object at a time. This is consistent with recent psychophysical evidence (Davis, Driver, Pavani, & Shepherd, 2000; Davis, Welch, Holmes, & Shepherd, 2001). When the surface (or surfaces) is selected, it is evaluated in higher-level visual areas, and if it is not suitable for the task at hand, it could be actively inhibited. When representation of the most active surface is removed, the next most active representation is free from inhibition because it received inhibition only from the previously selected surface. Therefore, its activity grows until it reaches the maximal amplitude, which means that it is above the threshold and becomes the next selected surface. This process could be continued until a target surface is found. In this way, a serial search for the visual target is implemented in the network (Wolfe et al., 2002). Visual search models usually assume a saliency map as an input, which is computed as a weighted difference over all registered visual features and drives the search process (Itti & Koch, 2000, 2001; Wolfe, 1994). The labeling network assigns arbitrary activity amplitude to the surfaces due to the abstract identity code. However, it is possible that the labeling network receives input directly from the saliency map. In that case, the activity amplitude of the particular surface would correspond with its saliency, but if two or more surfaces have the same saliency measure, they will receive the same activity amplitude and be selected together. The disadvantage of the proposed approach is the extensive wiring necessary for proper network operation. Besides full connectivity of the lateral inhibitory pathways, every location has its own set of connections, from the dendritic interneuron to dendrites of lateral interneuron. Further research will explore whether it is possible to achieve the same behavior in a structurally less demanding architecture. Another task is to test the network on noisy images. In that case, a more complex boundary detection algorithm should be used (e.g., the BCS). Recent variants of the BCS have been shown to be very effective in detecting surface borders even in the presence of high noise (Mingolla, Ross, & Grossberg, 1999; Ross, Grossberg, & Mingolla, 2000).

References Albright, T. D., & Stoner, G. R. (2002). Contextual influences on visual processing. Annual Review of Neuroscience, 25, 339–379.

Representational Capacity

1939

Averbeck, B. B., Chafee, M. V., Crowe, D. A., & Georgopoulos, A. P. (2002). Parallel processing of serial movements in prefrontal cortex. Proceedings of the National Academy of Sciences USA, 99, 13172–13177. Beck, D. M., Rees, G., Frith, C. D., & Lavie, N. (2001). Neural correlates of change detection and change blindness. Nature Neuroscience, 4, 645–650. Becker, M. W., Pashler, H., & Anstis, S. M. (2000). The role of iconic memory in change detection tasks. Perception, 29, 273–286. Cesmeli, E., & Wang, D. L. (2000). Motion segmentation based on motion/brightness integration and oscillatory correlation. IEEE Transactions on Neural Networks, 11, 935–947. Cesmeli, E., & Wang, D. L. (2001). Texture segmentation using gaussian-Markov random fields and neural oscillator networks. IEEE Transactions on Neural Networks, 12, 394–404. Davis, G., Driver, J., Pavani, F., & Shepherd, A. (2000). Reapprising the apparent costs of attending to two separate visual objects. Vision Research, 40, 1323– 1332. Davis, G., Welch, V. L., Holmes, A., & Shepherd, A. (2001). Can attention select only a fixed number of objects at a time? Perception, 30, 1227–1248. Driver, J., Davis, G., Russell, C., Turatto, M., & Freeman, E. (2001). Segmentation, attention and phenomenal visual objects. Cognition, 80, 61–95. Gray, C. M. (1999). The temporal correlation hypothesis of visual feature integration: Still alive and well. Neuron, 24, 31–47. Grossberg, S. (1988). Nonlinear neural networks: Principles, mechanism, and architectures. Neural Networks, 1, 17–61. Grossberg, S., & Mingolla, E. (1985). Neural dynamics of perceptual grouping: Textures, boundaries, and emergent segmentations. Perception and Psychophysics, 38, 141–171. Grossberg, S., & Todorovi´c, D. (1988). Neural dynamics of 1-D and 2-D brightness perception: A unified model of classical and recent phenomena. Perception and Psychophysics, 43, 241–277. Grossberg, S., & Wyse, L. (1991). A neural network architecture for figure-ground separation of connected scenic figures. Neural Networks, 4, 723–742. Hauser, M., & Mel, B. W. (2003). Dendrites: Bug or feature? Current Opinion in Neurobiology, 13, 372–383. Hollingworth, A., & Henderson, J. M. (2002). Accurate visual memory for previously attended objects in natural scenes. Journal of Experimental Psychology: Human Perception and Performance, 28, 113–136. Itti, L., & Koch, C. (2000). A saliency-based search mechanism for overt and covert shifts of visual attention. Vision Research, 40, 1489–1506. Itti, L., & Koch, C. (2001). Computational modeling of visual attention. Nature Reviews Neuroscience, 2, 1–11. Kaski, S., & Kohonen, T. (1994). Winner-take-all networks for physiological models of competitive learning. Neural Networks, 7, 973–984. Lamme, V. A. F. (1995). The neurophysiology of figure-ground segregation in primary visual cortex. Journal of Neuroscience, 15, 1605–1615. Lamme, V. A. F., Rodriguez-Rodriguez, V., & Spekreijse, H. (1999). Separate processing dynamics for texture elements, boundaries and surfaces in primary visual cortex of the macaque monkey. Cerebral Cortex, 9, 406–413.

1940

D. Domijan

Lamme, V. A. F., & Roelfsema, P. R. (2000). The distinct modes of vision offered by feedforward and recurrent processing. Trends in Neuroscience, 23, 571–579. Landman, R., Spekreijse, H., & Lamme, V. A. F. (2003). Large capacity storage of integrated objects before change blindness. Vision Research, 43, 149–164. Lee, T. S. (2003). Computations in the early visual cortex. Journal of Physiology— Paris, 97, 121–139. Mack, A. (2003). Inattentional blindness: Looking without seeing. Current Directions in Psychological Science, 12, 180–184. Mack, A., & Rock, I. (1998). Inattentional blindness. Cambridge, MA: MIT Press. Marcus, D. S., & van Essen, D. C. (2002). Scene segmentation and attention in primate cortical areas V1 and V2. Journal of Neurophysiology, 88, 2648–2658. Mel, B. W. (1994). Information processing in dendritic tree. Neural Computation, 6, 1031–1085. Milner, P. M. (1974). A model of visual shape recognition. Psychological Review, 81, 521–535. Mingolla, E., Ross, W. D., & Grossberg, S. (1999). A neural network for enhancing boundaries and surfaces in synthetic aperture radar images. Neural Networks, 12, 499–512. Moore, C. M., & Egeth, H. (1997). Perception without attention: Evidence of grouping under conditions of inattention. Journal of Experimental Psychology: Human Perception and Performance, 23, 339–352. Opara, R., & Worg ¨ otter, ¨ F. (1998). A fast and robust cluster update algorithm for image segmentation in spin-lattice models without annealing—Visual latencies revisited. Neural Computation, 10, 1547–1566. Palmer, S., & Rock, I. (1994). Rethinking perceptual organization: The role of uniform connectedness. Psychonomic Bulletin and Review, 1, 29–55. Peterson, M. A. (1994). Object recognition processes can and do operate before figure-ground organization. Current Directions in Psychological Science, 3, 105– 111. Peterson, M. A., & Gibson, B. S. (1994a). Must figure-ground organization precede object recognition? An assumption in peril. Psychological Science, 5, 253– 259. Peterson, M. A., & Gibson, B. S. (1994b). Object recognition contributions to figure-ground organization: Operations on outlines and subjective contours. Perception and Psychophysics, 56, 551–564. Poirazi, P., Brannon, T., & Mel, B. W. (2003). Pyramidal neuron as two-layer neural network. Neuron, 37, 989–999. Pylyshyn, Z. (1999). Is vision continuous with cognition? The case for cognitive impenetrability of visual perception. Behavioral and Brain Sciences, 22, 341– 423. Rensink, R. A. (2000). Seeing, sensing, and scrutinizing. Vision Research, 40, 1469– 1487. Rensink, R. A. (2002). Change detection. Annual Review of Psychology, 53, 245–277. Roelfsema, P. R., Lamme, V. A. F., & Spekreijse, H. (2000). The implementation of visual routines. Vision Research, 40, 1385–1411. Roelfsema, P. R., Lamme, V. A. F., Spekreijse, H., & Bosch, H. (2002). Figureground segregation in recurrent network architecture. Journal of Cognitive Neuroscience, 14, 525–537.

Representational Capacity

1941

Ross, W. D., Grossberg, S., & Mingolla, E. (2000). Visual cortical mechanisms of perceptual grouping: Interacting layers, networks, columns and maps. Neural Networks, 13, 571–588. Rossi, A. F., Desimone, R., & Ungerleider, L. G. (2001). Contextual modulation in primary visual cortex of macaques. Journal of Neuroscience, 21, 1698–1709. Shadlen, M. N., & Movshon, J. A. (1999). Synchrony unbound: A critical evaluation of the temporal binding hypothesis. Neuron, 24, 67–77. Simons, D. J. (2000). Current approaches to change blindness. Visual Cognition, 7, 1–15. Simons, D. J., Chabris, C. F., Schnur, T., & Levin, D. T. (2002). Evidence for preserved representations in change blindness. Consciousness and Cognition, 11, 78–97. Simons, D. J., & Levin, D. T. (1997). Change blindness. Trends in Cognitive Sciences, 1, 261–267. Singer, W. (1999). Neuronal synchrony: A versatile code for the definition of relations? Neuron, 24, 49–65. Singer, W., & Gray, C. M. (1995). Visual feature integration and the temporal correlation hypothesis. Annual Review of Neuroscience, 18, 555–586. Spratling, M. W., & Johnson, M. H. (2001). Dendritic inhibition enhances neural coding properties. Cerebral Cortex, 11, 1144–1149. Spratling, M. W., & Johnson, M. H. (2002). Preintegration lateral inhibition enhances unsupervised learning. Neural Computation, 14, 2157–2179. Terman, D., & Wang, D. L. (1995). Global competition and local cooperation in a network of coupled oscillators. Physica D, 81, 148–176. van Hateren, J. H., & van der Schaaf, A. (1998). Independent component filters of natural images compared with simple cells in primary visual cortex. Proceedings of the Royal Society London B, 265, 359–366. van Rullen, R., & Koch, C. (2003). Competition and selection during visual processing of natural scenes and objects. Journal of Vision, 3, 75–85. Vecera, S. P. (2000). Toward a biased competition account of object-based segregation and attention. Brain and Mind, 1, 353–384. Vecera, S. P., Flevaris, A. V., & Filapek, J. C. (2004). Exogenous spatial attention influences figure-ground assignment. Psychological Science, 15, 20–26. Vecera, S. P., & O’Reilly, R. C. (1998). Figure-ground organization and object recognition processes: An interactive account. Journal of Experimental Psychology: Human Perception and Performance, 24, 441–462. von der Malsburg, C. (1981). The correlation theory of brain function (Int. Rep. 81-2). Gottingen: ¨ Max Planck Institute for Biophysical Chemistry. von der Malsburg, C., & Schneider, W. (1986). A neural cocktail-party processor. Biological Cybernetics, 54, 29–40. Wang, D. (1999) Object selection based on oscillatory correlation. Neural Networks, 12, 579–592. Wang, D. L. (2000). On connectedness: A solution based on oscillatory correlation. Neural Computation, 12, 131–139. Wang, D. L., & Terman, D. (1997). Image segmentation based on oscillatory correlation. Neural Computation, 9, 805–836. Wersing, H., Steil, J. J., & Ritter, H. (2001). A competitive-layer model for feature binding and sensory segmentation. Neural Computation, 14, 357–387.

1942

D. Domijan

Wolfe, J. M. (1994). Guided search 2.0: A revised model of visual search. Psychonomic Bulletin and Review, 1, 202–238. Wolfe, J. M. (1999). Inattentional amnesia. In V. Coltheart (Ed.), Fleeting memories (pp. 71–94). Cambridge, MA: MIT Press. Wolfe, J. M., Oliva, A., Horowitz, T. S., Butcher, S. J., & Bompas, A. (2002). Segmentation of objects from backgrounds in visual search tasks. Vision Research, 42, 2985–3004. Yuille, A. L., & Grzywacz, N. M. (1989). A winner-take-all mechanism based on presynaptic inhibition feedback. Neural Computation, 1, 334–347. Zipser, K., Lamme, V. A. F., & Schiller, P. H. (1996). Contextual modulation in primary visual cortex. Journal of Neuroscience, 16, 7376–7389. Received March 25, 2003; accepted March 9, 2004.

LETTER

Communicated by Michael Arnold

A Solution for Two-Dimensional Mazes with Use of Chaotic Dynamics in a Recurrent Neural Network Model Yoshikazu Suemitsu [email protected] Graduate School of Natural Science and Technology, Okayama University, Okayama 700-8530, Japan

Shigetoshi Nara [email protected] Department of Electrical and Electronic Engineering, Okayama University, Okayama, 700-8530, Japan

Chaotic dynamics introduced into a neural network model is applied to solving two-dimensional mazes, which are ill-posed problems. A moving object moves from the position at t to t + 1 by simply defined motion function calculated from firing patterns of the neural network model at each time step t. We have embedded several prototype attractors that correspond to the simple motion of the object orienting toward several directions in two-dimensional space in our neural network model. Introducing chaotic dynamics into the network gives outputs sampled from intermediate state points between embedded attractors in a state space, and these dynamics enable the object to move in various directions. System parameter switching between a chaotic and an attractor regime in the state space of the neural network enables the object to move to a set target in a two-dimensional maze. Results of computer simulations show that the success rate for this method over 300 trials is higher than that of random walk. To investigate why the proposed method gives better performance, we calculate and discuss statistical data with respect to dynamical structure. 1 Introduction Since Poincar´e discovered chaos, chaotic phenomena have been observed in various scientific and engineering fields. Deterministic but complex behavior, which is caused by internal instability and nonlinearity, has been attracting strong interest among scientists. A variety of phenomenological and dynamical properties of chaos have been investigated and reported. Chaos has been discovered even in living systems, including the brain, with the help of great advances in measuring technologies and has had a great impact in the effort to understand biological systems (Babloyantz c 2004 Massachusetts Institute of Technology Neural Computation 16, 1943–1957 (2004)

1944

Y. Suemitsu and S. Nara

& Destexhe, 1986; Skarda & Freeman, 1987). There is the possibility that chaos is important in realizing highly advanced information processing and control functioning. If this is so and the mechanism is understood, then chaos could be applied to a novel information processing or control system based on nonconventional control principles. In biological systems, interactions between local systems and the total system couple each other with quite strong nonlinearity. This prevents us from decoupling them by conventional reductionism, that is, decoupling them to individual parts or elements. In other words, the difficulty of “combinatorial explosion” or “divergence of algorithmic complexity” occurs due to degrees of freedom that are too large. In such situations, large-scale computer simulations or heuristic methods are useful for approaching this problem. Artificial neural networks in which chaotic dynamics can occur are typical approaches, and the relation between chaos and functions has been discussed (Aihara, Takabe, & Toyoda, 1990; Tsuda, 1991, 2001; Fujii, Itoh, Ichinose, & Tsukada, 1996). The whole picture, however, is not yet revealed. Nara and Davis found that chaotic dynamics can occur in a recurrent neural network, and they investigated functional aspects of chaos by applying it to solving, for instance, a memory search set under an ill-posed context (Nara & Davis, 1992, 1997; Nara, Davis, Kawachi, & Totuji, 1995; Kuroiwa, Nara, & Aihara, 1999; Suemitsu & Nara, 2003). From the perspective of control, chaos has been considered to spoil control systems. Consequently, a large number of methods have been proposed to avoid chaos. However, we consider chaos to be useful not only in solving ill-posed problems but also in controlling systems with large but finite degrees of freedom. To show the potential of chaotic dynamics generated in neural networks, we apply it to control functioning using a maze in twodimensional space as an example. Here, an object is assumed to move in a two-dimensional maze and approach the target by employing chaotic dynamics. One reason that we consider a maze is that the process involved with solving a maze can be easily visualized; we can understand how the dynamical structures are effectively utilized in controlling. Let us state the essence of the model’s construction. The object is assumed to move with discrete time steps. The state vector in the network is translated into the object’s motion, which will be given in a later section. In addition, several limit cycle attractors, which are regarded as prototypes of simple motion, are embedded in the network. They correspond to movement toward several directions in two-dimensional space via the motion function. If the network state converges into the prototypical attractor, the object moves in one of several directions constantly in two-dimensional space. Introducing chaotic dynamics into the network generates nonperiodic output. The nonperiodic state vector, which includes internal instability, is translated into the chaotic motion of the object through the motion function. Chaotic dynamics samples a novel motion among attractors spontaneously; this is the first important point. In addition, system parameter switching by a sim-

A Solution for Two-Dimensional Mazes

1945

ple evaluation with a certain tolerance between chaotic and stable phases of the network might realize a complex motion for the object. The interaction between spontaneous motion generated by chaotic dynamics and external simple evaluation including uncertainty gives constrained chaos; this is the second important point. In the computer simulation, we find that the present method utilizing chaos gives superior performance to a stochastic method based on a random walk. To understand the mechanism of the better performance, we have investigated the dynamical hysteresis based on statistical data. 2 Network Model and Motion Function In this letter, a binary neural network is employed, defined by  si (t + 1) = sgn 

N

εr;ij wij sj (t)

(2.1)

j=1

sgn(u) =



1 (if u ≥ 0) , −1 (if u < 0)

where si (t) = ±1 is a neuron state of a neuron i at time t, N is the total number of the neurons, sgn(u) is a signature function, {εr:ij } is a binary activity that is defined to be 0 or 1 and satisfies the relation j εr;ij = r, r is called a fan-in number or a connectivity for each neuron, and {wij } is a synaptic connection matrix. It is determined by wij =

L M

µ,λ+1 † µ,λ ξj

(2.2)

M L µ,λ α,β (O−1 )α,β ξj

(2.3)

µ=1 λ=1 † µ,λ ξj

=

µ,λ

Oα,β =

α

ξi

β

µ,λ α,β ξj ,

ξj

(2.4)

j µ,λ

where {ξi |i = 1, . . . , N : µ = 1, . . . , L : λ = 1, . . . , M} is an attractor pattern set and is shown below. The periodic sequence ξ µ,λ = ξ µ,λ+M is employed in this model where † ξ µ,λ is an adjoint vector of ξ µ,λ , which satisfies the relation † ξ α,ρ ξ β,σ = δα,β δρ,σ where δ is Kronecker’s delta. This enables us to embed L attractors with M-step maps into the network while avoiding superious attractors when the connectivity r = N (Nara & Davis, 1992, 1997; Nara et al., 1993, 1995; Suemitsu & Nara, 2003). In our computer simulation, the case N = 400, L = 4, M = 6 is employed. If N is too small, chaotic dynamics does not occur, whereas if N is oversized, it results in

1946

Y. Suemitsu and S. Nara

excessive computing time. This is why we selected the case N = 400. The meaning of L and M is shown below. In two-dimensional space, an object is assumed to move from the position (px (t), py (t)) to (px (t + 1), py (t + 1)) via a set of motion functions. Using network state s(t), the motion functions fx (s(t)), fy (s(t)) are defined by N

4 4 fx (s(t)) = si (t) · si+ N (t) 2 N i=1

(2.5)

N

fy (s(t)) =

4 4 s N (t) · si+ 3 N (t) , 4 N i=1 i+ 4

(2.6)

where fx and fy range from −1 to +1 due to normalization by 4/N. Note that each motion function is calculated by a self inner product between two parts of the network state s(t). In two-dimensional space, the actual motion of the object is given by px (t) = px (t − 1) + fx (s(t))

(2.7)

py (t) = py (t − 1) + fy (s(t)) .

(2.8)

In our computer simulations, two-dimensional space is digitized with a resolution of 0.02 due to the binary state vector s(t) with 400 elements and the definition of fx and fy in equations 2.5 and 2.6. The way to determine a set of attractor patterns in equation 2.2 is as follows. Each attractor pattern is divided into two parts. One is a random pattern component, where each µ,λ state becomes +1 or −1 with a probability of 0.5 (ξi = ±1 : i = 1, . . . , N/2). µ,λ The other (ξi = ±1 : i = N/2+1, . . . , N) is determined so that the relations ( fx (ξ 1,λ ), fy (ξ 1,λ )) = (−1, −1)

(2.9)

( fx (ξ 2,λ ), fy (ξ 2,λ )) = (−1, +1)

(2.10)

( fx (ξ 3,λ ), fy (ξ 3,λ )) = (+1, −1)

(2.11)

( fx (ξ

4,λ

), fy (ξ

4,λ

)) = (+1, +1) ,

(2.12)

are held. Under the condition λ = 1, . . . , 6, four limit cycle attractors, each of which have M (= 6) patterns, are embedded in the synaptic connection matrix of the network. Each limit cycle attractor corresponds to a constant motion of the object toward one of the four directions (+1, +1), (+1, −1), (−1, +1), (−1, −1).

A Solution for Two-Dimensional Mazes

1947

3 Chaotic Dynamics If the connectivity r is sufficiently large, the network state s(t) converges to one of the cyclic attractors. This corresponds to a conventional associative memory. Under the condition s(t) = ξ µ,λ , the output at the next time step becomes s(t + 1) = ξ µ,λ+1 . Even if the network state s(t) is near one of the attractor patterns ξ µ,λ , the output sequence s(t + kM) (k = 1, 2, 3, . . .) generated by the M-step map will converge to the pattern ξ µ,λ . This fact suggests that each attractor pattern has a set of the network states—the so-called attractor basin Bµ,λ . A network state s(t + kM) (k = 1, 2, 3, . . .) converges to one of the attractor patterns ξ µ,λ if s(t) is in an attractor basin Bµ,λ . Estimating basin volume is one of the main features of this network; to estimate basin volume accurately, one must check the final state (limk→∞ s(kM)) for all initial states in the network. Calculating all the final states is very difficult, however, due to the requirement for enormous amounts of time in this case (the total number of initial states is 2N ). Therefore, the approximate basin volume is estimated by the method shown below. First, a large number of random initial states is generated. These cover the entire state space uniformly if the number is sufficiently large. By updating the initial states, each final state limk→∞ s(kM) can be specified; therefore, the ratios between the number of initial states converge to a certain attractor and the total number of initial states. The rate of convergence to each attractor pattern is proportional to the basin volume, and this rate is regarded as the approximate basin volume for each attractor. Figure 1 shows basin volume of the network. The estimation result indicates that almost all initial states converge to one of the attractor patterns and that each basin volume is approximately equal to the others. Let us consider introducing chaotic dynamics while decreasing the connectivity r. When reducing r, the network state wanders chaotically in the state space, since the stability of the embedded attractors is lost. In our previous article (Nara et al., 1995), we discussed whether the generated dynamics was chaotic. We have observed orbital instability for low connectivity. We have also calculated the invariant measure in terms of basin visiting frequency. This means that dynamical structure does exist in the state space due to reduction of connectivity (Nara & Davis, 1992, 1997; Nara et al., 1993, 1995; Suemitsu & Nara, 2003). In particular, it is confirmed that the network states visit all basins of the embedded attractors chaotically within a certain range of small r. In our model, each embedded attractor corresponds to a prototypical motion in two-dimensional space. Stable output of the network with large r provides a constant motion of the object in two-dimensional space, and Figure 2 shows the motion of the object with large r. Furthermore, if the network becomes unstable with low connectivity r (< 70), the object will

1948

Y. Suemitsu and S. Nara

Figure 1: Basin volume: The horizontal axis represents memory pattern number (1–24). Basin 25 on the horizontal axis corresponds to samples that converged into cyclic outputs having a period of six steps that do not converge to any memory attractor. Basin 26 corresponds to samples that do not belong to any other case (1–25). The vertical axis represents the ratio of each sample to the total number of samples. Hatching and nonhatching are used alternately to show different cycle attractors.

move chaotically. Figure 3 shows the chaotic orbit generated from chaotic dynamics in the network.

4 Motion Control In this section, we propose a control method for the object through switching the connectivity r with a simple evaluation. A target is assumed to be set in two-dimensional space; furthermore, an exact coordinate value of the target (qx , qy ) is not given in the control method proposed below. As an example of external response with uncertainty, a rough direction of the target D1 (t) with a certain tolerance is defined and obtained by the object. For instance, D1 (t) becomes 1 if the direction from the object to the target is observed between the angle from 0 to π/2, where the angle is defined by the x-axis in two-dimensional space. Similarly, D1 (t) becomes n (= 1, 2, 3, 4) if the direction belongs to the angle (n − 1)π/2 and nπ/2. Second, a direction of

A Solution for Two-Dimensional Mazes

4

1949

r=400

3 2 1 0 -1 -1

0

1

2

3

4

Figure 2: An example of the object orbit from start point (0, 0) in twodimensional space with attracting network state (r = 400). One of the motions, which is embedded in synaptic connection matrix {wij }, is shown.

the object motion D2 (t) from time t − 1 to t is defined as  1    2 D2 (t) = 3    4

(cx (t) = +1 and cy (t) = +1) (cx (t) = −1 and cy (t) = +1) , (cx (t) = −1 and cy (t) = −1) (cx (t) = +1 and cy (t) = −1)

where cx (t) and cy (t) are given as px (t) − px (t − 1) |px (t) − px (t − 1)| py (t) − py (t − 1) . cy (t) = |py (t) − py (t − 1)| cx (t) =

(4.1) (4.2)

Finally, using D1 (t) and D2 (t), a time-dependent connectivity r(t) in the network is determined, r(t) =

RL RS

(= N) ( N)

(if D1 (t − 1) = D2 (t − 1)) , (otherwise)

(4.3)

1950

Y. Suemitsu and S. Nara

2

r=10

1 0 -1 -2 -3 -1

0

1

2

3

4

Figure 3: An example of the object orbit from the start point (0, 0) in twodimensional space with chaotic network states (r = 10).

where RL has sufficiently high connectivity (= N in our experiments) and RS has low connectivity. Therefore, after determining the connectivity r(t), we calculate the motion of the object from the network state updated with r(t). The motion provides the next calculation of D1 (t) and D2 (t), and by repeating this process, where the connectivity is switched between RL and RS , the object moves in two-dimensional space. In this control method, the high connectivity is maintained r(t) = RL if the object moves toward the target with a tolerance of π/2, as it provides stable motion toward the target. Figure 4 shows an example of the orbit of the object with the control method proposed above. In the figure, the object moves along the horizontal axis, which is not embedded as the prototypical attractor in the network. Figures 5 and 6 show examples where the walls, through which the object is not permitted to pass, are set in two-dimensional space like a maze. We have assumed that the object always knows which quadrant (= D1 (t)) the target is in at each time step. In our experiments, both fx (s(t)) and fy (s(t)) are taken to be zero if the moving object hits the wall, which means that the object cannot penetrate the wall when it hits. It is possible to observe various solutions to the problem of the object approaching the target in a two-dimensional maze.

A Solution for Two-Dimensional Mazes

1951

15 Rs=10 10 5 target 0 start -5 -10 -15 -5

0

5

10

15

20

25

Figure 4: An example of the object orbit in two-dimensional space with the simple control method from start point (0, 0) to target point (20, 0). Using simple switching between RS and RL , motion in the right direction appears, although that motion is not embedded in the synaptic connection matrix.

5 Result of the Two-Dimensional Maze Problem To show the performance of the control method proposed above in solving a two-dimensional maze from a quantity viewpoint, we have estimated a success rate over 300 random initial states. The result of the control method is compared to that of a random method (random walk), where a random bitpattern generator is used instead of chaotic dynamics. The network state s(t) is replaced at every step by a random pattern generator in the case D1 (t−1) = D2 (t−1) instead of chaotic dynamics. If D1 (t−1) = D2 (t−2), then r(t) is switched to RL = N. The problem in the actual computer simulation is the maze shown in Figure 6, where 300 random patterns are set as the initial states of the network. The rate of approaching the target within 5000 time steps is estimated as the success rate in Figure 7. The success rate for the same problem with random walk is 0.01; the success rate of the control method with chaotic dynamics in the network between RS = 40 and RS = 50, however, is significantly higher than that with randomwalk. The configuration of {εr;ij } is randomly chosen with the condition j εr;ij = r. The differing configurations of {εr;ij } provide a variety of results, even if

1952

Y. Suemitsu and S. Nara

15 Rs=10 10 wall

5 0 start

target

-5 -10 -15 -5

0

5

10

15

20

25

Figure 5: An example of the object orbit in two-dimensional space with a wall from start point (0, 0) to target point (20, 0). After colliding with the wall, the object escapes the wall and moves to the target. Table 1: Mean and the Standard Deviation of the Success Rate for RS = 30, 40, 50.

Average Standard Deviation

RS = 30

RS = 40

RS = 50

0.084 0.130

0.139 0.157

0.129 0.089

the same initial condition is given. The mean and the standard deviation of success rate over six different configurations {εr;ij } for RS = 30, 40, 50 are shown in Table 1. Note that this result is typical, although it strongly depends on the configuration {εr;ij } in equation 2.1. 6 Discussion To show one of the reasons that our model gives better performance than a random walk, let us discuss hysteresis of the network state s(t) from a dynamical viewpoint. In the proposed controlling method, the connectivity of the network is set to a low connectivity r(t) = RS when the object collides against a wall in two-dimensional space. Making a detour to approach

A Solution for Two-Dimensional Mazes

1953

20 Rs=30

target

15 wall 10

5 start

0

0

5

10

15

20

Figure 6: An example of the object orbit in two-dimensional space with walls from start point (1, 9) to target point (19, 19).

0.4 0.3 0.2 0.1 0 0

10

20

30

40

50

60

Figure 7: Three hundred random patterns are set as the initial states of the network. The rate of successfully reaching the target within 5000 time steps is estimated as the success rate. The horizontal axis is RS , and the vertical axis is the success rate. The broken line represents the result of a random search (= 0.01).

1954

Y. Suemitsu and S. Nara

the target, the object must move toward a direction that does not include the target. It implies that with low connectivity RS , high-dimensional orbits in the network must stay in each attractor basin, which is defined at full connection, within certain successive time intervals. We could make a hypothesis that the origin of the successful result is in itinerancy between attractor ruins. To confirm this hypothesis, we estimate residence time in a certain basin. A distribution p(l) is defined as p(l) = {the number of l|s(t) ∈ βµ in τ ≤ t ≤ τ + l and

s(τ − 1) ∈/ βµ ands(τ + l + 1) ∈/ βµ }, M βµ = Bµ,λ ,

(6.1) (6.2)

λ=1

K=

lp(l) ,

(6.3)

l

where l is the length of successional time steps staying in each embedded basin, and p(l) represents a distribution of l within K steps, which is taken to be K = 105 in our actual computer simulation. In statistical processing, it is known that the distribution p(l) fits a exponential function. Figures 8 (semilog plot) and 9 (log-log plot) show the frequency distribution p(l) for the case of random walk (R.W.), where r = 10, r = 40(a) and r = 40(b), in which the two cases (r = 40(a) and r = 40 (b)) have different {εr;ij } configurations. They are calculated in the free runs of the object, which are shown in Figure 3. It is found that the distribution of random walk (R.W. in the figure) fits a line in the semilog scale. Since r = 10 also fits a line in the semilog scale, highly developed chaos is considered to occur near r = 10. Better performance in solving a two-dimensional maze is not obtained in the result of the success rate shown in Figure 7 due to small values of p(l) in the region of large l with the case of random walk or where r = 10. In contrast, there is no exponential decrease found in the cases of r = 40(a) and r = 40(b) in the semilog scale (see Figure 8). They fit the power law well within l = 10 on the log-log scale (see Figure 9), which indicates that the distribution p(l) has a long tail near r = 40 and suggests that dynamical properties are included in chaotic dynamics generated from the network. The case of r = 40 gives a broader distribution in the region of long l than that of random walk or r = 10, which is one reason that chaotic dynamics near r = 40 gives superior performance, since the object must generate various motions to make a detour after colliding with a wall. Residence time in a certain basin might be considered to help the object generate the appropriate motion. An unsolved problem still remains. In spite of the similarity of the distribution p(l) between r = 40(a) and r = 40(b) in Figures 8 and 9, the case with r = 40(a) performs better (success rate: 0.373) than that with r = 40(b)

A Solution for Two-Dimensional Mazes

1955

100000 10000 1000 100

r = 40 (b)

10

R.W.

1

5

r = 10 10

r = 40 (a) 15

20

25

30

Figure 8: The log plot of the frequency distribution of staying length l. The horizontal axis represents the basin-visiting time length and the vertical axis is each observed frequency from t = 1 to t = 105 . r = 10 is the case of highly developed chaos with connectivity r = 10. r = 40(a) is the case with r = 40, which gives good performance for the maze shown in Figure 6. r = 40(b) is different from r = 40(a) for the configuration {εr;ij }, and that gives poor performance. R.W. means that the random number generator is replaced with chaotic dynamics in the neural network model.

100000 10000 1000

r = 40 (b) 100

r = 40 (a)

10 1

1

10

Figure 9: Log-log plot of the frequency distribution of visiting time length l. Each axis corresponds to the axis in Figure 8.

1956

Y. Suemitsu and S. Nara

(success rate: 0.013) in solving the maze in Figure 6. This indicates that the performance depends heavily on the configuration of {εr;ij }. Generally, arbitrarily generated chaos is not always useful in solving the given problem. A functional aspect of chaotic dynamics is still context dependent, which is an issue for future study. 7 Summary We proposed a simple method to solve two-dimensional mazes with the use of chaotic dynamics. Generally, arbitrarily generated chaos is not always useful in solving the given problem. However, better results were often observed with the use of chaotic dynamics. This is why we have confined ourselves in this article to comparing results by utilizing chaotic dynamics with the result using random pattern generator. It is desirable that we consider the comparison of a colored random pattern generator based on reinforcement learning with chaos learning. However, we have not yet dealt with this problem; this is a future issue. A summary of our results is given below: • We proposed a motion control method in two-dimensional space with chaotic dynamics in a recurrent neural network. • Utilizing chaotic dynamics gives better performance than a random walk for solving a two-dimensional maze. • The distribution of p(l) shows that chaotic dynamics leads to longer stays in a certain basin in the network state space than does a random walk. • We found itinerancy between attractor ruins in the chaotic dynamics. • Low connectivity gives highly developed chaos, and it is almost equivalent to a random walk in the distribution p(l). References Aihara, K., Takabe, T., & Toyoda, M. (1990). Chaotic neural networks. Physics Letters A, 114, 333–340. Babloyantz, A., & Destexhe, A. (1986). Low-dimensional chaos in an instance of epilepsy. Proceedings of the National Academy of Sciences of the United States of America, 83, 3513–3517. Fujii, H., Itoh, H., Ichinose, N., & Tsukada, M. (1996). Dynamical cell assembly hypothesis—Theoretical possibility of spatio-temporal coding in the cortex. Neural Networks, 9, 1303–1350. Kuroiwa, J., Nara, S., & Aihara K. (1999). Functional possibility of chaotic behavior in a single chaotic neuron model for dynamical signal processing elements. Proceedings of IEEE International Conference on SMC, 1, 290–295.

A Solution for Two-Dimensional Mazes

1957

Nara, S., & Davis, P. (1992). Chaotic wandering and search in a cycle memory neural network. Progress of Theoretical Physics, 88, 845–855. Nara, S., & Davis, P. (1997). Learning feature constraints in a chaotic neural memory. Physical Review E, 55, 826–830. Nara, S., Davis, P., Kawachi, M., & Totuji, H. (1995). Chaotic memory dynamics in a recurrent neural network with cycle memories embedded by pseudoinverse method. International Journal of Bifurcation and Chaos, 5, 1205–1212. Nara, S., Davis, P., & Totsuji, H. (1993). Memory search using complex dynamics in a recurrent neural network model. Neural Networks, 6, 963–973. Skarda, C. A., & Freeman, W. J. (1987). How brains make chaos in order to make sense of the world. Behavioral and Brain Sciences, 10, 161–195. Suemitsu, Y., & Nara, S. (2003). A note on time delayed effect in a recurrent neural network model. Neural Computing and Applications, 11(3&4), 137–143. Tsuda, I. (1991). Chaotic itinerancy as a dynamical basis of hermeneutics of brain and mind. World Futures, 32, 167–185. Tsuda, I. (2001). Towards an interpretation of dynamic neural activity in terms of chaotic dynamical systems. Behavioral and Brain Sciences, 24, 793–847. Received August 12, 2003; accepted March 9, 2004.

LETTER

Communicated by Jerome Friedman

Online Adaptive Decision Trees Jayanta Basak [email protected] IBM India Research Lab, Indian Institute of Technology, New Delhi–110048, India

Decision trees and neural networks are widely used tools for pattern classification. Decision trees provide highly localized representation, whereas neural networks provide a distributed but compact representation of the decision space. Decision trees cannot be induced in the online mode, and they are not adaptive to changing environment, whereas neural networks are inherently capable of online learning and adpativity. Here we provide a classification scheme called online adaptive decision trees (OADT), which is a tree-structured network like the decision trees and capable of online learning like neural networks. A new objective measure is derived for supervised learning with OADT. Experimental results validate the effectiveness of the proposed classification scheme. Also, with certain real-life data sets, we find that OADT performs better than two widely used models: the hierarchical mixture of experts and multilayer perceptron. 1 Introduction Two major tools for classification (Duda & Hart, 1973; Jain, Duin, & Mao, 2000) are neural networks (Haykin, 1999) and decision trees (Breiman, Friedman, Olshen, & Stone, 1983; Quinlan, 1993). The first, especially feedforward networks, embeds a distributed representation of the inherent decision space, whereas the decision tree always makes a localized representation of the decision. The decision tree (Duda, Hart, & Stork, 2001; Durkin, 1992; Fayyad & Irani, 1992; Friedman, 1991; Breiman et al., 1983; Quinlan, 1993, 1996; Brodley & Utgoff, 1995) is a widely used tool for classification in various domains, such as text mining (Yang & Pedersen, 1997), speech (Riley, 1989; Chien, Huang, & Chen, 2002), bio-informatics (Salzberg, Delcher, Fasman, & Henderson, 1998), web intelligence (Zamir & Etzioni, 1998; Cho, Kim, & Kim, 2002), and many other fields that need to handle large data sets. Decision trees are in general constructed by recursively partitioning the training data set into subsets such that finally certain objective error (such as impurity) over the entire data is minimized. Although the very nature of the construction of axis-parallel decision trees always tries to reduce the bias at the cost of variance (it can be a high generalization error), axis-parallel decision trees are quite popular in terms of their interpretability; each class c 2004 Massachusetts Institute of Technology Neural Computation 16, 1959–1981 (2004)

1960

J. Basak

can be specifically described by a set of rules defined by the path from the root to the leaf nodes of the tree. Variations of decision trees also exist in the literature such as SVDT (Bennett, Wu, & Auslender, 1998) and oblique decision trees (Murthy, Kasif, & Salzberg, 1994), which take into account the combination of more than one attribute in partitioning the data set at any level (at the cost of interpretability). Decision trees usually perform hard splits of the data set, and if a mistake is committed in splitting the data set at a node, then it cannot be corrected further down the tree. In order to take care of this problem, attempts have been made (Friedman, Kohavi, & Yun, 1996) that in general employ various kinds of look-ahead operators to guess the potential gain in the objective after more than one recursive split of the data set. Second, a decision tree is always constructed in the batch mode, that is, the branching decision at each node of the tree is induced based on the entire set of training data. Different variations have been proposed for an efficient construction mechanism (Utgoff, Berkman, & Clouse, 1997; Mehta, Agrawal, & Rissanen, 1996). Online algorithms (Kalai & Vempala, n.d.; Albers, 1996), on the other hand, sometimes prove to be more useful in the case of streaming data, very large data sets, and situations where memory is limited. Unlike decision trees, neural networks (Haykin, 1999) make decisions based on the activation of all nodes, and therefore even if a node is faulty (or makes mistakes) during training, it does not affect the performance of the network very much. Neural networks also provide a computational framework for online learning and adpativity. In short, a decision tree can be viewed as a token passing system, and a pattern is treated as a token that follows a path in the tree from the root node to a leaf node. Each decision node behaves as a gating channel, depending on the attribute values of the pattern token. Neural networks, on the other hand, resemble the nervous system in the sense of distributed representation. A pattern is viewed as a phenomenon of activation, and the activation propagates through the links and nodes of the network to reach the output layer. Some attempts have been made to combine the neural networks with decision trees to obtain different classification models. For example, in the neural tree (Stromberg, ¨ Zrida, & Isaksson, 1991; Golea & Marchand, 1990), multilayer perceptrons are used at each decision node (intermediate node) of the tree for partitioning the data set. Attempts are also made to construct decision trees (Boz, 2000) from trained neural networks, as well as to construct neural networks (Lee, Jone, & Tsai, 1995) from trained decision trees. However, all of these methods attempt to hybridize the concepts borrowed from both these paradigms and learn locally within each node, although none of them addresses the issues of learning in terms of achieving overall classification performance. Possibly the most elegant approach in this line is the hierarchical mixture of experts (HME) (Jordan & Jacobs, 1993, 1994) and the associated learning algorithm. In HME, a tree-structured network (similar to a decision tree)

Online Adaptive Decision Trees

1961

is viewed as a hierarchy of the mixture of experts, and the output of the individual experts is combined selectively through gating networks in a bottom-up manner through the hierarchy such that the root node always represents the final decision. HME provides a general framework for both classification and regression, and a statistical framework, the expectationmaximization (EM) algorithm, has been used for estimating the network parameters in the batch mode. Jordan and Jacobs (1993) demonstrated that HME is a more powerful classification tool when compared with the feedforward neural networks and decision trees. Online learning algorithms are also derived for HME. Here we provide a scheme for online classification by preserving the structure of decision trees and the learning paradigm of neural networks. We call our model an online adaptive decision tree (OADT), which can learn in the online mode and can exhibit adaptive behavior (i.e., adapt to a changing environment). Online adaptive decision trees are complete binary trees with a specified depth where a subset of leaf nodes is assigned for each class, in contrast to HME, where root node always represents the class. In the online mode, for each training sample, the tree adapts the decision hyperplane at each intermediate node from the response of the leaf nodes and the assigned class label of the sample. Unlike HME, OADT uses the top-down structure of a decision tree; instead of probabilistically weighing the decisions of the experts and combining them, OADT allocates a data point to a set of leaf nodes (ideally one) representative of a class. Since OADT operates in a topdown manner, it requires a different objective error measure. The decision in this case is collectively represented by a set of leaf nodes. We provide a different objective error measure for OADT and subsequently derive the learning rules based on the steepest gradient descent criteria. We organize the rest of the letter as follows. In section 2, we show how the leaf nodes should respond to a given sample in OADT for a specified structure of the tree. We then formulate an objective error measure to reflect the discrepency between the collective responses of the leaf nodes and the assigned class label of the training sample. We derive the learning rule by following the gradient descent of the error measure. In section 3, we demonstrate the performance of the OADT on real-life data sets. Finally, we discuss and conclude this letter in sections 4 and 5, respectively. 2 Description of the OADT An OADT is a complete binary tree of depth l where l can be specified beforehand. A tree has a set of intermediate nodes D, |D| = 2l − 1 and a set of leaf nodes L where L = 2l , such that the total number of nodes in the tree is 2l+1 − 1. Each intermediate node i in the tree stores a decision hyperplane in the form (wi , θi ), a vector lr (the form of lr is explained later in equation 2.4), and a sigmoidal activation function f (.). The weight vector wi is n-dimensional for an n-dimensional input and constrained as wi = 1

1962

J. Basak

for all i such that (wi , θi ) together have n free variables. As a convention, we number the nodes in OADT in the breadth-first order starting from the root node numbered as 1, and this convention is followed throughout this article. 2.1 Activation of OADT. As input, each intermediate node can receive activation from its immediate parent and the input pattern x. The root node does not have any parent, and it can receive only the input pattern. As output, each intermediate node produces two outputs—one for the left child and the other for the right child. According to our convention of numbering the nodes, node i has two children numbered as 2i and 2i + 1 where node 2i is the left child of node i and node 2i + 1 is the right child. If u(i) denotes the activation received by node i from its parent, then the activation values received by the left and right children of node i are given as u(2i) = u(i). f (wi x + θi )

(2.1)

u(2i + 1) = u(i). f (−(wi x + θi )).

(2.2)

and

In other words, the node i takes the product of the activation value received from its parent and the activation produced by itself and sends it to the child nodes. The activation values produced by the node i itself are different for two different child nodes. The value of u(1) for the root node is considered to be unity. The leaf nodes do not see the input pattern x and reproduce the activation received from its parent. Thus, from equations 2.1 and 2.2, the activation value of a leaf node p is given as u(p) =

l p p mod( n−1 , 2n ) 2 f (−1) (w n=1

p 2n

.x

+ θ pn ) , 2

(2.3)

that is, the activation of a leaf node is decided by the product of activations produced by the intermediate nodes on the path from the root node to the concerned leaf node. We define a function lr(i, p) to represent whether the leaf node p is on the left path or the right path of an intermediate node i, such that  if p is on the left path of i  1 lr(i, p) = −1 if p is on the right path of i . (2.4)   0 otherwise Using the function lr(.), the activation of a leaf node p can be represented as u(p) = f lr(i, p)(wi .x + θi ) , (2.5) i∈Pp

Online Adaptive Decision Trees

1963

where Pp is the path from the root node to the leaf node p. The function lr(.) is stored in the form of a vector at each intermediate node i (the respective lr(i, :)). In OADT, each intermediate node can observe the input pattern, which is unlike the conventional neural networks. In the feedforward networks, usually there exist some hidden nodes that can observe only the input provided by other nodes but cannot observe the actual input pattern. During the supervised training phase of OADT, it is difficult to decide which leaf node should receive the highest activation. However, it is observed that the total activation of the leaf nodes is always constant if the activation function is sigmoid or of similar nature. Claim 1. The total activation of the leaf nodes is constant if the activation function of each intermediate node is sigmoidal in nature. Proof.

Let us assume one form of the sigmoid function as

f (v; m) =

1 . 1 + exp(−mv)

(2.6)

Then f (v; m) + f (−v; m) = 1.

(2.7)

This property in general holds true for any kind of symmetric sigmoid functions (including the S-function used in the literature of fuzzy set theory). For any child node p, the activation is given as u(p) = u(p/2 ). f ((−1)mod(p,p/2 ) (wp/2 .x + θp/2 )).

(2.8)

Thus, for any two leaf nodes p and p + 1 having a common parent node p/2, we have u(p) + u(p + 1) = u(p/2).

(2.9)

Thus, the total activation of the leaf nodes is l+1 p=2

−1

u(p) =

p=2l

l q=2

−1

u(q).

(2.10)

u(r).

(2.11)

q=2l−1

By similar reasoning, l q=2

−1

q=2l−1

u(q) =

l−1 r=2 −1

r=2l−2

1964

J. Basak

As we go up the tree until we reach the root node, the sum of the activation becomes equal to u(1). Since we consider that u(1) = 1 by default, the total activation of the leaf nodes is

u(p) = 1. (2.12) p

Therefore, decreasing the output of one leaf node will automatically enforce the increase in the activation of the other nodes given that the activation of the root node is constant. 2.2 Activation Function. The activation of each intermediate node is given as u(p) =

l p p mod( n−1 , 2n ) 2 f (−1) (w n=1

p 2n

.x

+ θ pn ) , 2

(2.13)

and the activation function f (.) is f (v) =

1 . 1 + exp(−mv)

(2.14)

The value of m is to be selected in such a way that a leaf node representing the true class should be able to produce an activation value greater than a certain threshold, that is, u(p) ≥ 1 − ,

(2.15)

where p is the activated leaf node representing the true class. Note that if the condition in equation 2.15 holds then the total activation of the rest of the leaf nodes is less than by claim 1. For example, can be chosen as 0.1. Let us further assume that mini |wi x + θ| = δ,

(2.16)

that is, δ is a margin between a pattern and the closest decision hyperplane. In that case, the minimum activation of a maximally activated leaf node representing the true class is up (x) ≤

1 1 + exp(−mδ)

l ,

(2.17)

considering that the pattern x satisfies all conditions governed by all hyperplanes wi on the path from root to the leaf node p, (wi x + θ)lr(i, p) > 0.

(2.18)

Online Adaptive Decision Trees

1965

In that case, from equation 2.15, 1 ≥ 1 − , (1 + exp(−mδ))l

(2.19)

such that l log(1 + exp(−mδ)) ≤ − log(1 − ).

(2.20)

Assuming exp(−mδ) and to be small, we have l exp(−mδ) ≤ ,

(2.21)

that is, m≥

1 log(l/). δ

(2.22)

For example, if we choose = 0.1 and δ = 1, then m = log(10l). 2.3 Loss Function. For a two-class problem, we partition the leaf nodes into two categories 1 and 2 , corresponding to two different classes, C1 and C2 . We assign the odd-numbered leaf nodes to 1 and the even-numbered leaf nodes to 2 such that for a trained OADT, it is desired that (x ∈ C1 ) and (p is activated)

⇒

p ∈ 1

(x ∈ C2 ) and (p is activated)

⇒

p ∈ 2 .

(2.23)

For a supervised learning machine with a single output, if g(x) is the output for an underlying model y(x), then y(x) can be described as (Amari, 1998) y(x) = g(x; W) + ε(x),

(2.24)

where W is a matrix defining the set of parameters and ε is a gaussian noise such that

1 1 p(ε) = √ exp − 2 (y − g(x; W))2 . (2.25) 2σ 2πσ The effective loss function derived from the likelihood measure can thus be expressed as E=

1

(y − g(x; W))2 . 2 x

(2.26)

1966

J. Basak

However, in OADT, we do not have any underlying function analogous to y(x) for each leaf node since it is not known a priori which leaf node will be activated for a given input. In the framework of decision trees, it is difficult to assign any desired activation to any particular leaf node. For a given pattern x, we can say that one node from a group of nodes should be activated. We define an indicator function, y(x) =

1

if x ∈ C1

−1

x ∈ C2

.

(2.27)

We model the noise as 



1 uj (x; W) exp − 2 (1 − y) p(ε) = √ 2σ 2πσ j∈ 1 1

+(1 + y)

2   uj (x; W)  ,

(2.28)

j∈ 2

where 1 and 2 are the set of leaf nodes assigned to class 1 and class 2, respectively. Here, W represents all weight vectors corresponding to all intermediate nodes of OADT. Equation 2.28 tells us that for any sample x ∈ C1 , if any node from 2 is activated or vice versa, then there is an error. Considering that j uj (x; W) = 1, the objective error measure can be expressed as  2

1  y(x) − uj (x; W) − uj (x; W) E= 2 x j∈

j∈

1

since y2 (x) = 1 for all x. Considering that 1, the objective error measure is

(2.29)

2

j∈ 1

uj (x; W) +

 2 1 

uj (x; W) , E= 2 j∈

c

where c is the set of leaf nodes assigned for class c.

j∈ 2

uj (x; W) =

(2.30)

Online Adaptive Decision Trees

1967

2.4 Learning Rule. Following steepest gradient descent, we have  wi = −η 

 uj (x; W)

j∈ c

 θi = −η 

j∈ c



∂uj (x; W) ∂wi j∈

c

(2.31)

∂uj (x; W) uj (x; W) , ∂θi j∈

c

where c is either 1 or 2 depending on sample belonging to Class 2 or Class 1 respectively. The activation of a leaf node j can in general be expressed as uj (x; W) = v1j · · · vij · · · vj/2 j where 1, . . . , i, . . . , j/2 are the nodes on the path from the root to the leaf node j, and vij = f ((wi x + θi )lr(i, j))

(2.32)

is the contribution of node i in the activation of the leaf node j. Therefore, from equation 2.31,  wi = −ηm 

 uj (x; W) 

j∈ c

 θi = −ηm 

 uj (x; W)(1 − vij )lr(i, j) x

j∈ c

 uj (x; W) 

j∈ c



(2.33)

uj (x; W)(1 − vij )lr(i, j) .

j∈ c

Since we impose that wi = 1 for all i, we have wi wi = 0 for a small η, that is, w is normal to w. Further, from equation 2.33, we observe that w changes in the direction of x for steepest descent. Taking the projection of x on to the normal hyperplane of w, we get w ∝ (I − ww )x.

(2.34)

The modified learning rule is therefore given as wi = −ηmUc zic (I − wi wi )x θi = −ηmUc ulic ,

(2.35)

where Uc =

j∈ c

uj (x; W)

(2.36)

1968

J. Basak

and zic =

uj (x; W)(1 − vij (x))lr(i, j).

(2.37)

j∈ c

We select η to ensure steepest descent such that the residual error with first-order approximation goes to zero, that is,

uj (x; W) = −Uc (x). (2.38) j∈ c

Expanding the first-order approximation of uj , from equation 2.35,

uj = −ηm2 [Uc zic uj (x; W) i∈D

× (1 − vij (x))lr(i, j)(1 + x2 − (w x)2 )], where D is the set of intermediate nodes. Therefore,

uj = −ηm2 Uc z2ic (1 + x2 − (w x)2 ).

(2.39)

(2.40)

i∈D

j

From equation 2.38, η=

m2

1 . 2 (1 + x2 − (w x)2 ) z i∈D ic

(2.41)

In this analysis, we assumed the same rate for updating w and θ , although w is updated with an additional constraint of w = 1. In order to make it more flexible, we can have different learning rates as η and λη for updating w and θ, respectively. The overall learning rule is therefore given as wi = −αUc zic (I − wi wi )x θi = −λαUc zic ,

(2.42)

where α=

m

2 k∈D zkc (λ

1 . + x2 − (w x)2 )

(2.43)

It can be noted here that near the optimal point in the parameter space, the value for zkc becomes very small, and as a result, the steepest descent can take relatively larger steps. In order to take care of this problem, we define the denominator of the learning rule as α=

1 . m 1 + k∈D z2kc (λ + x2 − (w x)2 )

(2.44)

Online Adaptive Decision Trees

1969

It is interesting to observe the behavior of the learning rule from equation 2.42. The changes in the weights are always modulated by the total undesired response of the leaf nodes, that is, Uc , which is the total activation of the leaf nodes not corresponding to class c (i.e., nodes not belonging to c ) for a sample coming from class c equation 2.36. The factor zic denotes the contribution of the leaf nodes that are descendants of the node i and do not belong to c (see equation 2.37). If a leaf node j not corresponding to class c (see equation 2.37) is on the right path of node i, then j contributes negatively to z, and in effect it is a positive contribution in the learning rule in equation 2.42. This is due to the fact that if the activation of the left child of a node increases, then that of the right child always decreases (claim 1), and vice versa. Also looking at the expression for zic , equation 2.37, it appears that z is modulated by (1 − v), where v is the contribution of that node in the resultant activation of the leaf nodes. In other words, if a node is highly active, then it is penalized more or the weight parameters are adjusted by a relatively larger step than if the node is less active, that is, a highly active intermediate node is taken to be more responsible for producing the erroneous responses in its descendants if the descendants are found to be erroneous in their activities. Observing the learning rule in equation 2.42, we find that every intermediate node can adjust its parameter values regardless of other nodes so long as it can observe the activation values in the leaf nodes. This is unlike the class of backpropagation learning algorithms as used in the feedforward networks where there is an inherent dependency of the lower-layer nodes on the higher-layer nodes for learning due to error propagation. However, in OADT learning (see equations 2.42 and 2.44), since zic represents the accumulated effect of all descendant leaf nodes of node i, it ensures the accumulation of activities from a larger number of leaf nodes as i goes up the hierarchy. Thus, the coarse level changes in the parameter values occur at the higher layers (the root node and the nodes closer to the root), and the finer details are captured at the lower layers as it comes closer to the leaf nodes. 3 Simulation and Experimental Results We simulated OADT in MATLAB 5.3 on Pentium III machine. We experimented with both synthetic and real-life data sets, and here we demonstrate the performance of OADT in both the cases. 3.1 Protocols. We trained OADT in the online mode. The entire batch of samples is repeatedly presented in the online mode to OADT, and we call the number of times the batch of samples presented to an OADT the number of epochs. If the size of the data set is large (i.e., data density is high), then it is observed that OADT takes fewer epochs to converge, and for relatively smaller size data sets, OADT takes a larger number of epochs. On average, we found that OADT converges near its local optimal solution

1970

J. Basak

within 10 to 20 epochs, although the required number of epochs increases with the depth of the tree. We report all results for 200 epochs. The performance of OADT depends on the depth (l) of the tree and the parameter m. We have chosen the parameters and δ (see equation 2.22) as 0.1 and 1, respectively, that is, m = log(10l). Since m is determined by l in our setting, the performance is solely dependent on l. We report the results for different depths of the tree. We normalize all input patterns such that any component of x lies in the range [−1, +1]. Note that it is not always required to normalize the input within this specified range. However in that case, the parameter λ (see equation 2.42) needs to be chosen properly, although we observed that the performance is not very sensitive to variations in the choice of λ. We have chosen λ as unity in the case of normalized input values. In order to test the performance of OADT with the test samples, we provide an input pattern x and obtain the activation of all leaf nodes. We find the class label corresponding to the maximally activated leaf node and assign the same class label to the test sample. In all experiments, we assigned the odd-numbered leaf nodes to class 1 ( 1 is the set of odd-numbered leaf nodes) and even-numbered leaf nodes to class 2. 3.2 Real-Life Data. We experimented with different real-life data sets available in the UCI machine learning repository (Merz & Murphy, 1996). We computed the 10-fold cross-validation score for each data set. The experimental protocols for OADT are exactly as described in section 3.1. Table 1 demonstrates the performance of OADT. As a comparison, we report the classification performance obtained with the HME for each data set. We implemented the HME using the software available on the web (Martin, n.d.), which is also cited in the Bayes net toolbox (Murphy, 2001, 2003). We used 100 epochs for HME with the EM algorithm, although it converges within 30 epochs for every data set. Table 1 demonstrates that OADT performs better (sometimes marginally) than HME. We also implemented a multilayer perceptron (MLP) network with different number of hidden nodes using the WEKA software package (Garner, 1995; Witten & Frank, 2000) in Java. We observed that both HME and OADT outperform MLP in terms of classification score for each data set. Jordan and Jacobs (1993) reported that HME performs better than MLP in most cases in terms of classification score. Also, for the data sets having highly nonlinear structure (e.g., liver disease data), MLP often gets stuck to local minima if the parameters (such as learning rate and the momentum factor) are not properly tuned. With different experiments, we obtained a maximum score of 57.97% for this data set using an MLP with 1 hidden layer and 15 hidden nodes. Possibly an increase in the number of hidden layers may increase the score. We therefore do not report the classification score obtained with MLP separately. In addition to comparing with HME using the EM algorithm, we compare the performance of OADT with certain batch-mode algorithms, including decision tree, k-nearest neighbor, naive Bayes, support vector machine, and bagged

Online Adaptive Decision Trees

1971

Table 1: Ten-Fold Cross-Validation Scores of OADT on Four Real-Life Data Sets for Three Depths of 3, 4, and 5. Data Description and Model Number of features Number of instances HME Depth = 3 Depth = 4 Depth = 5 Depth = 6 Depth = 7 OADT Depth = 3 Depth = 4 Depth = 5 C4.5 Score |D| k-NN SVM Naive Bayes Bagging C4.5 AdaBoost C4.5

Data Set Breast Cancer (Diagnostic)

Breast Cancer (Prognostic)

Liver Disease (Bupa)

Diabetes (Pima)

30 569 92.98 92.44 93.15 94.90 94.55 97.37 97.19 96.49 92.44 25 96.13 97.19 93.15 94.73 95.96

32 198 73.11 70.17 74.56 71.56 68.72 77.78 76.89 77.78 76.77 19 72.22 78.28 66.67 80.81 75.76

6 345 63.42 66.47 65.81 59.14 64.86 66.67 66.24 64.86 66.38 51 64.35 70.43 55.94 72.46 67.54

8 768 69.78 68.87 69.78 69.00 71.21 76.69 75.39 74.35 74.09 43 70.44 76.69 75.78 74.87 71.35

Notes: The data sets are available in the UCI machine learning repository (Merz & Murphy, 1996). A comparison with HME is also provided for different depths. Comparison with different batch-mode algorithms are also shown. |D| denotes the number of decision nodes generated by the decision tree C4.5. For k-NN classifier, the value of k is set automatically by the leave-one-out principle. SVM uses the cubic polynomial kernels. The classifiers k-NN, SVM, naive Bayes, C4.5 (J4.8), and bagged and boosted trees are available in the WEKA software package.

and boosted decision trees. We implemented all these classification tools using the WEKA software package (Garner, 1995; Witten & Frank, 2000) in Java. In the case of the decision tree, we used C4.5, which is available as J4.8 in the WEKA package. In the case of k-NN algorithms, the value of k is automatically set by the leave-one-out principle. In the case of SVM, we chose cubic polynomial kernels. We observe that OADT outperforms the first three classification algorithms in terms of classification score. However, SVM performs best for two data sets and the bagged trees perform best for two other data sets as compared to all other algorithms. In this context, it may be mentioned that OADT learns strictly in the online mode, and it performs better than its online counterparts. Overall, in the paradigm of online learning, we find that OADT performs reasonably well. 3.3 Synthetic Data. It is well known that parity problems are difficult to deal with in the standard feedforward neural networks. Various investigations (Lavretsky, 2000; Hohil, Liu, & Smith, 1999) are available in the

1972

J. Basak

Table 2: Performance of OADT for n-Parity Problem with Different Amounts of Gaussian Noise. Task

Recognition score (%) of OADT of Depth

Bayes Score

Dimension

SD

2

3

4

5

6

(Theoretical)

2-parity

0.2 0.3 0.4 0.2 0.3 0.4 0.2 0.3 0.4 0.2 0.3 0.4

97.40 88.30 77.90 82.40 70.40 57.40 56.55 55.16 53.27 62.30 50.60 52.92

97.90 88.20 76.40 96.30 82.10 67.20 71.43 65.77 61.90 65.73 56.55 57.06

98.10 89.50 79.90 97.00 83.30 75.00 84.23 79.07 62.20 74.60 62.40 58.97

98.40 89.30 81.30 95.90 81.50 74.50 89.78 78.67 66.57 83.17 67.24 61.59

98.10 88.70 81.30 96.30 84.40 71.80 96.33 83.04 68.75 87.70 71.37 64.11

98.77 90.90 81.10 98.16 86.99 74.53 97.56 83.45 69.35 96.97 80.26 65.26

3-parity

4-parity

5-parity

Notes: The dimensionality n varies from 2 to 5, and the standard deviation of gaussian noise varies from 0.2 to 0.4. It is observed that the OADT performance gracefully degrades with noise.

literature that address solving N-parity problems in connectionist frameworks and provide the design of different architectures to deal with this class of problems. In this section, we show that OADT is also able to handle the gaussian parity problems and provide reasonable classification score. Table 2 demonstrates the performance of OADT for gaussian parity problems where we generated 1000 data points in a unit hypercube and perturbed them with gaussian noise. We show the classification results for different sizes of the gaussian kernels along with the corresponding theoretical upper bound (Bayes risk in the appendix). We find that in most cases, OADT performs close to the theoretical optimum. In the case of higher dimension, the performance falls short of the theoretical bound due to the decrease in the data density. For example, in the case of the 5-parity problem, there are 32 corners of the hypercube, and we have only 31 points (approximately) on an average at each corner. Decision trees are usually not used for such problems because it is difficult to classify such patterns using decision trees without smart look-ahead operators; however, decision trees prove to be very suitable in real-life scenario for their interpretability. 4 Discussion 4.1 Issues of Online Learning. We have used the steepest gradient descent algorithm for training OADT, and adjusted the learning rate parameter by observing the error that needs to be reduced at each step. In the updating rule, since we rely on only the steepest descent, we do not explicitly incor-

Online Adaptive Decision Trees

1973

porate any factor to explicitly distinguish the differening priors; rather, the weights are updated depending on the frequency of different class labels. In order to incorporate the probability models, we need to compute an inverse Hessian matrix as used in the online version of the hierarchical mixture of experts (Jordan & Jacobs, 1993, 1994), where the inverse covariance matrix (Ljung & Soderstr ¨ om, ¨ 1986) is computed, and in the natural gradient descent algorithm (Amari, 1998; Amari, Park, & Fukumizu, 2000), where the inverse Fisher information matrix is computed. Since it is difficult to derive the inverse of a matrix incorporating the statistical structure in the online mode, a stochastic approximation is made using a decay parameter. However, in this class of algorithms, we need to adjust two learning rate parameters as described in Amari et al. (2000). In our learning algorithm, we do not require any explicit learning rate parameter to be defined by the user. Only the depth of the tree needs to be specified before training. Nonetheless, the performance of OADT can possibly be improved by considering smarter gradient learning algorithms, which constitutes a scope for further study. During the training of OADT, we normalized the data set so that the maximum and minimum values of each component of the input patterns are within the range of [−1, 1] (see section 3.1). We did the normalization in order to select λ = 1 (see equation 2.42). However, in the case of the streaming data set, it may not be possible to know the maximum and minimum values that the attributes can take, and thus sometimes normalization may not be possible, although an educated guess can work in this case. However, in general, λ can be selected greater than unity also in order to enforce that the step size for parameter θ is larger than that for w, although it may be difficult to derive any theoretical guideline for selecting θ . In order to avoid this situation for a general case, the learning in equation 2.42 can be performed in two steps, analogous to the learning as presented in Basak (2001). The weight vector w can be updated first and then θ can be updated based on the residual error. However, in the case of the two-step learning algorithm, the responses for the leaf nodes are to be computed twice for the same pattern. The two-stage learning algorithm can be formulated as follows. Update w without changing θ based on the response of the leaf nodes as wi = −

Uc zic (I − wi wi )x . m 1 + k∈D z2kc (x2 − (w x)2 )

(4.1)

After updating w for all intermediate nodes (which can be performed simultaneously), recompute the activation of the leaf nodes for the same pattern x. From the recomputed activation values of the leaf nodes, update θ as θi = −

Uc zic . m(1 + k∈D z2kc )

(4.2)

1974

J. Basak

Note that in the two-stage training algorithm for OADT, the requirement for selecting the parameter λ, equation 2.42, is not required. By the inherent nature of OADT, it can adapt to a changing environment since we do not freeze the learning during the training process. Our learning algorithm does not depend on any decreasing learning rate parameter, and therefore the changing characteristics of a data set are reflected in the changes in the parameters of the intermediate nodes. This fact enables OADT to behave adaptively so long as the depth of the tree does not demand any changes. We have used an OADT of fixed depth for training; however, it can be grown by observing the performance. If the performance of the tree (such as the decrease in error rate or some other measure) is not satisfactory with a certain depth, then it can be increased by creating child nodes to the existing leaf nodes without disturbing the learned weight vectors. The process is analogous to growing neural networks, that is, adding hidden nodes to a feedforward network. However, we have not yet studied the performance of a growing OADT. It constitutes an area for future study. Similarly, for obtaining better generalization performance, it may be necessary to prune the tree selectively, which is also an area for further study. The depth of an OADT approximately depends on the number of convex polytopes that can describe the class structure. Analogous to model selection techniques in neural networks, the model selection for OADT can be investigated in the future. 4.2 Behavior of the Model. In Figure 1, we illustrate the different decision regions generated by OADT while trained on a two-class problem. For multidimensional patterns, it is difficult to visualize the regions formed, and therefore we show the behavior of OADT for only two attributes. We generated data from a mixture of bivariate gaussian distribution such that the classification task is highly nonlinear. In Figure 1, we observe the regions that are obtained from OADT after 50 iterations for different depths from one to six. With the increase in the depth of OADT, the complexity of the regions increases. However, we observe that after a certain depth, the complexity of the regions does not change significantly. It indicates an interesting fact that for this particular problem, the complexity of the decision region has a saturating tendency after a certain depth of OADT, which shows that even if the depth of OADT is increased beyond the desired value, OADT does not overfit, which may provide good generalization capability. We have illustrated the decision space for only two attributes; however, it requires further investigation to understand whether this saturating tendency is true in general. Conventional decision tree learning algorithms are more elegant in inducing axis-parallel trees (Breiman et al., 1983; Quinlan, 1993) where at each decision node, a decision is induced based on the value of a particular attribute. An OADT by its very nature induces oblique decision trees, since

Online Adaptive Decision Trees

1975

Figure 1: The regions obtained by OADT for (a) depth = 1, (b) depth = 2, (c) depth = 3, (d) depth = 4, (e) depth = 5, and (f) depth = 6. Two different class labels are marked as x and o.

if the steepness of the activation function (m) of the intermediate nodes is increased, then the OADT starts behaving like an oblique decision tree, and in the limit of m → ∞, it is exactly an oblique decision tree. Thus, OADT can be viewed as interpretable as the oblique decision trees are. However, in comparison to the existing axis-parallel decision trees (such as C4.5), OADT falls short in terms of interpretability. In feedforward networks, usually there are hidden nodes, which cannot observe the input and the output. The weight updating for each hidden node depends on the error propagated from the output layer. In the case of OADT, each intermediate node can directly compute the change in weight from the response of the leaf nodes. Thus, the weight updating of each intermediate node can be made independently and simultaneously with other intermediate nodes, unlike feedforward networks, where there is an inherent dependency on the error propagation through layers. The memory required to store an OADT is O(d ∗ (n + 1)), where d is the number of intermediate nodes and n is the dimension of the input space. In the case of a two-layer perceptron with one output node, the required storage is O(h ∗ (n + 1) + h + 1), where h is the number of hidden nodes, considering that all hidden nodes are connected to all input nodes. Since d = 2l−1 , where l is the depth of OADT, the storage required for OADT

1976

J. Basak

can be more than a feedforward network if d > h. The hyperplane set by one intermediate node in OADT interacts only with that created by the descendants of that node in forming the decision regions. In feedforward networks, all decision hyperplanes formed by the hidden nodes can interact with each other. Thus feedforward networks can in general provide a more compact representation of the decision region as compared to OADT. 4.3 Multiclass Input Label. Here we provided an algorithm for learning a two-class problem. Any k-class problem can be handled using (k − 1) trained classifiers able to deal with two-class problems. However, it is always desirable to have a single classifier that is able to classify multilabel input data. In the case of k-class problem, we can modify the network into a k-ary tree such that the node i sends an activation gil to its lth child (l ∈ {1, . . . , k}) (in equations 2.1 and 2.2), where f (wil .x + θil ) gil (x) = k , j=1 f (wij .x + θij )

(4.3)

and the activation of the lth child node is u(i).gil (x). The activation is analogous to that of the gating networks used in the framework of mixture of experts (Jordan & Jacobs, 1993, 1994), although the purpose here is to make the activation flow from the root node to the leaf nodes. The rest of the learning algorithm can be derived in exactly the same way as in the case of a two-class problem, where we assign each leaf node i a class index c where c = mod(i, k)+1 if the nodes in the tree are numbered in a breadth-first order. However, using such networks requires extra storage space as compared to the two-class problems. In general, if the depth is d, then we require kd + 1 nodes in a k-ary OADT for a k-class problem and (k − 1)(2d + 1) nodes if we employ k − 1 binary OADTs. Since OADT operates in a top-down manner, it does not have an analogous way of handling multiple class labels with extra output nodes as in the case of HME and the MLP. 5 Conclusions We presented the OADT, a classification method that maintains the structure of a tree and employs the gradient descent learning algorithm like neural networks for supervised learning in the online mode. OADT is useful in the case of streaming data and limited memory situation where data sets cannot be stored explicitly. OADT is similar to the HME framework in the sense that the latter also employs a tree-structured model. However in OADT, the leaf nodes represent the decision instead of the root node, which is different from the HME framework. Thus, in OADT, the decision region is always explicitly formed by the collective activation of a set of leaf nodes. An analog of the difference between OADT and HME can be drawn from the difference between the divisive and agglomerative clustering in the un-

Online Adaptive Decision Trees

1977

supervised framework, where in the first case, the patterns are successively partitioned down the hierarchy, whereas in the second case, they are successively agglomerated into larger clusters up the hierarchy. Since in OADT, the decision is represented by the collective activation of the leaf nodes, the objective error measure is different, and therefore new learning learning rules are derived to train OADT in the online mode. We also provided guidelines for selecting the activation function of each node of an OADT for a given depth, and therefore the performance of our model is dependent on only one parameter—the depth of the tree. We also observe that if we increase the depth of OADT, the complexity of the decision region generated by OADT almost saturates after a certain depth, an indication that the model may not suffer much from overfitting even if the depth is increased gradually. We used steepest gradient descent learning in OADT as opposed to the EM framework in HME, although it is possible to employ gradient descent online learning in the latter framework. We used steepest gradient descent learning in our model, which is analogous to the class of gradient descent algorithms used in feedforward neural networks. In formulating the learning algorithms for OADT, we do not view the activation of the nodes in terms of the posteriors as performed in the HME where multinomial logit probability model (Jordan, 1995) is used to interpret the activation of the nodes in terms of class posteriors. However, analogous to the behavior feedforward neural networks where the activation of the output nodes corresponds to class posteriors (Barron, 1993) after convergence, further investigation can be carried out to interpret the activation of the leaf nodes of OADT in a statistical framework. Also, smarter gradient descent algorithms apart from only the steepest descent can be incorporated in OADT for improved performance. With the real-life data sets, we observe that OADT performs better than HME. However, it requires further exploration to determine if OADT will provide better scores than HME in general. Further exploration in a possible statistical framework may provide better insight into the behavior of OADT and its relationship with HME. Appendix: Bayes Risk for Parity Problem For a two-class (C1 and C2 ) symmetric gaussian parity problem, the Bayes risk is given as R= p(x|C1 )dx, (A.1) P(C2 |x)>P(C1 |x)

where p(x|C1 ) = N1 i N (µi , ) is the conditional distribution with µ representing the corners of a unit n-dimensional hypercube, = σ 2 I being the covariance matrix, and N = 2n−1 being the number of corners of the hypercube allocated to class C1 .

1978

J. Basak

The conditional distribution of a class is given as 1

N (µi , ), N i

p(x|C1 ) =

(A.2)

where µ represents the corners of a unit hypercube such that for class C1 ,  mod 

 µij , 2 = 0

(A.3)

j

for all i, and = σ 2 I is the covariance matrix. Each attribute is considered to be independent of the other. The Bayes error for the patterns generated by the first kernel (0, 0, . . . , 0) when the first component of the patterns exceeds 1/2 is given as 1 N

r=

√

1/2

×

∞

1/2 −∞

1 2πσ

e

−

x2 1 2σ 2

dx1

2

x 1 − 2 e 2σ 2 dx2 √ 2πσ

1/2

−∞

2

x 1 − 3 e 2σ 2 dx3 · · · √ 2π σ

(A.4)

where N = 2n−1 , assuming the independence of the components, 1 r= N

∞

√

1/2

1 2πσ

e

−

x2 2σ 2

n−1 1/2 2 1 1 − x2 dx . e 2σ dx √ N −∞ 2π σ

(A.5)

Computing the integrals, r=

1 1 . (1 − K)(1 + K)n−1 , N 2n

(A.6)

where K = erf ( √1 ). If we consider all possible ways by which one com2 2σ ponent of a pattern generated from the kernel at (0, 0, . . . , 0) can exceed the value 1/2, we get the risk as r1 =

1 2N2

n 1

(1 − K)(1 + K)n−1 .

(A.7)

Similarly, if we consider all possible ways by which three components of a pattern generating from the kernel at (0, 0, . . . , 0) can exceed the value 1/2,

Online Adaptive Decision Trees

1979

we get the risk as r2 =

1 2N2

n 3

(1 − K)3 (1 + K)n−3 ,

(A.8)

and so on. Summing up the factors r1 , r2 , . . . , we get the risk for patterns generated from the kernel at (0, 0, . . . , 0) exceeding the value 1/2 by an odd number of components as r=

1 2N2

n 1

+

(1 − K)(1 + K)n−1 n (1 − K)3 (1 + K)n−3 + · · · . 3

(A.9)

After simplification, r=

1 (1 − Kn ). 2N

(A.10)

For N such kernels, the total Bayes risk is given as R=

1 1 1 − erf n . √ 2 2 2σ

(A.11)

It can be noted from equation A.11 that with the same variance of the kernel distribution (i.e., the same noise), the Bayes error increases with the increase in dimension. Acknowledgments I gratefully acknowledge the anonymous reviewers for their comments, which improved this article considerably. References Albers, S. (1996). Competitive online algorithms (Tech. Rep. No. BRICS Lecture Series LS-96-2). Aarhus, Denmark: University of Aarhus. Amari, S. I. (1998). Natural gradient works efficiently in learning. Neural Computation, 10(2), 251–276. Amari, S. I., Park, H., & Fukumizu, K. (2000). Adaptive method of realizing natural gradient learning for multilayer perceptrons. Neural Computation, 12(6), 1399–1409. Barron, A. R. (1993). Universal approximation bounds for superposition of a sigmoidal function. IEEE Transactions on Information Theory, 39, 930–945.

1980

J. Basak

Basak, J. (2001). Learning Hough transform: A neural network model. Neural Computation, 13, 651–676. Bennett, K. P., Wu, D., & Auslender, L. (1998). On support vector decision trees for database marketing (Tech. Rep. No. RPI Math Report 98-100). Troy, NY: Rensselaer Polytechnic Institute. Boz, O. (2000). Converting a trained neural network to a decision tree DecText— decision tree extractor. Unpublished doctoral dissertation, Lehigh University. Breiman, L., Friedman, J. H., Olshen, R. A., & Stone, C. J. (1983). Classification and regression trees. New York: Chapman & Hall. Brodley, C. E., & Utgoff, P. E. (1995). Multivariate decision trees. Machine Learning, 19, 45–77. Chien, J., Huang, C., & Chen, S. (2002). Compact decision trees with cluster validity for speech recognition. In IEEE Int. Conf. Acoustics, Speech, and Signal Processing (pp. 873–876). Orlando, FL: IEEE Signal Processing Society. Cho, Y. H., Kim, J. K., & Kim, S. H. (2002). A personalized recommender system based on web usage mining and decision tree induction. Expert Systems with Applications, 23, 329–342. Duda, R., & Hart, P. (1973). Pattern classification and scene analysis. New York: Wiley. Duda, R., Hart, P., & Stork, D. (2001). Pattern classification (2nd ed). New York: Wiley. Durkin, J. (1992). Induction via ID3. AI Expert, 7, 48–53. Fayyad, U. M., & Irani, K. B. (1992). On the handling of continuous-values attributes in decision tree generation. Machine Learning, 8, 87–102. Friedman, J. H. (1991). Multivariate adaptive regression splines. Annals of Statistics, 19, 1–141. Friedman, J. H., Kohavi, R., & Yun, Y. (1996). Lazy decision trees. In H. Shrobe & T. Senator (Eds.), Proceedings of the Thirteenth National Conference on Artificial Intelligence and the Eighth Innovative Applications of Artificial Intelligence Conference (pp.S 717–724). Menlo Park, CA: AAAI Press. Garner, S. (1995). Weka: The Waikato environment for knowledge analysis. In Proc. of the New Zealand Computer Science Research Students Conference (pp. 57–64). Hamilton, New Zealand: University of Waikato. Golea, M., & Marchand, M. (1990). A growth algorithm for neural network decision trees. Europhysics Letters, 12, 105–110. Haykin, S. (1999). Neural networks: A comprehensive foundation. Upper Saddle River, NJ: Prentice Hall. Hohil, M. E., Liu, D., & Smith, S. H. (1999). Solving the N-bit parity problem using neural networks. Neural Networks, 12, 1321–1323. Jain, A. K., Duin, R. P. W., & Mao, J. (2000). Statistical pattern recognition: A review. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22, 4–37. Jordan, M. (1995). Why the logistic function? A tutorial discussion on probabilities and neural networks. Available online at: http://citeseer.nj.nec.com/ jordan95why.html. Jordan, M. I., & Jacobs, R. A. (1993). Hierarchical mixtures of experts and the EM algorithm (Tech. Rep. No. AI Memo 1440). Cambridge, MA: MIT. Jordan, M. I., & Jacobs, R. A. (1994). Hierarchical mixtures of experts and the EM algorithm. Neural Computation, 6, 181–214.

Online Adaptive Decision Trees

1981

Kalai, A., & Vempala, S. (n.d.). Efficient algorithms for online decision. Available online at: http://citeseer.nj.nec.com/585165.html. Lavretsky, E. (2000). On the exact solution of the parity-N problem using ordered neural networks. Neural Networks, 13, 643–649. Lee, S.-J., Jone, M.-T., & Tsai, H.-L. (1995). Construction of neural networks from decision trees. Journal of Information Science and Engineering, 11, 391–415. Ljung, L., & Soderstr ¨ om, ¨ T. (1986). Theory and practice of recursive identification. Cambridge, MA: MIT Press. Martin, D. (n.d.). Hierarchical mixture of experts. Available online at: http://www. cs.berkeley.edu/∼dmartin/software/. Mehta, M., Agrawal, R., & Rissanen, J. (1996). SLIQ: A fast scalable classifier for data mining. In P. M. G. Apers, M. Bouzeghoub, & G. Gardarin (Eds.), Advances in database technology—EDBT’96 (pp. 18–32). Avington, France. Merz, C. J., & Murphy, P. M. (1996). UCI repository of machine learning databases (Tech. Rep.). Irvine: University of California at Irvine. Murphy, K. (2001). The Bayes net toolbox for Matlab. Computing Science and Statistics, 33, 1–20. Murphy, K. (2003). Bayes net toolbox for Matlab. Available online at: http:// www.ai.mit.edu/∼murphyk/software/index.html. Murthy, S. K., Kasif, S., & Salzberg, S. (1994). A system for induction of oblique decision trees. Journal of Artificial Intelligence Research, 2, 1–32. Quinlan, J. R. (1993). Programs for machine learning. San Francisco: Morgan Kaufmann. Quinlan, J. R. (1996). Improved use of continuous attributes in C4.5. Journal of Artificial Intelligence, 4, 77–90. Riley, M. D. (1989). Some applications of tree based modeling to speech and language indexing. In Proc. DARPA Speech and Natural Language Workshop (pp. 339–352). San Francisco: Morgan Kaufmann. Salzberg, S., Delcher, A. L., Fasman, K. H., & Henderson, J. (1998). A decision tree system for finding genes in DNA. Journal of Computational Biology, 5, 667–680. Stromberg, ¨ J. E., Zrida, J., & Isaksson, A. (1991). Neural trees—using neural nets in a tree classifier structure. In Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (pp. 137–140). Toronto: IEEE Signal Processing Society. Utgoff, P. E., Berkman, N. C., & Clouse, J. A. (1997). Decision tree induction based on efficient tree restructuring. Machine Learning, 29(1), 5–44. Witten, I. H., & Frank, E. (2000). Data mining: Practical machine learning tools and techniques with Java implementations. San Francisco: Morgan Kaufmann. Yang, Y., & Pedersen, J. O. (1997). A comparatative study on feature selection in text categorization. In D. H. Fisher (Ed.), Fourteenth Int. Conference on Machine Learning (ICMl97) (pp. 412–420). San Francisco: Morgan Kaufmann. Zamir, O., & Etzioni, O. (1998). Web document clustering: A feasibility demonstration. SGIR ‘98: Proceedings of the 21st Annual International Conference on Research and Development in Information Retrieval (pp. 46–54). New York: ACM. Received August 11, 2003; accepted February 23, 2004.

LETTER

Communicated by Ning Qian

Understanding the Cortical Specialization for Horizontal Disparity Jenny C.A. Read [email protected]

Bruce G. Cumming [email protected] Laboratory of Sensorimotor Research, National Eye Institute, National Institutes of Health, Bethesda, MD 20892, U.S.A.

Because the eyes are displaced horizontally, binocular vision is inherently anisotropic. Recent experimental work has uncovered evidence of this anisotropy in primary visual cortex (V1): neurons respond over a wider range of horizontal than vertical disparity, regardless of their orientation tuning. This probably reflects the horizontally elongated distribution of two-dimensional disparity experienced by the visual system, but it conflicts with all existing models of disparity selectivity, in which the relative response range to vertical and horizontal disparities is determined by the preferred orientation. Potentially, this discrepancy could require us to abandon the widely held view that processing in V1 neurons is initially linear. Here, we show that these new experimental data can be reconciled with an initial linear stage; we present two physiologically plausible ways of extending existing models to achieve this. First, we allow neurons to receive input from multiple binocular subunits with different position disparities (previous models have assumed all subunits have identical position and phase disparity). Then we incorporate a form of divisive normalization, which has successfully explained many response properties of V1 neurons but has not previously been incorporated into a model of disparity selectivity. We show that either of these mechanisms decouples disparity tuning from orientation tuning and discuss how the models could be tested experimentally. This represents the first explanation of how the cortical specialization for horizontal disparity may be achieved. 1 Introduction Our remarkable ability to deduce depth from slight differences between the left and right retinal images appears to begin in primary visual cortex (V1), where inputs from the two eyes first converge and where many neurons are sensitive to binocular disparity (Barlow, Blakemore, & Pettigrew, 1967; Nikara, Bishop, & Pettigrew, 1968). Since the discovery of these neurons, it c 2004 Massachusetts Institute of Technology Neural Computation 16, 1983–2020 (2004)

1984

J. Read and B. Cumming

has always seemed obvious that their disparity tuning should reflect their tuning to orientation: a cell should be most sensitive to disparities orthogonal to its preferred orientation. As Figure 1 shows, this is a natural consequence of a linear-oriented filter preceding binocular combination. The fact that disparities in natural viewing are almost always close to horizontal, due to the horizontal displacement of our eyes, led to the expectation that cells tuned to vertical orientations should be the most useful for stereopsis, because such cells would be most sensitive to horizontal disparities (DeAngelis, Ohzawa, & Freeman, 1995a; Gonzalez & Perez, 1998; LeVay & Voigt, 1988; Maske, Yamane, & Bishop, 1986a; Nelson, Kato, & Bishop, 1977; Ohzawa & Freeman, 1986b). However, it has recently been demonstrated that the “obvious” premise was wrong. The response of V1 neurons was probed using random dot stereograms with disparity applied along different axes to obtain the cell’s firing rate as a function of the two-dimensional disparity of the stimulus. The resulting disparity tuning surfaces were generally elongated along the horizontal axis, independent of the cell’s preferred orientation (Cumming, 2002). Paradoxically, this means that V1 neurons are more sensitive to small changes in vertical than in horizontal disparity—precisely the opposite of the expected anisotropy. In this article, we consider why the observed anisotropy might be functionally advantageous for the visual system and then how individual cortical neurons may be wired up to achieve this. We begin, in part A, by estimating the distribution of real two-dimensional disparities encountered in natural viewing. We show that while large vertical disparities do occur, they are extremely rare. The probability distribution of two-dimensional disparity is highly elongated along the horizontal axis, reflecting the much higher probability of large horizontal disparity compared to vertical disparity. We argue that the horizontal elongation observed in the disparity tuning of individual neurons may plausibly reflect the horizontal elongation of this probability distribution. This would be functionally useful because in stereopsis, the brain has to solve the correspondence problem: to match up features in the two eyes that correspond to the same object in space. The initial step in this process appears to be the computation, in V1, of something close to a local cross-correlation of the retinal images as a function of disparity (Fleet, Wagner, & Heeger, 1996). The cross-correlation function should peak when it compares points in the retinas that are viewing the same object in space. The problem faced by the brain is that not all peaks in the cross- correlation function correspond to real objects in space; there are a multitude of false matches where the interocular correlation is high by chance, even though no object is present at the corresponding position in space. To distinguish true matches from false, the brain has to consider the image over large scales and use additional constraints, such as the expectation that disparity generally varies smoothly across the image (Julesz, 1971; Marr & Poggio, 1976). The horizontal elongation of the probability distribution for two-dimensional disparity represents another

Understanding Horizontal Disparity Specialization

1985

important constraint. Because correct matches almost always have disparities very close to horizontal, local correlations between retinal positions that are separated vertically are likely to prove false matches (Stevenson & Schor, 1997). This may explain why the disparity-sensitive modulation of V1 neurons is abolished by small departures from zero vertical disparity: this property immediately weeds out a large number of false matches and simplifies the solution of the correspondence problem. Thus, although the global solution of the correspondence problem appears to be achieved after V1 (Cumming & Parker, 1997, 2000), the recently observed anisotropy (Cumming, 2002) may represent important preprocessing performed by V1. We next, in parts B and C, address the puzzling issue of how this functionally useful preprocessing can be achieved by V1 neurons. In all existing models (Ohzawa, DeAngelis, & Freeman, 1990; Qian, 1994; Read, Parker, & Cumming, 2002), disparity selectivity is tightly coupled to orientation tuning, with the direction of maximum disparity sensitivity orthogonal to preferred orientation. This coupling arises as a direct result of orientation tuning at the initial linear stage (see Figure 1). It seems impossible to minimize disparity-sensitive responses to nonzero vertical disparities, independent of orientation tuning. Thus, at present, we have the undesirable situation that the best models of the neuronal computations supporting depth perception conflict with the best evidence that these computations do support depth perception (Uka & DeAngelis, 2002). In this letter, we present two possible ways in which existing models can be simply modified so as to achieve horizontally elongated disparity tuning, regardless of orientation tuning. The first depends on position disparities between individual receptive fields feeding into a V1 binocular cell, the second on a form of divisive normalization. This demonstrates that the specialization for horizontal disparity exhibited by V1 neurons can be straightforwardly incorporated into existing models and is thus compatible with an initial linear stage. 2 Materials and Methods 2.1 The Probability Distribution of Two-Dimensional Disparity. In part A of the Results (section 3.1), we estimate the probability distribution of two-dimensional disparity encountered by the visual system within the central 15 degrees of vision. 2.1.1 Eye Position Coordinates. We specify eye position using Helmholtz coordinates, since these are particularly convenient for binocular vision (Tweed, 1997a, 1997b). In the Helmholtz system, eye position is specified by a series of rotations away from a reference position, in which the gaze is straight ahead. Starting with the eye in the reference position, we make a rotation through an angle T clockwise (from the eye’s point of view) about the reference gaze direction; then we make a rotation left through

1986

J. Read and B. Cumming

A

B

receptive fields

left eye

right eye 1 0 2

C

D

o rity pa dis

1 0

o rth

2

l na go RF to

d

ari isp

p ty

ara

R l to lle

F

disparity tuning surface

E

disparity parallel to RF

disparity orthogonal to RF

Figure 1: Existing models predict that neurons should be more sensitive to disparities orthogonal to their preferred orientation. (A,B) The receptive fields of a binocular neuron. The left and right eye receptive fields (RFs) are identical, both consisting of a single long ON region oriented at 45 degrees to the horizontal (shown in white) and flanked by OFF regions (black). We consider stimuli with three different disparities. In each case, the left eye’s image falls in the center of the left receptive field (circle in A). The different disparities occur because the right eye’s image falls at different positions for the three stimuli. This are labeled with the numbers 0–2 in B. (C) The disparities of the three stimuli are plotted in disparity space. Stimulus 0 has zero disparity; its images fall in the middle of the central ON region of the receptive field in each eye, and so it elicits a strong response. Stimulus 1 has disparity parallel to the receptive field orientation. Although its image is displaced in the right eye, its images still fall within the ON region in both eyes, so the cell still responds strongly to both stimuli. In consequence, the disparity tuning curve showing response as a function of disparity parallel to receptive field is broad (D, representing a crosssection through the disparity tuning surface along the dashed line). Conversely, stimulus 2 has the same magnitude of disparity, but in a direction orthogonal to the receptive field orientation. Its image in the right eye falls on the OFF flank of the receptive field (B), so the binocular neuron does not respond. This leads to a much narrower disparity tuning curve; the response falls off rapidly as a function of disparity orthogonal to the receptive field (E, representing a cross-section through the disparity tuning surface along the dotted line). This means that the neuron is more sensitive to disparity orthogonal to its preferred orientation, in the sense that its response falls off more rapidly as a function of disparity in this direction.

Understanding Horizontal Disparity Specialization

1987

an angle H; then we make a rotation down through an angle V. Thus, the reference direction has H = V = T = 0 by definition. In general, the positions of the two eyes have 6 degrees of freedom, expressed by the angles HL , HR , VL , VR , TL , TR . However, we are interested only in the case where both eyes are fixated on a single point in space, so that the gaze lines intersect (possibly at infinity). This means that both eyes must have the same Helmholtz elevation, which we write V(= VL = VR ), and positive or zero vergence angle D, defined by D = HR − HL . Specifying the fixation point further constrains HL and HR , leaving just 2 degrees of freedom, TL and TR . These describe the torsion state of the eyes, corresponding to the freedom of each eye to rotate about its line of sight without changing the fixation point. Donder’s law states that whenever the eyes move to look at a particular fixation point, specified by V, HL , and HR , they always adopt the same torsional state. So Helmholtz torsion can be expressed as a function of the Helmholtz elevation and azimuth: TL (V, HL , HR ), TR (V, HL , HR ). Current experimental evidence (Bruno & Van den Berg, 1997; Mok, Ro, Cadera, Crawford, & Vilis, 1992; Van Rijn & Van den Berg, 1993) suggests that the function TR (V, HL , HR ) has the form tan TR/2 = tan

V 2

tan[λR + µR (HR − HL )] − tan HR/2 , 1 − tan HR/2 tan[λR + µR (HR − HL )]

(2.1)

where TR/2 stands for TR /2 and HR/2 stands for HR /2. The corresponding expression for the left eye’s torsion is obtained by interchanging L and R throughout. Estimates of µ range from 0.12 to 0.43 (Minken & Van Gisbergen, 1996; Mok et al., 1992; Somani, DeSouza, Tweed, & Vilis, 1998; Tweed, 1997b; Van Rijn & Van den Berg, 1993), and estimates of λ are around +2 degrees for the left eye and −2 degrees for the right (Bruno & Van den Berg, 1997). In our simulations, we use µ = 0.2 and |λ| = 2◦ . 2.1.2 The Retinal Coordinate System and Two-Dimensional Disparity. Next, we need a coordinate system for describing the position of images on the retinas. We assume that the retinas are hemispherical in shape. We define a Cartesian coordinate system (x, y) on the retina by projecting each retina onto the plane that is tangent to it at the fovea. The coordinates of a point P on the physical retina are given by the point where the line through P and the nodal point of the eye intersect this tangent plane. By definition, the fovea lies at the origin (0,0). We define the x-axis to be the retinal horizon, that is, the intersection of the horizontal plane through the fovea with the retina in its reference position. The y-axis is the retinal vertical meridian: the intersection of the sagittal plane through the fovea with the retina in its reference position. Clearly, these axes change their orientation in space as the eyes move. However, we shall continue to refer to the x- and y-axes as the “horizontal” and “vertical” meridians, with the understanding that this terminology is

1988

J. Read and B. Cumming

based on their orientation when the eyes are in the reference position. We shall define the horizontal and vertical disparity of an object to be the difference between the x- and y-coordinates, respectively, of its images in the two retinas. Expressed in angular terms,

xL xR δx = arctan − arctan , f f yL yR δy = arctan − arctan , f f

(2.2)

where f is the distance from the nodal point of the eye to the fovea. 2.1.3 The Epipolar Line. Given a particular point (xL , yL ) in the left retina, the locus of possible matches in the right retina defines the right epipolar line. This depends not only on (xL , yL ), but also on the position of the eyes. On our planar retina, the epipolar lines are straight but of finite extent. Two points on the epipolar line are the epipole, which is where the interocular axis intersects the tangent plane of the right retina, and the infinity point, which is the image in the right retina of the point at infinity that projects to (xL , yL ) in the left retina. Written as a vector, these are, respectively, q0 =

f (− cos TR , sin TR ), tan HR

(2.3)

and q∞ =

f [ f cos D − xL sin D cos TL − yL sin D sin TL ] × (xL [cos D cos TR cos TL + sin TR sin TL ] − yL [cos D cos TR sin TL − sin TR cos TL ] + f sin D cos TR , − xL [cos D sin TR cos TL − cos TR sin TL ] + yL [cos D sin TR sin TL + cos TR cos TL ] − f sin D sin TR ),

(2.4)

where f is the distance from the nodal point of the eye to the fovea. We use bold type to denote a two-dimensional vector on the retina. Not all points on the straight line through q0 and q∞ correspond to possible objects in space (we do not allow objects to be inside or behind the eyes). When the eye is turned inward, the epipolar line runs from the epipole q0 to the infinity point q∞ . When the eye is turned outward, the epipolar line starts at the infinity point. Formally:

Understanding Horizontal Disparity Specialization

1989

If HR = 0: the right-epipolar line is specified by the vector equation r = (xR , yR ) = q∞ + β ε, where β ranges from 0 to ∞, and ε=

2If (xL sin HL cos TL − yL sin HL sin TL + f cos HL ) × (− cos HR cos TR , sin TR ),

where I is half the interocular distance. If HR = 0: the right-epipolar line is r = q0 + β ε, where ε = q∞ − q0 , and sin HR > 0, ( f cos D − xL sin D cos TL + yL sin D sin TL ) then β goes from 0 to 1; sin HR < 0, if ( f cos D − xL sin D cos TL + yL sin D sin TL ) then β goes from 1 to ∞;

if

(2.5)

2.1.4 Distribution of Horizontal and Vertical Disparities for a Given Eye Posture. For a given eye posture, we sought the distribution of physically possible disparities for all objects whose images fall within an angle 15 degrees of the fovea in both retinas. For example, we consider an object whose image falls on (xL , yL ) in the left retina and calculate the epipolar line in the right retina, as described above (see equations 2.3–2.5). We discard all points on the epipolar line that fall further than 15 degrees from the right eye fovea. With equation 2.1, we can then calculate the set of possible twodimensional disparities (δx , δy ) for all objects whose image in the left retina falls at (xL , yL ), and whose image on the right retina falls within 15 degrees of the fovea. On disparity axes (δx , δy ), this set forms an infinitely narrow line segment, which we convolved with an isotropic two-dimensional gaussian in order to obtain a smooth distribution. We repeated this calculation for a set of 420 points (xL , yL ) within the central 15 degrees region of the left retina and then redid the calculation the other way around: starting with a set of 420 points in the right retina and finding the epipolar lines in the left retina. The starting points (x, y) were equally spaced in latitude and longitude on the hemispherical retina: x = f tan φ cos θ, y = f tan φ sin θ , where θ covered the range from 0 to 360 degrees in steps of 18 degrees and φ ran from 0 to 15 degrees in steps of 0.75 degrees. For each starting point, we obtained another line segment on the disparity axes, and so gradually built up a picture of the distribution of possible disparities for the chosen eye position. The value of the angular disparities (δx , δy ) so obtained does not depend on the interocular distance 2I or the eye’s focal length f individually, but only on the ratio I/f , which we took to be 3.

1990

J. Read and B. Cumming

2.1.5 Distribution of Eye Postures. Finally, we wanted to find the mean distribution of disparity after averaging over likely eye positions. To do this, we repeated the above procedure for 1000 different randomly chosen eye positions. The eyes’ common Helmholtz elevation V, mean Helmholtz azimuth Hc , and the vergence angle D were drawn from independent gaussian distributions with mean 0 degree (for D, the absolute value was then taken, since D must be positive for fixation). The Helmholtz azimuths of the two eyes are given by HL = Hc − D/2, HR = Hc + D/2. The Helmholtz torsion in each eye, TL and TR , was then set according to equation 2.1. The disparity distribution for this eye position was then calculated as above. 2.2 Models of Disparity Tuning in V1 Neurons. In parts B and C of the Results (section 3), we investigate how existing models of disparityselective V1 neurons can be modified to produce horizontally elongated disparity tuning surfaces, regardless of preferred orientation. Most of the results presented here use our model of V1 disparity selectivity (Read et al., 2002), which is a modified version of the stereo energy model (Ohzawa et al., 1990; Ohzawa, DeAngelis, & Freeman, 1997). As in the original energy model, disparity-selective cells are built from binocular subunits characterized by a receptive field in the left and right eye (see Figure 2A). Each binocular subunit (BS in Figure 2A) sums its inputs and outputs the square of this sum. However, whereas the original energy model is entirely linear until inputs from the two eyes have been combined, our version postulates that inputs from left and right eyes pass through a threshold before being combined at the binocular subunit. For instance, this could occur if the initial linear operation represented by the receptive field is performed in monocular simple cells (MS in Figure 2A). Because these cannot output a negative firing rate, they introduce a threshold before inputs from the two eyes are combined. This modification allows better quantitative agreement with the data in a number of respects. It was introduced to account for the weaker disparity tuning observed with anticorrelated stereograms (Cumming & Parker, 1997). It also naturally accounts for disparity-selective cells that do not respond to monocular stimulation in one of the eyes, which are not possible in the original energy model, and explains why the dominant spatial frequency in the Fourier spectrum of the disparity tuning curve is usually lower than the cell’s preferred spatial frequency tuning, even though the energy model predicts that these should be equal (Read & Cumming, 2003b). All the models considered in this article have an initial linear stage in which we calculate the inner product of the retinal image I(x, y) with the receptive field function ρ(x, y) in that eye: v=

+∞

−∞

dx

+∞ −∞

dy I(x, y)ρ(x, y).

(2.6)

Understanding Horizontal Disparity Specialization

1991

Figure 2: (A) Circuit diagram for our modified version of the energy model (Read et al., 2002). Binocular subunits (BS) are characterized by receptive fields in left and right eyes. The initial processing of the retinal images is linear, but input from left and right eyes is thresholded before being combined together. Here, this thresholding is represented by a layer of monocular simple cells (MS). In our simulations, the receptive field functions are two-dimensional Gabor functions (shown with gray scale: white = ON region, black = OFF region). The dots in the receptive field profiles mark the centers of the receptive fields, which are taken to be the peak of the gaussian envelope. (B) Disparity tuning surface for this binocular subunit. The gray scale indicates the mean firing rate of a simulated binocular subunit to 50,000 random dot patterns with different disparities: white = high firing rate (normalized to 1), gray = baseline firing rate obtained for binocularly uncorrelated patterns; black = low firing rate. The dot shows the difference between the receptive field centers, which is the position disparity of this binocular subunit (zero in this example).

In our modified version of the energy model (Read et al., 2002), the response of a binocular subunit is B = [(vL ) + (vR )]2 ,

(2.7)

where the subscripts L and R indicate left and right eyes, and denotes a threshold operation: (v) = (v − q) if v exceeds some threshold level q, and 0 otherwise. In part C of the Results (section 3.3), the threshold q was set to zero (half-wave rectification). Since any given random dot pattern is as likely to occur as its photographic negative, the inner product in equation 2.6 is equally likely to be positive or negative. Thus, a threshold of zero implies that the monocular simple cell fired to only half of all random dot patterns.

1992

J. Read and B. Cumming

In part B of the Results, the threshold was raised above zero, such that the monocular simple cell fired to only 30% of all random dot patterns. 2.2.1 Inhibitory Input from One Eye. We also consider the case where input from one eye influences the binocular subunit via an inhibitory synapse. In the expression for the response of the binocular subunit, the inhibitory synapse is represented by a minus sign: B = [Pos{(vL ) − (vR )}]2 ,

(2.8)

where we have explicitly included half-wave rectification (Pos(x) = x if x > 0, and 0 otherwise) since it is now possible for the net input to the binocular subunit to be negative, and it should not fire in this case. A subunit of this type is disparity selective while still being “monocular” in that it does not respond to stimulation in the inhibitory eye alone. Many such cells are observed (Ohzawa & Freeman, 1986a; Poggio & Fischer, 1977; Read & Cumming, 2003a, 2003b). 2.2.2 The Original Energy Model. We also consider the response of the original energy model (Ohzawa et al., 1990). This assumes that each binocular subunit receives bipolar linear input from both eyes, so its response is B = [vL + vR ]2

(2.9)

This circuitry is sketched in Figure 8A. 2.2.3 Response Normalization. In part C of the Results (section 3.3), we consider the effect of incorporating a form of response normalization. This postulates that monocular inputs are gated by inhibitory zones outside the classical receptive field, before they converge on a binocular subunit (see Figure 3). The response of a binocular subunit is now B = [(vL zL ) + (vR zR )]2

(2.10)

where the “gain factor” z describes the total amount of inhibition from the inhibitory zones. z ranges from 1 (inhibitory zones are not activated; output from monocular subunit is allowed through unimpeded) to 0 (inhibitory zones are highly activated: monocular subunit is silenced). To work out how much a particular retinal image activates the inhibitory zones, we calculate the inner product of the retinal image with each inhibitory zone’s weight function w(x, y) (analogous to a receptive field): +∞ +∞ u= dx dy I(x, y)w(x, y). −∞

−∞

Understanding Horizontal Disparity Specialization

1993

Figure 3: Circuit diagram for a binocular subunit in which the response from a classical receptive field is gated by input from inhibitory regions. As in Figure 2, monocular subunits, each characterized by a receptive field, feed into a binocular subunit. However, now the input from each eye may be suppressed, prior to binocular combination, by activity in horizontally elongated inhibitory regions. Excitatory connections are shown with an arrowhead, and quasi-divisive inhibitory input is shown with a filled circle. In the plots of the inhibitory regions, the gray scale shows several inhibitory zones in a ring around the receptive field in each eye.

1994

J. Read and B. Cumming

The weight functions are gaussians, which ensures that the inhibition is spatially broadband; this is designed to approximate the total input from a large pool of neurons with different spatial frequency tunings. In order to obtain the desired horizontally elongated disparity tuning surfaces, we postulate that the inhibitory zones are elongated along the horizontal axis (the elliptical contours in Figure 10A). This means that each inhibitory zone individually is orientation selective; however, we include multiple inhibitory zones, so arranged that their total effect is independent of orientation. Thus, the inhibition does not alter the orientation tuning of the original receptive field. The amount of inhibition contributed by each end zone is a sigmoid function of µ. The total inhibition from n zones is z=

n 1 1 . n j=1 1 + exp[(|uj | − c)/s]

(2.11)

The parameters c and s are chosen so that when u is small, the sigmoid function is 1, allowing monocular input to pass unimpeded, but as the magnitude of u increases, it falls to zero, meaning that more and more of the monocular input is blocked. This is qualitatively similar to divisive normalization (Heeger, 1992, 1993). c and s are small relative to the typical activation of each inhibitory zone, so that the sigmoid is less than half for 85% of the random dot patterns. We take the absolute value of |u| so that the inhibitory zones are activated by both bright and dark features. In the simulation of Figure 10, we used 10 inhibitory zones, which were gaussians with standard deviations 0.31 degree in a horizontal direction and 0.04 degree vertically. The inhibitory zone centers were placed at 0.5 degree from the origin, at polar angles that are integer multiples of 36 degrees. As explained in section 3, these precise values are unimportant. Again, the input from one eye can be purely subtractive inhibition; the output of the binocular subunit is then B = {Pos[(vL zL ) − (vR zR )]}2 .

(2.12)

2.2.4 Receptive Fields. Model receptive fields are two-dimensional Gabor functions with isotropic gaussian envelopes of standard deviation 0.2 degree, and carrier frequency 2.5 cycles per degree. Spatial frequency and orientation tuning are experimentally observed to be similar in the two eyes (Bridge & Cumming, 2001; Ohzawa, DeAngelis, & Freeman, 1996; Read & Cumming, 2003b), so in our simulations, these are the same in both eyes. Left- and right-eye model receptive fields differ only in their position on the retina; none of the models in this article has phase disparity (although it would be easy to incorporate it). 2.2.5 Stimuli. Orientation tuning was assessed using sinusoidal luminance grating stimuli with a spatial frequency of 2.5 cycles per degree. Disparity tuning was assessed using random dot stereograms with dot-density

Understanding Horizontal Disparity Specialization

1995

25%, as in the relevant experimental studies (cf. Cumming, 2002). The model response varies depending on the pattern of dots in the individual random dot stereogram. We therefore show the mean response, averaged over 50,000 random dot patterns generated with different seeds. This simulation was repeated for every combination of horizontal and vertical stimulus disparity on a two-dimensional grid, so as to obtain the model’s disparity tuning surface (analogous to the one-dimensional disparity tuning curve obtained when disparity is applied along a single axis). All simulated disparity tuning surfaces are shown normalized, so that 1 is the largest mean response at any disparity. The random dot patterns used dots of 2 × 2 pixels. The size of the image and model retina was either 41 × 41 pixels representing 1.4◦ × 1.4◦ (30 pixels per simulated degree; see part B of the Results, section 3.2) or 221 × 131 pixels representing 2.2◦ × 1.3◦ (100 pixels per simulated degree; see part C of the Results, section 3.3). For part C, the monocular-suppression model, a higher resolution was needed since the disparity tuning is suppressed at large disparities, so we need to be able to study the disparities around zero in more detail. All simulations were performed in Matlab 6.1. 3 Results 3.1 A: The Probability Distribution of Two-Dimensional Disparity. Stereopsis appears to be a phenomenon mainly of central vision; stereoscopic thresholds rise sharply for eccentric stimuli. We therefore consider only the central 15 degrees of cyclopean vision (i.e., the region of space projecting within 15 degrees of the fovea in both retinas). Obviously, 30 degrees is an upper bound for the disparity of any object in the central 15 degrees of vision. Most points on the retina will be associated with much smaller disparities; the precise distribution depends on the position of the eyes. To find the distribution for a given eye position, we have to consider each point PL on the left retina in turn. For each PL , we find the set of points PR in the right retina that are possible physical correspondences for PL . This is the epipolar line, and because we are restricting ourselves to central vision, we take only that portion of the epipolar line that lies within 15 degrees of the fovea. This gives us a set of possible disparities that could be associated with the point PL . So every point in the left retina gives us a set of possible disparities. If we repeat this for all points in both retinas, we obtain the set of all the physical correspondences that are possible within the central visual field for this eye position (see section 2). Figure 4A shows the probability distribution of retinal disparity for the eye position specified by a Helmholtz elevation V of 15 degrees, mean Helmholtz azimuth Hc of 15 degrees, and vergence D of 10 degrees. That is, the eyes are converged and looking down and off to the left. Note that the range of the horizontal axis is 10 times wider than that of the vertical axis. The dotted contour line marks the extent of possible physical correspon-

1996

J. Read and B. Cumming

Figure 4: The probability distribution of horizontal and vertical disparities for a vergence angle of 10 degrees. (A) Helmholtz elevation V = 15 degrees, mean Helmholtz azimuth Hc = 15 degrees. (B) Helmholtz elevation V = 0 degree, mean Helmholtz azimuth Hc = 0 degree. In each case, the outer dotted contour marks the limit of possible disparities (beyond this contour, the probability density is zero). In A, this contour is only approximate: its irregularities reflect the relatively small number of retinal positions investigated. The solid contour marks the median (50% of randomly chosen possible correspondences lie within this iso-probability contour). In A, the SD of the isotropic gaussian used for smoothing was 0.04 degree; in B, it was 0.01 degree. The width of the distribution’s central ridge is limited by this smoothing.

dences. Outside this boundary, the probability density is zero; that is, for the given eye position, there can be no physical correspondences with that disparity. For this eye position, the maximum vertical disparity is 2.3 degrees, but this occurs with a horizontal disparity of 20 degrees, far too large for fusion. If we restrict ourselves to horizontal disparities within Panum’s fusion limit, δx < |0.5|◦ , the maximum vertical disparity is around 1 degree. The solid contour line marks the median; that is, 50% of possible physical correspondences fall within the solid line. Within this boundary, the range of vertical disparities is even smaller. Figure 4A demonstrates that large, vertical disparities can occur. But in fact, during natural viewing, it must be relatively uncommon to have the eyes elevated as much as 15 degrees and turned as much as 15 degrees to the side. We flick our eyes around a scene, but we do not usually spend a long time directing our eyes eccentrically; if that became necessary, we would move our head or the object of interest so that it could be viewed with Hc ∼ V ∼ 0◦ , where large, vertical disparities are even rarer. Figure 4B shows the distribution of retinal disparity for Hc = V = 0◦ with the same vergence as before, D = 10 degrees. Now, the largest vertical disparity is 0.35 degree. As before, the solid contour line encloses 50% of the possible physical correspondences. While this contour extends to horizontal dispari-

Understanding Horizontal Disparity Specialization

1997

ties as large as 15 degrees, it never reaches a vertical disparity as large as 0.05 degree. Of course, which physical correspondences actually occur depends on the scene being viewed. But since there are far more possible correspondences with vertical disparities less than 0.1 degree, it seems reasonable to assume that with the given eye position (D = 10 degrees, Hc = V = 0 degree), it would be rare for a natural scene to contain an object with a vertical disparity more than 0.1 degree. In contrast, with this eye position, extremely large horizontal disparities are common. To estimate the incidence of vertical disparities experienced by the visual system, we need to find the disparity distribution averaged not only over retinal position, as in Figure 4, but over all eye positions. For simplicity, we assume that Hc , V, and D are independently and normally distributed with mean zero (for D, since fixation requires a positive vergence angle, we use only the positive half of the gaussian). Figure 5 shows the resulting distributions for two different sets of values for the standard deviations of these distributions. In these plots, we have homed in on the small range of horizontal disparities that can actually be fused rather than the full range of possible horizontal disparities. Note that again the range of the horizontal axis is 20 times that of the vertical. In Figure 5A, gaze direction is assumed

Figure 5: The distribution of horizontal and vertical disparities for normal viewing, averaged over 1000 random eye positions. Helmholtz elevation V and mean Helmholtz azimuth Hc are picked from a gaussian with mean 0 degree and SD either 10 degrees (A) or 20 degrees (B). To obtain vergence D, a random number is picked from a gaussian with mean 0 degree and SD either 3 degrees (A) or 8 degrees (B), and then vergence is set to the absolute value of this number. The resulting vergence distribution has a mean 2.4 degrees (A) or 6.4 degrees (B), and SD 1.8 degrees (A) or 4.8 degrees (B). The contour line marks the median for the section of the probability distribution shown (50% of randomly chosen possible correspondences with |δx| < 0.5 and |δy| < 0.05 lie within this iso-probability contour).

1998

J. Read and B. Cumming

to be fairly tightly distributed about Hc = V = 0 degree (with a standard deviation of 10 degrees), while vergence is generally small (mean 2.4 degrees, SD 1.8 degrees). Figure 5B gives more weight to eccentric and converged gaze (the standard deviation of Hc and of V is 20 degrees, meaning that the person spends 5% of his or her time gazing more than 40◦ off to the side, while the vergence angle D has mean 6.4 degrees, SD 4.8 degrees, meaning that the person spends 20% of the time looking at objects closer than about 34 cm). These distributions are chosen to reflect opposite extremes of behavior, to demonstrate that our conclusions do not depend critically on the assumptions made about the distribution of eye position during normal viewing. The only critical assumption is that binocular eye movements are coordinated such that the gaze lines always intersect. Under both sets of assumptions, the most striking feature in Figure 5 is the extreme narrowness of the disparity distributions in the vertical dimension (especially given the much larger scale on the vertical axis): less than 0.01 degree in both cases. At first sight, this extreme narrowness is hard to reconcile with Figure 4A, where the probability distribution peaks for −0.5◦ < δy < 0◦ . The solution is that the plots in Figure 5 represent the average of many plots like those shown in Figure 4. These all have a rodlike structure, with the orientation of the rod depending on the particular eye posture. But all the rods pass close to zero disparity. Thus, when we average over all eye postures, we end up with a strong peak close to zero disparity. This peak is stronger in Figure 5B, where the gaze and vergence are more variable. The rods are thus more widely dispersed, and we end up with a more pronounced peak at their common intersection. In Figure 5A, the eyes are assumed to stay fairly close to primary position, and so the rods are almost all close to horizontal. The averaged distribution is thus more elongated horizontally. The slight bias toward crossed (positive) disparities particularly evident in Figure 5A stems from an obvious geometric reason: the amount of uncrossed disparity, −δx, can never exceed the vergence angle D. (This restriction is visible as a sharp boundary in Figure 4.) No such restriction exists for crossed disparity. We began this discussion by restricting ourselves to central vision—the central 15 degrees around the fovea. If we widen this to the central 45 degrees—essentially the entire retina—then obviously larger disparities, both horizontal and vertical, become possible. However, the shape of the disparity distribution remains highly elongated horizontally (not shown). This analysis has shown that independent of the precise assumptions made, the distribution of two-dimensional disparity experienced by the visual system in natural viewing is highly elongated. Large, horizontal disparities are far more likely to be encountered than vertical disparities of the same magnitude. We would therefore expect disparity detectors in V1 to be tuned to a considerably wider range of horizontal than vertical disparities. Following conflicting reports in earlier studies (Barlow et al., 1967; Joshua & Bishop, 1970; von der Heydt, Adorjani, Hanny, & Baumgartner, 1978),

Understanding Horizontal Disparity Specialization

1999

recent experimental evidence shows that this is indeed the case. Cumming (2002) extracted a single preferred two-dimensional disparity for each of 60 neurons and found that the distribution of this preferred disparity was horizontally elongated, with an SD of 0.225 degree in the horizontal direction and only 0.064 degree vertically. This observation is simple to incorporate within existing models: for example, if a cell’s preferred disparity reflects the position disparity between its receptive field locations in the two eyes, then the horizontally elongated distribution means that there is greater scatter in the horizontal than in the vertical location of the two eyes’ receptive fields. However, Cumming also found that the disparity tuning for individual neurons was horizontally elongated, independent of their orientation tuning. While this too makes sense in view of the extreme horizontal elongation of the disparity distribution shown in Figure 5, it presents problems for existing models of individual neurons. In the next section, we explain why and investigate how these models can be altered so as to reconcile them with the data. In existing physiological models of disparity selectivity, tuning to twodimensional binocular disparity must reflect the monocular tuning to orientation. This is illustrated in Figure 2, which sketches the circuitry underlying our model of disparity selectivity. In this example, the left and right eye receptive fields are oriented at 45 degrees to the horizontal (see Figure 2A). The disparity tuning surface (see Figure 2B) is approximately the crosscorrelation of the two receptive fields (the thresholds prior to binocular combination in our model mean that it is not exactly the cross- correlation, but the approximation is accurate enough to be helpful). It is therefore elongated along the same axis as the receptive field. This means that as we move away from the cell’s preferred disparity along an axis orthogonal to the preferred orientation, the response falls off rapidly, whereas it falls off only slowly along an axis parallel to the preferred orientation. In this sense, therefore, the cell is most sensitive to disparities applied orthogonal to its preferred orientation. The challenge is to find a way of making the cell most sensitive to vertical disparity, while keeping its oblique orientation tuning. We consider two possible mechanisms that achieve this. 3.2 B. Multiple Binocular Subunits with Horizontal Position Disparity. Tuning to nonzero disparity may arise if the left and right eye receptive fields differ in their position on the retina, or in the arrangement of their ON and OFF subregions. In recent years, there has been intense debate about the relative contribution of the two mechanisms, known as position and phase disparity (Anzai, Ohzawa, & Freeman, 1997, 1999; DeAngelis, Ohzawa, & Freeman, 1991; DeAngelis et al., 1995a; Fleet et al., 1996; Ohzawa et al., 1997; Prince, Cumming, & Parker, 2002; Tsao, Conway, & Livingstone, 2003; Zhu & Qian, 1996). Both mechanisms predict a disparity tuning surface elongated along the preferred orientation of the cell, and thus preclude the experimentally observed specialization for horizontal disparity. However,

2000

J. Read and B. Cumming

all previous analysis has assumed that position and phase disparity is the same for all binocular subunits feeding into a given complex cell. Here, we relax this assumption and show that this breaks the relationship between disparity tuning surface and orientation tuning, allowing a specialization for horizontal disparity independent of orientation selectivity. In Figure 2A, the left and right eye receptive fields are identical and occupy corresponding positions on the retina. The receptive fields are represented by Gabor functions; the dot indicates the receptive field center, which is in the same place in each retina. Thus, the disparity tuning surface has its peak at zero disparity (the dot in Figure 2B). If the receptive fields were offset from one another, they would have a position disparity, and the peak of the disparity tuning surface would be offset accordingly. We now consider a complex cell that sums the outputs of several binocular subunits that differ in their horizontal position disparity. Figure 6A shows the centers of the left eye receptive fields for 18 binocular subunits that all feed into the same complex cell. (For comparison, one receptive field is indicated with contour lines.) The gray scale shows the sum of the gaussian envelopes for all 18 subunits, which is almost circularly symmetrical. Thus, no anisotropy is visible when the cell is probed monocularly. The disparity tuning of the complex cell depends on how pairs of monocular receptive fields are wired together into binocular subunits. We choose

Understanding Horizontal Disparity Specialization

2001

to pair together receptive fields that have the same vertical position on the retina but differ in their horizontal position. This means that the binocular subunits all have zero vertical position disparity but differ in their horizontal position disparity. Three examples are shown in Figure 7; the position disparity for these three subunits are indicated with different symbols in Figure 6C. For example, the top subunit in Figure 7, marked with a diamond, has zero position disparity, whereas the bottom subunit in Figure 7, marked with a triangle, has a position disparity of −0.6 degree. Because of the horizontal scatter of the subunits, the disparity tuning surface of the complex cell (see Figure 6C) is elongated along the horizontal axis, even though the monocular receptive field envelope is isotropic (see Figure 6A), and the preferred orientation remains at 45 degrees (see Figure 6B), reflecting the individual receptive fields. This demonstrates that disparity tuning

Figure 6: Facing page. Response properties for a multiple subunit complex cell. (A) Monocular receptive field envelope. The gray scale shows the sum of the gaussian envelopes for all 18 receptive fields in one eye. The dots indicate the center of the receptive fields. The square is the center of the receptive field shown in Figure 2 (marked here with contour lines; solid lines show ON regions and broken lines OFF regions). (B) Orientation tuning curve. This shows the mean response of the complex cell to drifting grating stimuli at the optimal spatial frequency, as a function of the grating’s orientation. The preferred orientation is 45 degrees, reflecting the structure of the individual receptive fields. (C) Disparity tuning surface. This shows the mean response of the complex cell to random dot stereograms as a function of two-dimensional disparity. The disparity tuning surface is clearly elongated along the horizontal axis. The responses have been normalized to one, as indicated with the scale bar. The superimposed dots show the position disparities of the individual binocular subunits; the 18 subunits have 9 different disparities. The circuitry for three subunits is sketched in Figure 7; the position disparity for the three subunits shown there is here indicated with matching symbols. In the top subunit in Figure 7, the receptive fields in the two eyes are identical; this subunit therefore has zero position disparity (the diamond at (0,0)). The middle subunit in Figure 7 has receptive fields at different positions in each retina. It therefore has a horizontal position disparity, marked with a triangle pointing up. The bottom subunit in Figure 7 has an even larger horizontal position disparity, marked with a triangle pointing right. The tuned-excitatory disparity tuning surface shown here is the sum of 18 disparity tuning surfaces like that in Figure 2, but offset from one another horizontally. Thus, although the individual tuning surfaces were elongated along an oblique axis, the resulting disparity tuning surface is elongated horizontally. (D) Disparity tuning surface for a tuned-inhibitory complex cell. This was obtained with exactly the same receptive fields and subunits as in C, except that now, one out of each pair of monocular subunits made an inhibitory synapse onto the binocular subunit (see equation 2.8).

2002

J. Read and B. Cumming

Figure 7: Circuit diagram indicating how multiple binocular subunits (BS) can be combined to yield a complex cell whose disparity tuning surface is elongated along the horizontal axis. Our model complex cell receives input from 18 binocular subunits, of which 3 are shown. The gray scale plots on the left show the monocular receptive field functions for monocular simple cells (MS). These are combined into binocular subunits (BS); note that the monocular simple cells apply an output threshold prior to binocular combination. The centers of the receptive field envelopes are shown with the symbols used to identify these disparities in Figure 6. As shown by the positions of these symbols, the three subunits have different horizontal position disparities (0 degree, −0.3 degree, −0.6 degree) and no vertical disparity. The receptive field phase is chosen randomly for each subunit, but within each subunit, it is the same for each eye.

Understanding Horizontal Disparity Specialization

2003

and orientation tuning can be decoupled in a cell that receives input from multiple subunits with different position disparities. In Figure 7, all the binocular subunits receive excitatory input from the monocular subunits feeding into them, and the left and right receptive fields for each subunit are in phase. This results in a “tuned-excitatory” type of disparity tuning surface (Poggio & Fischer, 1977), in which the cell’s firing rate rises above the baseline level for range of preferred disparities (see Figure 6C). If for each subunit one eye sends excitatory input and the other inhibitory (Read et al., 2002), then we obtain a “tuned-inhibitory” complex cell, whose firing rate falls below the baseline for a particular range of disparities (see Figure 6D). The position disparity of the individual subunits works in exactly the same way as before, so this cell’s disparity tuning surface is also elongated along the horizontal axis. Thus, this model works for both tuned-excitatory and tuned-inhibitory cells. 3.2.1 Energy Model Subunits Tend to Cancel Out. In the simulations above, we used our modified version of the stereo energy model (Ohzawa et al., 1990) because this is in better quantitative agreement with experimental data than the original energy model (Read & Cumming, 2003b; Read et al., 2002). As we shall see, this modification is also key to the success of the multiple-subunit model shown in Figures 7 and 6. The receptive fields in the example cells in Figures 2 and 7 were chosen to be similar to the type of receptive fields derived from reverse correlation mapping (DeAngelis et al., 1991; DeAngelis, Ohzawa, & Freeman, 1995b; Jones & Palmer, 1987; Ohzawa et al., 1996). They are spatially bandpass, containing several ON and OFF subregions and little power at DC. In the energy model, bandpass receptive fields must yield bandpass disparity tuning curves (i.e., curves with several peaks and troughs). One problem for the energy model is that multiple peaks are not often seen in experimental disparity tuning curves. These are frequently close to gaussian in shape, with only weak side lobes, even when the spatial frequency tuning is bandpass. That is, real disparity tuning curves seem to be more low-pass than predicted from the energy model. This discrepancy has been noted by various authors (Ohzawa et al., 1997; Prince, Pointon, Cumming, & Parker, 2002), and quantified in detail by us (Read & Cumming, 2003b). Our modification to the energy model, introducing thresholds prior to binocular combination, helps fix this problem by removing side lobes from the disparity tuning curves, making them more low-pass and in better agreement with experimental data (Read & Cumming, 2003b). It turns out that this reduction of the side lobes is also what enables us to construct a horizontally elongated disparity tuning surface by combining multiple subunits with different horizontal position disparities. To see why, consider what happens without the monocular thresholds. Figure 8A shows the circuit diagram of a binocular subunit according to the original energy model; the key feature is that bipolar input from the

2004

J. Read and B. Cumming

Figure 8: (A) Circuit diagram for a single binocular subunit from the original energy model (Ohzawa et al., 1990). In contrast to our modified version (see Figure 2), inputs from the two eyes are combined linearly (though the binocular subunit still applies an output nonlinearity of half-wave rectification and squaring). This results in more bandpass disparity tuning. (B) The side lobes in the disparity tuning surface shown here are much deeper than in Figure 2. This surface has almost no power at DC. As a consequence, when several binocular subunits with different horizontal position disparity combine to form a complex cell, their disparity tuning surfaces cancel out over most of the range (cf. Figure 9A). (C) The disparity tuning surface for a complex cell receiving input from 18 energy model binocular subunits whose position disparities are indicated by the symbols. Instead of being elongated horizontally as in Figure 7, all but the two end subunits are canceled out by their neighbors, so it has two separate regions of oscillatory response. This does not resemble the responses recorded from real cells.

two eyes is combined linearly. The disparity tuning surface for this subunit is shown in Figure 8B. Note that the central peak is flanked by two deep troughs where firing falls well below baseline, so the surface as a whole has very little power at DC (compare Figure 8B to Figure 2B, which shows the equivalent surface for our modified version). As a result, when we add together several copies of such disparity tuning surfaces, most of the power cancels out, leaving a highly unrealistic tuning surface containing two distinct regions of modulation (see Figure 8C). To see why this happens, consider Figure 9A. The thin, broken lines show five hypothetical disparity tuning curves (in one dimension for clarity) that have no power at DC, because their deep side troughs have the same area as their central peak. When the five offset curves are averaged (heavy line), the peak of one subunit is canceled out by the side troughs of the two adjacent subunits, so that the response averages out to the baseline level everywhere except at the two ends, where the subunits have only one neighbor. In Figure 9B, the five disparity tuning curves are gaussians, representing the effect of a

Understanding Horizontal Disparity Specialization

6

6

response

B8

response

A8 4 2 0

-1

0

1

disparity (deg)

2005

4 2 0

-2

0

2

disparity (deg)

Figure 9: Problems encountered in combining subunits tuned to different disparities. Thin, broken curves: hypothetical disparity tuning curves from five binocular subunits. They are identical in form but offset from one another horizontally. Heavy curve: the mean of the thin curves. (A) These binocular subunits have bandpass disparity tuning (like those obtained from the original energy model). The thin curves have near-zero DC component, since the central peak is nearly equal in area to the two side troughs. When the subunits are combined, their disparity tuning curves cancel out over most of the range, leaving an oscillatory region at each end. This does not resemble real disparity tuning curves. (B) These binocular subunits have low-pass disparity tuning (like those obtained from our modified version of the energy model). Now, combining many subunits tuned to different disparities results in a broadened disparity tuning curve resembling experimental results. However, the disparity tuning is weakened. Whereas the tuning curves for individual subunits (thin lines) have amplitude equal to their baseline, the amplitude for the combination (heavy line) is only 25% of the baseline. This problem could be solved by imposing a final output threshold on the complex cell. If the region of the plot below the dotted line could be removed by an output threshold, then the amplitude of the disparity tuning curve would be a larger fraction of the new baseline.

high threshold before binocular combination (Read & Cumming, 2003b). Because there are no inhibitory side lobes, cancellation does not occur, and the resultant disparity tuning curve (heavy curve) is above baseline for an extended range. In our simulations, the threshold was set such that each monocular simple cell fired on average for 3 out of every 10 random- dot patterns; for comparison, this figure would be 5 out of 10 for a threshold at zero (half-wave rectification). This does not entirely remove the side lobes (see Figure 2), but has enough of an effect to prevent cancellation. Thus our modification, introduced for quite different reasons, also enables us to obtain horizontally elongated disparity tuning surfaces by combining multiple subunits with horizontal position disparities. For bandpass receptive

2006

J. Read and B. Cumming

fields, this is not possible in the energy model, although it is possible for receptive fields with a substantial DC response. 3.2.2 An Additional Output Threshold is Needed. Figure 9B also highlights a problem with this multiple-subunit model as it stands. Summing multiple subunits inevitably reduces the amplitude of the modulation with disparity. In Figure 9B, the individual disparity tuning curves have amplitude equal to the baseline response, while the amplitude of the averaged curve (black) is just 20% of the baseline. This effect is apparent on examining the scale bars in Figure 6: the amplitude of the disparity tuning is only about 5% of the baseline (the response at large disparities, where the random dot patterns in the two eyes are effectively uncorrelated). In an experiment, a cell with such weak modulation would probably not even pass the initial screening for disparity selectivity. This problem can be solved if we postulate that the complex cell applies an output threshold: that is, it fires only when its inputs exceed a certain value (cf. the dotted line in Figure 9B). For the tunedexcitatory model, it would be equally valid to postulate that the necessary threshold is applied by the individual binocular subunits. However, for tuned-inhibitory cells, the threshold must be applied by the complex cell that sums the individual subunits (this is apparent on redrawing Figure 9B with tuned-inhibitory tuning curves). 3.3 C. Monocular Suppression from Horizontally Elongated Inhibitory Zones. We now turn to an alternative way of decoupling disparity and orientation tuning. A well-known feature of V1 neurons is that they can be inhibited by activity outside their classical receptive field (Cavanaugh, Bair, & Movshon, 2002; Freeman, Ohzawa, & Walker, 2001; Gilbert & Wiesel, 1990; Jones, Wang, & Sillito, 2002; Maffei & Fiorentini, 1976; Sillito, Grieve, Jones, Cudeiro, & Davis, 1995; Walker, Ohzawa, & Freeman, 1999, 2000, 2002). Phenomena such as end stopping, side stopping, and cross-orientation inhibition have been explained by suggesting that individual V1 neurons are subject to a form of gain control from the rest of the population, for example, by divisive normalization (Carandini, Heeger, & Movshon, 1997; Heeger, 1992, 1993). Yet existing models of disparity selectivity ignore these aspects of V1 neurons’ behavior. We now extend our model (Read et al., 2002) to include these effects. We postulate that inputs from each eye are suppressed by activity from inhibitory zones prior to binocular combination. We shall show that if the individual inhibitory zones are elongated horizontally on the retina, the disparity tuning surface will be horizontally elongated. If there are several inhibitory zones arranged isotropically, the total inhibition is independent of orientation. Thus, this horizontal elongation in disparity tuning is achieved without altering the cell’s orientation selectivity. 3.3.1 Monocular Suppression Can Decouple Disparity and Orientation Tuning. As before, we consider a binocular subunit with identical obliquely

Understanding Horizontal Disparity Specialization

2007

oriented receptive fields in left and right eyes. However, now the monocular inputs are gated by inhibitory zones before being combined in a binocular subunit. This is sketched in Figure 3. If the retinal image does not stimulate the inhibitory zones, the inhibitory synapses in Figure 3 are inactive, and the output of the monocular simple cell is passed to the binocular subunit just as in the previous model. But if the retinal image stimulates the inhibitory zones, the inhibitory synapses become active, and the firing rate of the monocular cell is reduced or even silenced (see section 2 for details). This is very similar to the divisive normalization proposed to explain nonspecific suppression and response saturation (Carandini et al., 1997; Heeger, 1992, 1993; Muller, Metha, Krauskopf, & Lennie, 2003): the firing rate of a fundamentally linear neuron is suppressed by activity in a “normalization pool” of cortical neurons. In the simple model presented here, the horizontal elongation of the individual inhibitory zones, chosen in order to obtain horizontally elongated disparity tuning surfaces, means that they are tuned to horizontal orientation. This is a side effect of the computationally cheap way we have chosen to implement divisive normalization by a pool of neurons: in a full model, each inhibitory zone would represent input from a multitude of sensors tuned to different spatial frequencies and orientations, so that the response of each inhibitory zone individually would be independent of orientation. Equally, it would be possible to construct the surround such that it was sensitive to orientations other than horizontal. So long as the region over which these subunits are summed remains horizontally elongated, it would still produce the desired effect on disparity tuning. In this way, it would be possible to construct a model using the same principles that matched the reported orientation selectivity of surround suppression (Cavanaugh et al., 2002; Levitt & Lund, 1997; Muller et al., 2003). However, the simple model presented here is designed to be nonspecific in that the suppression contributed by all the inhibitory zones together is independent of orientation. This means that the suppression in our model affects the disparity tuning while leaving the orientation tuning unchanged. One way of achieving this is to place the inhibitory zones in a ring around the original receptive field, as shown in Figure 10A, so that while the individual inhibitory zones are horizontally elongated, the total inhibitory region is roughly isotropic. Figure 10B shows the orientation tuning curves obtained with no suppression (solid curve) and after including the effect of suppression from the inhibitory end zones (broken curve). Despite the horizontal orientation of each inhibitory zone individually, the cell still responds best to a luminance grating aligned at 45 degrees. The ring pattern is consistent with experimental evidence indicating that suppressive regions are located all around the receptive field (Walker et al., 1999). In fact, as far as disparity tuning to random dot patterns is concerned, similar results are obtained no matter where the inhibitory zones are placed, provided that the overall arrangement is not tuned to orientation. We also ran our simulations with inhibitory end zones placed at the top and bottom of the original recep-

2008

J. Read and B. Cumming

Figure 10: Horizontally elongated inhibitory end zones need not alter the orientation tuning. (A) Receptive field and inhibitory end zones. The gray scale shows the receptive field—a Gabor function in which a central ON region (white) is flanked by two OFF regions (black). The contours show the ten inhibitory end zones. Each inhibitory zone is a gaussian, and the contour marks the halfmaximum. (B) Orientation tuning curve for the model with (dashed line) and without (solid line) gating by the inhibitory zones. This demonstrates that the horizontally elongated inhibitory zones do not alter the orientation tuning of the cell: it still responds best to a grating oriented at 45 degrees.

tive field (not shown), and obtained essentially the same results. Similarly, the gaps between the inhibitory zones in Figure 10 are not important. We chose a sparse arrangement of inhibitory zones so that the location of the individual zones would be clear; in fact, the same results are obtained with a much denser array of overlapping inhibitory zones forming a complete ring around the original receptive field (not shown). Although the inhibitory zones have little effect on orientation tuning, they have a profound effect on the two-dimensional disparity tuning observed with random dot stereograms. Figure 11A shows the disparity tuning surface that would be obtained from this binocular subunit in the absence of inhibitory zones. It is, as expected, elongated along an oblique axis corresponding to the orientation tuning. Figures 11B and 11C show the disparity tuning surface after incorporating inhibition from the ten inhibitory zones (in Figure 11B, for the same range of disparities as in Figure 11A, and in Figure 11C, focusing in on a smaller range of disparities). Comparing Figures 11A and 11B, it is apparent that the original disparity tuning surface has been greatly reduced. Suppression from the inhibitory zones has removed or weakened the cell’s responses to most disparities away from the preferred disparity (zero, in this example), especially to disparities with

Understanding Horizontal Disparity Specialization

2009

Figure 11: Monocular suppression by horizontally oriented inhibitory regions can result in a horizontally elongated disparity tuning surface. (Top row) Disparity tuning surface with and without monocular suppression. (A) Disparity tuning surface for a single binocular subunit without any suppression from inhibitory zones, as sketched in Figure 2 (equation 2.7, with the threshold set at zero so that represents half-wave rectification). (B, C) Disparity tuning surface for the same binocular subunit after incorporating monocular suppression from inhibitory zones prior to binocular combination, as sketched in Figure 3 (see equation 2.10; again represents half-wave rectification). At large scales (B), some trace of the original oblique structure remains, but in the central region where sensitivity to disparity is strongest (shown expanded in C), the structure is horizontal. (Bottom row) Correlation between left and right eye inputs as a function of disparity. (D) Correlation between output of classical receptive fields in left and right eyes (see equation 2.6). (E) Correlation between total inhibition from end zones in left and right eyes (see equation 2.11). (F) Correlation between output of classical receptive fields after inhibition from end zones.

significant vertical components. However, the response to horizontal disparities has been relatively spared. This is especially clear in Figure 11C, where we focus in on the smallest disparities. What remains of the disparity tuning surface is now elongated along a near-horizontal axis. 3.3.2 The Monocular Suppression Vetoes Vertical-Disparity Matches. The inhibitory zones effectively veto local correlations with vertical disparities. Functionally, this would be useful since such correlations are likely to be false matches, which will only hamper a solution of the correspondence problem. Yet at first sight, it is hard to understand how this can occur. In

2010

J. Read and B. Cumming

our model, the inhibition is purely monocular: the inhibitory zones receive input from only one eye and suppress the response only in that eye. How, then, can they detect interocular correlation for vertical disparities in order to suppress it? To understand this, we first note that the output of an ordinary binocular subunit (without gating from inhibitory zones) depends on the correlation between the terms from left and right eyes, which depends on disparity. If we plot the correlation between the inputs from left and right eyes, before normalization by the inhibitory zones, as a function of horizontal and vertical disparity, the resulting correlation surface has the same orientation as the receptive fields (see Figure 11D). Similarly, the correlation between the suppressive input from the inhibitory zones in left and right eyes is oriented along the same axis as the inhibitory zones, that is, horizontally (see Figure 11E). It was to achieve this result that we made the inhibitory zones horizontally elongated. Regions of weaker correlations at disparities corresponding to the separation between nearby inhibitory zones are also visible. Finally, Figure 11F homes in on the same small range of disparities as in Figure 11C and shows the correlation between left and right eye inputs, after suppression from the inhibitory zones. In order for these to be tightly correlated, we need strong correlation between both the left and right receptive field inputs and the left and right eye inhibition. Thus, Figure 11F can be thought of as the product of Figures 11D and 11E. For disparities with a significant vertical component in Figure 11F, the lack of correlation in the inhibitory zone responses (see Figure 11E) has largely canceled out the correlation between the inputs from the receptive fields (see Figure 11D). The original obliquely oriented correlation function has been weakened, leaving a band of high correlation for small horizontal disparities. This horizontally elongated correlation translates into the horizontally elongated disparity tuning surface we saw earlier (see Figure 11C). Figure 12 shows that this works for tuned-inhibitory cells (see equation 2.12) too. Thus, end stopping can decouple disparity tuning from orientation tuning, for both tuned-excitatory and tuned-inhibitory cells. This basic idea has been suggested before (Freeman & Ohzawa, 1990; Maske et al., 1986a; Maske, Yamane, & Bishop, 1986b), but no one has previously developed a quantitative model and examined its properties. 4 Discussion Because our eyes are displaced horizontally, most disparities we encounter are horizontal. We have quantified this by producing estimates of the probability distribution of two-dimensional disparity encountered during natural viewing, making plausible assumptions for the distribution of eye posture. The results are independent of the precise assumptions used and show clearly that the probability distribution of experienced disparity is highly

Understanding Horizontal Disparity Specialization

2011

Figure 12: Monocular suppression by horizontally oriented inhibitory regions can also yield a tuned-inhibitory disparity tuning. As for Figures 11A to 11C, but for a model incorporating a tuned-inhibitory synapse, so the disparity is of the tuned-inhibitory type.

elongated along the horizontal axis. Psychophysically, this anisotropy mirrors our different sensitivities to horizontal and vertical disparity (Stevenson & Schor, 1997). Physiologically, it is reflected in V1 at two different levels. At the population level, there is a wider scatter in the horizontal than in the vertical component of preferred disparities (Barlow et al., 1967; Cumming, 2002; DeAngelis et al., 1991). At the level of individual cells, disparity tuning surfaces are usually elongated along the horizontal axis, regardless of orientation tuning (Cumming, 2002). While the specialization at the population level was expected and is straightforward to incorporate into existing models, the specialization found in individual cells is at first sight extremely surprising. It is not only incompatible with all existing models of the response properties of disparity-tuned cells; it is the exact opposite of the behavior previously predicted to be functionally useful for stereopsis. Many workers have argued that cells mediating stereopsis should be most sensitive to horizontal disparity; their firing rate should change more steeply as a function of horizontal than of vertical disparity (DeAngelis et al., 1995a; Gonzalez & Perez, 1998; LeVay & Voigt, 1988; Maske et al., 1986a; Nelson et al., 1977; Ohzawa & Freeman, 1986b). Given the expectation that the direction of maximum disparity sensitivity would be orthogonal to the cell’s preferred orientation, this led to the expectation that cells tuned to vertical orientations should be most useful for stereopsis, since these were expected to be the most sensitive to horizontal disparity. It was therefore puzzling why real disparity-tuned cells occur with the full range of preferred orientations, with no obvious tendency to prefer vertical orientations. The recent discovery that the direction of maximum disparity sensitivity is in fact independent of orientation tuning (Cumming, 2002) would have satisfactorily solved this puzzle if it had not been for the fact that the direction of maximum disparity sensitivity was found to be vertical rather than horizontal. From the conventional point of view, this anisotropy of individual V1 neurons looks more like a specialization for

2012

J. Read and B. Cumming

vertical than for horizontal disparity. It is important, therefore, before we discuss how we have been able to reproduce the horizontally elongated disparity tuning surfaces of V1 neurons, to consider why this anisotropy occurs. Armed with hindsight and recent experimental evidence, we shall argue that the previous expectations were flawed and attempt to suggest plausible reasons that the observed anisotropy is in fact a useful specialization for horizontal disparity. The previous expectation of vertically elongated disparity tuning surfaces was based on the fact that these are most sensitive to horizontal disparity. Yet given that stereo information is usually thought to be represented in the activity of a whole population of sensors tuned to different disparities, this is not particularly relevant. The sensitivity with which the population encodes horizontal disparity is not limited by the sensitivity of the individual sensors. It is not possible, therefore, to deduce the shape of the individual disparity tuning surfaces from the need for sensitivity to horizontal disparity. A population of sensors with either horizontally elongated or vertically elongated surfaces would be equally capable of achieving high sensitivity to horizontal disparity. Imagine a population of neurons whose disparity tuning surfaces are isotropic (equally sensitive to vertical and horizontal disparity). If the scatter of center positions (preferred disparity) is equal in vertical and horizontal directions, the population carries equivalent information concerning the two disparity directions. The population response could be made more sensitive to horizontal than vertical disparity by reducing the scatter in the horizontal direction. However, the population would show such sensitivity over a narrower range of horizontal than vertical disparities. For simple pooling over a population of independent disparity detectors, there is a trade-off between the range of disparity encoded in any one direction and the sensitivity to small changes, regardless of the shape of individual filters. That said, the horizontal displacement of the eyes does place constraints on the maximum useful extent of the individual disparity tuning surfaces. There is no point in having any one filter encompass a larger range than is required of the population. As our simulation of the probability distribution of the two-dimensional disparities encountered during normal viewing demonstrates, large, vertical disparities are rare in comparison to large, horizontal disparities. The iso-probability contours (see Figures 4 and 5) span a greater range of horizontal than vertical disparities. Suppose that the brain ignores disparities outside a certain iso-probability contour and that this is the functional reason that we cannot fuse large disparities outside Panum’s fusional limit: such large disparities are too rare to be worth encoding. The range of disparities to be encoded in each direction places an upper limit on the useful extent of an individual cell’s disparity tuning surface in that direction. Due to the horizontally elongated shape of the iso-probability contours, this upper limit is much lower in the vertical direction than in the horizontal direction. While this does not prove that individual disparity

Understanding Horizontal Disparity Specialization

2013

tuning surfaces must be horizontally elongated (it would be theoretically possible to cover the desired horizontally elongated range with vertically elongated disparity sensors), it nevertheless suggests that horizontally elongated surfaces may be more likely than vertically elongated ones. The characteristic feature of horizontally elongated disparity tuning surfaces is that they continue responding to features across a range of horizontal disparities, while their response falls quickly back to baseline if the vertical disparity changes. This may be useful, given that features with significant vertical disparity are likely to be “false matches.” There is some evidence that the brain does not explicitly search for nonzero vertical disparities when solving the stereo correspondence problem. If the eyes are staring straight ahead (more precisely, are in primary position), then there are no vertical disparities, and the epipolar line of geometrically possible matches for a given point in the other retina is parallel to the horizontal retinal meridian. When the eyes move, the epipolar lines shift on the retina, so they are not in general exactly horizontal. However, rather than recompute the position of the epipolar lines every time the eyes move, the brain seems to approximate epipolar lines by horizontally oriented search zones of finite vertical extent. The search zones are horizontal because the epipolar lines are usually close to horizontal, while their finite vertical width allows for the fact that nonzero vertical disparities do occur. Computationally, this strategy is an enormous saving. The cost is that when the eyes adopt an extreme position that gives rise to vertical disparities larger than the vertical width of the search zones, matches are not found and the correspondence problem cannot be solved (Schreiber, Crawford, Fetter, & Tweed, 2001). If this theory is correct, the horizontally elongated disparity tuning surfaces in V1 (Cumming, 2002) may be a neural correlate of the psychophysics reported by Schreiber et al. (2001). The vertical extent of the search zones should then be determined by the vertical width of the disparity tuning surface and the scatter in preferred vertical disparity. The tolerance of vertical disparity estimated from Cumming’s (2002) data is ∼0.4 degree at eccentricities between 2 and 9 degrees, while the psychophysics suggests that vertical disparities of ∼0.7 degree should be tolerated at 3 degrees eccentricity (Schreiber et al., 2001; Stevenson & Schor, 1997). Given the uncertainties, these are in reasonable agreement. Of course, nonzero vertical disparities do sometimes occur, especially when the eyes are looking up, down, or off to one side, and there is evidence that they influence perception (first shown by Helmholtz, 1925). Thus, the brain must presumably contain sensors that detect vertical disparity. However, the relative rarity of vertical disparities in perceptual experience suggests that these sensors should be correspondingly sparse. This may explain why vertical disparities appear to influence perception at a more global level, being integrated across the whole image, whereas horizontal disparities appear to act more locally, contributing to a local depth map (Rogers & Bradshaw, 1993, 1995; Stenton, Frisby, & Mayhew, 1984).

2014

J. Read and B. Cumming

The above arguments give some insight into the functional constraints underpinning V1’s observed specialization for horizontal disparity at both the population level and the level of individual cells. The next problem is how to reconcile the individual cell specialization—the horizontally elongated disparity tuning surfaces—with existing models of disparity tuning, in which disparity tuning surfaces must be elongated parallel to the preferred orientation. Thanks to a multitude of electrophysiological, psychophysical, and mathematical studies, stereopsis is one of the best-understood perceptual systems. Our understanding of the early stages of cortical processing is encapsulated in the famous energy model of disparity selectivity, which has successfully explained many aspects of the behavior of real cells (Cumming & Parker, 1997; Ohzawa et al., 1990, 1997). Although it has long been known that there are quantitative discrepancies between the energy model and experimental data (Ohzawa et al., 1997; Prince, Cumming et al., 2002; Prince, Pointon et al., 2002; Read & Cumming, 2003b), the decoupling of disparity and orientation observed in real cells (Cumming, 2002) violates a key energy model prediction in a dramatic, qualitative way. This forces us to reconsider the model’s most fundamental assumptions, such as the underlying linearity (Uka & DeAngelis, 2002). This initial linear stage has been a cornerstone of a whole class of models, not only of disparity selectivity but of V1 processing in general. Fortunately, as we demonstrate in this article, it turns out that the existing models can be fairly readily modified to allow the observed separation of disparity and orientation tuning. We present two possible ways of solving the problem: (1) multiple subunits with different horizontal position disparities and (2) monocular suppression from inhibitory zones. Both yield horizontally elongated disparity tuning surfaces—the first by extending the response to a wider range of horizontal disparities and the second by preventing response to any but a narrow range of vertical disparities. Both involve further modifications of the energy model, building on previous work in which we introduced a threshold prior to binocular combination (Read et al., 2002). The models are not necessarily alternative possibilities; some combination of the two may occur. For instance, complex cells in V1 may represent the sum of multiple subunits with different horizontal position disparities, with the monocular inputs to each subunit gated by horizontally elongated inhibitory zones. 4.1 Position versus Phase Disparity. The multiple subunit model relies on horizontal position disparities to achieve horizontally elongated disparity tuning surfaces. Phase disparities alone could not achieve this because differences in receptive field phase produce disparities only along an axis orthogonal to the cell’s preferred orientation. Of course, phase disparities could occur alongside the position disparities postulated here. Furthermore, phase disparities may be important in producing the wider range of preferred disparities along the horizontal than along the vertical axis which is

Understanding Horizontal Disparity Specialization

2015

observed at the population level (Anzai et al., 1999; Cumming, 2002; DeAngelis et al., 1991, 1995a). 4.2 Model Tests and Predictions. When postulating new physiological models, it is important to consider how they could be tested and falsified. One important point is that, at least as developed here, the multiple subunit model works only for complex cells. The model would respond to all phases of a grating stimulus at the optimum spatial frequency, since as the grating moves, it simply passes from one of the multiple subunits to the next. Thus, the observation of simple cells with horizontally elongated disparity tuning surfaces would present difficulties for the multiple subunit model. It would require a rather contrived relationship between the phase of a subunit’s receptive field and its position within the array of subunits. Since the monocular suppression model achieves horizontally elongated disparity tuning by postulating horizontally elongated inhibitory regions, the obvious way of testing this model appears to be to see if suppressive zones in real cells are horizontally elongated. However, if there are several suppressive zones, as in our simulations, this could be difficult to establish. For example, in our simulations, the inhibitory zones are arranged in a ring. An experimental determination of the regions where stimuli exerted a suppressive effect would discover a halo of suppression around the receptive field center. In fact, this halo is made up of many horizontally elongated components, but this might not be apparent experimentally. Thus, it is hard to test this feature of the model. Importantly, both models predict that with appropriately chosen stimuli, the horizontally elongated disparity tuning should vanish, and the coupling between disparity tuning and preferred orientation should reemerge. For example, the monocular suppression model predicts that when the disparate stimulus is a single dot, the two-dimensional disparity tuning should be elongated along the cell’s preferred orientation, as in the energy model. This is because—at least if the inhibitory zones do not substantially overlap the classical receptive field—when the dot stimulus activates the inhibitory zones, it is already largely outside the classical receptive field, so further monocular suppression has little effect. The disparity tuning is therefore elongated along the preferred orientation, as it would be in a model without horizontally elongated inhibitory zones. In contrast, random dot patterns—and most natural stimuli—activate both the receptive field and the inhibitory regions at the same time, allowing the inhibitory zones to suppress the response to vertical disparities. In this context, it is interesting to note that Pack, Born, and Livingstone (2003), using a single disparate dot stimulus, did not report the specialization for horizontal disparity found by Cumming (2002); in their data, in both V1 and MT, disparity tuning appeared to agree well with direction tuning. Our monocular suppression model explains how this apparent discrepancy could be due to the different stimuli used. It would thus be valuable to test disparity tuning

2016

J. Read and B. Cumming

in the same cell with both a disparate dot stimulus and with random-dot patterns. The multiple subunit model still shows a horizontally elongated disparity tuning surface when probed with a single dot stimulus, as with random dot patterns. However, with suitable stimuli, the disparity tuning of this model too can be made to reveal the oriented structure of the receptive fields. For example, consider extended one-dimensional noise stimuli (“bar codes”). The disparity of such a stimulus is effectively one-dimensional: disparities applied parallel to the bar code have no effect, so all disparities can be reduced to their component orthogonal to the bar code. Now consider a stimulus made up of two such bar codes superimposed—one aligned with the cell’s preferred orientation and the other orthogonal to it. Any disparity can be resolved into a component parallel to the preferred orientation (which would affect only the orthogonal bar code), and a component orthogonal to the preferred orientation (which would affect only the parallel bar code). Thus, disparity would effectively be applied only along the two axes orthogonal and parallel to the cell’s preferred orientation. The disparity tuning is therefore elongated along this axis. This is not just an inevitable consequence of the stimulus, but reflects the underlying linearity of the models presented in this article. If V1 neurons implement more sophisticated nonlinearities than proposed here, they could potentially show horizontally elongated disparity tuning surfaces even with this crossed bar code stimulus. Thus, experiments with this stimulus could be a powerful way of probing the computations performed by V1 neurons. 5 Conclusion This work provides the first explanation of the newly observed specialization for horizontal disparity in V1. It demonstrates that although this observation initially seemed completely at odds with all previous models, it can in fact be fairly straightforwardly incorporated into existing models of disparity selectivity. This represents an important step forward in our understanding of how cortical circuitry is specialized for binocular vision. Acknowledgments This work was supported by the National Eye Institute. References Anzai, A., Ohzawa, I., & Freeman, R. D. (1997). Neural mechanisms underlying binocular fusion and stereopsis: Position vs. phase. Proc. Natl. Acad. Sci. U.S.A., 94, 5438–5443. Anzai, A., Ohzawa, I., & Freeman, R. D. (1999). Neural mechanisms for encoding binocular disparity: Receptive field position versus phase. J. Neurophysiol., 82, 874–890.

Understanding Horizontal Disparity Specialization

2017

Barlow, H. B., Blakemore, C., & Pettigrew, J. D. (1967). The neural mechanisms of binocular depth discrimination. J. Physiol., 193, 327–342. Bridge, H., & Cumming, B. G. (2001). Responses of macaque V1 neurons to binocular orientation differences. J. Neurosci., 21(18), 7293–7302. Bruno, P., & Van den Berg, A. V. (1997). Relative orientation of primary positions of the two eyes. Vision Res., 37(7), 935–947. Carandini, M., Heeger, D. J., & Movshon, J. A. (1997). Linearity and normalization in simple cells of the macaque primary visual cortex. J. Neurosci., 17(21), 8621–8644. Cavanaugh, J. R., Bair, W., & Movshon, J. A. (2002). Selectivity and spatial distribution of signals from the receptive field surround in macaque V1 neurons. J. Neurophysiol., 88(5), 2547–2556. Cumming, B. G. (2002). An unexpected specialization for horizontal disparity in primate primary visual cortex. Nature, 418(6898), 633–636. Cumming, B. G., & Parker, A. J. (1997). Responses of primary visual cortical neurons to binocular disparity without depth perception. Nature, 389, 280– 283. Cumming, B., & Parker, A. (2000). Local disparity not perceived depth is signaled by binocular neurons in cortical area V1 of the macaque. J. Neurosci., 20(12), 4758–4767. DeAngelis, G. C., Ohzawa, I., & Freeman, R. D. (1991). Depth is encoded in the visual cortex by a specialised receptive field structure. Nature, 352, 156– 159. DeAngelis, G. C., Ohzawa, I., & Freeman, R. D. (1995a). Neuronal mechanisms underlying stereopsis: How do simple cells in the visual cortex encode binocular disparity? Perception, 24(1), 3–31. DeAngelis, G. C., Ohzawa, I., & Freeman, R. D. (1995b). Receptive- field dynamics in the central visual pathways. Trends in Neuroscience, 18(10), 451–458. Fleet, D., Wagner, H., & Heeger, D. (1996). Neural encoding of binocular disparity: Energy models, position shifts and phase shifts. Vision Res., 36, 1839–1857. Freeman, R. D., & Ohzawa, I. (1990). On the neurophysiological organisation of binocular vision. Vision Res., 30(11), 1661–1676. Freeman, R. D., Ohzawa, I., & Walker, G. (2001). Beyond the classical receptive field in the visual cortex. Prog. Brain Res., 134, 157–170. Gilbert, C. D., & Wiesel, T. N. (1990). The influence of contextual stimuli on the orientation selectivity of cells in primary visual cortex of the cat. Vision Res., 30(11), 1689–1701. Gonzalez, F., & Perez, R. (1998). Neural mechanisms underlying stereoscopic vision. Prog. Neurobiol., 55(3), 191–224. Heeger, D. J. (1992). Normalization of cell responses in cat striate cortex. Vis. Neurosci., 9(2), 181–197. Heeger, D. J. (1993). Modeling simple-cell direction selectivity with normalized, half- squared, linear operators. J. Neurophysiol., 70(5), 1885–1898. Helmholtz, H. von (1925). Treatise on physiological optics. Rochester, NY: Optical Society of America. Jones, H. E., Wang, W., & Sillito, A. M. (2002). Spatial organization and magnitude of orientation contrast interactions in primate V1. J. Neurophysiol., 88(5), 2796–2808.

2018

J. Read and B. Cumming

Jones, J. P., & Palmer, L. A. (1987). The two-dimensional spatial structure of simple receptive fields in cat striate cortex. J. Neurophysiol., 58(6), 1187– 1211. Joshua, D., & Bishop, P. O. (1970). Binocular single vision and depth discrimination: Receptive field disparities for central and peripheral vision and binocular interaction of peripheral single units in cat striate cortex. Exp. Brain Res., 10, 389–416. Julesz, B. (1971). Foundations of cyclopean perception. Chicago: University of Chicago Press. LeVay, S., & Voigt, T. (1988). Ocular dominance and disparity coding in cat visual cortex. Vis. Neurosci., 1(4), 395–414. Levitt, J. B., & Lund, J. S. (1997). Contrast dependence of contextual effects in primate visual cortex. Nature, 387(6628), 73–76. Maffei, L., & Fiorentini, A. (1976). The unresponsive regions of visual cortical receptive fields. Vision Res., 16(10), 1131–1139. Marr, D., & Poggio, T. (1976). Cooperative computation of stereo disparity. Science, 194, 283–287. Maske, R., Yamane, S., & Bishop, P. O. (1986a). End-stopped cells and binocular depth discrimination in the striate cortex of cats. Proc. R. Soc. Lond. B Biol. Sci., 229(1256), 257–276. Maske, R., Yamane, S., & Bishop, P. O. (1986b). Stereoscopic mechanisms: Binocular responses of the striate cells of cats to moving light and dark bars. Proc. R. Soc. Lond. B Biol. Sci., 229(1256), 227–256. Minken, A. W., & Van Gisbergen, J. A. (1996). Dynamical version—vergence interactions for a binocular implementation of Donders’ law. Vision Res., 36(6), 853–867. Mok, D., Ro, A., Cadera, W., Crawford, J. D., & Vilis, T. (1992). Rotation of Listing’s plane during vergence. Vision Res., 32(11), 2055–2064. Muller, J. R., Metha, A. B., Krauskopf, J., & Lennie, P. (2003). Local signals from beyond the receptive fields of striate cortical neurons. J. Neurophysiol., 90, 822–831. Nelson, J. I., Kato, H., & Bishop, P. O. (1977). Discrimination of orientation and position disparities by binocularly activated neurons in cat striate cortex. J. Neurophysiol., 40(2), 260–283. Nikara, T., Bishop, P. O., & Pettigrew, J. D. (1968). Analysis of retinal correspondence by studying receptive fields of binocular single units in cat striate cortex. Exp. Brain Res., 6(4), 353–372. Ohzawa, I., DeAngelis, G. C., & Freeman, R. D. (1990). Stereoscopic depth discrimination in the visual cortex: Neurons ideally suited as disparity detectors. Science, 249, 1037–1041. Ohzawa, I., DeAngelis, G. C., & Freeman, R. D. (1996). Encoding of binocular disparity by simple cells in the cat’s visual cortex. J. Neurophysiol., 75(5), 1779– 1805. Ohzawa, I., DeAngelis, G. C., & Freeman, R. D. (1997). Encoding of binocular disparity by complex cells in the cat’s visual cortex. J. Neurophysiol., 77(6), 2879–2909. Ohzawa, I., & Freeman, R. D. (1986a). The binocular organization of complex cells in the cat’s visual cortex. J. Neurophysiol., 56(1), 243–259.

Understanding Horizontal Disparity Specialization

2019

Ohzawa, I., & Freeman, R. D. (1986b). The binocular organization of simple cells in the cat’s visual cortex. J. Neurophysiol., 56(1), 221–242. Pack, C. C., Born, R. T., & Livingstone, M. S. (2003). Two-dimensional substructure of stereo and motion interactions in macaque visual cortex. Neuron, 37(3), 525–535. Poggio, G. F., & Fischer, B. (1977). Binocular interaction and depth sensitivity of striate and prestriate cortex of behaving rhesus monkey. J. Neurophysiol., 40(6), 1392–1405. Prince, S. J., Cumming, B. G., & Parker, A. J. (2002). Range and mechanism of encoding of horizontal disparity in macaque V1. J. Neurophysiol., 87(1), 209– 221. Prince, S. J., Pointon, A. D., Cumming, B. G., & Parker, A. J. (2002). Quantitative analysis of the responses of V1 neurons to horizontal disparity in dynamic random-dot stereograms. J. Neurophysiol., 87, 191–208. Qian, N. (1994). Computing stereo disparity and motion with known binocular cell properties. Neural Computation, 6, 390–404. Read, J. C. A., & Cumming, B. G. (2003a). Ocular dominance predicts neither strength nor class of disparity selectivity with random-dot stimuli in primate V1. J. Neurophysiol., 91, 1271–1281. Read, J. C. A., & Cumming, B. G. (2003b). Testing quantitative models of binocular disparity selectivity in primary visual cortex. J. Neurophysiol., 90(5), 2795– 2817. Read, J. C. A., Parker, A. J., & Cumming, B. G. (2002). A simple model accounts for the reduced response of disparity-tuned V1 neurons to anti-correlated images. Vis. Neurosci., 19, 735–753. Rogers, B. J., & Bradshaw, M. F. (1993). Vertical disparities, differential perspective and binocular stereopsis. Nature, 361(6409), 253–255. Rogers, B. J., & Bradshaw, M. F. (1995). Disparity scaling and the perception of frontoparallel surfaces. Perception, 24(2), 155–179. Schreiber, K., Crawford, J. D., Fetter, M., & Tweed, D. (2001). The motor side of depth vision. Nature, 410(6830), 819–822. Sillito, A. M., Grieve, K. L., Jones, H. E., Cudeiro, J., & Davis, J. (1995). Visual cortical mechanisms detecting focal orientation discontinuities. Nature, 378(6556), 492–496. Somani, R. A., DeSouza, J. F., Tweed, D., & Vilis, T. (1998). Visual test of Listing’s law during vergence. Vision Res., 38(6), 911–923. Stenton, S. P., Frisby, J. P., & Mayhew, J. E. (1984). Vertical disparity pooling and the induced effect. Nature, 309(5969), 622–623. Stevenson, S. B., & Schor, C. M. (1997). Human stereo matching is not restricted to epipolar lines. Vision Res., 37(19), 2717–2723. Tsao, D. Y., Conway, B., & Livingstone, M. S. (2003). Receptive fields of disparitytuned simple cells in macaque V1. Neuron, 38, 103–114. Tweed, D. (1997a). Kinematic principles of three-dimensional gaze control. In M. Fetter, T. P. Haslwanter, H. Misslisch & D. Tweed (Eds.), Three-dimensional kinematics of eye, head and limb movements. Amsterdam: Harwood Academic Publishers. Tweed, D. (1997b). Visual-motor optimization in binocular control. Vision Res., 37(14), 1939–1951.

2020

J. Read and B. Cumming

Uka, T., & DeAngelis, G. C. (2002). Binocular vision: An orientation to disparity coding. Curr. Biol., 12(22), R764–766. Van Rijn, L. J., & Van den Berg, A. V. (1993). Binocular eye orientation during fixations: Listing’s law extended to include eye vergence. Vision Res., 33(5–6), 691–708. von der Heydt, R., Adorjani, C., Hanny, P., & Baumgartner, G. (1978). Disparity sensitivity and receptive field incongruity of units in the cat striate cortex. Exp. Brain Res., 31(4), 523–545. Walker, G. A., Ohzawa, I., & Freeman, R. D. (1999). Asymmetric suppression outside the classical receptive field of the visual cortex. J. Neurosci., 19(23), 10536–10553. Walker, G. A., Ohzawa, I., & Freeman, R. D. (2000). Suppression outside the classical cortical receptive field. Vis. Neurosci., 17(3), 369–379. Walker, G. A., Ohzawa, I., & Freeman, R. D. (2002). Disinhibition outside receptive fields in the visual cortex. J. Neurosci., 22(13), 5659–5668. Zhu, Y. D., & Qian, N. (1996). Binocular receptive field models, disparity tuning, and characteristic disparity. Neural Computation, 8(8), 1611–1641. Received October 1, 2003; accepted April 6, 2004.

LETTER

Communicated by Daniel Wolpert

Different Predictions by the Minimum Variance and Minimum Torque-Change Models on the Skewness of Movement Velocity Profiles Hirokazu Tanaka Center for Neurobiology and Behavior and Department of Physiology and Cellular Biophysics, Columbia University, New York, NY 10032, U.S.A.

Meihua Tai Department of Mechanical Engineering, Polytechnic University, Brooklyn, NY 11201, U.S.A.

Ning Qian [email protected] Center for Neurobiology and Behavior and Department of Physiology and Cellular Biophysics, Columbia University, New York, NY 10032, U.S.A.

We investigated the differences between two well-known optimization principles for understanding movement planning: the minimum variance (MV) model of Harris and Wolpert (1998) and the minimum torque change (MTC) model of Uno, Kawato, and Suzuki (1989). Both models accurately describe the properties of human reaching movements in ordinary situations (e.g., nearly straight paths and bell-shaped velocity profiles). However, we found that the two models can make very different predictions when external forces are applied or when the movement duration is increased. We considered a second-order linear system for the motor plant that has been used previously to simulate eye movements and single-joint arm movements and were able to derive analytical solutions based on the MV and MTC assumptions. With the linear plant, the MTC model predicts that the movement velocity profile should always be symmetrical, independent of the external forces and movement duration. In contrast, the MV model strongly depends on the movement duration and the system’s degree of stability; the latter in turn depends on the total forces. The MV model thus predicts a skewed velocity profile under many circumstances. For example, it predicts that the peak location should be skewed toward the end of the movement when the movement duration is increased in the absence of any elastic force. It also predicts that with appropriate viscous and elastic forces applied to increase system stability, the velocity profile should be skewed toward the beginning of the movement. The velocity profiles predicted by the MV model can even show oscillations when the plant becomes highly oscillatory. Our c 2004 Massachusetts Institute of Technology Neural Computation 16, 2021–2040 (2004)

2022

H. Tanaka, M. Tai, and N. Qian

analytical and simulation results suggest specific experiments for testing the validity of the two models. 1 Introduction The problem of motor planning is ill posed. For a given initial and a target position, the required control signals cannot be uniquely determined (Bernstein, 1967). Conceptually, the problem can be divided into a few subproblems (Hollerbach, 1982; Kawato, Furukawa, & Suzuki, 1987): (1) the determination of a desired trajectory given an initial and a target position in the visual coordinate system, (2) the coordinate transformation of the trajectory from the visual coordinate system to the body-oriented coordinate system, (3) the determination of muscle torques for producing the desired trajectory, and (4) the specification of neuronal control signals for realizing the muscle torques. In general, each of these subproblems is ill posed due to kinematic or dynamic redundancies. To understand why the brain chooses a particular motor plan among infinite possibilities, it is usually assumed that the motor system tries to optimize a certain quantity related to movements. One such optimization principle is the minimum jerk (MJ) model proposed by Flash and Hogan (1985). The model determines a unique trajectory by minimizing jerk, the third temporal derivation of trajectory in the task-oriented coordinate system. It thus prefers paths with smooth acceleration. It predicts straight paths and bell-shaped velocity profiles often observed in human reaching movements. The MJ model is purely kinematic, and it provides a solution to the first subproblem mentioned above. Later, dynamic models were also proposed, attempting to solve more than one subproblem simultaneously. We will analyze and compare two representative dynamic models in this article: the minimum torque change (MTC) model (Uno, Kawato, & Suzuki, 1989) and the minimum variance (MV) model (Harris & Wolpert, 1998). The MTC model is a dynamical extension of the MJ model; it prefers smooth muscle torques instead of acceleration. In contrast, the MV model introduces a signal-dependent noise term into the control signals and requires a minimum postmovement variance around the target position. These dynamic models explain a wider range of experimental observations than the MJ model does (Uno, Kawato, et al., 1998; Harris & Wolpert, 1998). Several authors have discussed how these optimization processes may be implemented in a biologically plausible network (Massone & Bizzi, 1989; Kawato, Maede, Uno, & Suzuki, 1990; Hoff & Arbib, 1993). The MTC and MV models make similar predictions for movements under normal situations (Uno, Kawato, & Suzuki, 1989; Harris & Wolpert, 1998), such as nearly straight paths and bell-shaped velocity profiles (Kelso, Southard, & Goodman, 1979; Morasso, 1981; Abend, Bizzi, & Morasso, 1982). This is remarkable since the two models employ very different optimization criteria: the MTC model prefers the smoothness of muscle torques during

Minimum Variance and Minimum Torque-Change Models

2023

the movement, while the MV model focuses on the accuracy after the movement. The main purpose of this article is to explore the conditions under which the two models make divergent predictions. Obviously, this is important for determining which optimization principle, if any, may be used by the brain for motor planning. Specifically, we formulate these optimization processes with a linear motor plant and show that the velocity profiles predicted by the MV model can be highly asymmetrical when we change the movement duration or the balance between viscous and elastic forces, while the MTC model always predicts a symmetrical velocity profile. In section 2, we provide an analytical formulation of these two models with a linear plant and explain the underlying reasons for the different predictions. We also determine the conditions under which the velocity profiles predicted by the MV model become skewed toward the beginning or the end of a movement. In section 3, we verify our analyses with numerical simulations of representative cases. We discuss in section 4 some specific experiments for testing the models. Since neither model includes sensory feedback, we also briefly discuss potential effects of feedback on the velocity profile in the framework of Todorov and Jordan’s recent model (Todorov & Jordan, 2002; Todorov, 2004). Preliminary results have been presented in abstract form (Tanaka & Qian, 2003). 2 Analyses Let us first define a motor plant to be used by both the MTC and the MV models. We consider a second-order linear system whose dynamics is specified by ˙ + a0 θ(t) = τ (t). θ¨ (t) + a1 θ(t)

(2.1)

This equation is based on Newtonian mechanics and has been used for modeling eye movements (Robinson, Gordon, & Gordon, 1986) and singlejoint arm movements (Hogan, 1984). (The nonlinear equations for a twojoint arm (Luh, Walker, & Paul, 1980) reduce to this linear form when the upper joint is assumed to be fixed.) Here, a0 and a1 are elastic and viscous constants, respectively. θ is a state variable representing eye position or joint angle, and a dot over the variable denotes the temporal derivative. τ (t) is the muscle torque. In our analysis of the MV model, we will make the simplifying assumption that τ is also the neural control signal. In other words, we assume that there is no delay between neural activity and muscle responses. This simplification allows us to use the same dynamic equation for the MTC and MV models and derive analytical solutions. In reality, the muscle torque should be a low-pass filtered version of the control signal (Winters & Stark, 1985). We will introduce this more realistic relationship in our numerical simulations in section 3, and show that it does not change the conclusions of our analysis.

2024

H. Tanaka, M. Tai, and N. Qian

The second-order differential equation 2.1 can be reduced to the first order by introducing a vector representation of the state: θ x≡ ˙ . θ

(2.2)

Then the equation becomes ˙ = Ax(t) + Bτ (t), x(t) where the matrices A and B are defined as 0 1 0 A≡ . , B≡ −a0 −a1 1

(2.3)

(2.4)

The stability of the plant is determined by the eigenvalues of the matrix A, 2 λ± = −a1 ± a1 − 4a0 /2.

(2.5)

The system is stable if the real parts of both eigenvalues are negative, and unstable otherwise. This stability condition is satisfied if both a0 and a1 are positive. The stable system is called over-, critical-, or underdamped according to whether a21 > 4a0 , a21 = 4a0 , or a21 < 4a0 , respectively. We can control the system’s degree of stability by adjusting a0 and a1 to alter the real-part magnitude of the eigenvalues λ± . The system can also be made oscillatory (underdamped) by increasing the elastic force relative to the viscous force. Equation 2.3 can be integrated after a coordinate transform that diago˙ can then be obtained via the inverse transform. nalizes A, and θ (t) and θ(t) The results are: t (λ+ eλ− t − λ− eλ+ t )θ0 + 0 eλ+ (t−t ) − eλ− (t−t ) τ (t )dt θ (t) = (2.6) (λ+ − λ− ) t λ λ (−eλ+ t + eλ− t )θ0 + 0 λ+ eλ+ (t−t ) − λ− eλ− (t−t ) τ (t )dt ˙θ (t) = + − (2.7) (λ+ − λ− ) ˙ = 0. Thus, θ (t) and θ˙ (t) under the initial condition that θ(0) = θ0 and θ(0) are essentially filtered versions of τ (t). Equation 2.7 will be used in our numerical simulations of the MTC model’s velocity profiles in section 3. 2.1 The Minimum Torque-Change Model. We show analytically that with the second-order linear plant, the MTC model always predicts symmetrical velocity profiles of movement.

Minimum Variance and Minimum Torque-Change Models

2025

2.1.1 The Cost Function. The cost function of the MTC model is defined as an integration of the torque changes over the movement duration [0, t f ] (Uno, Kawato, et al., 1998), CMTC

1 = 2

tf

τ˙ 2 (t)dt.

(2.8)

0

If there are no elastic and viscous forces in equation 2.1, this cost function reduces to that for the MJ model. By minimizing the cost function, a unique trajectory is specified among all trajectories that satisfy equation 2.1 and appropriate boundary conditions (see below). The optimization process can be solved by using either the Lagrange multiplier method (Uno, Kawato, et al., 1989) or the Euler-Lagrange equation (Wada, Kaneko, Nakano, Osu, & Kawato, 2001). The latter method is more convenient for proving the symmetry of the velocity profile, as we show below. The solution based on the Lagrange multiplier method is provided in the appendix and is used in our numerical simulations in section 3. 2.1.2 Symmetric Velocity Profiles Predicted by the MTC Model. By substituting equation 2.1 into equation 2.8 and applying the variational procedure with respect to θ, we obtain the following Euler-Lagrange equation (Wada et al., 2001): 4 2 d6 θ (t) 2 d θ(t) 2 d θ(t) − a + a = 0. 1 0 dt6 dt4 dt2

(2.9)

The boundary terms of the partial integrations vanish because of the zero velocity and acceleration conditions at the initial and final times. This equation is clearly invariant under time translation. We can thus let the movement duration be from −t f /2 to t f /2 such that a symmetrical velocity profile is equivalent to an even function of time. The boundary conditions are θ (−t f /2) = θ0 , ˙ ˙ ¨ ¨ θ (t f /2) = θ f , and θ(−t f /2) = θ(t f /2) = θ(−t f /2) = θ (t f /2) = 0. The above differential equation is reduced to fifth order if we use the ˙ as the variable velocity ω(t) ≡ θ(t) 3 d5 ω(t) dω(t) 2 d ω(t) − a + a20 = 0, 1 dt5 dt3 dt

(2.10)

and the original boundary conditions become five new conditions: ω(−t f /2) tf /2 = ω(t f /2) = ω(−t ˙ ˙ f /2) = 0, and −tf /2 dtω(t) = θ f − θ0 . The eigenf /2) = ω(t values of equation 2.10 are zero and four nonzero values αi (i = 1, 2, 3, 4) given by the characteristic polynomial: α 4 − a21 α 2 + a20 = 0.

(2.11)

2026

H. Tanaka, M. Tai, and N. Qian

The solution of equation 2.10 is a linear combination of the eigenstates, ω(t) = β0 + 4i=1 βi eαi t where the coefficients βi are determined by the five conditions mentioned above. Since equation 2.11 is a function of α 2 , only two of the four eigenvalues, say α1 and α3 , are independent, and the other two are given by α2 = −α1 and α4 = −α3 . It is then straightforward, though tedious, to solve for βi and show that they satisfy the following simple relations: β1 = β2 ,

β3 = β4 .

(2.12)

Using this result, the velocity can be expressed as ω(t) = β0 + β1 eα1 t + e−α1 t + β3 eα3 t + e−α3 t .

(2.13)

This solution is an even function of time ω(−t) = ω(t). Therefore, with the linear plant, the velocity profile predicted by the MTC model is always exactly symmetrical regardless of the parameter values. 2.2 The Minimum Variance Model. We now consider the MV model with the second-order linear plant and determine the conditions under which the velocity profiles are asymmetrical. 2.2.1 The Cost Function. According to the MV model (Harris & Wolpert, 1998), a signal-dependent noise term ξ should be added to the dynamic equation 2.3: ˙ = Ax(t) + B [τ (t) + ξ(t)] . x(t)

(2.14)

ξ is assumed to be a gaussian white noise with zero mean and a variance proportional to the square of the control signal τ (Harris & Wolpert, 1998): E[ξ(t)ξ(t )] = Kτ 2 (t)δ(t − t ).

E[ξ(t)] = 0,

(2.15)

Here K is a proportionality constant that scales the amplitude of noise. E[·] denotes the operation of taking an average over the noise distribution. The dynamic equation 2.14 is stochastic, so it is not particularly informative to consider each individual trajectory. Instead, we consider the expected value of the state vector and the covariance matrix by averaging over the signal-dependent noise ξ :

t

(2.16)

T eA(t−t ) B eA(t−t ) B τ 2 (t )dt .

(2.17)

E[x(t)] = e x0 + Cov[x(t)] = K 0

eA(t−t ) Bτ (t )dt ,

At

0 t

Minimum Variance and Minimum Torque-Change Models

2027

Here x0 = (θ0 , 0)T is the initial state vector at t = 0. These expressions are obtained by integrating the equation of motion, equation 2.14, and then taking the average over the noise variable according to equation 2.15. Consider a movement over an interval 0 ≤ t ≤ t f . The cost function to be minimized by the MV model is the variance of the position summed over a certain postmovement duration [t f , t f + tp ] (Harris & Wolpert, 1998). We first consider the simplified case that the cost function is the variance of the position at the final time point t f only. The problem, then, is to minimize the variance of position θ at t f under the constraint that the expected value of the state x at t f is the target state x f = (θ f , 0)T (Harris & Wolpert, 1998). The variance of the position is the (1,1) component of the 2×2 covariance matrix, equation 2.17. The constrained optimization can be solved by introducing a 2 × 1 Lagrange multiplier µ, with the resulting augmented cost function: C MV

=K 0

tf

f (t; t f ) τ 2 (t) dt − µT E[x(t f )] − x f ,

(2.18)

where f (t; t f ) ≡ eA(tf −t) B(eA(tf −t) B)T

1,1

.

(2.19)

The subscript denotes the (1, 1) component of the matrix. f (t; t f ) is a weighting factor that determines how much the noise in control signal τ (t) at time t contributes to the final variance of position. By applying the variational principle with respect to τ (t) to equation 2.18, we obtain an analytical solution for the optimal control signal: τ (t) =

µT eA(tf −t) B . 2Kf (t; t f )

(2.20)

The Lagrange multiplier vector µ is a constant that can be determined via equation 2.16 and the boundary condition E[x(t f )] = x f . Note that the control signal u(t) is inversely related to the weighting factor f (t; t f ). This makes sense because, as we mentioned above, f (t; t f ) determines how much the noise in the control signal at time t will contribute to the variance at the final time t f . A large f means a large contribution, and the corresponding noise (and thus the control signal) should be small in order to minimize the final variance. A problem with the above simplification is that since f (t f ; t f ) = 0 according to equation 2.19, u(t) given by equation 2.20 diverges at t f . The problem can be avoided by using the integration of the variance (and also the constraint) over a postmovement period [t f , t f + tp ] as the cost function (Harris & Wolpert, 1998). The reason is that noise at t f does not affect positional variance at t f but does contribute to the variance after t f . It can then be

2028

H. Tanaka, M. Tai, and N. Qian

shown that equation 2.20 becomes tf +tp µ(t )T eAt dt e−At B tf τ (t) = tf +tp 2K tf f (t; t )dt τ (t) = a0 θ f

(0 ≤ t ≤ t f )

(t f < t ≤ t f + tp ),

(2.21) (2.22)

and the divergence problem disappears. Here, the Lagrange multiplier vector µ(t) is a function of time. The meaning of equation 2.21 is similar to that of equation 2.20, with the torque inversely proportional to the weighting factor for the noise. Equation 2.22 is the torque needed for balancing the elastic force over the postmovement period in order to keep the position at θf . 2.2.2 Velocity Profiles Predicted by the MV Model. A precise discussion of the velocity profile shape predicted by the MV mode requires determining the Lagrange multiplier vector µ from the boundary conditions and then substituting equation 2.20 into equation 2.7. However, since the result will be complicated, involving integrals with no closed-form solutions, we take a more heuristic approach here and verify our conclusions via numerical simulations in section 3. Specifically, we focus on the weighting factor f (t; t f ) because it determines how much the noise in the control signal τ (t) at time t contributes to the positional variance at the end of the movement t f . If, for example, f (t; t f ) is small at the start of a movement and becomes large near the end, then the signal-dependent noise during the early phase of the movement does not matter as much as the noise near the end, and we expect a large initial control signal and a velocity profile skewed toward the beginning. Using the same diagonalization procedure for obtaining equation 2.7, we can express the weighting factor, equation 2.19, in terms of the two eigenvalues λ± of matrix A. For the nondegenerate case λ+ = λ− , the weighting factor is given by λ (t −t) 2 e + f − eλ− (tf −t) (0 ≤ t ≤ t f ), (2.23) f (t; t f ) = (λ+ − λ− )2 and for the degenerate (i.e., critical damping) case where the two eigenvalues λ± take the same value λ, f (t; t f ) = (t f − t)2 e2λ(tf −t)

(0 ≤ t ≤ t f ).

(2.24)

We first consider a stable case when both eigenvalues are real and negative. By differentiating equations 2.23 and 2.24, the weighting factor is found to have a single peak at t∗ = t f −

1 λ− ln λ+ − λ− λ+

(2.25)

Minimum Variance and Minimum Torque-Change Models

2029

for the nondegenerate plant and t∗ = t f +

1 λ

(2.26)

for the degenerate plant. For real and negative eigenvalues, we have |λ− | ≥ |λ+ | according to equation 2.5, and it is easy to see that t∗ is always less than t f . The system is most stable when the two negative eigenvalues have similar and large magnitudes. Under this condition, t∗ is very close to t f , indicating that noise in the early phase of movement tends to decay away quickly and does not contribute much to the final variance. This means that the control signal can be large around the start of the movement, and the velocity profile should thus be skewed toward the movement onset. As one or both eigenvalues become smaller in magnitude, the system’s degree of stability decreases, and t∗ shifts toward the start of the movement. (With very small eigenvalue magnitudes, t∗ can even be less than 0, and the weighting factor will be a monotonically deceasing function over [0, t f ].) Consequently, the early noise becomes more and more important, the control signal should thus be weaker near the movement onset, and the velocity profile will eventually become more skewed toward the end of the movement. Next, we consider the unstable case where the eigenvalues are real but at least one of them is positive. This implies a negative viscous constant, which could be realized with a robotic manipulandum. In this case, it can be shown that the weighting factor, equation 2.23 or equation 2.24, is a monotonically decreasing function of time. Again, this means that the velocity profile should be skewed toward the end of the movement. In the case of complex eigenvalues λ± = µ ± iν, the system is oscillatory. The analytical expression for the weighting factor becomes f (t; t f ) =

e2µ(tf −t) 1 − cos 2ν(t f − t) , 2 2ν

(2.27)

which is also oscillatory with frequency ν/π. When this frequency is sufficiently high, one expects to see oscillation in the control signal and thus in the velocity profile as well. We plot in Figure 1 the weighting factor as a function of time for a few representative cases. Figure 1A shows curves corresponding to four negative eigenvalues for the degenerate (critical damping) plant. As mentioned above, the peak shifts to the left as the system’s degree of stability (the magnitude of the eigenvalue) decreases. A stable overdamping, an unstable, and a stable underdamping (oscillatory) case are shown in Figure 1B. Finally, since the above arguments are based on how the signal-dependent noise propagates through time, one expects that the movement duration should affect the shape of the velocity profile in the MV model. For a very stable system where the noise decays with time, a longer duration

2030

H. Tanaka, M. Tai, and N. Qian (A) Critical Damping Cases

(B) Stable and Unstable Cases stable overdamping unstable stable underdamping

Normalized weighting factor

Normalized weighting factor

1

0.8

0.6

λ tf = −1.5 λ t = −2.0 f λ tf = −2.5 λ tf = −3.0

0.4

0.2

0 0

0.2

0.4

0.6

Normalized time

0.8

1

1 0.8 0.6 0.4 0.2 0 0

0.2

0.4

0.6

0.8

1

Normalized time

Figure 1: The weighting factor f (t; t f ) plotted as a function of normalized time t/t f . (A) Four cases of stable, critical damping with λt f equal to −1.5, −2.0, −2.5, and −3.0 from left to right. (B) A stable overdamping (λ± t f = −1, −2), an unstable (λ± t f = 0.4, 0.2), and a stable underdamping (oscillatory) case (λ± t f = −1 ± 5i).

will make the early noise contribute even less to the final variance, and the velocity profile should be skewed more toward the movement onset with increasing duration. This explains a result in Harris and Wolpert (1998) that the velocity profile of a saccadic eye movement model becomes more skewed toward the beginning with longer duration. Likewise, if the system is less stable or unstable such that the velocity profile is skewed toward the end, the skewing can be slight for a short duration but should become more pronounced with increasing duration. Note that duration is a prefixed parameter in both the MV and MTC models that does not depend on movement amplitude, even though actual movement duration usually increases with amplitude (Fitts, 1954). For the linear plant we used, both models predict that when the movement amplitude is changed, the velocity profile will simply be scaled up or down without changing its duration or shape. One can avoid this problem by an ad hoc adjustment of the movement duration according to the amplitude prior to the optimization procedure. We summarize our arguments for the MV model as follows. When we change the viscous and elastic constants of the system by applying external viscous and elastic forces, the shape of the movement velocity profiles should change. In particular, when the system is made more (or less) stable, the peak of the velocity profile should be shifted toward the beginning (or end) of the movement. The skewing should be more pronounced with increasing duration. Intuitively, a more stable system tends to attenuate the signal-dependent noise more over time, the control signal (and the associ-

Minimum Variance and Minimum Torque-Change Models

2031

ated noise) can thus afford to be large around the movement onset, and the velocity profile will consequently be skewed toward the beginning. In contrast, a less stable or unstable system may amplify noise through time, the control signal should start small, and the velocity profile will be skewed toward the end. When the system is made strongly oscillatory, the noise effect oscillates in time, and the velocity profile should also show oscillation. 3 Simulations We have conducted extensive numerical simulations in order to confirm our analyses above, particularly when the simplifying assumptions for the MV model are removed. We consider single-joint movements of the forearm with the dynamic equation ˙ + (k0 + k)L20 θ(t) = τ (t). I0 θ¨ (t) + (b0 + b)θ(t)

(3.1)

This equation can be derived by fixing the shoulder angle in the two-joint arm model examined by Uno, Kawato, et al. (1989). θ represents the elbow angle and τ is the muscle torque exerted to the elbow. I0 , b0 , k0 , and L0 are the moment of inertia, the intrinsic viscosity, the intrinsic elasticity, and the length of the forearm, respectively, and we adopt the standard values of 0.25 kg · m2 , 0.20 kg · m2 /s, 0 N/m, and 0.35 m for these parameters (Harris & Wolpert, 1998). We also included externally applied viscosity b and elasticity k for altering the system’s degree of stability and oscillation. This single-joint system does not have any kinematic redundancy but does have dynamic redundancies. For the MTC model, we used the Lagrange multiplier method presented in the appendix to find the torque numerically and then used equation 2.7 to obtain velocity. For the MV model, we made the simplifying assumptions in section 2 that the control signal is the muscle torque and that the skewing of velocity profile can be qualitatively understood by the weighting factor f . These limitations are removed in our simulations. We considered a second-order muscle model to relate the neural control signal u to the muscle torque τ (Winters & Stark, 1985): d d 1 + ta 1 + te τ (t) = u(t), dt dt

(3.2)

where ta and te are muscle activation and excitation time constants, and their values are 40 and 30 ms, respectively. This equation implies that the torque is acquired by low-pass filtering the control signal. Consequently, we expect a smooth rise of the muscle torque instead of a sudden onset. We used the original cost function of Harris and Wolpert (1998) by integrating the positional variance over a period of 400 ms after the movement

2032

H. Tanaka, M. Tai, and N. Qian

(longer postmovement duration does not alter the results), and numerically calculated the velocity profiles via quadratic programming. We considered only stable cases in our simulations since unstable cases generate divergent movement trajectories that are difficult to test experimentally. Based on our analyses, we examined how the system’s degree of stability, the movement duration, and the plant oscillation affect the shape of the predicted velocity profiles. 3.1 Effects of the System’s Degree of Stability. The system has different degrees of stability when its eigenvalues have different magnitudes but the same (negative) sign. The simplest way to change the degree of stability is to apply external viscous and elastic forces to the forearm. In psychophysical experiments, arbitrary elastic and viscous forces can be introduced by a robotic manipulandum. One could also add the elastic force by attaching a spring to the hand. The movement duration was fixed at 400 ms in all simulations. We first considered the baseline simulation with no external forces (b = k = 0). We then increased the viscous coefficients b from 1 to 5 in step of 1 kg · m2 /s. This range of viscous force can be achieved by a manipulandum (Shadmehr & Mussa-Ivaldi, 1994). We also imposed an external elastic force for each b according to k = (b0 + b)/4I0 L20 such that the plant was in critical damping condition. k ranged from 0.33 to 220 N/m. Based on our analyses in section 2, the velocity profile of the MV model should skew more toward the beginning as the system becomes more stable, while the velocity profile of the MTC model should always remain symmetrical. This is confirmed by the simulations in Figure 2. Figure 2A shows the normalized velocity profiles predicted by the MV model. The right-most profile is the baseline case without any external forces, and it skews slightly toward the end of the movement. The profile gradually shifts toward the movement onset as the system’s degree of stability increases with the external forces. To show the peak shift more clearly, we plot in Figure 2B the normalized peak locations of the profiles in Figure 2A as a function of the external viscosity. Along the vertical axis, 0, 0.5, and 1 indicate the beginning, midpoint, and end of the movement, respectively. The corresponding simulations for the MTC model are shown in Figures 2C and 2D. As expected, the predicted velocity profiles are always symmetrical regardless of the external forces. The six normalized velocity profiles almost overlap each other completely. 3.2 Effects of the Movement Duration. We next simulated the effects of the movement duration on the shape of the velocity profile. We varied the duration from 200 ms to 1000 ms in steps of 200 ms while keeping all other parameters to their standard values. No external forces were applied in these simulations. The results for the MV model are shown in Figures 3A and 3B. The peak of the velocity profile shifts more toward the end of the

Minimum Variance and Minimum Torque-Change Models

(B) MV: Peak Locations Normalized peak location

Normalized velocity

(A) MV: Velocity Profiles 1 0.8 0.6 0.4 0.2 0 0

0.1

0.2

0.3

0.4

1 0.8 0.6 0.4 0.2 0

0.6 0.4 0.2 0 0

0.1

0.2

0.3

Time (s)

0.4

Normalized peak location

Normalized velocity

1 0.8

0

2

4

External viscosity (kg m2/s)

Time (s)

(C) MTC: Velocity Profiles

2033

1

6

(D) MTC: Peak Locations

0.8 0.6 0.4 0.2 0

0

2

4

6

External viscosity (kg m2/s)

Figure 2: Velocity profiles and peak locations under different degrees of system stability. (A) Normalized velocity profiles predicted by the MV model. The rightmost curve corresponds to no external force applied to the arm model. The other four curves, from right to left, are for a critical damping system with increasing system stability via external viscous and elastic forces. (B) Normalized peak location of the velocity profiles in A. The peak locations are divided by the total movement duration. (C) Normalized velocity profile predicted by the MTC model with the same plant parameters as in A. (D) Normalized peak location of the profiles in C.

movement with increasing duration. In contrast, the MTC model always predicts symmetrical profile (see Figures 3C and 3D). These results are again in agreement with our analyses. The movement duration can be easily manipulated in psychophysical experiments by requiring different accuracy at the target location or varying movement amplitude (Fitts, 1954). 3.3 Effects of the Plant Oscillation. Finally, we compared the two models when the plant has complex eigenvalues and is thus oscillatory. This happens when the elastic force is more dominant over the viscous force and can be achieved by imposing a strong external elastic force or partially canceling the intrinsic viscous force via a manipulandum. In our simulations, we used the elastic coefficient k of 200, 300, and 400 N/m while keeping all

2034

H. Tanaka, M. Tai, and N. Qian

(B) MV: Peak Locations Normalized peak location

Normalized velocity

(A) MV: Velocity Profiles 1

1

0.8

0.8

0.6

0.6 0.4

0.4

0.2

0.2

0 0

0.2 0.4 0.6 0.8

1

0 0

(C) MTC: Velocity Profiles Normalized velocity

0.2

0.4

0.6

0.8

1

Movement duration (s)

1

normalized peak location

Time (s) 1

(D) MTC: Peak Locations

0.8

0.8

0.6

0.6 0.4

0.4

0.2

0.2

0 0

0.2 0.4 0.6 0.8

Time (s)

1

0 0

0.2

0.4

0.6

0.8

1

Movement duration (s)

Figure 3: Velocity profiles and peak locations under various movement durations. The format is same as Figure 2, with A and B for the MV model and C and D for the MTC model.

the other parameters to their standard values. The corresponding values for νt f /π were 1.26, 1.53, and 1.78, respectively. The movement’s duration was 400 ms in all simulations. The simulated velocity profiles for the MV model are shown in Figure 4A. The right-most curve corresponds to the largest k value. It is clear from the figure that as k increases, the profile becomes more oscillatory, and the peaks shifts toward the end of the movement. With the largest k we used, the predicted velocity is initially negative, indicating that the movement should first be in the opposite direction of the target. In contrast, the velocity profiles for the MTC model shown in Figure 4B are always symmetrical. Again, these simulations are consistent with our analytical considerations. Note that the MTC model solution, equation 2.13, can also be oscillatory when α’s are complex. However, we found through simulations that k has to be larger than about 1000 N/m for the oscillation to be observed. In addition, the MTC solution always has to be symmetrical with or without oscillation.

Minimum Variance and Minimum Torque-Change Models

(B) MTC: Velocity Profiles Normalized velocity

Normalized velocity

(A) MV: Velocity Profiles 1

0.8

2035

1

0.8

0.6

0.6

0.4

0.4

0.2

0.2

0 0

0.1

0.2

0.3

Time (s)

0.4

0 0

0.1

0.2

0.3

0.4

Time (s)

Figure 4: Velocity profiles under different degrees of plant oscillation. (A) Prediction of the MV model when the external elasticity k takes values of 200, 300, and 400 N/m, respectively. (B) Prediction of the MTC model under the same condition.

4 Discussion The main goal of this study is to compare the two well-known models for motor planning: the minimum variance model and the minimum torque change model. We have focused on the different velocity profiles predicted by the two models and found through analyses and simulations that with a second-order linear plant, the MTC model always predicts a perfectly symmetrical velocity profile, while the MV model predicts skewed velocity profiles under many conditions. Our results suggest the following specific experiments for testing the validity of the two models. Subjects should be instructed to make single-joint movements of the forearm in a horizontal plane; this is the condition where the second-order linear plant used in our studies is valid. The movement velocity profiles should be measured under each of the following manipulations: 1. Different levels of viscous forces are applied through a manipulandum. 2. Different movement durations are imposed by asking the subjects to hit the target position with various degrees of precision. 3. Different elastic forces are applied through either a manipulandum or attachment of springs. If the observed velocity profiles under these manipulations are all symmetrical, the MTC model is supported. But, if the velocity profiles become more skewed toward the movement onset with increasing viscosity in condition 1, the skewing is more pronounced with movement duration in condition 2, and the velocity profile becomes more skewed toward the end (and more

2036

H. Tanaka, M. Tai, and N. Qian

oscillatory) with elastic force in condition 3, the MV model is favored. All other possible outcomes will suggest that neither model is correct. In condition 1, the movement duration may increase with viscosity. This confound should not matter for testing purposes because a concurrent increase of duration with viscosity would only make the skewing predicted by the MV model stronger. When the applied elastic force is very strong in condition 3, the MV model makes the counterintuitive prediction that the movement should first be in the opposite direction of the target (see Figure 4A). In this study, we considered the second-order linear plant that is often used in motor control modeling and can be realized with single-joint arm movements. This simplification allowed us to gain analytical insights into the two models. The simplification is justified because we are mainly interested in searching for specific conditions where the two models can be clearly differentiated. It may be desirable to extend our studies to nonlinear plants, such as the two-link arm model, because many extant experimental studies involve multijoint movements. However, nonlinear systems are not only difficult to analyze but also difficult to simulate due to the local-minimum problem in the optimization process. Our simulations (not shown) indicate that the problem becomes worse when large external forces are applied. Predictions from the two models based on nonlinear simulations may thus be less reliable for distinguishing the models than the results in this article. In addition, the MTC and MV models’ predictions have a complex dependence on the system parameters in general. This makes experimental tests of the models difficult due to the uncertainties in the estimation of arm’s intrinsic parameters. In contrast, we show here that for the special case of single-joint movements, the symmetry of the velocity profiles predicted by the MTC model holds for any parameter combinations. Likewise, the trend of velocity peak shift predicted by the MV model when duration and external forces are manipulated is valid regardless of the intrinsic parameter values. The simple case we considered in this article may thus provide a more conclusive test of the models. It is known that large-amplitude saccadic eye movements show skewed velocity profiles (Collewijn, Erkelens, & Steinman, 1988; Harris & Wolpert, 1998). To the extent that the eye can be approximated by a linear plant, this observation argues against the MTC model for eye movements. Indeed, the MTC model was proposed only for arm movement, not eye movement (Uno, Kawato, et al., 1998), while the MV model was proposed for both (Harris & Wolpert, 1998). Skewed velocity profiles in arm movements have also been reported (Nagasaki, 1989; Milner & Ijaz, 1990). However, these studies usually employ multi-joint arm movements, a condition under which the MTC model can also predict skewed velocity profile (Uno, Kawato, et al., 1989). We therefore conclude that the extant data cannot distinguish the two models, and new experiments as outlined above should be performed. Our analyses can be readily generalized to higher-order linear systems, and we expect that our conclusions will still be valid. It would also be in-

Minimum Variance and Minimum Torque-Change Models

2037

teresting to examine the implications of the other models developed by Kawato et al. that are closely related to the MTC model considered here. They include the minimum muscle-tension-change model (Uno, Suzuki, & Kawato, 1989) and the minimum motor-command-change model (Kawato, 1992). Since all of these models require smoothness in dynamic variables, and the smoothness criterion is independent of the system stability or duration, we speculate that they may all predict nearly symmetrical velocity profiles under a linear plant. Both the MTC and MV models are purely feedforward and are perhaps most appropriate for understanding rapid, well-learned movements. However, sensory feedback, when present, has been shown to have profound effects on trajectory formation (Keele & Posner, 1968; Carlton, 1981; Sheth & Shimojo, 2002). Todorov and Jordan recently proposed an extended linearquadratic-gaussian (LQG) model that includes sensory feedback during movement execution (Todorov & Jordan, 2002; Todorov, 2004). We have performed some simulations with their model and the linear plant used in this article, and found that sensory feedback can strongly influence the shape of the predicted velocity profile. For example, for single-joint movements without external forces, the feedback can cause the velocity profile to change from skewing toward the end to skewing toward the beginning of a movement. Since the extended LQG model contains many parameters not present in the MTC or MV model and some of those parameters (such as the relative weighting between the error and the control cost) can also influence the shape of the velocity profile, we will present a full account of the extended LQG model predictions in a future publication. Appendix A: A Solution of the MTC Model by the Lagrange Multiplier Method We present a solution of the MTC model based on the Lagrange multiplier method, following the original work of Uno, Kawato, et al. (1989). We used this method for our numerical simulations of the MTC model in section 3, because the Euler-Lagrange method can be numerically less stable under some parameters. The problem is to minimize the torque change, equation 2.8, under the constraint that the dynamic equation, equation 2.3, should be satisfied. The augmented cost function after introducing a Lagrange multiplier vector px is thus C MTC =

tf

dt 0

1 2 τ˙ − pTx (x˙ − Ax − Bτ ) . 2

(A.1)

The variation procedure would give rise to a second-order differential equation. Instead, by introducing an auxiliary variable z(t) as the temporal

2038

H. Tanaka, M. Tai, and N. Qian

derivative of the torque τ (t), we obtain a new cost function, C˜ MTC =

tf

dt 0

1 2 z − pTx (x˙ − Ax − Bτ ) − pτ (τ˙ − z) , 2

(A.2)

that will give rise to a set of first-order differential equations. Here, pτ is another Lagrange multiplier to enforce the constraint that z should be the derivative of τ . Applying the variation principle with respect to x, z, τ , px , and pτ , we have the following set of equations:  x˙ = Ax + Bτ,        z = pT , τ˙ = −pτ ,    p˙x = −AT px ,     p˙τ = −BT px .

(A.3)

The boundary conditions of the state variable at the initial and final time are θ (0) = θ0 , θ (t f ) = θ f ,

˙ ¨ θ(0) = 0, θ(0) = 0, ˙ f ) = 0, θ(t ¨ f ) = 0. θ(t

(A.4)

The initial condition of the Lagrange multipliers px (0) and pτ (0) is determined so that the boundary conditions A.4 are satisfied. Finally, by integrating equations A.3, the muscle torque is obtained as T τ (t) = a0 θ0 − pτ (0)t + BT (AT )−1 t12 + (AT )−1 (e−A t − 12 ) px (0). (A.5) Here 12 is a 2 × 2 unit matrix. The optimal trajectory is obtained by solving the dynamical equation with the muscle torque, equation A.5. Acknowledgments We thank Emanuel Todorov and Daniel Wolpert for answering our inquiries on their models and providing us their codes. We are also grateful to John Krakauer, Pietro Mazzoni, Claude Ghez, and two anonymous reviewers for their helpful comments and suggestions. This work is supported by NIH grant MH54125. References Abend W., Bizzi E., & Morasso P. (1982). Human arm trajectory formation. Brain, 105, 331–348.

Minimum Variance and Minimum Torque-Change Models

2039

Bernstein N. (1967). The coordination and regulation of movements. London: Pergamon. Carlton L. G. (1981). Processing visual feedback information for movement control. Journal of Experimental Psychology: Human Perception and Performance, 7, 1019–1030. Collewijn H., Erkelens C. J., & Steinman R. M. (1988) Binocular coordination of human horizontal saccadic eye-movements. Journal of Physiology, 404, 157– 182. Fitts P. M. (1954). Information capacity of the human motor system in controlling the amplitude of movement, Journal of Experimental Psychology 47, 381– 391. Flash T., & Hogan N. (1985). The coordination of movements: An experimentally confirmed mathematical model. Journal of Neuroscience, 5, 1688–1703. Harris C. M., & Wolpert D. M. (1998). Signal-dependent noise determines motor planning. Nature, 394, 780–784. Hoff B., & Arbib M. A. (1993). Models of trajectory formation and temporal information of reach and grasp. Journal of Motor Behavior, 25, 175–192. Hogan N. (1984). An organizing principle for a class of voluntary movements. Journal of Neuroscience, 4, 2745–2754. Hollerbach J. M. (1982). Computers, brains and the control of movement. Trends in Neuroscience, 5, 189-192. Kawato M. (1992). Optimization and learning in neural networks for formation and controls of coordinated movement. In D. Meyer & S. Kornlum (Eds.), Attention and performance XIV (pp. 821–849). Cambridge, MA: MIT Press. Kawato M., Furukawa K., & Suzuki R. (1987). A hierarchical neural-network model for control and learning of voluntary movement. Biological Cybernetics, 57, 169–185. Kawato M., Maeda Y., Uno Y., & Suzuki R. (1990). Trajectory formation of arm movement by cascade neural network model based on minimum torquechange criterion. Biological Cybernetics, 62, 275–288. Keele S. W., & Posner M. I. (1968). Processing visual feedback in rapid movements. Journal of Experimental Psychology, 77, 155–158. Kelso J. A. S., Southard D. L., & Goodman D. (1979). On the nature of human interlimb coordination. Science, 203, 1029–1031. Luh J. Y. S., Walker M. W., & Paul R. P. C. (1980). Online computational scheme for mechanical manipulators. Journal of Dynamic Systems Measurement and Control, 102, 69–76. Massone L., & Bizzi E. (1989). A neural network model for limb trajectory formation. Biological Cybernetics, 61, 417–425. Milner T. E., & Ijaz M. M. (1990). The effect of accuracy constraints on 3dimensional movement kinematics. Neuroscience, 35, 365–374. Morasso P. (1981). Spatial control of arm movements. Experimental Brain Research, 42, 223–227. Nagasaki H. (1989). Asymmetric velocity and acceleration profiles of human arm movements, Experimental Brain Research, 74, 319–326. Robinson D. A., Gordon J. L., & Gordon S. E. (1986). A model of smooth pursuit eye movement system. Biological Cybernetics, 55, 43–47.

2040

H. Tanaka, M. Tai, and N. Qian

Shadmehr R., & Mussa-Ivaldi F. (1994). Adaptive representation of dynamics during learning of a motor task. Journal of Neuroscience, 14, 3208–3224. Sheth B. R., & Shimojo S. (2002). How the lack of visuomotor feedback affects even the early stages of goal-directed pointing movements. Experimental Brain Research, 143, 181–190. Tanaka H., & Qian N. (2003). Different predictions by the minimum variance and minimum torque-change models on the skewness of movement velocity profiles. Program no. 492.7. 2003 Abstract Viewer/Itinerary Planner. Washington, DC: Society for Neuroscience. Available online at: http://sfn. scholarone.com/itin2003/index.html. Todorov E. (2004). Stochastic optimal control and estimation methods adapted to the noise characteristic of the sensorimotor system. Manuscript submitted for publication. Todorov E., & Jordan M. I. (2002). Optimality feedback control as a theory of motor coordination. Nature Neuroscience, 5, 1226–1235. Uno Y., Kawato M., & Suzuki R. (1989). Formation and control of optimal trajectory in human multijoint arm movement. Biological Cybernetics, 61, 89– 101. Uno Y., Suzuki R., & Kawato M. (1989). Minimum muscle-tension-change model which reproduces human arm movement. In Proceedings of the 4th Symposium on Biological and Physiological Engineering, (pp. 299–302). Fukuok, Japan. Wada Y., Kaneko Y., Nakano E., Osu R., & Kawato M. (2001). Quantitative examinations for multijoint arm trajectory planning. Neural Networks, 14, 381–393. Winters J. M., & Stark L. (1985). Analysis of fundamental human movement patterns through the use of in-depth antagonistic muscle models. IEEE Transactions on Biomedical Engineering, 32, 826–839. Received September 5, 2003; accepted April 2, 2004.

LETTER

Communicated by Eero Simoncelli

Disambiguating Visual Motion Through Contextual Feedback Modulation Pierre Bayerl [email protected]

Heiko Neumann [email protected] Department of Neural Information Processing, University of Ulm, D-89069 Ulm, Germany

Motion of an extended boundary can be measured locally by neurons only orthogonal to its orientation (aperture problem) while this ambiguity is resolved for localized image features, such as corners or nonocclusion junctions. The integration of local motion signals sampled along the outline of a moving form reveals the object velocity. We propose a new model of V1-MT feedforward and feedback processing in which localized V1 motion signals are integrated along the feedforward path by model MT cells. Top-down feedback from MT cells in turn emphasizes model V1 motion activities of matching velocity by excitatory modulation and thus realizes an attentional gating mechanism. The model dynamics implement a guided filling-in process to disambiguate motion signals through biased on-center, off-surround competition. Our model makes predictions concerning the time course of cells in area MT and V1 and the disambiguation process of activity patterns in these areas and serves as a means to link physiological mechanisms with perceptual behavior. We further demonstrate that our model also successfully processes natural image sequences. 1 Introduction The analysis and interpretation of moving shapes and objects based on motion estimations is a major task in everyday vision. However, motion can locally be measured only orthogonal to an extended contrast (aperture problem), while this ambiguity can be resolved at localized image features, such as corners or junctions from nonoccluding geometrical configurations. The models previously suggested to solve the aperture problem, which integrate localized motion signals to generate coherent form motion. For example, the vector sum approach averages movement vectors measured for a coherent shape (Wilson, Ferrera, & Yo, 1992). Local motion signals of an object define possible solutions of the motion constraint equation. (Horn & Schunk, 1981). If several distinct measures are combined, their associated constraint c 2004 Massachusetts Institute of Technology Neural Computation 16, 2041–2066 (2004)

2042

P. Bayerl and H. Neumann

lines in the velocity space intersect and thus yield the velocity common to the individual measures (intersection of constraints, IOC) (Adelson & Movshon, 1982). Bayesian models combine different probabilities for velocities with a preference for, e.g., slower motions (Weiss & Fleet, 2001; Weiss, Simoncelli, & Adelson, 2002). Such preferences were encoded as statistical priors in the Bayesian estimation process. Like for the IOC, Bayesian models mostly assume that motion estimates belonging to distinct objects were already grouped together. Unambiguous motion signals can be measured at locations of significant 2D image structure such that curvature maxima, corners or junctions.1 These sparse features can be tracked over several frames to yield robust movement estimates and predictions (feature tracking) (Del Viva & Morrone, 1998). Coherent motion is often computed by smoothing and interpolating sparse measures estimated at localized image contrasts. Thus, the inverse problem of motion estimation is regularized by minimizing a smoothing constraint over surface regions (Horn & Schunk, 1981; Nagel, 1987) or along boundaries (Hildreth, 1984). In this letter, we investigate the mechanisms of the primate visual system to process fields of visual motion induced by moving objects or selfmotion. Motion information is primarily processed along the dorsal pathway in the visual system, but mutual interactions exist at different stages between the dorsal and ventral path (Van Essen & Galant, 1994). Our modeling of mechanisms of cortical motion processing focuses on the integration and segregation of visual motion in reciprocally connected areas V1 and MT. Our model dynamics are defined by spatially local isotropic interactions of velocity tuned cells. Additional signals indicating luminance boundaries or explicitly detected motion discontinuities could easily be used to improve our results, but would make it impossible to analyze the basic functionality of our system, which we claim to represent a fundamental mechanism of motion processing in the visual system. Despite its simplicity, the model is able to explain experimental data and, without parameter changes, to successfully process real-world data used for model benchmarking (Barron, Fleet, & Beauchemin, 1994). The new contribution of this work is a unified mechanism of motion disambiguation that deals with localized features as well as elongated boundaries. It is demonstrated how recurrent feedback processing between two model areas with similar dynamics operating on two different scales stabilizes local feature measurements. In addition, we show how such features trigger a filling-in process along boundaries to resolve the aperture problem and therefore arrive at globally consistent motion estimates through local lateral interactions. Such local lateral processes are biased and controlled by feedforward-feedback interactions that 1 In case of spatial object occlusions, junctions may signal incorrect velocities, since their boundaries belong to different object shapes (McDermott, Weiss, & Adelson, 2001). We do not particularly consider the corresponding segmentation problem, but briefly discuss consequences in section 4.

Disambiguating Visual Motion

2043

can be viewed as scale-invariant mechanisms of hypothesis testing and correction. The letter is structured as follows. In section 2, we describe the model and the model dynamics. In section 3, we present computational simulations, followed by a discussion of the model and the results in section 4. 2 The Model Motion analysis in visual cortex starts with primary visual area V1 and is subsequently followed by parietal areas such as MT/MST and beyond. These areas communicate with a bidirectional flow of information via feedforward and feedback connections. The mechanisms of this feedforward and feedback processing between model areas V1 and MT can be described by a unified architecture of lateral inhibition and modulatory feedback whose elements have been developed in the context of shape processing and boundary completition (Neumann & Sepp, 1999). In this section, we present the model dynamics within and between model cortical areas involved to realize the integration and segregation of inherently ambiguous input patterns. In a nutshell, the model consists of two areas with similar architecture that implement the following mechanisms. 1. Feedback modulation. Cells in area V1 are modulated by cell activations from model area MT. Cells in MT can, in principle, also be modulated by higher areas such as MST or attention. Since we focus here on the two stages of V1-MT interaction, the feedback to model area MT is set to zero. 2. Feedforward integration. 3. Lateral shunting inhibition enhancing unambiguous signals (see Figure 1). 2.1 Input Stage of Model V1 (Initial Motion Detection). The input stage in V1 consists of a set of modified elaborated Reichardt detectors (ERD) (Adelson & Bergen, 1985) to measure local motion for a specific range of velocities at each location (population code) (Pouget, Zemel, & Dayan, 2000). The functionality of the correlation-based detector is described in appendix A. The activities of this input stage (c(3) (x,∆x) for different velocities (encoded by x) at different locations x (see Figure 1 and appendix A) indicate unambiguous motion at corners and line endings, ambiguous motion along contrasts, and no motion for homogeneous regions. Physiological findings suggest that motion-sensitive cells in V1 respond to orientation as well (component motion) and that only MT cells get (more) invariant to orientation indicating “pattern motion” (Movshon, Adelson, Gizzi, & Newsome, 1985; Movshon & Newsome, 1996). We focus our investigation on the

P. Bayerl and H. Neumann

...

...

2044

•

...

c(3)

S1 S1 S1

÷ S2

v(1)

•

v(2) v(3) model V1

• •

...

• •

(copy of v(3) from model V1)

S1 S1 S1

÷ S2

v(1)

v(2) v(3) model MT

Figure 1: Overview of model areas V1 and MT. The architecture defines the dynamics of interaction between both areas consisting of modulary feedback (v(1) ), feedforward integration (v(2) ), and lateral shunting inhibition (v(3) ). Differences in the parameterization concern receptive field sizes and the wirings of feedforward input and feedback (see Table 1). 1 denotes an integration with isotropic gaussian kernels in space and in the velocity domain (see Table 1). 2 denotes the sum over all velocities at each individual location. c(3) is the input stage of model V1 and is realized as a correlation detector based on complex cell responses (see appendix A). Note that the input stage of model MT is just a copy of v(3) from V1.

role of feedback in motion disambiguation and therefore employ motionsensitive cells in model V1 that are not orientation selective (see the discussion in section 4.1). The input stage is divided in two steps. The first concerns cells selective to static-oriented contrasts at a fixed spatial frequency and independent of contrast polarity. Consistent with Movshon and Newsome (1996), these cells also respond to very weak contrasts, which is realized through shunting normalization (see appendix A). The second concerns direction-selective cells, pooling over all orientation-selective cells at different time steps, which yields a representation of visual motion independent of contrast orientation. This simplification does not contradict basic properties of motion-selective cells in V1, since immediate response properties of model cells concerning visual motion basically indicate component motion except at very few locations in the image with intrinsic two-dimensional structures such as corners or line endings (see sections 3 and 4.1). We claim that the important difference between V1 and MT is the spatial size of receptive fields and that the proposed mechanisms within each area and between each area reveal properties consistent with physiological and psychophysical observations and yield highly accurate visual motion estimations (see section 3). In the following, we describe how different parts of the model contribute to the disambiguation of ambiguous motion signals in a biased competition process, while motion discontinuities remain preserved.

Disambiguating Visual Motion

2045

Table 1: Parametrization of Equations 2.1–2.3 for Model Areas V1 and MT. Model Area

netIN

netFB

C

Kernel Size σ1 (spatial)

Kernel Size σ2 (velocity)

V1 MT

c(3) v(3) V1

v(3) MT 0

100 0

0 (dirac) 7

0.75 0.75

Notes: Spatial RF ratio V1:MT = 1:5 (the RF √size of cells in model V1 is defined by the RF sizes of the correlation detector: 2σ 2 = 1.41 (σ = 1), and the RF size of MT cells is influenced by the RF sizes in V1:

1.412 + 72 = 7.14).

2.2 Motion Processing in Model Area V1 and MT (Motion Integration). Two areas, model V1 and model MT, with similar dynamics subsequently process the initial motion signal in a recurrent loop (an outline is given in Figure 1). Here we outline the details of the architecture of both model areas, which consists of three steps: feedback modulation, feedforward integration, and lateral inhibition (see Table 1 for the parameterizations of processing stages in model V1 and MT). The computational logic of recurrent feedback processing between two areas is that the higher area integrates localized measures from the lower area over a larger spatial range, or neighborhood, and thus evaluates these signals in a much larger context (Hup´e et al., 2001; Friston & Buchel, ¨ 2000). Consider the higher area, MT, as one in which cells evaluate their input via their feedforward input connection strength. Then the resulting activity could be interpreted as a signature of the degree of match between expected input (encoded by the input weights of receptive field [RF] kernels) and the current input signal. Feedback in turn functions as a predictor that enhances those signals in the lower area that are compatible with respect to feature specificity by way of top-down modulation (Grossberg, 1980; Ullman, 1995; Mumford, 1994). Such excitatory modulatory feedback interaction has the effect of providing activations in V1 that match the “expectations” of MT, a competitive advantage in subsequent mutual inhibition. Modulatory feedback has the further advantage that only compatible patterns get emphasized in the lower area and no activity is produced where no signal is provided by the input. Thus, our model adopts the “no strong loop” hypothesis developed by Crick and Koch (1998). Also, modulatory feedback can be compared with the intersection of constraints (IOC) analysis of motion integration, since the input signal constrains the feedback signal and the intersection of both gets emphasized.2 Before integrating the modified input 2 In our model, we particularly focus on the feedforward-feedback interaction between areas V1 and MT. This does not deny, however, the existence of (modulatory) feedback connections from, for example, area MST also. Such an extension can be incorporated by adopting the model mechanisms of MT-V1 feedback modulation also at the stage of model MT to include MST-MT feedback as well.

2046

P. Bayerl and H. Neumann

signal, it is squared to sharpen the distribution. Then the signal is processed by cells with isotropic spatial and isotropic directional gaussian RFs (see equation 2.2). This feedforward integration can be compared with vector average if the population code is interpreted as the sum of its components. Cell activities are normalized subsequently by lateral shunting inhibition (see equation 2.3). The sum of activations of cells sensitive to any velocity at a specific location normalizes the total energy (Simoncelli & Heeger, 1998). Therefore, unambiguous signals (indicating the presence of only one or only a few velocities) get emphasized, while ambiguous signals (indicating many possible velocities) lead to flat population responses. This step is similar to a feature tracking mechanism, whose behavior in our model emerges from the dynamic behavior of the system without employment of any explicit feature detection mechanism. Each model area consists of three stages of processing whose dynamics are defined by the following equations (compare Figure 1; see appendix B for the steady-state solutions used for the simulations): ∂t v(1) = −v(1) + netIN · (1 + C · netFB ) (x,space)

∂t v(2) = −v(2) + (v(1) )2 ∗ Gσ1 ∂t v(3) = −0.01 · v(3) + v(2) −

(∆x,velocity)

∗ Gσ2

1 v(2) , + v(3) · x 2n

(2.1) (2.2) (2.3)

where n denotes the number of cells tuned to different velocities at any specific location. netIN is the input of the model area (e.g., the output of the correlation detector for model V1), and netFB is the feedback signal (e.g., the output of model MT for model V1). * denotes the convolution operation and Gσi are gaussian kernels in space(Gσ1 ) and in the velocity domain (Gσ2 ). Velocity is coded by a spatial shift x per frame. Compared to model V1, the enlarged RF sizes in model MT (RF ratio V1:MT = 1:5; see Table 1) lead to less ambiguous signals, since the influence of unambiguous motion features generated by, e.g., line endings, is larger due to a larger aperture. We propose a solution to the aperture problem by feedback of activities from a higher stage, which provide a larger context of motion integration. Thus, less ambiguous MT responses from large spatial regions are combined with ambiguous but spatially localized V1 cell responses such that these signals in turn become less ambiguous. The temporal evolution of this disambiguation process in turn spreads, or fills in, unambiguous motion information along the moving shape outline like a traveling wave triggered by localized features. Again, we emphasize that this disambiguation is achieved by the network dynamics without the employment of specialized feature detectors and explicit mechanisms for decision making.

Disambiguating Visual Motion

2047

To handle continuous sequences of images (instead of iterating just over the correlation results of one image pair), the feedback signal has to “follow” the detected motion. This is realized by shifted synaptic feedback connections corresponding to the velocity (represented by x) in accordance with the cells’ maximum sensitivity and thus represents a kind of prediction mechanism. 3 Results The model dynamics described in the previous section emerges from a layered architecture of cells, or groups of cells, in which each area is represented by three layers. Model cells were considered as single compartment entities whose gradual activation dynamics follow shunting, or membrane, properties. The network can be shown to obey bounded input-output stability. Since the equations equilibrate fast, we simulate the dynamics using steady-state equations in order to save computing time. Computational simulations demonstrate the ability of the presented neural architecture to cope with synthetic image sequences as well as with natural sequences. The same set of parameters (see Table 1) was used for all simulations. The number of iterations varied, depending on how many frames were available. The number of cells to represent the velocity space was chosen to be 225 (∆x ∈ {−7, . . . , 7}2 , including ∆x = 0). We investigated the model by probing it with artificially generated image sequences utilized in physiological and psychophysical experiments. Furthermore, we demonstrate that the same model using the same parameter settings is able to process realistic sequences. These test cases were also used in benchmark tests for technical motion detection approaches. 3.1 Empirical Data. Simulations of model V1 and MT qualitatively confirm the results from neurophysiological recordings of the time course of MT cells (Pack & Born, 2001). Model MT cell responses along the extended contrasts initially signal normal flow (perpendicular to the contrast orientation and corresponding to component motion) as in experimental observations. In the psychophysical experiment, the cells’ selectivity gradually switches after approximately 150 ms to signal the true velocity. In our model, this change is facilitated by filling in correct motion interpretations derived from intrinsic feature signals at line endings or corners (see Figure 2). As a consequence, our model suggestions are twofold: (1) the spatial distance of MT cells from unambiguous motion features directly influences the time until the signal is completely disambiguated and (2) scale invariance is achieved through temporal integration of disambiguated motion estimates along boundaries. Note that the filling-in process preserves spatially localized responses due to modulatory feedback. Our model generates disambiguated motion cell responses in MT as well as in V1. This predicts that recordings of motion-sensitive cells in V1 should result in similar time pat-

2048

P. Bayerl and H. Neumann

terns as for recordings in MT. There is no clear evidence that V1 cells behave similar to MT cells. This may be a consequence of the RF size of V1 cells, which makes it extremely difficult to obtain spatially accurate physiological measurements over time for moving objects. However, recent physiological studies show that subpopulations of V1 cells near line endings actually encode pattern motion and show similar dynamics as cells in MT (Pack, Livingstone, Duffy, & Born, 2003) (see section 4.1). The following experiment shows how feedback processing causes the model to generate neural hysteresis and emphasizes the role of feedback for the disambiguation process. Figure 3 illustrates how neural hysteresis is generated by processing a random dot cinematogram, gradually changing from rightward to leftward motion or vice versa. Such displays induce perceptual hysteresis, indicating interactions with a short-term memory (Williams & Phillips, 1987). Such an interaction may help to lock in and keep the initially detected velocity (e.g., leftward motion) until the stimu-

Disambiguating Visual Motion

2049

lus has changed enough to switch to another alternative (e.g., rightward motion). The feedback signal in our model can be interpreted as short-term memory since it contains information from preceding time steps. Model simulations show that hysteresis is generated as a consequence of feedback processing. Without feedback, our model generates no hysteresis behavior, and the motion signal never gets completely disambiguated (maximum of approximately 80% correct flow information for purely left- or rightward motion; see Figure 3B). The latter effect occurs since the correlation detector used in the input stage cannot distinguish between the different stimulus dots, which leads to a high ambiguity in the correlation signal (similar to most other motion detectors). If feedback is applied, motion gets completely disambiguated and 100% right- or leftward motion is indicated (even in the presence of dots with switched direction) until the signal represented by MT cells switches nearly immediately to 100% of the other direction. This behavior can be interpreted as neural decision making. 3.2 Real-World Data 3.2.1 Artificial Sequences. Figures 4 and 6 illustrate the ability of the model to segregate different regions of visual motion. The artificially gener-

Figure 2: Facing page. Example of processing an artificial test sequence (100 × 100 pixel). (A) Temporal development (t = 1, . . . , 9) of the direction indicated by local model MT cell populations at three different image locations (indicated by simulated “electrodes”). The direction of motion measured at the corner (electrode 1, filled circle) points toward the true direction of motion (45 degrees) from the beginning, while on the edge (electrodes 2 and 3, open circles and triangles) flow initially measured orthogonal to the orientation of local contrast (90 degrees) gradually changes to the correct direction. Cell populations spatially more distant from locations with initially unambiguous estimations take longer to get disambiguated. (B) Real data showing the temporal course of direction indicated by a cell population in macaque area MT located on a moving bar pattern, redrawn and adapted from Pack and Born (2001): (1) 60 ms after stimulus onset, motion is orthogonal to the orientation of the bar. (2) 150 ms after stimulus onset, correct motion is indicated. The temporal activity pattern of model cell populations (A) qualitatively matches the temporal pattern of activity in macaque MT (B) by rescaling the temporal axis by an appropriate factor. (C) Temporal development of V1 cell population responses. Populations of motion-sensitive cells representing the local velocity space are illustrated at given locations indicated by small circles (dark = low activity, light = high activity). t = 1: Initial flow estimations (V1) at different (subsampled) cell locations and activity patterns of selected cell populations (large arrows: true direction of motion). t = 2, . . . , 6: Demonstration of the disambiguation process.

2050

P. Bayerl and H. Neumann

ated sequence of a flight over Yosemite National Park simulates self-motion over a static environment, which leads to an expanding flow pattern, with the exception of clouds moving horizontally to the right. The resulting flow estimation demonstrates that the model is able to extract motion induced by gray-level sequences of different frequencies, although the correlation detector is based on only one small scale (the clouds are represented on a much coarser scale than the ground).

Disambiguating Visual Motion

2051

Figure 4 shows the quantitative analysis of the processing results for the Yosemite sequence. We obtain a flow field of 100% density with an average angular error of 6.20 degrees after 10 iterations. This result compares well with technical solutions that produce less accurate estimations in most cases (Barron et al., 1994; Nicolescu & Medioni, 2003). The median error of 2.95 degrees after 10 iterations illustrates the robust model performance when outliers near region boundaries are excluded that occur exclusively at the horizon. Note that Barron et al. (1994) did not use iterations and multiresolution to refine the results of the computational models they have compared, specifically in those approaches, where iterations and multiresolution processing would be optional. However iterations were used, for example, to propagate smoothness constraints (Horn & Schunk, 1981), which can be compared to our relaxation process. Again, we point out that our approach relies on a single-scale estimation of visual motion, which is subsequently integrated and in turn updated by context information. 3.2.2 Real Camera Images. Natural image sequences recorded by cameras are noisy, and there is a high probability of encountering complex Figure 3: Facing page. Perceptual hysteresis effect and model feedback for motion disambiguation. Theproportion of MT cell activities indicating rightward motion ( activitiesright / activitieslef t+right ) is plotted for each frame, processing two random dot cinematograms (each sequence shows 60 moving dots and consists of 60 frames, 40 × 40 pixel). In the first sequence (sequence a, the solid line), the dots are initialized with a random position and a velocity of three pixels per frame to the right. In each frame, one right-moving dot changes its direction by moving to the left. In the second sequence (sequence b, the dashed line), the dots have an initial direction to the left, switching one after the other to the right. (A) Feedback processing disambiguates the signal and generates a directional hysteresis effect that indicates the inertia generated by locking in the prediction from top-down feedback of a motion direction measured over time. Both sequences show an initial ambiguity in estimated motion (relative activity is 80% and 20% for correct and incorrect motion, respectively), because the correlation detector confounds corresponding dots within a certain neighborhood. Therefore, wrong velocity cues are detected in addition to the correct ones. After a few iterations of feedback processing, the MT cells are disambiguated and indicate a perfectly coherent motion signal (100% or 0% rightward). The response for sequence a (solid line) switches from rightward to leftward motion between 60% and 75% of changed dot directions, whereas for sequence b (dashed line), it switches from leftward to rightward motion between 40% and 25% of changed dot directions (hysteresis). (B) Without feedback (C = 0; see Table 1), no hysteresis is generated. The sum of cell activities indicating rightward motion is proportional to the relative number of dots moving to the right. The initial ambiguity of 80% versus 20% correct and incorrect motion responses is not resolved.

2052

P. Bayerl and H. Neumann

imaging conditions like occlusions and nonrigid motion. The proposed architecture successfully deals with a large set of natural sequences, including traffic sequences and animals in motion. Figure 5 illustrates the results for two examples: a traffic sequence with the moving taxi and a moving zebra. Feedback processing combines temporal information in order to segregate motion and eliminate outliers and noise. The spreading of unambiguous flow is visible especially where the aperture problem arises (along the stripes of the zebra) and where occlusion falsifies initial motion estimates (the border of the cars). Object segregation capabilities of the model are sketched in Figure 6. Concurrent interaction (shunting inhibition) in velocity space (including zero velocity, ∆x = 0) segregates regions of different motion from each other as well as from regions with stationary contrast configurations. 4 Discussion In this letter, we presented a new computational framework of recurrent motion processing to integrate and segregate visual motion signals. In the following, we discuss the model’s biological plausibility, compare it with existing models of motion processing, and summarize the major contributions. 4.1 Relevance and Biological Plausibility. There is both structural and functional evidence for the mechanisms and layered organization of our model. Anatomical and physiological studies suggest inter- and intra-areal connections also used in our model (Van Essen & Galant, 1994; Maunsell, 1995). Motion-sensitive cells can be found in MT as well as in V1 (Maunsell & Van Essen, 1983). Physiological studies (Movshon et al., 1985) have shown that cells in V1 are sensitive to component motion (motion along oriented contrasts) while cells in MT are less sensitive to oriented components but signal pattern motion. Since our focus is on the investigation of feedback in the processing of motion stimuli, we kept the input stage to the model as simple as possible. As a consequence, we assumed motion-sensitive cells in V1 to be independent of contrast orientation. We claim that this does not constrain the capability of our model to resolve ambiguities among motionsensitive V1 cells if they were sensitive to orientation as well, because motion signals indicated by such cells inherently contain ambiguities by the definition of component motion. However, such an approach would obscure the disambiguation process described by our model, which is a consequence of lateral and feedback interactions between both model areas, differing in the spatial scale of integration. We claim that model V1 could be adapted by incorporating further details to match physiological properties of V1, such as spatial frequency tuning, contrast polarity, and orientation tuning (Hubel & Wiesel, 1968). There is recent physiological evidence that V1 cells can partially encode pattern motion (thus, motion independent of the orientation). Pack et al.

Disambiguating Visual Motion

2053

Figure 4: (A) Detected motion using an artificially generated sequence of four frames simulating a flight over the Yosemite national park (316 × 252 pixel resolution); shown: motion indicated by V1 cell populations. Expanding flow is caused by self-motion through a static scene (the ground) and rightward translational flow by moving clouds (upper region of the horizon). Motion is detected even though the underlying gray-level structure has strongly varying spatial frequency content, or scale. Both regions of motion (ground and clouds) are clearly segregated from each other after a few iterations of V1-MT feedback processing. (B,C) Angular error of the optical flow direction indicated by MT cell populations processing the first two frames of the Yosemite sequence. Since there is no ground-truth available for the movement of the clouds, we assumed a horizontal motion to the right (according to Barron et al., 1994). (B) Spatial configuration of the angular error at different time steps (dark = large errors, light = small errors). Error peaks occur at flow discontinuities (clouds and ground). Note that the regions containing these outliers are much smaller than the RF size of MT cells (indicated at the top left of each image). (C) Temporal development of the average and the median angular error (of the entire flow field). The average error almost converges after five iterations and slightly improves to 6.20 degrees after 10 iterations. The median error of 2.95 degrees (after 10 iterations) illustrates the accuracy of the model excluding outliers, such as errors of approximately 180 degrees, which occur at region boundaries along the horizon.

2054

P. Bayerl and H. Neumann

Figure 5: Detected motion using real-world sequences (four frames). (A) The Hamburg taxi sequence (256 × 190 pixel resolution). (B) A walking zebra (http://www.junglewalk.com, 320 × 240 pixel). Shown: motion indicated by MT cell populations. Initially, erroneous flow information is eliminated where no motion occurs, and slight directional corrections can be observed where flow information is affected by the aperture problem (at the border of the cars and the stripes of the zebra).

(2003) showed that the time course of a subpopulation of V1 cells is similar to the time course of cells in MT solving the aperture problem near line endings. This is consistent with the prediction of our model that cells in V1 and MT are disambiguated simultaneously as a consequence of feedback and local competitive interaction. Lateral inhibition (normalization) is accomplished in our model by an isotropic center-surround interaction, which represents the simplest way to realize that functionality. The shunt-

Disambiguating Visual Motion

2055

Figure 6: Object segregation capabilities of the model. Display of maximal MT/V1 activities that were normalized against global minima and maxima for the sequences in Figures 4 and 5A (dark = low activation, light = high activation). Low activation appears due to strong competition at locations of high motion contrast. (A) In the taxi sequence, the cars are clearly segregated from the background and against each other. (B, C) The Yosemite sequence reveals two major regions of motion: ground and sky/clouds. It shows that the motion signal in V1 (C) contains some structural information that is averaged in MT (B).

2056

P. Bayerl and H. Neumann

ing equation (see equation 2.3) reflects the neuronal activity saturation. As a result of our modeling investigations, we provide evidence that experimentally observed behavior is generated by network dynamics that emerges from several elementary operations and functional principles of layering and connectivity. With the detail of employed model components, it was not our primary focus to quantitatively fit physiological data. Instead, our model serves as a link between physiology and perceptual behavior, which also enables processing real-world image sequences for benchmarking. The model suggests key principles of motion disambiguation and integration. The proposed modulatory feedback mechanism is supported by recent physiological investigations of feedback connections between early visual areas (V1, V2, and V3) and MT (Hup´e et al., 2001; Friston & Buchel, ¨ 2000). For example, Hup´e et al. (2001) show that cell activities in V1 are highly affected by feedback from MT in an excitatory manner shortly after stimulus onset. This is consistent with our model, in which only excitatory feedback modulation is used (1 + netFB ≥ 1; see equation 2.1). As a result of recurrent processing, and consistent with physiological recordings of the time course of MT neurons (Pack & Born, 2001), our model disambiguates the motion signal shortly after stimulus onset. Here, the time to establish the final percept is influenced by the strength of the feedback connections, the RF field size ratio between V1 and MT, and, as a prediction of our model, the spatial extension of the region of ambiguous motion (see Figure 2). The time course of MT cell populations was also investigated by Pack and Born (2001) for different bar lengths (2–8 degrees). Consistent with our results, the time required to disambiguate such stimuli was roughly proportional to the bar length (Pack, personal communication, December 2003). 4.2 Comparison with Existing Models of Motion Processing. Different stages and computational mechanisms utilized by our model were also used in other biologically inspired as well as computational models. For the purpose of better readability, we organize our discussion according to categories of existing models of motion estimation; pure feedforward and recurrent models. 4.2.1 Feedforward Models. Simoncelli and Heeger (1998) proposed a model of detecting motion energy in areas V1 and MT using linear spatiotemporal filters. Individual motion estimates are normalized by dividing individual responses through the average response of activity in a spatial neighborhood. This center-surround mechanism has also been employed in our model. We achieve such a normalization by an antagonistic mechanism that involves shunting inhibition. The net effect leads to a divisive inhibition at individual locations by average activities integrated over a neighborhood in the space-velocity domain. Unlike Simoncelli and Heeger, we have incorporated a mechanism of modulatory feedback that disambiguates the motion signal and spreads activities over longer distances, solving the aperture

Disambiguating Visual Motion

2057

problem. It is worth mentioning that their filters in V1 and MT differ from our mechanisms and model the fact that motion-sensitive cells in V1 have less or no speed tuning. Such filters could also be used in our model, but in order to focus on the influence of feedback processing, we omitted any additional parameters. Nowlan and Sejnowski (1994, 1995) described a model of motion integration that utilized an explicit selection signal that is computed to determine the regions in the visual field where velocity estimations are most reliable. Motion-sensitive cells are then gated by this signal to produce the final estimate. Note that the way they learn how to compute the selection signal is an elegant method that may be applied to learn a normalization process like the one described by Simoncelli and Heeger (1998). Our model differs in several ways from Nowlan’s approach. While their approach utilizes a feedforward scheme, our model combines feedforward estimates with feedback integration and prediction. As a consequence, initial rough estimates are integrated and evaluated over time within a recurrent loop of matching velocities and motion predictions generated in area MT. As a by-product, the determination of reliable motion estimates is computed implicitly in our model instead of explicitly generating a decision-like selection signal. Weiss and Fleet (2001) and Weiss, Simoncelli, and Adelson (2002) solved the problem to determine the velocity of a moving object using a Bayesian estimation approach. They estimate coherent motion of a moving shape by maximizing the posterior probability of velocity votes given the detected image motion. This formulation leads to a probability representation in velocity space for all measures of single moving objects. In the spirit of IOC computation, all probability distributions are multiplicatively combined, including a given prior of expected velocities in the scene. The aperture problem is solved implicitly by maximizing the posterior from all individual measures. In our model, we do not directly combine all initial estimates, since this requires a priori knowledge about which moving parts in the stimulus belong together. Instead, we let initial motion signals being modulated by a predictive signal from the higher processing stage of area MT, which serves as a local prior. In order to achieve a global consistent estimate, this process is iterated to allow propagation of disambiguated motion signals along extended shape boundaries. 4.2.2 Recurrent Models. Grossberg, Mingolla and Viswanathan (2001) and Mingolla (2003) presented a model of motion integration and segmentation in MT and MST based on inputs from the form pathway (modeled as the FACADE framework—Form-And-Color-And-DEpth) (Grossberg, 1994). They studied how motion signals from partly occluded patterns can be integrated and segregated in a recurrent fashion. In contrast to our approach, their feedback signal (from MST) inhibits MT activities and has a more global character due to the RF size of MST cells (depending on the stimulus, these RFs cover 50% up to 100% of the entire stimulus). The authors suggest that such a mechanism of feedback inhibition and selection

2058

P. Bayerl and H. Neumann

also helps to solve the aperture problem (Mingolla, 2003). Unlike our model, this is realized by a decision-like mechanism through the inhibitory influence of global context information delivered by large-spanning kernels. We predict, therefore, that any resolution of uncertainty from the aperture problem should be independent of the length of the bar stimulus. Instead, our model propagates salient motion information along extended boundaries through recurrent interaction of MT and V1 cells with different RF sizes. This filling-in mechanism achieves size-invariance properties. Also, its temporal properties concur with experimental observations as the time needed for disambiguation increases with distance from locations of unambiguous motion. Lid´en and Pack (1999) proposed a model of recurrent lateral motion interactions, which is able to produce a traveling wave of motion activation to solve the aperture problem. Like our model, they use the normalization similar to the mechanism described by Simoncelli and Heeger (1998) to emphasize salient motion estimates. In contrast to our model, their normalization mechanism is not isotropic in the velocity space. The propagation is done by recurrent lateral excitation leading to an unbounded filling-in process, which has to be constrained by long-range inhibition of motion cells of different directional selectivity and by a separately processed motion boundary signal. In the absence of concurrent motion signals from multiple objects, their model leads to completely filled-in motion fields, which must be gated by multiplying the input signal in order to display only relevant motion patterns. Conversely, our model implements a kind of “soft gating” by biasing the input signal during feedback processing (see equation 2.1) and therefore produces spatially restricted motion estimates at all time steps without an explicit computation of motion or form boundaries. Koechelin, Anton, and Burnod (1999) describe a model of motion integration along the V1-MT pathway that utilizes mechanisms of recurrent lateral interactions. Their model utilizes a multiplicative combination of feedforward input and the result of lateral integration, which leads the authors to claim that their approach implements a neural mechanism of Bayesian estimation. Salient motion features are emphasized through normalization, and the results of recurrent lateral modulation (gating) are used to propagate these features. Though these mechanisms seem to be rather similar compared to those proposed in our model, the realization and behavior differ in many respects. For example, their gating process leads to strong inhibition of the input signal once the model has focused on one specific velocity while the stimulus changes to another velocity. Such lateral multiplication intensifies the winner-takes-all characteristic of their model (Koechelin et al., 1999) and makes it more vulnerable to errors. Our model follows a gradual prediction-and-correction philosophy realized by an exclusively excitatory modulation of the feedforward input through the feedback signal, which is followed by a center-surround competitive mechanism to realize a biased competition. Essential to our model is the decoupling into different areas with different RF sizes. That provides a larger context to the

Disambiguating Visual Motion

2059

higher area and thus the ability to correct (bias) and disambiguate cell activities in earlier areas with higher spatial accuracy. Another important point is that Koechelin et al. did not include cells sensitive to zero velocity or include an interaction with static form information in their model. This renders it impossible to segregate moving objects from a static background in a spatially localized fashion. Combined with the winner-takes-all property, their model may even form some regions of motion in a static image sequence with spatiotemporal noise that actually are not present. Finally, the results published in Koechelin et al. (1999) show that they fail to solve the aperture problem for moving bars in cases when they are longer than the size of their RFs. Contrary to this behavior, in our model a traveling wave of activation emerges, which helps to disambiguate motion signals along extended bars and shape outlines independent of the ratio between shape size and RF size. This concurs with the temporal evolution of MT cell activities investigated by (Pack & Born, 2001). Our proposed architecture is demonstrated to deal with large varieties of shape or object size as to provide a mechanism of size invariant motion integration. 5 Conclusion In sum, we presented a model of motion processing in area V1 and MT capable of handling synthetic as well as artificial image sequences. The model shows the following key properties: initial detection of raw flow information, temporal spreading of reliable motion signals to gradually correct uncertain flow estimates, and the ability to sharply segregate regions of individual visual motion. We showed how to solve the aperture problem by contextual modulation, how feedback acts as short-term memory to account for hysteresis effects in motion disambiguation, and how global consistency is achieved by local interactions. Our model is unique in the sense that it combines mechanisms of local lateral interaction with modulatory and purely excitatory feedback to solve ambiguities of detected visual motion. Our approach makes several new contributions. First, we propose a model of cortical feedforward and feedback processing in the dorsal pathway of motion integration implementing a neural hypothesis-test cycle of computation. Most important, the feedback mechanism is part of top-down modulatory enhancement of initial activities that match signal properties at a higher processing stage. Second, the disambiguation of initial estimates is solved by the interplay between top-down modulation and subsequent lateral competition. Consequently, the network dynamics propagate disambiguated motion signals along shape boundaries, thus realizing a guided filling-in process (Neumann, 2003). This mechanism is important in that it provides a means to process objects of different sizes in an invariant fashion. Third, the model serves as a link between physiological recordings (e.g., Pack & Born, 2001) and psychophysical investigations of perceptual motion integration (Williams & Phillips, 1987). Beyond this, the model is

2060

P. Bayerl and H. Neumann

able to process real-world stimulus sequences to yield accurate motion estimations. We believe that this further justifies the explanatory competence of key computational elements of the model, as most other biologically inspired models do not compare the quality of their results against other technical or nontechnical models. Based on model simulations, we make several predictions concerning the computational mechanisms involved in early motion perception: 1. The disambiguation process observed for macaque MT cells (Pack & Born, 2001) should also be observed in V1 cells as a direct consequence of feedback interactions. This model prediction is partly confirmed by the findings of Pack et al. (2003). 2. The time to disambiguate regions of ambiguous motion depends on the distance of such regions from unambiguous motion features (e.g., induced by corners or line ends). This time is consumed by the increased number of feedforward-feedback cycles necessary to bias responses that cohere with the apparent motion direction. 3. We predict that without feedback, no perceptual hysteresis is generated, and motion activity patters remain ambiguous. At the current stage of modeling, the process of motion grouping is solely based on proximity in the spatial and the velocity domain. Interactions with the form pathway could enhance the model performance by grouping motion information preferably along contours. Ownership cues arising from occlusion can also be deduced from the ventral form pathway. Psychophysical investigations suggest that motion features are integrated only when they are intrinsic to the moving boundary, while extrinsic signals should be suppressed (Shimojo, Silverman, & Nakayama, 1989). This topic needs further investigation since it is yet unclear how signals from the form and motion pathway are integrated utilizing mainly excitatory interactions between cortical areas. However, even without all these extensions, the model already yields psychophysically and physiologically consistent results for a broad range of motion stimuli. In addition, the quality of estimated motion direction in real-world sequences compares well with technical solutions without explicit parameter tuning for each type of sequence. In all, the proposed model provides further evidence for key computational principles that are involved in the cortical computation of sensory stimuli, their integration and segregation. Neumann and Sepp (1999) have proposed the basic mechanisms of feedforward feature detection, subsequent integration and matching, and subsequent modulation through feedback to implement a neural hypothesis-testing paradigm for boundary integration in static form perception. Model simulations of model V1V2 interaction demonstrated context-dependent changes of orientation selectivity, texture density suppression, and subjective contour completion as observed in physiological and psychophysical experiments. An exten-

Disambiguating Visual Motion

2061

sion of this architecture by Thielscher and Neumann (2003) incorporates a model area V4 to investigate the segregation of textures generated from bars of different orientations. Whereas these models focus on the inter-areal feedforward-feedback interaction, we also proposed a model of intra-areal recurrent processing of V1 contour processing (Hansen & Neumann, 2004). These neurodynamical models of static form perception utilize the same core mechanisms of layered processing in cortical architecture. In this letter, we now propose the same core mechanisms to account for the processing of temporally varying stimuli in the cortical motion pathway. Given the evidence gathered from our computational experiments, we claim that the early processing stages in visual cortex along the ventral and the parietal pathway are organized in a homologous fashion. Modulatory feedback and subsequent divisive inhibition realize a mechanism of biased competition already at an early stage with a similar behavior as the one proposed by Desimone and Duncan (1995) for attention mechanisms to filter out irrelevant information. We have proposed a concept of feedback as part of a layered structure and representation and presented an implementation of multiple loops of recurrent interaction whose dynamics realize multilevel cortical hypothesis testing cycles. Appendix A: Equations for Initial Motion Detection (Input Stage) Here we describe the equations used to generate an initial raw motion estimation, which is used as input stage to V1. We use (oriented) model complex cells to compute a spatiotemporal correlation (modified elaborated Reichardt detectors [ERD], similar to Adelson and Bergen (1985)) to measure local motion for a specific range of velocities at each location. Equations A.1 to A.4 (illustrated in Figure 7A) generate a raw motion signal (encoded as population code in c(3) ), which is computed from a pair of images and is further processed by the main model equations, B.1 to B.3. c(1) t,x,α represents oriented complex cells responses normalized through shunting inhibition. Eight orientations (α) were used for the simulations 2 (* denotes the convolution operator, Gσ is a gaussian function, and ∂x(α) the second directional derivative in direction α). The convolution with the second derivative of an isotropic gaussian filter (with approximately σ = 1) was computed by applying two times a sampled first derivative of an isotropic gaussian filter (σ = 0.75) in order to preserve (numerically) a dccomponent of zero. Figure 7B illustrates some examples for sampled first derivatives of a gaussian. Figure 7E shows a zoomed part of a frame from an example sequence to compare the size of the kernels (printed at the same scale) with typical image structures processed by the model: c(1) t,x,α =

2 G It,x ∗ ∂x(α) σ . 2 G |∗G 0.01 + β |It,x ∗ ∂x(β) σ σ

(A.1)

2062

P. Bayerl and H. Neumann

Figure 7: (A) Schematic overview of the correlation detector used as the input stage of V1 (for details, see the text). (B) First derivative gaussian kernels for different orientations. (C) Iosotropic gaussian kernel with σ = 1 used in model V1 and the correlation detector (see the text). (D) Gaussian kernel (σ = 7) used in model MT (see the text). (E) Zoomed part of a frame from an example sequence, printed at the same scale as the kernels in B–D.

Disambiguating Visual Motion

2063

(2−) c(2+) t,x,∆x and ct,x,∆x represent half-detectors for a specific velocity defined by a shift ∆x = (x, y) between two successive frames and can be interpreted as a raw correlation of model complex cell responses. These results are obtained by summing over all orientations (as a consequence, these results are no longer orientation specific) and pooling over a small spatial neighborhood with an isotropic gaussian kernel (σ = 1), illustrated in Figure 7C:

c(2−) t,x,∆x

(1) (1) c · c ∗ Gσ t,x,α t+1,x+ ∆ x,α α = c(1) · c(1) t,x+∆x,α ∗ Gσ . α t+1,x,α

c(2+) t,x,∆x =

(A.2) (A.3)

c(3) t,x,∆x builds the final population code at each location indicating raw motion estimatations ([·]+ is a rectification operator realizing max(·, 0)). A shunting inhibition normalized these activities, which differs from the standard implementation of the full Reichardt detector, which uses subtractive inhibition: c(3) t,x,∆x =

(2−) [c(2+) t,x,∆x ]+ − 0.5 · [ct,x,∆x ]+

1.0 + [c(2−) t,x,∆x ]+

.

(A.4)

The key features of the mechanism are that (1) the responses of normalized complex cells of all orientations are involved for motion estimation and that (2) even stationary patterns can produce (weak) responses in a spatial velocity code (e.g., induced by a periodic gray-level pattern). The resulting activities c(3) (x,∆x) for different velocities (encoded by x) at different locations (x) indicate unambiguous motion at corners and line endings, ambiguous motion along contrasts, and no motion for homogeneous regions. Appendix B: Equations for Motion Integration Here we present the steady-state equations corresponding to equations 2.1 through 2.3. v(1) samples the input signal (netIN , depending on the model area; see Table 1) and multiplicatively enhances incoming signals matching to the feedback signal (netFB , depending on the model area; see Table 1). v(1) = netIN · (1 + C · netFB ).

(B.1)

v(2) realizes the integration in space and the velocity domain by isotropic gaussian filters. In the velocity domain, the filter size is identical in both model areas (σ = 0.75). Note that the resolution of the velocity space was set to 15 × 15 for all simulations. The spatial kernel is a dirac-impulse (a

2064

P. Bayerl and H. Neumann

gaussian with σ = 0) in model V1 and a gaussian filter with σ = 7 in model MT (illustrated in Figure 7D): (x,space)

v(2) = (v(1) )2 ∗ Gσ1

(∆x,velocity)

∗ Gσ2

(B.2)

v(3) represents normalized velocity estimations in both model areas, which is realized by a shunting inhibition with the sum over all velocities. As a consequence, ambiguous signals are weakened, while unambiguous signals are emphasized: 1 · v(2) v(2) − 2n ∆x(2) . v(3) = (B.3) 0.01 + ∆x v References Adelson, E., & Bergen, J. (1985). Spatiotemporal energy models for the perception of motion. Optical Society of America, A 2(2), 284–299. Adelson, E., & Movshon, J. (1982). Phenomenal coherence of moving visual patterns. Nature, 300, 523–525. Barron, J. L., Fleet, D. J., & Beauchemin, S. S. (1994). Performance of optical flow techniques. Int. J. Computer Vision, 12(1), 43–77. Crick, F., & Koch, C. (1998). Constraints on cortical and thalamic projections: The no-strong-loop hypothesis. Nature, 391, 245–250. Del Viva, M. M., & Morrone, M. C. (1998). Motion analysis by feature tracking. Vision Research, 38, 3633–3653. Desimone, R., & Duncan, J. (1995). Neural mechanisms of selective visual attention. Annual Review of Neuroscience, 18, 193–222. Friston, K. J., & Buchel, ¨ C. (2000). Attentional modulation of effective connectivity from V2 to V5/MT in humans. PNAS, 97(13), 7591–7596. Grossberg, S. (1980). How does a brain build a cognitive code? Psychological Review, 87, 1–51. Grossberg, S. (1994). 3-D vision and figure-ground separation by visual cortex. Perception and Psychophysics, 55, 48–120. Grossberg, S., Mingolla, E., & Viswanathan, L. (2001). Neural dynamics of motion integration and segmentation within and across apertures. Vision Research, 41, 2521–2553. Hansen, T., & Neumann, H. (2004). Neural mechanisms for the robust detection of junctions. Neural Computation, 16(5), 1013–1037. Hildreth, E. C. (1984). Computations underlying the measurement of visual motion. Artificial Intelligence, 23, 309–354. Horn, B. K. P., & Schunk, B. G. (1981). Determining optical flow. Artificial Intelligence, 17, 185–203. Hubel, D. H., & Wiesel, T. N. (1968). Receptive fields and functional architecture of monkey striate cortex. Jour nal of Physiology, 195, 215–243. Hup´e, J. M., James, A. C., Girard, P., Lomber, S. G., Payne, B. R., & Bullier, J. (2001). Feedback connections act on the early part of the responses in monkey visual cortex. J. Neurophys., 85, 134–145.

Disambiguating Visual Motion

2065

Koechelin, E., Anton, J. L., & Burnod, Y. (1999). Bayesian inference in populations of cortical neurons: A model of motion integration and segregation in area MT. Biological Cybernetics, 80, 25–44. Lid´en, L., & Pack, C. C. (1999). The role of terminators and occlusion in motion integration and segmentation: A neural solution. Vision Research, 39, 3301– 3320. Maunsell, J. H. R. (1995). The brain’s visual world: Representation of visual targets in cerebral cortex. Science, 270, 764–769. Maunsell, J. H. R., & Van Essen, D. C. (1983). Functional properties of neurons in the middle temporal visual area of the macaque monkey. I: Selectivity for stimulus direction, speed and orientation. J. Neurophys., 49(5), 1127–1147. McDermott, J., Weiss, Y., & Adelson, E. H. (2001). Beyond junctions: Nonlocal form contraints on motion interpretation. Perception, 30, 905–923. Mingolla, E. (2003). Neural models of motion integration and segmentation. Neural Networks, 16, 939–945. Movshon, J. A., Adelson, E. H., Gizzi, M. S., & Newsome, W. T. (1985). The analysis of moving visual patterns. In C. Chagas, R. Gattass, & C. R. Gross (Eds.), Pattern recognition mechanisms (pp. 117–151). Vatican City, Vatican Press. Movshon, J. A., & Newsome, W. T. (1996). Visual response properties of striate cortical neurons projecting to area MT in macaque monkeys. J. Neuroscience, 16(23), 7733–7741. Mumford, D. (1994). Neuronal architectures for pattern-theoretic problems. In C. Koch & J. Davis (Eds.), Large scale neuronal theories of the brain (pp. 125–152). Cambridge, MA: MIT Press. Nagel, H. H. (1987). On the estimation of optical flow: Relations between different approaches and some new results. Artificial Intelligence, 33, 299–324. Neumann, H. (2003). Completion phenomena in vision: A computational approach. In L. Pessoa & P. De Weerd (Eds.), Filling-in—From perceptual completion to cortical reorganization (pp. 151–173). New York: Oxford University Press. Neumann, H., & Sepp, W. (1999). Recurrent V1–V2 interaction in early visual boundary processing. Biological Cybernetics, 81, 425–444. Nicolescu, M., & Medioni, G. (2003). Layered 4D representation and voting for grouping from motion. PAMI, 25(4), 492–501. Nowlan, S. J., & Sejnowski, T. J. (1994). Filter selection model for motion segmentation and velocity integration. Optical Society of America, A 11(12), 3177–3200. Nowlan, S. J., & Sejnowski, T. J. (1995). A selection model for motion processing in area MT of primates. J. Neuroscience, 15(2), 1195–1214. Pack, C. C., & Born, R. T. (2001). Temporal dynamics of a neural solution to the aperture problem in cortical area MT. Nature, 409, 1040–1042. Pack, C. C., Livingstone, M. S., Duffy, K. R., & Born, R. T. (2003). End-stopping and the aperture problem: Two-dimensional motion signals in macaque V1. Neuron, 39, 671–680. Pouget, A., Zemel, R. S., & Dayan, P. (2000). Information processing with population codes. Nature Review Neuroscience, 1(2), 125–132. Shimojo, S., Silverman, G., & Nakayama, K. (1989). Occlusion and the solution to the aperture problem for motion. Vision Research, 29, 619–626.

2066

P. Bayerl and H. Neumann

Simoncelli, E. P., & Heeger, D. J. (1998). A model of neuronal responses in visual area MT. Vision Research, 38, 743–761. Thielscher, A., & Neumann, H. (2003). Neural mechanisms of cortico-cortical interaction in texture boundary detection: A modeling approach. J. Neuroscience, 12, 921–939. Ullman, S. (1995). Sequence seeking and counter streams: A computational model for bidirectional information flow in the visual cortex. Cerebral Cortex, 5, 1–11. Van Essen, D. C., & Galant, J. L. (1994). Neural mechanisms of form and motion processing in the primate visual system. Neuron, 13, 1–10. Weiss, Y., & Fleet, D. J. (2001). Velocity likelihoods in biological and machine vision. In R. P. N. Rao, B. A. Olshausen, & M. S. Lewicki (Eds.), Probabilistic models of the brain: Perception and neural function (pp. 81–100). Cambridge, MA: MIT Press. Weiss, Y., Simoncelli, E. P., & Adelson, E. H. (2002). Motion illusions as optimal percepts. Nature Neuroscience, 5(6), 598–604. Williams, D., & Phillips, G. (1987). Cooperative phenomena in the perception of motion direction. Optical Society of America, A 4(5), 878–885. Wilson, H. R., Ferrera, V. P., & Yo, C. (1992). A psychophysically motivated model for two-dimensional motion perception. Visual Neuroscience, 9, 79–97. Received August 8, 2003; accepted April 1, 2004.

LETTER

Communicated by Geoffrey Goodhill

Exact Solution for the Optimal Neuronal Layout Problem Dmitri B. Chklovskii [email protected] Cold Spring Harbor Laboratory, Cold Spring Harbor, NY 11724, U.S.A.

Evolution perfected brain design by maximizing its functionality while minimizing costs associated with building and maintaining it. Assumption that brain functionality is specified by neuronal connectivity, implemented by costly biological wiring, leads to the following optimal design problem. For a given neuronal connectivity, find a spatial layout of neurons that minimizes the wiring cost. Unfortunately, this problem is difficult to solve because the number of possible layouts is often astronomically large. We argue that the wiring cost may scale as wire length squared, reducing the optimal layout problem to a constrained minimization of a quadratic form. For biologically plausible constraints, this problem has exact analytical solutions, which give reasonable approximations to actual layouts in the brain. These solutions make the inverse problem of inferring neuronal connectivity from neuronal layout more tractable. 1 Introduction Wiring up distant neurons in the brain is costly to an organism (Ramon ´ y Cajal, 1899/1999). The cost of wiring arises from its volume (Cherniak, 1992; Mitchison, 1991), metabolic requirements (Attwell & Laughlin, 2001), signal delay and attenuation (Rall et al., 1992; Rushton, 1951), or possible guidance defects in development (Dickson, 2002). Whatever the origin of the wiring cost, it must grow with the distance between connected neurons. Therefore, placing connected neurons as close as possible reduces the wiring cost and, for a given connectivity, confers selection advantage to an organism. In principle, this evolutionary argument allows one to predict neuronal placement from connectivity data by solving an optimal layout problem. In practice, solving this problem for many neurons with nonstereotypical connectivity is complicated, and often impossible, due to the large number of possible neuronal permutations, which grows exponentially in the number of neurons. In this letter, we argue that the wiring cost may scale approximately as the wire length squared (see section 2). In this approximation, the optimal layout problem reduces to the minimization of a quadratic form (see section 3). The trivial solution is ruled out by biological constraints that can be classified into external and internal. For both classes of constraints, the optimal layout c 2004 Massachusetts Institute of Technology Neural Computation 16, 2067–2078 (2004)

2068

D. Chklovskii

can be found in analytical form (see sections 4 and 5). To test the quadratic placement optimization, we compare its predictions in two cases where both connectivity and layout are known: prefrontal cortical areas in macaque (see section 6) and Caenorhabditis elegans ganglia (see section 7). The solution of the quadratic optimal layout problem gives a reasonable approximation to the actual placement of these multineuron complexes. Thus, the quadratic wire length cost function promises to be a powerful tool for solving optimal neuronal layout problems. 2 Wiring Cost May Scale as Wire Length Squared Because the exact origin of the wiring cost is not known, one can only guess its dependence on the distance between neurons. In this section, we consider several plausible hypotheses for the wiring cost function and argue that the wire length squared may serve as a reasonable approximation. Previous work suggests that the cost of wiring is proportional to its volume (Cherniak, 1992; Chklovskii, Schikorski, & Stevens, 2002; Mitchison, 1991; Stepanyants, Hof, & Chklovskii, 2002), which scales with the distance times wire diameter squared. If the wire diameter is fixed, the wiring cost grows linearly with distance. But if the cost is proportional to volume, why not make all the axons infinitesimally thin? My collaborators and I have argued (Chklovskii et al., 2002; Chklovskii & Stepanyants, 2003) that the observed axon diameter may result from the trade-off between the wire volume cost, which grows with wire diameter, and signal propagation delay cost, which decreases with wire diameter because of an increase in conduction speed. This trade-off is captured by the wiring cost function, C, which contains two terms—one proportional to the signal propagation delay, T, to power n, and the other proportional to the wire volume, V:

C = αTn + βV,

(2.1)

where α and β are unknown constants. If the wires are myelinated axons, the signal propagation speed, s, scales linearly with the wire diameter, d, s = kd, leading to the following expression for the cost function:

C=α

L kd

n +β

π 2 d L, 4

(2.2)

where L is the wire length. The cost function is minimized by the wire diameter that solves the equation ∂ C/∂d = 0. By substituting this optimal wire diameter into the cost function 2.2, I get the following dependence of the cost on the wire length, L: n/(n+2) 3n n πα 2/n β C= 1+ L n+2 . 2 2 2nk

(2.3)

Exact Solution for the Optimal Neuronal Layout Problem

2069

If the exponent, n = 1, then the wiring cost scales linearly with the wire length. It is possible, however, that the signal propagation delay is a hard constraint (n = ∞). Then the wiring cost scales as the length cubed. Intermediate values of the exponent give cost functions that lie between the linear and the cubic dependence on length. Therefore, it is reasonable to approximate the actual wire length cost function by a quadratic expression:

C=

3 4

π 2 αβ 2 k4

1/3 L2 .

(2.4)

It is possible that the actual cost function is noticeably different from the quadratic form (e.g., if the exponent, n < 1). Still, the quadratic cost function can be very useful due to the exact solvability of the quadratic layout problem, as demonstrated in sections 4 and 5. The validity of the quadratic cost function is established by the comparison of theoretical predictions with experimental data (see sections 6 and 7). Thus, the quadratic cost function may play a role in neuronal layout optimization similar to the harmonic oscillator in physics or the fruit fly in genetics. 3 Optimal Layout Problem Requires Constraints In order to formulate the quadratic optimal layout problem, we represent a neuronal circuit as a nondirected weighted graph. Nodes of the graph correspond to neurons (or multineuron complexes), and edges correspond to connections between neurons (or between multineuron complexes). The weight of each edge represents the connection strength and is given by the (constant) coefficient in front of the wire length squared (see equation 2.4) times the multiplicity of the connection. In turn, the multiplicity of the connection is given by the number of parallel wires between the given pair of neuronal complexes or perhaps by the number of synapses between the given pair of neurons. The directionality of connections can be ignored because the cost of a wire does not depend on the direction of signal propagation. The graph is specified algebraically by the adjacency matrix (or the wiring diagram), A, where weights Aij give (nondirectional) connection strengths between neurons i and j. This matrix is symmetric (Aij = Aji ), nonnegative (Aij ≥ 0), with all diagonal elements equal to zero (Aii = 0). With the help of the adjacency matrix, the quadratic wire length cost function for the neuronal circuit can be written as

C=

1 Aij (ri − rj )2 , 2 i,j

(3.1)

where ri , rj are coordinates of the nodes i and j. The quadratic optimal layout problem is to find coordinates, which minimize the cost function for given constraints. Constraints exclude the trivial solution, ri = 0, and may

2070

D. Chklovskii

be classified by their biological origin into external and internal. External constraints arise from the fact that the brain is not an isolated network of neurons but is connected with the sensory and motor organs, the placement of which is determined by functional requirements unrelated to wiring (see section 4). Internal constraints arise from the volume exclusion by neuron bodies and axons, meaning that no two neurons or axons can occupy the same point in space (see section 5). The quadratic wire length cost function 3.1 has a simple physical analogy. If neurons are connected by stretched rubber bands of zero length at rest, then equation 3.1 represents their elastic energy. The weights Aij in equation 3.1 represent elasticity of connections. Then the minimal energy state is achieved when all neurons are in one location and all rubber bands have zero length. This trivial solution is ruled out by external or internal constraints. Previously, the rubber band analogy inspired the elastic net algorithm (Durbin & Mitchison, 1990; Durbin & Willshaw, 1987; Goodhill & Willshaw, 1990), whose relationship to the wiring minimization is discussed in Goodhill and Sejnowski (1997). 4 Exact Solution Under External Constraints The function of the brain is to bridge sensory input and motor output. Communications between sensory and motor organs, on the one hand, and the brain, on the other, require biological wires. The cost of these wires must be included in the overall cost function, 1 C= Aij (ri − rj )2 + Bij (ri − fj )2 , (4.1) 2 i,j i,j where the first term represents the cost of wiring between neurons in the brain and the second term represents the cost of wiring between the brain and sensory and motor organs. Weight Bij represents connection strength between neuron i and organ j, and fj is the coordinate of organ j. As various functional requirements determine organ placement in the body plan (e.g., frontal eyes, forward nose, muscles attached to bones), it is reasonable to formulate the optimal neuronal layout problem with the organ coordinates fixed. To solve the optimal layout problem, we search for the minimum of the wiring cost function 4.1, while varying the locations of the brain neurons, ri . An elegant way to do this is by first rewriting the two terms of the cost function in a matrix form (Hall, 1970), 1 1 Aij (ri − rj )2 = Aij (r2i − 2ri rj + rj2 ) 2 i,j 2 i,j r2i Aij − ri Aij rj = rT (DA − A)r = i T

= r Lr,

j

i,j

(4.2)

Exact Solution for the Optimal Neuronal Layout Problem

where r is a column vector {ri } matrix DAij = δij Laplacian of matrix A, Bij (ri − fj )2 = Bij (r2i − 2ri fj + fj2 ) i,j

i,j

=

r2i

i

Bij − 2

j

k

ri Bij fj +

2071

Aik , and L is called the

i,j

j

fj2

Bij

i

= rT DB r − 2rT Bf + const, (4.3) where matrix DBij = δij k Bik . The minimum of the quadratic wire length cost function 4.1 can be found by taking a derivative in respect to ri and setting it to zero: dC = 2(L + DB )r − 2Bf = 0. dr

(4.4)

Then the optimal layout is given by the following matrix equation: r = (L + DB )−1 Bf.

(4.5)

This solution for the layout problem can be easily generalized to d spatial dimensions. Because the cost function 4.1 is separable into d terms, each containing distances along different dimensions, equation 4.5 gives the projection of the layout vector onto the corresponding spatial dimension. In the rubber bands analogy, this solution is reminiscent of a cobweb, where the network nodes’ location is determined by the forces exerted by external and internal connections. If external connections, Bij , are much stronger than internal, Aij , the location of the nodes is determined mainly by the balance of external forces. For example, if each node makes a strong connection with only one fixed organ, it should be located on that organ. In the opposite limit, when internal connections dominate, all nodes cluster together near the center of mass. Then the optimal layout problem can be broken into two steps. First, the location of the center of mass, rcm , corresponds to the minimum of Ccm = (rcm − fj )2 Bij . (4.6) j

i

Second, coordinates of the nodes relative to the center of mass can be found by using the spectral decomposition of the Laplacian (see the next section). This two-step solution allows one to predict the center of mass location even when the internal connections are not completely known. 5 Spectral Analysis Emulates Internal Constraints The finite size of neuronal bodies and axons places constraints on the possible layouts because of volume exclusion, or congestion. Inclusion of these

2072

D. Chklovskii

constraints is in general a difficult problem. Here we present an approximate treatment of the internal constraints, which yields an aesthetically appealing exact solution (Hall, 1970). In order to avoid the trivial solution, the norm of vector r is fixed, yielding the following optimization problem (see equation 4.2 for derivation): minimize C =

1 Aij (ri − rj )2 = rT Lr, subject to rT r = 1. 2 i,j

(5.1)

This minimization problem is solved by the eigenvector of L corresponding to its lowest eigenvalue. However, the lowest eigenvalue of L is 0, and the corresponding eigenvector is 1/N1/2 , which means that all nodes are at the same point. This led Hall (1970) to introduce an additional constraint by requiring that the minimization problem solution be orthogonal to that eigenvector: rT 1 = 0 or ri = 0. (5.2) i

Then the solution to the one-dimensional optimal layout problem is given by the eigenvector of the Laplacian, v2 , corresponding to the second lowest eigenvalue, λ2 . If the problem is d-dimensional, then the solution is given by the d eigenvectors, corresponding to the 2nd to d + 1st lowest eigenvalues (Hall, 1970). The layout problem with internal constraints also admits a physical analogy. In addition to the elastic force exerted on the nodes by massless rubber bands, there is repulsive force proportional to the distance from the origin. The role of the additional constraint 5.1 is to pin the center of mass to the origin. Alternatively, one can view this problem as finding the configuration with minimum elastic energy for fixed moment of inertia 5.1 and center of mass 5.2. Unfortunately, the above solution for internal constraints cannot be combined straightforwardly with that for external constraints presented in section 4. For example, the center of mass coordinates is determined by incompatible considerations in the two cases: arbitrary placement at the origin versus force balance depending on external constraints. Yet the Laplacian spectrum and the corresponding eigenvectors may approximate the solution obtained with external constraints if the weights of internal connections dominate that of external. This relationship between the two formulations can be formalized by rewriting the scalar product between the ith eigenvector, vi , and the external constraint solution, r, vi r =

vi Bf , λ i + βi

(5.3)

where βi are coefficients of the spectral decomposition of DB in the projection basis.

Exact Solution for the Optimal Neuronal Layout Problem

2073

6 Layout of Prefrontal Cortical Areas in Macaque Cerebral cortex consists of multiple areas whose spatial arrangement and interconnectivity are reproducible from animal to animal. Previous work suggests that the arrangement of cortical areas is determined by minimizing the total length of the interconnections between them (Cherniak, 1994; Mitchison, 1991). Recently, this suggestion has been put to a direct test in the macaque prefrontal cortex (Klyachko & Stevens, 2003), where most of the interconnections and the layout of areas are known (Carmichael & Price, 1996; see Figure 1). Brute force enumeration of all possible area layouts shows that the wiring in the actual layout is the shortest (Klyachko & Stevens, 2003). Here we test the quadratic wire length approach on the data set used in Klyachko and Stevens (2003) by using both external and internal constraints formulation. In the external constraints formulation, the 14 areas on the periphery of the prefrontal cortex are treated as fixed organs, their locations being the actual ones (the crosses in Figure 1A and 1B). Locations of the remaining 10 areas are continuously varied to minimize the sum of connection lengths squared. Equation 4.5 yields the placement shown in Figure 1B. Although the predicted locations are closer together than in reality, the predicted ordering is close to the actual one. The only exception to the correct ordering is the placement of 14r, which should be lower. Interestingly, this placement corresponds to one of the close-to-optimal placements reported in Klyachko and Stevens (2003). There are two possible explanations for why the areas are predicted to bunch up more than they do in reality. First, the external constraint formulation neglects volume exclusion, that is, the fact that the areas have finite size and cannot overlap. Second, there may be connections between the areas that were considered in Carmichael and Price (1996) and the areas that were not considered. Because these areas lie outside the considered fixed areas, their inclusion would pull apart the movable areas. The internal constraints formulation applied to the 18 areas included in prefrontal orbital and medial networks (Carmichael & Price, 1996) yields the arrangement shown in Figure 1C. This placement has approximately correct ordering of the areas. Moreover, this analysis correctly clusters cortical areas into two clusters distinguished by the sign of the second eigenvector component. These clusters correspond to the known (Carmichael & Price, 1996) subdivisions of the prefrontal cortex orbital (blue labels) and medial (red labels) networks. Thus, the predictions of the internal constraints formulation are consistent with anatomical data. 7 Layout of Ganglia in C. elegans Neurons in the C. elegans nervous system are clustered into several ganglia distributed along its body. Most connections between the ganglia are known, making this system a natural choice for testing the wiring optimiza-

2074

D. Chklovskii

A 46 45

45

12r 12m

11l

13l

32

32

10o

12l

10m

11m

12r

14r 12o

24a/b

46

24a/b

10o 12l

9

B

9

13m

25

12o

13b

12m 13l

14c

14r 10m 11m

11l 13m

13b 14c 13a

13a Ial

Ial

Iam Iapm

Iai

25

Iam Iapm

Iai

C 12r

0.6

third eigenvector

0.4

0.2 24a/b

12m 12l

0

10o

11l 11m

13l

13b

32 10m

14r Iai

14c

13m

-0.2

Iam Ial

-0.4

-0.6

-0.4

-0.2

Iapm

0

0.2

0.4

0.6

second eigenvector

Figure 1: Comparison of the actual cortical area arrangement (A) with the predictions of the quadratic layout optimization under external (B) and internal (C) constraints. (A) Cortical area centers in the coordinate frame of the flattened prefrontal cortex, taken from Klyachko and Stevens (2003) and labeled according to Carmichael and Price (1996). Crosses indicate areas that were fixed in the external constraint formulation, and circles indicate movable areas. (B) Area locations predicted by the external constraint formulation. Blue lines show internal connections (Aij ), and red lines external ones (Bij ). (C) Area locations predicted by the internal constraint formulation. Areas cluster by the sign of the second eigenvector component (negative versus positive abscissa) in accordance with the known division of the prefrontal cortex into orbital (brown labels) and medial (green labels) networks (Carmichael & Price, 1996).

Exact Solution for the Optimal Neuronal Layout Problem

2075

1

predicted position

0.8

0.6

0.4

0.2

0 0

0.2

0.4

0.6

0.8

1

actual position

Figure 2: Solid dots (connected in the anterior-posterior order) show predicted versus actual positions of C. elegans ganglia normalized by the distance from head to tail. Deviations from the diagonal line correspond to differences in the actual versus predicted ganglia positions. Although predicted ganglia positions differ from the actual ones, their order is predicted correctly with the exception of the dorsorectal ganglion (actual position: 0.88 of body length).

tion approach. The layout problem is essentially one-dimensional because of the large aspect ratio (more than 10:1) of the worm body. Brute force enumeration of all permutations of 11 movable components (including nerve ring in addition to ganglia; Cherniak, 1994) shows that the actual ordering minimizes the total wire length. Here I show that solving a quadratic placement problem can largely reproduce the actual order of ganglia. In the external constraints formulation, the locations of ganglia are given by equation 4.5, where the wiring diagram and fixed locations of sensors and muscles are the same as in Cherniak (1994). Figure 2 shows the predicted positions of C. elegans ganglia versus the actual ones. The predicted order agrees with the actual one with the exception of a single ganglion. This is a reasonably good agreement considering that there are 11! alternative orderings. However, predicted ganglia locations deviate from the actual ones. These deviations may be due to missing information in the wiring diagram (e.g., the lack of neuromuscular connections); deviations of the cost function from the quadratic form; or

2076

D. Chklovskii

other factors that should be included in the cost function. Future work will determine which of these factors are responsible for the disagreement. Internal constraints are unlikely to play a significant role in the placement of ganglia due to the relative sparseness of the C. elegans nervous system. Therefore, I skip this analysis. 8 Discussion In this letter, I argue that the wire length squared may approximate the wiring cost, thus reducing the optimal layout problem to the constrained minimization of a quadratic form. For two types of constraints, external and internal, exact analytical solutions exist, allowing straightforward and intuitive analysis. To test the quadratic optimization approach, I revisit two known cases of wire length minimization, where previous solutions relied on brute force complete enumeration. Minimization of wire length squared approximates the actual layouts reasonably well. One recurring problem with external constraint formulation is the bunching of graph nodes in the solution. This happens because the number of internal connections usually exceeds that of the external ones. The bunching does not happen in actual brains because of the ”volume exclusion” of multineuron complexes or internal constraints. The spectral method emulates these constraints and eliminates the bunching problem. However, exclusion of the external connections may lead to the overall rotation of the graph or incorrect positioning of some multineuron complexes. As in any other theoretical analysis, the optimal neuronal layout solution relies on several simplifying assumptions. The central assumption, the quadratic form of the cost function, is supported by the argument in section 2, showing that quadratic cost function may be a reasonable approximation. The utility of this approximation is due to its exact solvability (see sections 4 and 5). The validity of this approximation is supported by the fact that its predictions are consistent with experimental data (see sections 7 and 8). Another assumption is that wiring consists of point-to-point (nonbranching) axons. This assumption is valid for connections between cortical areas and can serve as a first-order approximation in other cases. Future work will analyze the impact of the axonal branching and the presence of dendrites on brain design. Also in the real brain, axonal branches are not exactly straight lines. Their curvature is itself a result of internal constraints, which are included here only on the mean-field level (see section 5). A more detailed treatment of internal constraints, or congestion, will be presented elsewhere. Although quadratic cost function is an approximation, it yields optimal layouts reasonably close to those obtained by minimizing total wiring length in realistic situations. While complete enumeration of possible layouts is limited to a small number of movable components, quadratic layout problem yields exact solutions in analytical form for the wiring diagrams as big as computers can handle. These analytical solutions can be readily and

Exact Solution for the Optimal Neuronal Layout Problem

2077

intuitively investigated, making the inverse problem (predicting connectivity from neuronal layout) more tractable. Since solving this problem may complement the existing experimental methods for establishing neuronal connectivity, the quadratic cost function promises to be an important tool for understanding brain design and function. Acknowledgments I thank Natarajan Kannan for bringing Hall’s work to my attention and Armen Stepanyants, Anatoli Grinshpan, and all the members of my group for helpful discussions. I am grateful to Charles Stevens for making Klyachko and Stevens (2003) available prior to publication. This work was supported by the Lita Annenberg Hazen Foundation and the David and Lucile Packard Foundation. References Attwell, D., & Laughlin, S. B. (2001). An energy budget for signaling in the grey matter of the brain. J. Cereb. Blood Flow Metab., 21, 1133–1145. Carmichael, S. T., & Price, J. L. (1996). Connectional networks within the orbital and medial prefrontal cortex of macaque monkeys. J. Comp. Neurol., 371, 179–207. Cherniak, C. (1992). Local optimization of neuron arbors. Biol. Cybern., 66, 503– 510. Cherniak, C. (1994). Component placement optimization in the brain. J. Neurosci., 14, 2418–2427. Chklovskii, D. B., Schikorski, T., & Stevens, C. F. (2002). Wiring optimization in cortical circuits. Neuron, 34, 341–347. Chklovskii, D. B., & Stepanyants, A. (2003). Power-law for axon diameters at branch point. BMC Neurosci., 4, 18. Dickson, B. J. (2002). Molecular mechanisms of axon guidance. Science, 298, 1959–1964. Durbin, R., & Mitchison, G. (1990). A dimension reduction framework for understanding cortical maps. Nature, 343, 644–647. Durbin, R., & Willshaw, D. (1987). An analogue approach to the travelling salesman problem using an elastic net method. Nature, 326, 689–691. Goodhill, G. J., & Sejnowski, T. J. (1997). A unifying objective function for topographic mappings. Neural Computation, 9, 1291–1303. Goodhill, G. J., & Willshaw, D. J. (1990). Application of the elastic net algorithm to the formation of ocular dominance stripes. Network, 1, 41–59. Hall, K. (1970). An r-dimensional quadratic placement algorithm. Management Science, 17, 219–229. Klyachko, V. A., & Stevens, C. F. (2003). Connectivity optimization and the positioning of cortical areas. Proc. Natl. Acad. Sci. USA, 100, 7937–7941. Mitchison, G. (1991). Neuronal branching patterns and the economy of cortical wiring. Proc. R. Soc. Lond. B Biol. Sci., 245, 151–158.

2078

D. Chklovskii

Rall, W., Burke, R. E., Holmes, W. R., Jack, J. J., Redman, S. J., & Segev, I. (1992). Matching dendritic neuron models to experimental data. Physiol. Rev., 72, S159–186. Ramon ´ y Cajal, S. (1999). Textura del sistema nervioso del hombre y de los vertebrados (Texture of the nervous system of man and the vertebrates). New York: Springer-Verlag. (Original work published in 1899). Rushton, W. A. (1951). Theory of the effects of fibre size in medullated nerve. J. Physiol., 115, 101–122. Stepanyants, A., Hof, P. R., & Chklovskii, D. B. (2002). Geometry and structural plasticity of synaptic connectivity. Neuron, 34, 275–288. Received October 21, 2003; accepted March 17, 2004.

LETTER

Communicated by Dean Buonomano

Decoding a Temporal Population Code Philipp Knusel ¨ [email protected]

Reto Wyss [email protected]

Peter Konig ¨ [email protected]

Paul F.M.J. Verschure [email protected] ¨ ¨ Institute of Neuroinformatics, University/ETH Zurich, Zurich, Switzerland

Encoding of sensory events in internal states of the brain requires that this information can be decoded by other neural structures. The encoding of sensory events can involve both the spatial organization of neuronal activity and its temporal dynamics. Here we investigate the issue of decoding in the context of a recently proposed encoding scheme: the temporal population code. In this code, the geometric properties of visual stimuli become encoded into the temporal response characteristics of the summed activities of a population of cortical neurons. For its decoding, we evaluate a model based on the structure and dynamics of cortical microcircuits that is proposed for computations on continuous temporal streams: the liquid state machine. Employing the original proposal of the decoding network results in a moderate performance. Our analysis shows that the temporal mixing of subsequent stimuli results in a joint representation that compromises their classification. To overcome this problem, we investigate a number of initialization strategies. Whereas we observe that a deterministically initialized network results in the best performance, we find that in case the network is never reset, that is, it continuously processes the sequence of stimuli, the classification performance is greatly hampered by the mixing of information from past and present stimuli. We conclude that this problem of the mixing of temporally segregated information is not specific to this particular decoding model but relates to a general problem that any circuit that processes continuous streams of temporal information needs to solve. Furthermore, as both the encoding and decoding components of our network have been independently proposed as models of the cerebral cortex, our results suggest that the brain could solve the problem of temporal mixing by applying reset signals at stimulus onset, leading to a temporal segmentation of a continuous input stream.

c 2004 Massachusetts Institute of Technology Neural Computation 16, 2079–2100 (2004)

2080

P. Knusel, ¨ R. Wyss, P. Konig, ¨ and P. Verschure

1 Introduction The processing of sensory events by the brain requires the encoding of information in an internal state. This internal state can be represented by the brain using a spatial code, a temporal code, or a combination of both. For further processing, however, this encoded information requires decoding at later stages. Hence, any proposal on how a perceptual system functions must address both the encoding and the decoding aspects. Encoding requires the robust compression of the salient features of a stimulus into a representation that has the essential property of invariance. The decoding stage involves the challenging task of decompressing this invariant and compressed representation into a high-dimensional representation that facilitates further processing steps such as stimulus classification. Here, based on a combination of two independently proposed and complementary encoding and decoding models, we investigate sensory processing and the properties of a decoder in the context of a complex temporal code. Previously we have shown that visual stimuli can be invariantly encoded in a so-called temporal population code (Wyss, Konig, ¨ & Verschure, 2003). This encoding was achieved by projecting the contour of visual stimuli onto a cortical layer of neurons that interact through excitatory lateral couplings. The temporal evolution of the summed activity of this cortical layer, the temporal population code, encodes the stimulus-specific features in the relative spike timing of cortical neurons on a millisecond timescale. Indeed, physiological recordings in area 17 of cat visual cortex support this hypothesis showing that cortical neurons can produce feature-specific phase lags in their activity (Konig, ¨ Engel, Rolfsema, & Singer, 1995). The encoding of visual stimuli in a temporal population code has a number of advantageous features. First, it is invariant to stimulus transformation and robust to both network and stimulus noise (Wyss, Konig, ¨ & Verschure, 2003; Wyss, Verschure, & Konig, ¨ 2003). Thus, the temporal population code satisfies the properties of the encoding stage outlined above. Second, it provides a neural substrate for the formation of place fields (Wyss & Verschure, in press). Third, it can be implemented without violating known properties of cortical circuits such as the topology of lateral connectivity and transmission delays (Wyss, Konig, ¨ & Verschure, 2003). Thus, the temporal population code provides a hypothesis on how a cortical system can invariantly encode visual stimuli. Different approaches for decoding temporal information have been suggested (Kolen & Kremer, 2001; Mozer, 1994; Buonomano & Merzenich, 1995; Buonomano, 2000). A recently proposed approach is the so-called liquid state machine (Maass, Natschl¨ager, & Markram, 2002; Maass & Markram, 2003). We evaluate the liquid state machine as a decoding stage since it is a model that aims to explain how cortical microcircuits solve the problem of the continuous processing of temporal information. The general structure of this approach consists of two stages: a transformation and a readout stage.

Decoding a Temporal Population Code

2081

The transformation stage consists of a neural network, the liquid, which performs real-time computations on time-varying continuous inputs. It is a generic circuit of recurrently connected integrate-and-fire neurons coupled with synapses that show frequency-dependent adaptation (Markram, Wang, & Tsodyks, 1998). This circuit transforms temporal patterns into highdimensional and purely spatial patterns. A key property of this model is that there is an interference between subsequent input signals, so that they are mixed and transformed into a joint representation. As a direct consequence, it is not possible to separate consecutively applied temporal patterns from this spatial representation. The second stage of the liquid state machine is the readout stage, where the spatial representations of the temporal patterns are classified. Whereas most previous studies considered Poisson spike trains as inputs to the liquid state machine, in this article, we investigate the performance of this model in classifying visual stimuli that are represented in a temporal population code. Although the liquid state machine was originally proposed for the processing of continuous temporal inputs, it is unclear how this generalizes to the continuous processing of a sequence of stimuli that are temporally encoded. By analyzing the internal states of the network, we show that in its original setup, it tends to create overlaps among the stimulus classes. This suggests that in order to improve its performance, a reset locked to the onset of a stimulus could be required. We compare different strategies on preparing this network to the presentation of a new stimulus, ranging from random and deterministic initialization strategy to pure continuous processing with no stimulus-triggered resets. We find a large range of classification performance, showing that the no-reset strategy is significantly outperformed by the different types of stimulus-triggered initializations. Building on these results, we discuss possible implementations of such mechanisms by the brain. 2 Methods 2.1 Temporal Population Code. We analyze the classification of visual stimuli encoded in a temporal population code as produced by a cortical type network proposed earlier (Wyss, Konig, ¨ & Verschure, 2003). This network consists of 40 × 40 integrate-and-fire cells that are coupled with symmetrically arranged excitatory connections having distance-specific transmission delays. The inputs to this network are artificially generated “visual” patterns (see Figure 1). Each of the 11 stimulus classes consists of 1000 samples. The output of the network (see Figure 2) is the sum of activities recorded during 100 ms with a temporal resolution of 1 ms—that is, a temporal population code. We are exclusively interested in assessing the information in the temporal properties of this code. Thus, each population activity pattern is rescaled such that the peak activity is set to one. The resulting population activity patterns (which we also refer to as temporal activity patterns)

2082

P. Knusel, ¨ R. Wyss, P. Konig, ¨ and P. Verschure

Figure 1: Prototypes of the synthetic “visual” input patterns used to generate the temporal population code. There are 11 different classes where each class is composed of 1000 samples. The resolution of a pattern is 40 times 40 pixels. The prototype pattern of each class is generated by randomly choosing four vertices and connecting them by three to five lines. Given a prototype, 1000 samples are constructed by randomly jittering the location of each vertex using a two-dimensional gaussian distribution (σ = 1.2 pixels for both dimensions). All samples are then passed through an edge detection stage and presented to the network of Wyss, Konig, ¨ & Verschure (2003).

constitute the input to the decoding stage, the liquid state machine (see Figure 3). Based on a large set of synthetic stimuli consisting of 800 classes and using mutual information, we have shown that the information content of the temporal population code is 9.3 bits given a maximum of 9.64 bits (Wyss, Konig, ¨ & Verschure, 2003; Rieke, Warland, de Ruyter van Steveninck, & Bialek, 1997).

Decoding a Temporal Population Code

2083

Figure 2: Temporal population code of the 11 stimulus classes. Shown are the mean traces of the population activity patterns encoding the number of active cells as a function of time (1 ms temporal resolution, 100 ms length) after rescaling.

2.2 Implementation of the Liquid State Machine. The implementation of the liquid state machine evaluated here, including the readout configuration, is closely based on the original proposal (Maass et al., 2002; see the appendix). The liquid is formed by 12 × 12 × 5 = 720 leaky integrateand-fire neurons (the liquid cells) that are located on the integer points of a cubic lattice where 30% randomly chosen liquid cells receive input, and 20% randomly chosen liquid cells are inhibitory (see Figure 3). The simulation parameters of the liquid cells are given in Table 1. The probability of a synaptic connection between two liquid cells located at a and b is given by a gaussian distribution, p(a, b) = C · exp(−(|a − b|/λ)2 ), where |.| is the Euclidian norm in R3 and C and λ are constants (see Table 2). The synapses connecting the liquid cells show frequency-dependent adaptation (Markram et al., 1998; see the appendix).

2084

P. Knusel, ¨ R. Wyss, P. Konig, ¨ and P. Verschure

Figure 3: General structure of the implementation of the liquid state machine. A single input node provides a continuous stream to the liquid that consists of recurrently connected integrate-and-fire neurons that are fully connected with 11 readout groups. Each of the readout groups consists of 36 integrate-and-fire neurons. Weights of the synaptic connections projecting to the readout groups are trained using a supervised learning rule. Table 1: Simulation Parameters of the Neurons of the Liquid. Name

Symbol

Background current Leak conductance Membrane time constant Threshold potential Reset potential Refractory period

Ibg gleak τmem vθ vreset tref r

Value 13.5 nA 1 µS 30 ms 15 mV 13.5 mV 3 ms

Note: The parameters are identical to Maass et al. (2002).

The readout mechanism is composed of 11 neuronal groups consisting of 36 integrate-and-fire neurons with a membrane time constant of 30 ms (see Figure 3 and the appendix). All readout neurons receive input from the liquid cells and are trained to classify a temporal activity pattern at a specific point in time after stimulus onset, tL . Thus, training occurs only once during the presentation of an input. A readout cell fires if and only if its membrane potential is above threshold at t = tL ; that is, the readout cell is not allowed to fire at earlier times. This readout setup is comparable to the original proposal of the liquid state machine (Maass et al., 2002). Each readout group represents a response class, and the readout group with the highest number of firing cells is the selected response class. Input classes are mapped to response classes by changing the synapses projecting from the

Decoding a Temporal Population Code

2085

Table 2: Simulation Parameters of the Synapses Connecting the Liquid Cells. Value Name

Symbol

Average length of connections Maximal connection probability Postsynaptic current time constant Synaptic efficacy (weight) Utilization of synaptic efficacy Recovery from depression time constant Facilitation time constant

λ C τsyn wliq U τrec τ f ac

EE

EI

IE

II

2 (independent of neuron type) 0.4 0.2 0.5 0.1 3 ms 3 ms 6 ms 6 ms 20 nA 40 nA 19 nA 19 nA 0.5 0.05 0.25 0.32 1.1 s 0.125 s 0.7 s 0.144 s 0.05 s 1.2 s 0.02 s 0.06 s

Notes: The neuron type is abbreviated with E for excitatory and I for inhibitory neurons. The values of wliq , U, τrec , and τ f ac are taken from a gaussian distribution of which the mean values are given in the table. The standard deviation of the distribution of the synaptic efficacy is equal to the mean value, and it is half of the mean value for the last three parameters. The parameters are identical to Maass et al. (2002).

liquid onto the readout groups. A supervised learning rule changes these synaptic weights only when the selected response class is incorrect (see the appendix). In this case, the weights of the synapses to firing cells of the incorrect response class are weakened, whereas those to the inactive cells of the correct response class are strengthened. As a result, the firing probability of cells in the former group, given this input, is reduced while that of the latter is increased. The synapses evolve according to a simplified version of the learning rule proposed in Maass et al. (2002) and Auer, Burgsteiner, and Maass (2001), the main difference being that the clear margin term has been ignored. (Control experiments have shown that this had no impact on the performance.) The 1000 stimulus samples of each class are divided into a training and test set of 500 samples each. The simulation process is split into two stages. In the first stage, the synaptic weights are updated while all training samples are presented in a completely random order until the training process converges. In the second stage, the training and test performance of the network is assessed. Again, the sequence of the samples is random, and each sample is presented only once. In both stages, the samples are presented as a continuous sequence of temporal activity patterns where each stimulus is started exactly after the preceding one. Regarding the initialization of the network, any method used can reset either the neurons (membrane potential) or the synapses (synaptic utilization and fraction of available synaptic efficacy), or both. A reset of any of those components of the network can be deterministic or random. Combining some of these constraints, we apply five different methods to initialize the network at stimulus onset: entire-hard-reset, partial-hard-reset, entirerandom-reset (control condition), partial-random-reset (as used in Maass et al., 2002; Maass, Natschl¨ager, & Markram, 2003) and no-reset (see Table 3 for

2086

P. Knusel, ¨ R. Wyss, P. Konig, ¨ and P. Verschure

Table 3: Initialization Values of the Liquid Variables: Membrane Potential, Synaptic Utilization, and Fraction of Available Synaptic Efficacy.

Reset Method

Membrane Potential

Synaptic Utilization

Fraction of Available Synaptic Efficacy

Entire-hard-reset Partial-hard-reset Entire-random-reset Partial-random-reset No-reset

13.5 mV 13.5 mV [13.5 mV, 15 mV] [13.5 mV, 15 mV] –

U – [0, U] – –

1 – [0, 1] – –

Notes: Five different methods are used to initialize these variables. The symbol [ , ] denotes initialization values drawn from a uniform distribution within the given interval.

the corresponding initialization values). Whereas only the neurons are initialized by means of the partial reset, the entire reset initializes the neurons and the synapses. The initialization values are deterministic with the hardreset methods, and they are random with the random-reset methods. The random initialization is used to approximate the history of past inputs. The validity of this approximation will be controlled below. Finally, the network is not reset in case of the no-reset method. 2.3 Liquid State and Macroscopic Liquid Properties. The state of the network is formally defined as follows: Let z(t) be a time-dependent vector that represents the active cells at time t in the network with a 1 and all inactive cells with a 0. We call z ∈ Rp the liquid output vector (with p the number of liquid cells). The liquid-state-vector z˜ (usually called only the liquid state) is now defined as the component-wise low-pass-filtered liquid output vector using a time constant of τ = 30 ms. We introduce three macroscopic liquid properties. In all of the following equations, z˜ ijk ∈ Rp denotes the liquid state after the kth presentation of sample j from class i where i = 1, . . . , n, j = 1, . . . , m, and k = 1, . . . , r with n the number of classes, m the number of samples per class and r the number of presentations of the same sample, and p the number of liquid cells. For simplicity, we omit the time dependence in the following definitions. We compute a principal component analysis by considering all the vectors z˜ ijk as n · m · r realizations of a p-dimensional random vector. Based on the new coordinates zˆ ijk of the liquid state vectors in the principal component system, the macroscopic liquid properties are defined. The center of class i, ci , and the center of a sample j from class i, sij , are defined as the average values of the appropriate liquid state vectors:

ci =

m r 1 zˆ ijk mr j=1 k=1

Decoding a Temporal Population Code

sij =

2087

r 1 zˆ ijk . r k=1

Since these vectors are defined as average values over several presentations of the same sample, the liquid noise (see below) is not considered in these values if the number of repetitions r is large enough. The liquid-noise σ liq is defined as the average value of the vectorial standard deviation (the standard deviation is computed for each component separately) of all presentations of a sample, σ liq =

n m 1 stdk (ˆzijk ), mn i=1 j=1

and can be interpreted as the average scattering of a sample around its center sij . The average distance vector between the centers of all classes, the liquidliq class-distance dC , is defined as liq

dC =

1,...,n 2 |ci − cj |, n(n − 1) i<j

where |.| is the absolute value. liq The liquid-sample-distance dS is defined as the vectorial standard deviation of the sample centers of one class, averaged over all classes, liq

dS =

n 1 stdj (sij ), n i=1

where the subscript S stands for sample. 3 Results In the first experiment, we investigate the performance of the liquid state machine in classifying the temporal activity patterns by initializing the network according to the control condition (entire-random-reset; see section 2). The readout cell groups are trained to classify a sample at 100 ms after stimulus onset. We run 10 simulations, each using the complete training and testing data sets. Each simulation comprises a network where the synaptic arborization and the parameters controlling the synaptic dynamics are randomly initialized (see Table 2). We find that after training, 60.6±2% (mean ± standard deviation) of the training samples and 60.2±2% of the test samples are classified correctly. The corresponding values of the mutual information

2088

P. Knusel, ¨ R. Wyss, P. Konig, ¨ and P. Verschure

between stimulus and response class are 1.725 ± 0.056 bits using the training data set and 1.696 ± 0.053 bits using the test data set. The maximum value of the mutual information is log2 (11) ≈ 3.459 bits. Thus, although the generalization capability of the network is excellent (the performances on the test and training sets are virtually identical), it achieves only a moderate overall performance comparing it to a statistical clustering of the temporal activity patterns that shows 83.8% correct classifications (Wyss, Konig, ¨ & Verschure, 2003). In order to elucidate the mechanisms responsible for this moderate performance, we take a closer look at how the temporal activity patterns are represented in the network. Since we always triggered the training of the readout cell groups at 100 ms after stimulus onset, we are particularly interested in the liquid state (see section 2) at this point. Due to the fact that the liquid state is high-dimensional, we employ a principal component analysis to investigate the representation of the temporal activity patterns in the network. The first 50 samples of each class are presented 20 times to the network, which results in 20 repeats per sample × 50 samples per class × 11 classes = 11,000 liquid state vectors. Each of these 720-dimensional vectors is considered as a realization of 720 random variables. On these data, a principal component analysis is applied. Based on the new coordinates of the liquid states in the principal component system, we compute the three macroscopic liquid properties: the liquid-class-distance, the liquid-sampledistance, and the liquid-noise (see section 2). For the projection of the liquid states onto each principal component, these three properties describe the average distance between the centers of the classes, the average distance between the centers of the samples of one class, and the average variability of the liquid states of one particular sample. Thus, by means of the liquidsample-distance and the liquid-noise, the extent of all samples of one class along each principal axes can be assessed. This extent is limited by the average distance between the samples of one class (liquid-sample-distance) and the sum of this distance with the liquid-noise, the average variability of the liquid states of one sample. Hence, the projection of the liquid states of different classes onto a principal component is separated if the corresponding liquid-class-distance is greater than the sum of the liquidsample-distance and the liquid-noise. Conversely, the projection of liquid states onto a particular principal component overlaps if the corresponding liquid-sample-distance is greater than the liquid-class-distance. On the basis of this interpretation of the macroscopic liquid properties, we are able to quantitatively assess the separation among the classes. The above analysis of the liquid states is summarized in Figure 4. First, we find that the liquid-noise exceeds the liquid-class- and the liquid-sampledistance for dimensions greater than or equal to 26. Thus, there is little stimulus- or class-specific information but mostly noise in these components. Second, the liquid-sample-distance is greater than the liquid-classdistance for all dimensions greater than or equal to 5; the liquid states of

Decoding a Temporal Population Code

2089

Figure 4: Liquid state distances versus principal component dimensions. The network is initialized using the entire-random-reset method. The solid line shows the liquid-class-distance, the dashed line the liquid-sample-distance, the dotted line the liquid-noise, and the dash-dotted line the sum of the liquidsample-distance and the liquid-noise. For dimensions greater than 26, the liquidnoise is greater than the liquid-sample-distance, which is greater than the liquidclass distance. For dimensions 1 to 4, the liquid-class-distance is greater than the sum of the liquid-sample-distance and the liquid-noise.

different classes overlap for these dimensions. Third, for dimensions less than or equal to 4, the liquid-class-distance is greater than the sum of the liquid-sample-distance and the liquid-noise. As a result of this, the liquid states of different classes have little overlap for these dimensions. Fourth, as a consequence of the third point, the liquid-class-distance is also greater than the liquid-sample-distance for dimensions between 1 and 4. Given these macroscopic liquid properties, we can conclude, from the third observation, that the projection of the liquid states onto principal components 1 to 4 has little overlap. Therefore, class-specific information can be found only in the first four principal components. This finding is somewhat surprising, given the dimensionality of the temporal population code, which is of the order of 20 (Wyss, Verschure, & Konig, ¨ 2003) and considering that the liquid states were originally proposed to provide a very high-dimensional representation of the input (Maass et al., 2002). Finally, it follows from the second observation that the liquid states projected onto principal components greater than or equal to 5 do not carry class-specific information, with or without liquid-noise. Therefore, the liquid state machine appears to encode the stimuli into a low-dimensional representation.

2090

P. Knusel, ¨ R. Wyss, P. Konig, ¨ and P. Verschure

Taking the above observations into account, how can the liquid state machine be modified in order to increase its ability to separate the stimulus classes? Since the liquid state machine does not a priori contain class-specific information (that is, class-specific features cannot be detected), the liquidclass- and the liquid-sample-distance cannot be changed independently. Thus, it is not possible to selectively increase the liquid-class-distance while decreasing the liquid-sample-distance. However, as a result of the entirerandom-reset method used to initialize the network, the liquid-noise is independent of the stimuli and could be eliminated by resetting the liquid variables to predefined values at stimulus onset. According to the macroscopic liquid properties, this would therefore lead to an increased separation between the classes, which improves the classification performance of the liquid state machine. We examine the classification performance of the liquid state machine using four reset methods: the entire-hard-, partial-hard-, partial-random-, and no-reset methods (see section 2 and Table 3). We use the same experimental protocol as above, and the results are summarized in Figure 5. First, initializing the network with the entire-hard-reset method yields a better performance than with the entire-random-reset method, as predicted above. Quantitatively, this amounts to approximately a 10% increase in performance. Second, comparing the performance of the entire-hard-/entirerandom-reset method to its partial counterpart, we find that initializing the network with the partial-hard-/partial-random-reset method results in a higher performance (see Figure 5). Employing a two-way ANOVA on the classification performance using the results of the testing data set, we find for α = 0.01, pentire/partial ≈ 0.0002, phard/random ≈ 4 · 10−15 , and pInteraction ≈ 0.11. Thus, both entire and partial as well as hard and random resets result in significant differences of the average classification performance and the mutual information. The only difference between the partial and the entire reset is that the former does not reset the synapses (see Table 3), that is, the synaptic utilization and the available fraction of synaptic efficacy are never reset. Thus, this difference has to account for the observed improvement of the classification performance. Third, using the no-reset method, the network yields a performance that is significantly lower than a network initialized with any other reset method (for instance, performance comparison of entire-random-reset and no-reset, t-test of mutual information of testing data set, α = 0.01, p ≈ 2 · 10−16 ). Thus, resetting the network is required to achieve a satisfying classification performance. We investigate in more detail the performance difference yielded by the entire and the partial reset methods. As we found above, entire and partial reset render approximately the same performance. Since the only difference between them is that the synapses are not reset in case of the partial reset method, this suggests that the synaptic short-term plasticity has no effect on the performance of the network. Consequently, the decoding of the temporal activity pattern would be a result of the membrane time constant only.

Decoding a Temporal Population Code

2091

Figure 5: Evaluation of different reset mechanisms. (a) Classification performance and (b) mutual information of the readout cell groups trained 100 ms after stimulus onset with the input classes. Five different initialization methods are used (see section 2). Gray and white bars show the performance for the training and test data set, respectively, and 10 simulations are run per reset condition.

2092

P. Knusel, ¨ R. Wyss, P. Konig, ¨ and P. Verschure

Figure 6: Liquid state distances versus principal component dimensions. The network is initialized using the no-reset method. The solid line shows the liquidclass-distance, the dashed line the liquid-sample-distance, the dotted line the liquid-noise, and the dash-dotted line the sum of the liquid-sample-distance and the liquid-noise. The liquid-class-distance is greater than the sum of the liquid-sample-distance and the liquid-noise only for dimensions 2 and 6. All other dimensions are dominated by the liquid-noise.

Hence, we effectively remove synaptic short-term depression by setting the recovery time constant, τrec , for all synapse types to 1 ms. This results in a training and testing performance of 10.0 ± 2.8%, which is almost chance level. A further analysis of this very low performance reveals that it is caused by the saturation of activity within the network. Thus, synaptic short-term depression is required for the proper operation of the network as it balances the amount of excitation. Since a reset of the network has a large effect on its classification performance, we again explore the representation of the temporal activity patterns in the network in order to explain this effect quantitatively (see Figure 6). However, here we use the no-reset method to record the liquid states at 100 ms after stimulus onset. We apply the same analysis as before to plot the three macroscopic liquid properties versus the principal components (see section 2 and Figure 4). This analysis shows that the liquid-class-distance is greater than the sum of the liquid-sample-distance and the liquid-noise only for dimensions 2 and 6. As this difference is only marginal for dimension 6, virtually only the projection of the liquid states onto principal component 2 have a small overlap. Hence, only the second principal component carries class-specific information. Comparing this result with the previous analysis

Decoding a Temporal Population Code

2093

of the liquid states obtained using the entire-random-reset method (see Figure 4), we find that not resetting the network results in an enormous increase in the overlap of the liquid states between different classes. Thus, in case of the no-reset method, there is a rather critical dependence between the initial state of the network at stimulus onset and the liquid state recorded after stimulus presentation. In all previous experiments, we trained the readout cell groups exactly at 100 ms after stimulus onset. Since it was shown that the information encoded in the temporal population code increases rapidly over time and already 60% of the total information is available after 25 ms (Wyss, Konig, ¨ & Verschure, 2003), it is of interest to investigate how fast the classification performance of the liquid state machine rises. Moreover, it is unclear from the previous experiments whether the classification performance is better at earlier times after stimulus onset. In the following experiment, this will be examined by training the readout cell groups at one particular time between 2 and 100 ms with respect to stimulus onset. The network is initialized using the entire-hard-reset, the entire-random-reset, or the no-reset method. For each fixed training and testing time and initialization method, 10 simulations are performed (as in previous experiments). The results depicted in Figure 7 show that up to 26 ms after stimulus onset, the classification performance stays at chance level (0.09) and 0 bits of mutual information for both training and testing. Thus, the first two bursts of the temporal activity pattern do not give rise to class-specific information in the network. The best performance is achieved by initializing the network with the entirehard-reset method, whereas the no-reset method again results in the lowest classification performance. As already shown in Wyss, Konig, ¨ & Verschure (2003), here we also find a rapid increase in the classification performance (see Figure 7). The performance does not increase after 55 ms but rather fluctuates at a maximum level. Consequently, processing longer temporal activity patterns does not augment the mutual information or the classification performance. 4 Discussion In this study, we investigated the constraints on the continuous time processing of temporally encoded information using two complementary networks; the encoding network compresses its spatial inputs into a temporal code by virtue of highly structured lateral connections (Wyss, Konig, ¨ & Verschure, 2003), while the decoding network decompresses its input into a high-dimensional space by virtue of unstructured lateral connections (Maass et al., 2002). Our analysis of the decoding network showed that it did not sufficiently separate the different stimulus classes. We investigated different strategies to reset the decoding network before stimulus presentation. While resetting the network leads to a maximal performance of 75.2%, the no-reset method performs dramatically below the other meth-

2094

P. Knusel, ¨ R. Wyss, P. Konig, ¨ and P. Verschure

Figure 7: Performance of the liquid state machine at varying temporal length of the temporal population code using different reset methods. (a, b) Classification performance and (c, d) mutual information during training and testing as a function of the time of training and testing (chance level in a and b is 1/11 ≈ 0.091). Up to 50 ms, the performance shows an oscillation, which results from the strong onset response in the temporal population code.

ods investigated—35.4±1.9% correct classifications. A quantitative analysis showed that this performance difference is caused by overlaps of the classes in the high-dimensional space of the decoding network. Thus, in order to decode and classify temporal activity patterns with the liquid state machine successfully, the latter needs to be clocked by the presentation of a new stimulus as opposed to using a true continuous mode of operation. The liquid state machine was successfully applied to the classification of a variety of temporal patterns (Maass et al., 2002; Maass & Markram, 2003). In this study, we investigate yet another type of stimulus: a temporal population code. While our input stimuli are continuously varying, most other studies consider spike trains. Given the maximal performance of 75.2% correct classifications of the test stimuli, however, we believe that in principle, the liquid state machine is capable of processing this kind of temporal activity patterns. Since the liquid state machine was put forward as a general and

Decoding a Temporal Population Code

2095

biologically based model for computations on continuous temporal inputs, it should be able to handle these kinds of stimuli. The liquid state machine was originally proposed for processing continuous streams of temporal information. This is a very difficult task, as any decoder of temporal information has to maintain an internal state of previously applied inputs. However, continuous streams of information can often be divided into short sequences (that is, temporally confined stimuli). The knowledge of the onset of a new stimulus would certainly be beneficial for such a network, as the single stimuli could be processed separately and the network could be specifically initialized and reset prior to their presentation. Thus, as opposed to a regime where a continuous stream of information is processed, there would be a possibility of avoiding interferences of stimuli in the internal state of the network and the network should therefore perform better. However, while the performance difference between continuous or stimulus-triggered processing of temporal information is very intuitive, it is unclear how big its effect would be on the performance and the internal state of the information in the decoding network. Moreover, in previous work on liquid state machines, this difference was not assessed (Maass et al., 2002, 2003; Maass & Markram, 2003; Legenstein, Markram, & Maass, 2003). Here, we quantitatively investigated this difference in the context of the temporal population code where the input is not a continuous stream but composed of temporally confined stimuli. The initial hypothesis was that the decoding network can process a continuous stream of temporal activity patterns generated by the encoding network. We found, however, that for the decoding network to perform reasonably, it required a reset of its internal states at stimulus onset. The resulting percentage of correctly classified stimuli practically doubled for both hard- and random-reset. A mathematical analysis revealed a critical dependence between the initial state of the network at stimulus onset and its internal state after stimulus presentation. Whereas this dependence is fairly low in the case of any reset method, not resetting the network drastically increases it, which results in much larger overlaps of the internal states between different stimulus classes. Our analysis suggests that although the mixing of previously temporally segregated information is of central importance for the proper operation of the liquid state machine, the mixing of information across stimuli leads to an inevitable degradation of its classification performance and the internal representation of the stimuli. In the original study, the decoding network was actually initialized with a method that is similar to the partial-random-reset method used here (Maass et al., 2002, 2003). This raises the question whether the liquid state machine operated in a true continuous mode in the cited studies. In conclusion, our results suggest that a reset mechanism is an essential component of the proposed encoding-decoding system. Any reset system consists of two components: a signal that mediates the onset of a stimulus and a mechanism triggered by this signal that allows resetting the components of the processing substrate. Potential candidate

2096

P. Knusel, ¨ R. Wyss, P. Konig, ¨ and P. Verschure

neural mechanisms for signaling a new stimulus are the thalamic suppression found during saccades (Ramcharan, Gnadt, & Sherman, 2001) or the hyperpolarization observed in the projection neurons of the antennal lobe of moths coinciding with the presentation of a new odor (Heinbockel, Christensen, & Hildebrand, 1999). The temporal population code naturally generates such a signal, characterized by a pronounced first burst that can be easily detected and used to trigger a reset mechanism. Regarding reset mechanisms, we have distinguished several approaches that differ mainly in how they could be implemented by the brain. While the hard-reset method could be implemented by strongly inhibiting all cells in the processing substrate, implementing the random-reset method appears difficult. It could possibly be approximated by driving the processing substrate into a transient chaotic state, which could be achieved by selectively increasing and decreasing the synaptic transmission strength in a short time window after stimulus onset. This approach has similarities with simulated annealing (Kirkpatrick, Gelatt, & Vecchi, 1983) as well as the annealing mechanism presented in Verschure (1991) where chaotic behavior in a neural network is attained by adaptively changing the learning rate. Comparing the classification performance with and without resetting the synapses (entire versus partial reset) reveals that the latter outperforms the former. Thus, not to reset the synapses is rather an advantage than a shortcoming of the proposed mechanisms. Furthermore, these considerations suggest that such a reset system could be implemented in a biologically plausible way. From a general point of view, not only the liquid state machine but any decoder of continuous streams of temporal information has to maintain previously applied inputs in an internal state. Thus, inputs applied at temporally segregated times are mixed into a joint internal state. Our results demonstrate that in the absence of a stimulus-locked reset of this internal state, the effect of mixing strongly degrades the specificity of this internal state, which results in a significant decrease of the network’s performance. Thus, since the liquid state machine is seen as a model of cortical microcircuits, this raises the question how these circuits solve the problem of the mixing of temporally segregated information. On the basis of our results, we predict that it is solved by employing stimulus-onset specific-reset signals that minimize the mixing of information from past and present stimuli. Although some evidence exists that could support this hypothesis, further experimental work is required to identify whether the brain makes use of a form of temporal segmentation to divide a continuous input stream into smaller “chunks” that are processed separately. Appendix: Details of the Implementation We consider the time course of a temporal activity pattern of the encoding network, Iinp (t), as a synaptic input current to the decoding network. This

Decoding a Temporal Population Code

2097

current is arbitrarily normalized to a maximal value of 1 nA. The dimensionless weights of the input synapses, winp , are chosen from a gaussian distribution with mean value and standard deviation of 90 that is truncated at zero to avoid negative values. As only 30% of the liquid cells receive input from the temporal activity pattern (see section 2), a random subset of 70% of the input weights is set to zero. The recurrent synapses in the liquid show short-term plasticity (Markram et al., 1998). Let t be the time between the (n − 1)th and the nth spike in a spike train terminating on a synapse; then un , which is the running value of the utilization of synaptic efficacy, U, follows: − t − t un = un−1 e τf ac + U 1 − un−1 e τf ac , (A.1) where τ f ac is the facilitation time constant. The available synaptic efficacy, Rn , is updated according to

t

t

Rn = Rn−1 (1 − un )e− τrec + 1 − e− τrec ,

(A.2)

where τrec is the recovery from depression time constant. The peak synaptic current, Iˆsyn , is defined as Iˆsyn = wliq Rn un ,

(A.3)

where wliq is the weight of the synapses connecting the liquid cells. The excitatory and inhibitory postsynaptic currents Isyn,exc (t) and Isyn,inh (t) are given by an alpha function, Isyn,x (t) = Iˆsyn

e τsyn,x

te

t − τsyn,x

,

(A.4)

where the subscript x stands for exc or inh, Iˆsyn is the peak synaptic current (see equation A.3), and τsyn,x is the time constant of the postsynaptic potential. Finally, the connection probability, p(a, b), of two liquid cells located at the integer points a and b of a cubic lattice follows a gaussian distribution, p(a, b) = C · e(−(|a−b|/λ) ) , 2

(A.5)

where |.| is the Euclidian norm in R3 and C and λ are constants. The values of all the synaptic parameters listed above are given in Table 2. The liquid cells are simulated as leaky integrate-and-fire neurons. The membrane potential, v(t), is updated according to dt τmem gleak × (Ibg + winp Iinp (t) + Isyn,exc (t) − Isyn,inh (t) − gleak v(t)),

v(t + dt) = v(t) +

(A.6)

2098

P. Knusel, ¨ R. Wyss, P. Konig, ¨ and P. Verschure

where dt is the simulation time constant, τmem the membrane time constant, gleak the leak conductance, Ibg the background current, winp Iinp (t) the synaptic input current from the temporal activity pattern, and Isyn,exc (t) and Isyn,inh (t) are synaptic currents (see equation A.4). If v(t) > vθ , that is, the membrane potential is greater than the threshold potential, a spike is generated, and v(t) is set to the reset potential, vreset , and the neuron is quiescent until the refractory period of duration tref r has elapsed. The values of the parameters listed above are given in Table 1. The readout neurons are simulated as leaky integrate-and-fire neurons. Let i = 1, . . . , I be the index of a readout group (I = 11), j = 1, . . . , J the index of a readout neuron in group i (J = 36), and k = 1, . . . , K the index of a liquid neuron (K = 720). Then the membrane potential of readout neuron j of readout group i, rij (t), follows rij (t + dt) = rij (t) +

dt (rij,syn (t) − rij (t)), τmem,R

(A.7)

where dt is the simulation time constant, τmem,R = 30 ms the readout neuron membrane time constant, and rij,syn (t) the postsynaptic potential given by rij,syn (t) =

K

sgijk ak (t).

(A.8)

k=1

s = 0.03 is an arbitrary and constant scaling factor, gijk are the synaptic weights of liquid cell k to readout neuron j of readout group i, and ak (t) is the activity of liquid cell k, which is 1 if the liquid cell fired an action potential at time t and 0 otherwise. A readout cell may fire only if its membrane potential is above threshold, rθ = 20 mV, at t = tL , that is, rij (tL ) > rθ . tL is a specific point in time after stimulus onset. After a spike, the readout cell membrane potential, rij , is reset to 0 mV and the readout cell response, qij , is set to 1 (qij is zero otherwise). The readout group response, qi , of readout group i is then qi =

J

qij .

(A.9)

j=1

A simplified version of the learning rule described in Maass et al. (2002) and Auer et al. (2001) is used to update the synaptic weights gijk . Let N be the index of the stimulus class (the correct response class) and M the index of the selected response class, that is, M = arg(maxi=1,...,I qi ) is the readout group with the highest number of activated readout cells. Then two cases are distinguished: if N = M, that is, the selected response class is correct, the synaptic weights are not changed. And if N = M, then for all j = 1, . . . , J and

Decoding a Temporal Population Code

2099

k = 1, . . . , K, the synaptic weights are updated according to the following rule: η(−1 − gMjk ) if (rMj (tL ) > rθ ) and ak (tL ) = 0 (A.10) gMjk ← gMjk + 0 else gNjk ← gNjk +

η(1 − gNjk ) 0

if (rNj (tL ) < rθ ) and ak (tL ) = 0 , (A.11) else

where η is a learning parameter. Thus, synapses to firing readout cells of the incorrect response class M are weakened (see equation A.10), whereas those to the inactive readout cells of the correct response class N are strengthened (see equation A.11). References Auer, P., Burgsteiner, H., & Maass, W. (2001). The p-delta learning rule for parallel perceptrons. Manuscript submitted for publication. Buonomano, D. (2000). Decoding temporal information: A model based on short-term synaptic plasticity. Journal of Neuroscience, 20(3), 1129–1141. Buonomano, D., & Merzenich, M. (1995). Temporal information transformed into a spatial code by a neural network with realistic properties. Science, 267, 1028–1030. Heinbockel, T., Christensen, T., & Hildebrand, J. (1999). Temporal tuning of odor responses in pheromone-responsive projection neurons in the brain of the sphinx moth manduca sexta. Journal of Comparative Neurology, 409(1), 1–12. Kirkpatrick, S., Gelatt, C., & Vecchi, M. (1983). Optimization by simulated annealing. Science, 220, 671–680. Kolen, J., & Kremer, S. (Eds.). (2001). A field guide to dynamical recurrent networks. New York: IEEE Press. Konig, ¨ P., Engel, A., Rolfsema, P., & Singer, W. (1995). How precise is neuronal synchronization. Neural Computation, 7, 469–485. Legenstein, R. A., Markram, H., & Maass, W. (2003). Input prediction and autonomous movement analysis in recurrent circuits of spiking neurons. Reviews in the Neurosciences, 14(1–2), 5–19. Maass, W., & Markram, H. (2003). Temporal integration in recurrent microcircuits. In M. Arbib (Ed.), The handbook of brain theory and neural networks (2nd ed., pp. 1159–1163). Cambridge, MA: MIT Press. Maass, W., Natschl¨ager, T., & Markram, H. (2002). Real-time computing without stable states: A new framework for neural computation based on perturbations. Neural Computation, 14(11), 2531–2560. Maass, W., Natschl¨ager, T., & Markram, H. (2003). A model for real-time computation in generic neural microcircuits. In S. Becker, S. Thrun, & K. Obermayer (Eds.), Advances in neural information processing systems, 15 (pp. 213–220). Cambridge, MA: MIT Press.

2100

P. Knusel, ¨ R. Wyss, P. Konig, ¨ and P. Verschure

Markram, H., Wang, Y., & Tsodyks, M. (1998). Differential signaling via the same axon of neocortical pyramidal neurons. Proceedings of the National Academy of Sciences, USA, 95, 5323–5328. Mozer, M. (1994). Neural net architectures for temporal sequence processing. In A. Weigend & N. Gershenfeld (Eds.), Time series prediction: Forecasting the future and understanding the past (pp. 243–264). Reading, MA: Addison-Wesley. Ramcharan, E., Gnadt, J., & Sherman, S. (2001). The effects of saccadic eye movements on the activity of geniculate relay neurons in the monkey. Visual Neuroscience, 18(2), 253–258. Rieke, F., Warland, D., de Ruyter van Steveninck, R., & Bialek, W. (1997). Spikes— exploring the neural code. Cambridge, MA: MIT Press. Verschure, P. (1991). Chaos-based learning. Complex Systems, 5, 359–370. Wyss, R., Konig, ¨ P., & Verschure, P. (2003). Invariant representations of visual patterns in a temporal population code. Proceedings of the National Academy of Sciences, USA, 100(1), 324–329. Wyss, R., & Verschure, P. (in press). Bounded invariance and the formation of place fields. In S. Thrun, L. Saul, & B. Scholkopf ¨ (Eds.), Advances in neural information processing systems. Cambridge, MA: MIT Press. Wyss, R., Verschure, P., & Konig, ¨ P. (2003). On the properties of a temporal population code. Reviews in the Neurosciences, 14(1–2), 21–33. Received July 9, 2003; accepted February 26, 2004.

LETTER

Communicated by Bard Ermentrout

Minimal Models of Adapted Neuronal Response to In Vivo–Like Input Currents Giancarlo La Camera [email protected]

Alexander Rauch [email protected]

Hans-R. Luscher ¨ [email protected]

Walter Senn [email protected]

Stefano Fusi [email protected] Institute of Physiology, University of Bern, CH-3012 Bern, Switzerland

Rate models are often used to study the behavior of large networks of spiking neurons. Here we propose a procedure to derive rate models that take into account the fluctuations of the input current and firing-rate adaptation, two ubiquitous features in the central nervous system that have been previously overlooked in constructing rate models. The procedure is general and applies to any model of firing unit. As examples, we apply it to the leaky integrate-and-fire (IF) neuron, the leaky IF neuron with reversal potentials, and to the quadratic IF neuron. Two mechanisms of adaptation are considered, one due to an afterhyperpolarization current and the other to an adapting threshold for spike emission. The parameters of these simple models can be tuned to match experimental data obtained from neocortical pyramidal neurons. Finally, we show how the stationary model can be used to predict the time-varying activity of a large population of adapting neurons. 1 Introduction Rate models are often used to investigate the collective behavior of assemblies of cortical neurons. One early and seminal example was given by Knight (1972a), who described the difference between the instantaneous firing rate of a neuron and the instantaneous rate of a homogeneous population of neurons in response to a time-varying input. Ever since, more refined analyses have been developed, some making use of the so-called population density approach (see, e.g., Abbott & van Vreeswijk, 1993; Treves, 1993; Fusi & Mattia, 1999; Brunel & Hakim, 1999; Gerstner, 2000; Nykamp & Tranchina, 2000; Mattia & Del Giudice, 2002). The population activity c 2004 Massachusetts Institute of Technology Neural Computation 16, 2101–2124 (2004)

2102

G. La Camera, A. Rauch, H.-R. Luscher, ¨ W. Senn, and S. Fusi

at time t is the fraction of neurons of the network emitting a spike at that time. In a homogeneous network of identical neurons and in stationary conditions, this population activity is the single neuron current-frequency (f-I) curve (the response function), which is accessible experimentally and has been the subject of many theoretical (Knight, 1972a; Amit & Tsodyks, 1991, 1992; Amit & Brunel, 1997; Ermentrout, 1998; Brunel, 2000; Mattia & Del Giudice, 2002) and in vitro studies (see, e.g., Knight, 1972b; Strafstrom, Schwindt, & Crill, 1984; McCormick, Connors, Lighthall, & Prince, 1985; Poliakov, Powers, Sawczuk, & Binder, 1996; Powers, Sawczuk, Musick, & Binder, 1999; Chance, Abbott, & Reyes, 2002; Rauch, La Camera, Luscher, ¨ Senn, & Fusi, 2003). The f-I curve is central to much theoretical work on networks of spiking neurons and is a powerful tool in data analysis, where the population activity can be estimated by the peristimulus time histogram but requires many repetitions of the recordings under the same conditions. A suitable rate model would avoid this cumbersome and time-consuming procedure (see, e.g., Fuhrmann, Markram, & Tsodyks, 2002). When the rate models are derived from detailed model neurons, the predictions on network behavior can be very accurate. Recently Shriki, Hansel, and Sompolinsky (2003) fitted a rate model to the f-I curve of a conductancebased neuron with Hodgkin-Huxley sodium and potassium conductances and an A-current. The A-current was included to linearize the f-I curve as observed in experiment. With their model, Shriki et al. (2003) can predict, in several case studies, the behavior of the network of neurons from which the rate model was derived. A similar approach was taken by Rauch et al. (2003), who fitted the response functions of white noise–driven IF neurons to the f-I curves of rat pyramidal neurons recorded in vitro. They found that firing-rate adaptation had to be included in the model in order to fit the data. As opposed to the approach of Shriki et al. (2003), the contribution of the input fluctuations to the output rate was taken explicitly into account. Here we present a general scheme to derive adapting rate models in the presence of noise, of which the model used by Rauch et al. (2003) is a special case. The general scheme that we introduce may be considered a generalization of a model due to Ermentrout (1998). This author introduced a general rate model in which firing-rate adaptation is due to a feedback current; that is, the adapted rate f is given as the self-consistent solution of the equation f = (I − αf ), where I is the applied current, is the f-I curve in stationary conditions, and α is a parameter that quantifies the strength of adaptation. In this model, the effect of noise is not considered. For most purposes, the input to a cortical neuron can be decomposed in an average component, m, and a fluctuating gaussian component whose amplitude is quantified by its standard deviation, s, and the response of the neuron can be expressed as a function of these two parameters (see, e.g., Ricciardi, 1977; Amit & Tsodyks, 1992; Amit & Brunel, 1997; Destexhe, Rudolph, Fellous, & Sejnowski, 2001).

Adapting Rate Models

2103

We show that under such conditions, Ermentrout’s model can be easily generalized, and the adapted response can be obtained as the fixed point of f = (m − αf, s). For this result to hold, it is necessary that adaptation is slow compared to the timescale of the neural dynamics. In such a case, the feedback current αf is a slowly fluctuating variable and does not affect the value of s. Note that a slow adaptation is a minimal request in the absence of noise (Ermentrout, 1998). The proposed model is very general, but it can be used to full advantage only if the response function is known analytically. This is the case of simple model neurons, for which the rate function can be calculated, or more complex neurons whose f-I curve can be fitted by a suitable model function (e.g., Larkum, Senn, & Luscher, ¨ in press). In section 2, the adapting rate model is introduced and tested on several versions of IF neurons, whose rate functions are known and easily computable. The resulting rate models are checked against the simulations of the full models from which they are derived, including the leaky IF (LIF) neuron with conductance-based synaptic inputs. Only slight variations are needed if a different mechanism of adaptation is considered, as, for example, an adapting threshold for spike emission, which is dealt with in section 2.2. Evidence is also provided that the LIF neuron with an adapting threshold is able to fit the response functions of rat pyramidal neurons (see section 2.3), a result that parallels those of Rauch et al. (2003) obtained with an afterhyperpolarization current. Finally, in section 3, we show how the stationary response function can be used to predict the time-varying activity of a large population of adapting neurons. 2 Adapting Rate Models in the Presence of Noise Firing-rate adaptation is a complex phenomenon characterized by several timescales and affected by different ion currents. At least three phases of adaptation have been documented in many in vitro preparations, referred to as initial or one-interspike (ISI) interval adaptation, which affects the first or at most the first two ISIs (Schwindt, O’Brien, & Crill, 1997), early adaptation, involving the first few seconds, and late adaptation, shown in response to a prolonged stimulation (see Table 1 of Sawczuk, Powers, & Binder, 1997, for references and a list of possible mechanisms). Initial adaptation depends largely on Ca2+ -dependent K+ current (Sah, 1996; Powers et al., 1999), although other mechanisms may also play a role (Sawczuk et al., 1997). The early and late phases of adaptation are not well understood, and several mechanisms have been put forward: in neocortical neurons, it seems that Na+ -dependent K+ currents (Schwindt, Spain, & Crill, 1989; Sanchez-Vives, Nowak, & McCormick, 2000), and slow inactivation of Na+ channels (Fleidervish, Friedman, & Gutnik, 1996) may play a major role; in motoneurons, evidence is accumulating for an interplay between slow inactivation of Na+ channels, which tend to decrease the firing rate,

2104

G. La Camera, A. Rauch, H.-R. Luscher, ¨ W. Senn, and S. Fusi

and the slow activation or facilitation of a calcium current, which tends to increase the discharge rate (Sawczuk et al., 1997; Powers et al., 1999). Despite the many mechanisms involved, we derive in the following a simple model for the adapted rate at steady state, which describes synthetically the overall phenomenology. The cellular mechanism is inspired by those mentioned above: upon emission of a spike, a quantity AN of a given ion species N (one can think of Ca2+ or Na+ ) enters the cell and modifies the intracellular ion concentration [N]i , which then exponentially decays to its resting value in a characteristic time τN . Its dynamics are described by d[N]i [N]i + AN δ(t − tk ), =− dt τN k

(2.1)

where the sum is taken over all the spikes emitted by the neuron up to time t. As a consequence, an outward, N-dependent current Iahp = −gN [N]i , proportional to [N]i through the average peak conductance gN , results and causes a decrease in the discharge rate. Following the literature, we give this current the name of afterhyperpolarization (AHP) (e.g., Sah, 1996). Experimentally, the time constant τN of the dynamics underlying AHP summation (see equation 2.1) is found to be of the order of tens of milliseconds (fast AHP), hundreds of milliseconds to a few seconds (mediumduration AHP), to seconds (slow AHP) (see, e.g., Powers et al., 1999). Values often used in modeling studies are of the order of 100 ms (Wang, 1998; Ermentrout, 1998; Liu & Wang, 2001). In all cases, N-dynamics is typically slower than the average ISI. This fact can be exploited to obtain the stationary, adapted output frequency by noticing that from equation 2.1, the steady state (ss) intracellular concentration would be proportional to the neuron’s output frequency in a time window of a few τN : [N]i,ss = τN AN δ(t − tk ) ≈ τN AN f, tk
where T τN . This causes a feedback current Iahp,ss proportional to [N]i,ss , Iahp = −gN [N]i,ss , which is in general a fluctuating variable, as the output spike train is. We assume the neuron to be driven by a fluctuating current that can be described as an average component m plus a fluctuating component with zero mean and variance s2 . If the current is gauss distributed, as we always assume in this article, m and s2 are sufficient to characterize the current. Since [N]i dynamics are slow, Iahp,ss is only weakly fluctuating compared to the input current, so that only m is significantly affected. The total current felt by the neuron, spiking at frequency f , is then m − αf , with α = gN τN AN , plus the fluctuating component, which is unaffected by adaptation. This would cause the neuron to fire at a reduced frequency f1 , which in turn causes the mean current to be affected as m − αf1 , and so on. At equilibrium, the adapted frequency can be numerically obtained by solving

Adapting Rate Models

2105

the self-consistent equation f = (m − αf, s),

(2.2)

which requires only the knowledge of the response function (the actual dynamics leading to equation 2.2 is dealt with in more detail in section 3). It is easy to prove that the adapted firing rate, fα , is always a stable fixed point of equation 2.2: the condition for stability reads ≡ ∂ f (m − αf, s)| fα < 1; for equation 2.2, one has = −α∂m (m − αf, s)| fα < 0, as α > 0 and is an increasing function of m. 2.1 Examples. We checked the rate model, equation 2.2, against full simulations for several versions of IF neurons. In each case, the rate function in the presence of noise, , is known. 2.1.1 Leaky Integrate-and-Fire Neuron. The leaky integrate-and-fire (LIF) neuron is the best-known and most widely used of all IF neurons (see section A.1 for details of the model). Its response function in the presence of noise has been known for a long time; it reads LIF (m, s) = τr + τ

Cθ −mτ √ σ τ CVr −mτ √ σ τ

√

−1 x2

πe (1 + erf(x))dx

(2.3)

(Siegert, 1951; Ricciardi, 1977; Amit & Tsodyks, 1991). Vr and τr are the reset potential and the absolute refractory period after spike emission, which is said to occur when the membrane potential hits a fixed threshold θ ; τ is the membrane time constant; C is the membrane capacitance; and erf(x) = √ √ x 2 (2/ π) 0 dte−t is the error function. m and σ = 2τ s are the average current and the amplitude of the fluctuations, with s in units of current and τ a time constant to preserve units (set equal to 1 ms throughout). The equivalence between the rate model and the full simulation for different values of the noise is shown in Figure 1A. We report also the lower bound for the time constant (around τN ∼ 80 ms) for which the result holds (see Figure 1B). For irregular spike trains the agreement is remarkable also at very low frequencies, where the condition that the average ISI be smaller than τN is violated. This may be explained by the fact that although > τN , the ISI distribution is skewed towards smaller values, and Iahp ∼ −αf is still a good approximation. 2.1.2 Quadratic Integrate-and-Fire Neuron. The quadratic integrate-andfire (QIF) neuron—see section A.2—is related to a canonical model of type I membrane (see, e.g., Ermentrout & Kopell, 1986; Ermentrout, 1996). As such, it is expected to reproduce the firing behavior of any type I neuron close to

2106

G. La Camera, A. Rauch, H.-R. Luscher, ¨ W. Senn, and S. Fusi

firing rate [Hz]

A 80

60 40 20

60

0 −60 0

40

1

20 0 0

sim

(%)

B

∆f/f

2

500 m [pA]

1000

8 6 4 2 0 −2 0

100

200 300 τN [ms]

400

500

Figure 1: Adapting rate model from the LIF model neuron, theory versus simulations. (A) rate functions of adapted LIF neuron. Plots show firing rate as a function of the mean current m at constant noise s = 0, 200 and 600 pA (from right-most to left-most curve). Lines: Self-consistent solutions of equation f = LIF (m−αf, s) ( fth ), with LIF given by equation 2.3. Dots: Simulations of the full model, equations 2.1 and A.1 ( fsim ). Adaptation parameters: τN = 500 ms, gN AN = 8 pA (so that α = 4 pA·s). Neuron parameters: τr = 5 ms, C = 500 pF, θ = 20 mV, Vr = 10 mV, Vrest = 0, τ = 20 ms. Inset: Sample of membrane voltage (mV, top trace) and feedback current Iahp (pA, bottom trace) as a function of time (in seconds) for the input point (m, s) = (550, 200) pA. Note how equilibrium is reached after ≈ 1s = 2τN . (B) Dependence of ( fsim − fth )/fsim on τN . As τN is varied, AN is rescaled so that the total amount of adaptation (α = 4 pA·s) is kept constant. Parameters of the current: m = 600 pA (full symbols) and m = 800 pA (empty symbols); s = 0 (circles) and s = 200 pA (triangles). All other parameters as in A. Mean spike frequencies assessed across 50 s, after discarding a transient of 10τN . Integration time step dt = 0.1 ms. For s > 0, finite noise effects have to be expected, but the error is always below 3%. For τN < 80 ms, the error is positive, meaning that fsim > fth : the neuron adapts only slightly because N decays too quickly (vertical dotted line: 80 ms).

Adapting Rate Models

2107

firing rate [Hz]

60

40 20 0 40 0 −0.6 0

1

2

20 0 −2

0

µ

2

4

Figure 2: Adapting rate model from the QIF neuron, theory versus simulations. f-I curves plotted as in Figure 1A. Lines: Self-consistent solutions of equation f = QIF (µ − αf, σ ), with QIF given by equation 2.4. Dots: Simulations of the full model, equations 2.1 and A.2 ( fsim ). Parameters: τN = 500 ms, gN AN = 0.06 (α = 0.03 s), τ = 20 ms, τr = 0; σ = 0, 0.8, 1.6 (from right to left). Inset: Same as in Figure 1A, for the point (µ, σ ) = (−0.2, 1.6). Mean spike frequencies fsim assessed across 50 s, after discarding a transient of 10τN .

bifurcation, where the firing rates are low. Its firing rate in the presence of white noise reads √ QIF = τr + πτ

+∞

−∞

−1 σ 2 x6 dx exp −µx2 − , 48

(2.4)

(see Brunel & Latham, 2003). µ and σ quantify the drift and the fluctuations of a (dimensionless) gaussian input current. The adapted response in stationary conditions is given by f = QIF (µ − αf, σ ). The match between the predictions of the adapting rate model and simulations of the QIF neuron is presented in Figure 2. 2.1.3 Conductance-Based IF Neuron. The scenario outlined so far considered only current-based model neurons—models in which the input current does not depend on the state of the membrane voltage. A more realistic description takes into account the dependence of the current on the reversal potentials, VE,I , for the excitatory and inhibitory inputs, respectively. The adapting rate model for this class of neurons can be derived in the same way as done for the current-based models. For the LIF neuron with reversal potentials defined in section A.3, and referred to as the conductance-based (CB) neuron in the following, one finds that the adapted response is the

2108

G. La Camera, A. Rauch, H.-R. Luscher, ¨ W. Senn, and S. Fusi

fixed point of the self-consistent equation, f = CB (m0 − αf, s0 ),

(2.5)

where  CB = τr + τ ∗

Cθ−m0 τ ∗ √ s0 τ ∗ CVr −m0 τ ∗ √ s0 τ ∗

√

−1 πex (1 + erf(x))dx 2

.

(2.6)

CB is the rate function of the CB neuron (Burkitt, Meffin, & Grayden, 2003). Note that the only difference with the response function of the current-based LIF neuron, equation 2.3, is in the dependence of the quantities m0 , s0 , τ ∗ , given, respectively, by equations A.10, A.11, and A.6, upon the input parameters g¯ E,I , νE,I . Here g¯ E,I are the excitatory and inhibitatory peak conductances and νE,I the mean frequencies of the input spike trains, modeled as Poisson processes. Equations A.10 and A.11 have to be compared with m = g¯ E νE − g¯ I νI ,

s2 = g¯ 2E νE + g¯ 2I νI ,

valid for the current-based neuron. Note that one way to obtain the plots of √ Figure 1A is to increase νE while scaling g¯ E as A/ νE , with A held constant and for constant inhibition (i.e., for constant g¯ I , νI ). This in fact corresponds √ to increasing m as A νE − g¯ I νI at constant noise s2 = A2 + g¯ 2I νI . To make a comparison with the current-based neuron, we plotted CB as a function of µE = g¯ E νE according to such a procedure in Figure 3, which presents the match between the predictions of the rate model and simulations of the adapting CB neuron. 2.1.4 Other Model Neurons. A similar agreement is obtained for other IF model neurons (results not shown). Particularly worth mentioning is the constant leakage IF neuron with a floor (CLIFF) (Fusi & Mattia, 1999; Rauch et al., 2003), whose response function is very simple and does not require any integration: (m, s) = τr +

2Cθ(m−λ) C(θ − V ) −1 σ2 r − − 2CVr (m−λ) σ2 σ2 − e . e + 2(m − λ)2 m−λ

The input parameters m, σ are defined as for the LIF neuron. The CLIFF neuron is an LIF neuron with constant leakage (i.e., with the term −(V − Vrest )/τ in equation A.1 replaced by the constant −λ/C), and a reflecting barrier for the the membrane potential (Fusi & Mattia, 1999). However, the scheme proposed here to derive the adapting rate models does not apply to IF neurons only. For example, the response function of a Hodgkin-Huxley neuron with an A-current can be fitted by the simple model 1 = a[m − mth ]+ − b[m − mth ]2+ (Shriki et al., 2003), where a, b

Adapting Rate Models

firing rate [Hz]

80 60 40

2109 60 40 20

0 −60 0

1

2 2

20 0 400

1 0 2

600

3

4

5

800 1000 1200 gE νE [nS/s]

Figure 3: Adapting rate model from the conductance-based LIF neuron, theory versus simulations. Lines: Self-consistent response of equation 2.5 plotted as g¯ E νE → f at constant inhibition, with νI = 500 Hz, g¯ I = 1 nS throughout. Dots: Simulations of the full models ( fsim ). Each curve is obtained moving along νE and scaling g¯ E so that σE2 ≡ g¯ 2E νE constant, to allow comparison with the current-based neurons in Figure 1A. g¯ E [nS] as a function of νE [Hz] shown in the √ right inset as g¯ E vs log10 (νE ). The resulting σE values were 7.0, 16.9, 33.1 nS/ s (from right to left). Adaptation and neuron parameters as in Figure 1A, plus VE = √70 mV, VI = −10 mV. Left inset as in Figure 1 with µE = 783 nS/s, σE = 33.1 nS/ s. Mean spike frequencies fsim assessed across 50 s, after discarding a transient of 10τN .

are two positive constants (such that 1 ≥ 0 always), the rheobase mth is an increasing function of the leak conductance gL (Holt & Koch, 1997), and [x]+ = x if x > 0, and zero otherwise. The adapting rate model that corresponds to 1 , that is, f = 1 (m − αf ; a, b, gL ), could be interpreted as the rate model for the Hodgkin-Huxley neuron underlying it. Another example is given by the function 2 ∝ [m − mth ]+ , which describes the firing behavior of a type I membrane close to bifurcation and has been fitted (Ermentrout, 1998) to the in vitro response of cells from cat neocortex in the absence of noise (Stafstrom et al., 1984), and to the Traub model (Traub & Miles, 1991). It is easily seen that 2 is the rate function of the QIF neuron when σ = 0. Like 1 , this model does not take fluctuations explicitly into account. 2.1.5 IF Neurons with Synaptic Dynamics. The adapting rate model also works in the presence of synaptic dynamics, provided that the appropriate response function is used. For example, for the LIF neuron with fast synaptic √ dynamics, this is equation 2.3 with {θ, Vr } replaced by {θ, Vr } + 1.03sv τs /τ ,

2110

G. La Camera, A. Rauch, H.-R. Luscher, ¨ W. Senn, and S. Fusi

where sv is the standard deviation of the input in units of voltage and τs τ is the synaptic time constant (Brunel & Sergi, 1998; Fourcaud & Brunel, 2002). This model will be used in section 3, which deals with timedependent inputs. For slow synaptic dynamics, the response of the LIF neuron has been given by Moreno-Bote and Parga (2004b), while the response of the QIF neuron in both regimes has been given by Brunel and Latham (2003). 2.2 The Adapting Threshold Model. The above construction can be applied to other models in which the adaptation mechanism is of a different type. Among those is a model in which the threshold for spike emission adapts (see, e.g., Holden, 1976; Wilbur & Rinzel, 1983; Liu & Wang, 2001). In the adapting threshold model, the emission of an action potential causes the threshold θ for spike emission to step increase by an amount Bθ and then exponentially decay to its resting value θ0 with a time constant τθ : dθ θ − θ0 + Bθ δ(t − tk ). =− dt τθ k

(2.7)

There is indeed evidence that the spike threshold rises after the onset of a current step (Schwindt & Crill, 1982; Powers et al., 1999). Note that the feedback now affects the threshold, not the current. This case can be handled in a similar way as done for AHP adaptation: after a transient of the order of τθ , the neuron will have an average threshold proportional to its own output frequency, θ ≈ θ0 + Bθ τθ f ≡ θ0 + β f, with β = Bθ τθ , and the adapted response f is given by the self-consistent solution of f = (θ0 + β f ; m, s).

(2.8)

Also, this reduction is expected to work for slow enough threshold dynamics, τθ ∼ 100 ms, but apart from that, at any output frequency. This is confirmed by Figure 4, in which the prediction of the rate model is checked against the simulations of the full model for the LIF neuron (that is, equation 2.7 and equation A.1 with Iahp = 0). A similar agreement is obtained for the conductance-based LIF neuron (not shown). The qualitative differences between the two adapting models for the LIF neuron are illustrated in the next section, where we report the results of fitting the response functions to those of cortical pyramidal neurons.

Adapting Rate Models

firing rate [Hz]

60 40

2111 60 40 20 24 20 0

1

2

20 0 0

500 m [pA]

1000

Figure 4: LIF neuron with an adapting threshold, theory versus simulations. f-I curves plotted as in Figure 1A. Lines: Self-consistent solutions of equation f = (θ0 + β f, s), with given by equation 2.3. Dots: Simulations of the full model, equations 2.7 and A.1, ( fsim ). Adaptation parameters: τθ = 500 ms, Bθ = 0.5 mV·s (so that β = 0.25 mV·s). Neuron parameters and s values as in Figure 1A. Inset: Sample of membrane potential and θ (t) for the point (m, s) = (550, 200) pA (time in seconds). Note that for this point, the output frequency is f ≈ 14 Hz, so that after the transient θ = θ0 + β f ≈ 23.5 mV, as shown in the inset. Mean spike frequencies fsim assessed across 50 s, after discarding a transient of 10τθ .

2.3 Quantitative Comparison with Experimental Data. We have shown previously that the LIF neuron with AHP adaptation can be fitted to the experimental response functions of rat pyramidal neurons under noisy current injection (Rauch et al., 2003). We extended the analysis to the LIF neuron with an adapting threshold to find basically the same result. The results for the 26 rat pyramidal neurons considered for the analysis are summarized in Table 1 (see appendix B for details). Two examples are shown in Figure 5. The two adapting models can be made equivalent in the region of low frequencies, being both threshold-linear with slopes 1/α (AHP) and τ/Cβ (adapting threshold; see appendix C for details). This is confirmed by Table 1, which shows that C and τ are the same for the two models and Cβ/τ = 4.46 pA·s, consistent with α = 4.3 ± 2.2. The two models differ in the value of the refractory period, which is much shorter for the adapting threshold model. In fact, imposing τr = 0 only marginally affects the quality of the fits. This is because the LIF neuron with an adapting threshold has a square root behavior in m at intermediate and large frequencies (see equation C.1). To the contrary, a refractory period

2112

G. La Camera, A. Rauch, H.-R. Luscher, ¨ W. Senn, and S. Fusi

firing rate [Hz]

CELL A

CELL B

30 20 10 0

20

30 20 10 0

20

AHP

10 0

AT

10 0 200 400

0

0 200 400

m [pA] Figure 5: Best fits of different models to the rate functions of two rat pyramidal cells (see appendix B). Models: LIF neuron with AHP adaptation (AHP) and with an adapting threshold (AT). Theoretical curves (full lines) and experimental points (dots) plotted as in Figure 1A. Best-fit parameters are: Cell A: AHP: τr = 6.6 ms, τ = 27.1 ms, C = 260 pF, Vr = 1.7 mV, α = 5.1 pA·s, P = 0.32; AT: τr = 2.2 ms, τ = 28.8 ms, C = 270 pF, Vr = 14 mV, β = 0.58 mV·s, P = 0.38. Cell B: AHP: τr = 19.8 ms, τ = 41.1 ms, C = 440 pF, Vr = −2 mV, α = 2.8 pA·s, P = 0.85; AT: τr = 6.8 ms, τ = 40.1 ms, C = 430 pF, Vr = −12.8 mV, β = 0.30 mV·s, P = 0.80. In all the fits, the threshold (θ0 in AT) was kept fixed to 20 mV. P equals the probability that χ 2 is larger than the observed minimum 2 χmin . The fit was accepted whenever P > 0.01. Amplitude of the fluctuating current: cell A: s = 0, 200 and 400 pA; cell B: s = 50, 200, 400 and 500 pA.

is required to bend the response of the model with AHP, otherwise linear in that region. 3 Adapting Response to Time-Dependent Inputs The stationary response function can be used also to predict the timevarying activity of a population of adapting neurons, as shown in this section. Consider an input spike train of time-varying frequency νx (t), targeting each cell of the population through x-receptor mediated channels. Each spike contributes a postsynaptic current of the form g¯ x e−t/τx , where g¯ x is the peak conductance of the channels. In the diffusion approximation, such an input Ix is an Ornstein-Uhlenbeck (OU) process with average ¯ x = g¯ x νx (t)τx , variance s¯2x (t) = (1/2) g¯ 2x νx (t)τx , and correlation length τx : m ¯x Ix − m 2dt dIx = − dt + s¯x ξt , (3.1) τx τx where ξt is the unitary gaussian process defined in section A.1.

Adapting Rate Models

2113

Table 1: Summary of the Results of the Fit of the LIF Neuron to the Experimental Rate Functions of 26 Rat Neocortical Pyramidal Cells. AHP

AT

N

14

13

α [pA · s], β [mV · s] τr [ms] τ [ms] C [nF] Vr [mV] P

4.3 ± 2.2 9.0 ± 6.5 33.2 ± 9.4 0.50 ± 0.18 0.1 ± 11.2 0.40 ± 0.30

0.29 ± 0.13 3.0 ± 4.0 32.5 ± 9.2 0.50 ± 0.18 1.1 ± 13.7 0.33 ± 0.29

Notes: N is the number of fitted cells that required an adaptation parameter (α, or β) > 0. Two cells could be fitted without adaptation and were not included in the analysis. The parameters (left-most column) are defined in section 2.1 and their best-fit values are reported as average ± SD. The threshold for spike emission was held fixed to 20 mV. P is the probability 2 ] across fitted cells requiring adaptation. P[χ 2 > χmin A fit was accepted whenever P > 0.01. The threshold for spike emission was held fixed to 20 mV. AT: adapting threshold model.

The population activity of noninteracting neurons is well predicted by f (t) = (mx , s2x ), where is the stationary response function, and mx , s2x are the time-varying average and variance of Ix (see, e.g., Renart, Brunel, & Wang, 2003). These evolve according to the first-order dynamics (y˙ ≡ dy/dt), ˙ x = −(mx − m ¯ x ), τx m

(3.2)

and analogously for s2x , with τx replaced by τx /2 (e.g., Gardiner, 1985). We now include adaptation in the following way: f = (mx − Iahp , s2x ) τN I˙ahp = −Iahp + αf,

(3.3)

where Iahp is the AHP current, which follows the instantaneous output rate with time constant τN . Note that for a stationary stimulus, that is, νx constant, after a transient of the order of max{τx , τN }, one recovers the stationary ¯ x , s = s¯x . model, equation 2.2, with m = m In the case of several independent components, they follow their own synaptic dynamics and sum up in the argument of the response function to give the time-varying firing rate: 2 f = mx − Iahp , sx . x

x

2114

G. La Camera, A. Rauch, H.-R. Luscher, ¨ W. Senn, and S. Fusi

f(t) [Hz]

100 50

I [nA]

0 1.8 1.2 0.6 200

300 400 time [ms]

500

Figure 6: Time-varying activity of a population of independent, adapting LIF neurons in response to a noisy, broadband stimulus. (Top) prediction of the adapting rate model, equation 3.4 (gray), compared to the simulations of 20,000 neurons (black), as described in section 3. Shown is the activity after a transient of 200 ms. The short horizontal bar indicates a pulselike increase of 1 ms duration in ν0 , of strength 12ν0 . The long horizontal bar indicates a 50% steplike increase in ν0 , during which the inhibitory rate step increases by 0.7ν0 . (Bottom) Average time course of the stimulus. Neuron parameters as in Figure 1a, apart from τr = 2 ms. Other parameters (refer to the text): ν0 = 350 Hz, νinh = 1.2ν0 , Gampa,gaba = 200 pA, Gnmda = 10 pA, τampa,gaba = 5 ms, τnmda = 100 ms. Bin size of the PSTH: 0.5 ms. Integration time step: 0.01 ms.

In Figure 6 we show an example with two fast (τx = 5 ms) components, one excitatory (AMPA-like), the other inhibitory (GABAA -like), plus a third component mimicking slow (NMDA-like) current (with τnmda = 100 ms). The latter component is only slowly fluctuating, so that its variance can be neglected (Brunel & Wang, 2001), as in the case of the adaptation current. The output rate was calculated as f (t) = (mampa + mnmda − mgaba − Iahp , s2ampa + s2gaba ),

(3.4)

where is equation 2.3 corrected for fast synaptic dynamics (Fourcaud & Brunel, 2002), see section 2.1.5. The excitatory stimulus was of the form ν0 (1+ (t)), with ν0 = 350 Hz and (t) a rectified superposition of 10 sinusoidal components with random frequencies ωi /2π, phases φi , and amplitudes Ai drawn from uniform distributions (the latter between 0 and 0.5): Ai sin(ωi t + φi ) . (t) = i

+

Adapting Rate Models

2115

The maximum value used for ω/2π was 150 Hz. νI was held constant ¯ ampda,nmda = g¯ ampa,nmda ν0 (1 + (t))τampa,nmda , m ¯ gaba = to 1.2ν0 . This gives m 1.2 g¯ gaba ν0 τgaba , with mx given by equation 3.2, and analogously for s2ampa,gaba . The input currents used in the simulations were [Ix ]+ , where Ix evolved according to the OU process, equation 3.1. Each neuron received an independent realization of the stochastic currents. The time-varying population activity was assessed through the peristimulus time histogram (PSTH) with a bin size of 0.5 ms. The model makes a good prediction of the population activity, even during fast transients, as in response to the impulse of 1 ms duration at t = 250 ms and a step increase in ν0 , νinh occurred at t = 400 ms (horizontal bars in Figure 6). The small discrepancies are due to finite size effects (Brunel & Hakim, 1999; Mattia & Del Giudice, 2002), and to the approximation used for the stationary response function. Similar results were obtained with the adapting threshold model (i.e., with Iahp = 0 and equation 3.3 replaced by τθ θ˙ = −(θ − θ0 ) + β f ), and for the CB neuron (not shown). It has to be noticed that the condition τampa,gaba τ ∗ , where τ ∗ is the effective time constant of the CB neuron (see equation A.6), is usually more difficult to fulfill, as τ ∗ can reach values as small as a few ms, depending on the input (see, e.g., Destexhe et al., 2001). However, when τ ∗ < τampa,gaba , a good approximation to the response function has been given by Moreno-Bote and Parga (2004a). 4 Discussion We have proposed a general scheme to derive adapting rate models in the presence of fluctuating input currents. The adapting rate model is a reduced two-dimensional model that is a minimal model to fit the response of cortical neurons to in vivo–like current with stationary distribution. The rate models are easily computable; reproduce well the simulations of the full models from which they are derived; are of wide applicability (e.g., apply to model neurons with conductance-based synaptic inputs); allow for different mechanisms of adaptation to be used; and can be used to predict the activity of a large population of adapting spiking neurons in response to time-varying inputs. We considered an AHP current and an adapting threshold for spike emission, two mechanisms for which there is experimental evidence and which are often used in modeling studies. The only requirement is that adaptation be slow compared to the scale of the neural dynamics. Since the adapting rate models (with either model of adaptation) considered in this study are able to fit the rate functions of neocortical neurons, they offer a synthetic way to describe cortical firing, and the diversity of possible models can be used for a quantitative (and possibly functional) classification of cortical neurons based on their observed response to in vivo–like currents.

2116

G. La Camera, A. Rauch, H.-R. Luscher, ¨ W. Senn, and S. Fusi

Since the stationary model predicts the time-varying activity of a large population of independent neurons, the response function measured by Rauch et al. (2003) can be used to make quantitative analyses of the population dynamics, and not only of the stationary states (Amit & Brunel, 1997; Mattia & Del Giudice, 2002). This is particularly the case for networks of neurons in the fluctuation-dominated regime (e.g., Shadlen & Newsome, 1998; Fusi & Mattia, 1999), when spikes are emitted because of large fluctuations of the membrane potentials around an average value that is below the threshold. Such a regime could be the consequence of a high-conductance state (Destexhe et al., 2001), as observed in vivo (Par´e, Shink, Gaudreau, Destexhe, & Lang, 1998). Otherwise, the network may fall in a synchronized regime, for example, in response to a step current, causing the population activity to oscillate around that predicted by the rate model. Additional work is required to investigate the predictions of the timevarying model in more complex situations, as, for example, in populations of interacting neurons or in the case of voltage-gated, saturating conductances (as, e.g., NMDA-mediated). In the absence of adaptation, the predictions of the model have been shown to be good in these more complex cases as well (e.g., Renart et al., 2003). We expect no difference in performance when adaptation is included as done in section 3. Appendix A: Model Neurons A.1 Leaky Integrate-and-Fire Neuron. The leaky IF neuron is a singlecompartment model characterized by its membrane voltage V and with a fixed threshold θ for spike emission. Upon crossing of θ from below, a spike is said to occur, and the membrane is clamped to a reset potential Vr for a refractory time τr , after which normal dynamics resume. We assume that a large number of postsynaptic potentials of small amplitude are required to reach the threshold. In such a condition, prevalent in the cortex, the subthreshold membrane dynamics can be assimilated to a continuous random walk, the Ornstein-Uhlenbeck (OU) process (this is the diffusion approximation; see, e.g., L´ansky´ & Sato, 1999). Taking into account the effect of Iahp , the subthreshold dynamics of the membrane potential obeys the stochastic differential equation, dV = − where

√ V − Vrest dt + µdt + σ ξt dt, τ

(A.1)

√ m + Iahp 2τ s µ= , σ = C C are the average and standard deviation in unit time of the membrane voltage. m and s2 are the average and the variance of the synaptic input current,

Adapting Rate Models

2117

√ Iahp = −gN [N]i (see equation 2.1), and 2τ is a factor to preserve units (see, e.g., Rauch et al., 2003). Vrest = 0 is the resting potential, C is the capacitance of the membrane, τ = RC, and R is the membrane resistance. ξt is a gaussian process with flat spectrum and unitary variance, ξt ξt = δ(t − t ) (white noise; see, e.g., Tuckwell, 1988, or Gardiner, 1985, for more details). In a nonadapting neuron, Iahp ≡ 0. In the adapting threshold model (see section 2.2), Iahp = 0 and the threshold θ ≥ θ0 is a dynamical variable obeying equation 2.7. A.2 Quadratic Integrate-and-Fire Neuron. The dimensionless variable V, to be interpreted as the membrane potential of the white noise–driven QIF neuron obeys √ τ dV = (V 2 + µ)dt + σ ξt τ dt,

(A.2)

where τ is a time constant that mimics the effect of the membrane time constant, and µ, σ 2 are the average and variance per unit time of the input current. A spike is said to occur whenever V = +∞, after which V is clamped to V = −∞ for a refractory period τr . In practice, in the simulations, V is reset to −50 whenever V = +50. This gives an accurate approximation for the parameters chosen in Figure 2. On the other hand, in the rate function, equation 2.4, the actual values used for the integration limits do not matter, provided they are larger than +10 and smaller than −10 respectively. A.3 Conductance-Based LIF Neuron. The membrane potential of the conductance-based LIF neuron obeys dV = − g˜ L (V − Vrest )dt + gE (VE − V)dPE + gI (VI − V)dPI , where gE,I = τ g¯ E,I /C are dimensionless peak conductances, g˜ L = 1/τ is the leak conductance in appropriate units (1/ms), VE,I are the excitatory and inhibitory reversal potentials, and dPE,I = j δ(t − tjE,I )dt are Poisson spike trains with intensity νE,I . In the diffusion approximation (dPx → νx dt + √ νx dtξt ), the equation can be put in a form very similar to equation A.1 (see, e.g., Hanson & Tuckwell, 1983; L´ansky´ & L´ansk´a, 1987; Burkitt, 2001): dV = −

√ V dt + µ0 dt + σ0 (V) dtξt ∗ τ

(A.3)

where µ0 2 σ0 (V) ∗

= g˜ L Vrest + (gE VE νE + gI VI νI ) =

g2E (VE

2

− V) νE +

g2I (VI −1

τ = ( g˜ L + gE νE + gI νI ) .

(A.4) 2

− V) νI

(A.5) (A.6)

2118

G. La Camera, A. Rauch, H.-R. Luscher, ¨ W. Senn, and S. Fusi

The main differences with respect to the current-based IF neuron are (1) the fluctuations depend on the membrane voltage; (2) an input-dependent, effective time constant τ ∗ appears; (3) the parameter µ0 is not the average of the total input current (e.g., part of the input contributes to the leak term −V/τ ∗ and is not considered in µ0 ); and (4) the voltage is bounded from below by the inhibitory reversal potential (below VI , inhibitory inputs become excitatory). Usually the last point is taken care of by imposing a reflecting barrier R at VI (Hanson & Tuckwell, 1983; L´ansky´ & L´ansk´a, 1987). The rate function of the CB neuron, CB = τr + τ

∗

θ −µss √ 2σss Vr −µss √ 2σss

√

−1 x2

πe (1 + erf(x))dx

,

(A.7)

has been given in Burkitt et al. (2003) in the absence of a reflecting barrier. Figure 3 shows that in the typical case, it works also in the presence of a reflecting barrier at VI . The constants µss , σss2 appearing in equation A.7 are the stationary average and variance of the free (i.e., spikeless) membrane voltage, which are (Hanson & Tuckwell, 1983; Burkitt, 2001): µss = µ0 τ ∗ =

g˜ L Vrest + (gE VE νE + gI VI νI ) ( g˜ L + gE νE + gI νI )

(A.8)

and σss2 =

τ ∗ σ02 (µss ) τ∗ 2 ≈ σ (µss ), 2 1−η 2 0

(A.9)

where η ≡ (g2E νE + g2I νI )τ ∗ /2. The approximation in equation A.9 follows from the fact that η is negligible in the typical case. For example, when g¯ E,I ∼ 1 nS, C ∼ 500 pF, and νE,I ∼ 103 Hz, then η ∼ 10−4 − 10−3 . (A convenient way to obtain the result, equation A.9, is to make use of the equality σss2 = τ ∗ σ02 (V)/2, where the average . is taken on the free process. This is a generalization of a wellknown property of the OU process in which σ0 is constant.) equation A.7 can therefore be written as equation 2.6,  −1 τ∗ Cθ−m √0 √ x2 s0 τ ∗ ∗ CB = τr + τ CV −m τ ∗ πe (1 + erf(x))dx , r √ 0 s0 τ ∗

where m0 ≡ Cµ0 and s0 ≡ Cσ0 (µss ), so that m0 has units of current: m0 = C g˜ L Vrest + C(gE VE νE + gI VI νI ) s20 = C2 g2E (VE − µss )2 νE + C2 g2I (VI − µss )2 νI ,

(A.10) (A.11)

Adapting Rate Models

2119

with µss given by equation A.8 and gE,I ≡ τ g¯ E,I /C. The adapted frequency is given by the solution of either f = CB (m0 − αf, s0 ) (AHP-based model) or f = CB (θ0 + β f ; m0 , s0 ) (adapting threshold model). Appendix B: Fitting Procedure Here we briefly summarize the experimental procedure and the data analysis that led to the results of Table 1. Full details can be found in Rauch et al. (2003). Pyramidal neurons from layer 5 of rat somatosensory cortex were injected with an OU process with a correlation time constant τ = 1 ms to resemble white noise. Stimuli were delivered in random order from a preselected pool, which depended on the cell. The time length of the stimulus was between 6 and 12 seconds. The spike trains of 26 selected cells were analyzed to assess their mean spike frequencies. A transient ranging from 0.5 to 2 seconds (depending on stimulus duration) was discarded to deal with the stationary spike train only. On balance, the stationary response was usually adapted with respect to the transient one. The model rate functions were fitted to the experimental ones using a random least-square procedure, that is, a Monte Carlo minimization of the function (see, e.g., Press, Teukolsky, Vetterling, & Flannery, 1992), 2 χN−M =

N MODEL (mi , si ; ) − fi 2 , i i=1

where i runs over the experimental points, fi are the experimental spike frequencies, is the parameter set, and the weights i correspond approximately to a confidence interval of 68% for the output rate. M is the number of parameters to be tuned and N the number of experimental points. The 2 best-fit was accepted if the probability of a variable χN−M to be larger than 2 the observed χmin was larger than 0.01. The parameter set includes five parameters: τr , Vr , C, τ , and α [pA·s] for the AHP adaptation or β [mV·s] for the adapting threshold mechanism. Note that since LIF , equation 2.3, is invariant under the scaling θ → θh, Vr → Vr h, C → C/ h (h constant), only two out of these three parameters are independent. Therefore, the threshold for spike emission (θ0 in the case of an adapting threshold) was set to 20 mV throughout. The results are summarized in Table 1 and discussed in the text. Appendix C: The Effects of Adaptation on the LIF Neuron Here we summarize and compare the properties of the LIF neuron endowed with the two models of adaptation, which are mentioned in the analysis of the experimental data in section 2.3. We will refer to the self-consistent solutions of equation 2.2 and 2.8 as fα (m, s), fβ (m, s), respectively. We consider the regions of low and intermediate-to-large rates in turn.

2120

G. La Camera, A. Rauch, H.-R. Luscher, ¨ W. Senn, and S. Fusi

1. At low frequencies, the two models can be made equivalent by choosing βC/τ = α. This is because the response at rheobase, otherwise highly nonlinear, is linearized by either model of adaptation. One obtains fα,β ≈ ρα,β [m − mth ]+ as m → mth = Cθ/τ + , where mth is the rheobase current, ρα = α −1 for AHP adaptation, and ρβ = τ/Cβ for adapting threshold. ([x]+ = x if x > 0 and zero otherwise, and m → y+ means that the limit is performed for values of m larger than y.) The linearization argument is due to Ermentrout (1998) for AHP-like adaptation and can be easily generalized to the case of an adapting threshold for the LIF neuron. The slope, ρ, can be obtained by looking at how the two forms of adaptation affect the rheobase mth = Cθ/τ , that is, by requiring that m − αfα − Cθ/τ = m − C(θ + β fβ )/τ . One finds fα /fβ = Cβ/ατ so that for Cβ = ατ the output frequencies at the rheobase (hence, the slopes of the linearized rate functions) are the same. 2. For τr = 0, the rate functions of the two adapting models differ away from the rheobase. This can be seen most easily for large inputs, where the nonadapted response is approximately linear, f ∼ m/C(θ − Vr ). It is easy to derive that AHP adaptation preserves this linearity, fα ∼

m , C(θ − Vr ) + α

while for an adapting threshold, (θ − Vr ) fβ = 2β

4βm 1+ −1 , C(θ − Vr )2

(C.1)

with asymptotic behavior fβ ∼ m/Cβ. The introduction of a finite refractory period makes the AHP model bend in this region, 1 α fα ∼ 1− , τr τr m allowing the two models to match on the entire range of observed output rates. Acknowledgments We thank Paolo Del Giudice and Maurizio Mattia for useful discussions. This work was supported by the Swiss National Science Foundation (grants 31-61335.00 and 3152-065234.01) and the Silva Casa Foundation.

Adapting Rate Models

2121

References Abbott, L. F., & van Vreeswijk, C. (1993). Asynchronous states in networks of pulse-coupled oscillators. Phys. Rev. E, 48, 1483–1490. Amit, D. J., & Brunel, N. (1997). Model of global spontaneous activity and local structured (learned) delay activity during delay. Cerebral Cortex, 7, 237– 252. Amit, D. J., & Tsodyks, M. V. (1991). Quantitative study of attractor neural network retrieving at low spike rates: I. Substrate-spikes, rates and neuronal gain. Network, 2, 259–273. Amit, D. J., & Tsodyks, M. V. (1992). Effective neurons and attractor neural networks in cortical environment. Network, 3, 121–137. Brunel, N. (2000). Persistent activity and the single cell f-I curve in a cortical network model. Network, 11, 261–280. Brunel, N., & Hakim, V. (1999). Fast global oscillations in networks of integrateand-fire neurons with low firing rates. Neural Computation, 11, 1621–1671. Brunel, N., & Latham, P. (2003). Firing rate of the noisy quadratic integrate-andfire neuron. Neural Computation, 15, 2281–2306. Brunel, N., & Sergi, S. (1998). Firing frequency of leaky integrate-and-fire neurons with synaptic currents dynamic. J. Theor. Biol., 195, 87–95. Brunel, N., & Wang, X. J. (2001). Effects of neuromodulation in a cortical network model of object working memory dominated by recurrent inhibition. Journal of Computational Neuroscience, 11, 63–85. Burkitt, A. N. (2001). Balanced neurons: Analysis of leaky integrate-and-fire neurons with reversal potentials. Biol. Cybern., 85, 247–255. Burkitt, A. N., Meffin, H., & Grayden, D. B. (2003). Study of neuronal gain in a conductance-based leaky integrate-and-fire neuron model with balanced excitatory and inhibitory input. Biol. Cybern., 89, 119–125. Chance, F. S., Abbott, L. F., & Reyes, A. D. (2002). Gain modulation from background synaptic input. Neuron, 35, 773–782. Destexhe, A., Rudolph, M., Fellous, J. M., & Sejnowski, T. J. (2001). Fluctuating dynamic conductances recreate in-vivo like activity in neocortical neurons. Neuroscience, 107, 13–24. Ermentrout, B. (1996). Type I membranes, phase resetting curves, and synchrony. Neural Computation, 8, 979–1001. Ermentrout, B. (1998). Linearization of f-I curves by adaptation. Neural Computation, 10(7), 1721–1729. Ermentrout, G. B., & Kopell, N. (1986). Parabolic bursting in an excitable system coupled with a slow oscillation. SIAM J. Appl. Math., 46, 233–253. Fleidervish, I., Friedman, A., & Gutnick, M. J. (1996). Slow inactiavation of Na+ current and slow cumulative spike adaptation in mouse and guinea-pig neocortical neurones in slices. J. Physiol. (Cambridge), 493, 83–97. Fourcaud, N., & Brunel, N. (2002). Dynamics of the firing probability of noisy integrate-and-fire neurons. Neural Computation, 14, 2057–2110. Fuhrmann, G., Markram, H., & Tsodyks, M. (2002). Spike frequency adaptation and neocortical rhythms. J. Neurophysiology, 88, 761–770.

2122

G. La Camera, A. Rauch, H.-R. Luscher, ¨ W. Senn, and S. Fusi

Fusi, S., & Mattia, M. (1999). Collective behavior of networks with linear (VLSI) integrate and fire neurons. Neural Computation, 11, 633–652. Gardiner, C. W. (1985). Handbook of stochastic methods. New York: SpringerVerlag. Gerstner, W. (2000). Population dynamics of spiking neurons: Fast transients, asynchronous states, and locking. Neural Computation, 12, 43–90. Hanson, F. B., & Tuckwell, H. C. (1983). Diffusion approximation for neural activity including synaptic reversal potentials. J. Theor. Neurobiol., 2, 127–153. Holden, A. V. (1976). Models of stochastic activity of neurons. New York: SpringerVerlag. Holt, G. R., & Koch, C. (1997). Shunting inhibition does not have a divisive effect on firing rates. Neural Computation, 9, 1001–1013. Knight, B. W. (1972a). Dynamics of encoding of a population of neurons. Journal of General Physiology, 59, 734–736. Knight, B. W. (1972b). The relationship between the firing rate of a single neuron and the level of activity in a network of neurons. Experimental evidence for resonance enhancement in the population response. Journal of General Physiology, 59, 767. L´ansky, ´ P., & L´ansk´a, V. (1987). Diffusion approximation of the neuronal model with synaptic reversal potentials. Biol. Cybern., 56, 19–26. L´ansky, ´ P., & Sato, S. (1999). The stochastic diffusion models of nerve membrane depolarization and interspike interval generation. Journal of the Peripheral Nervous System, 4, 27–42. Larkum, M., Senn, W., & Luscher, ¨ H.-R. (in press). Top-down dendritic input increases the gain of layer 5 pyramidal neurons. Cerebral Cortex. Liu, Y. H., & Wang, X. J. (2001). Spike-frequency adaptation of a generalized leaky integrate-and-fire model neuron. Journal of Computational Neuroscience, 10, 25–45. Mattia, M., & Del Giudice, P. (2002). Population dynamics of interacting spiking neurons. Phys. Rev. E, 66, 051917. McCormick, D. A., Connors, B. W., Lighthall, J. W., & Prince, D. (1985). Comparative electrophysiology of pyramidal and sparsely stellate neurons of the neocortex. J. Neurophysiology, 54, 782–806. Moreno-Bote, R., & Parga, N. (2004a). Membrane potential and response properties of populations of cortical neurons in the high conductance state. Manuscript submitted for publication. Moreno-Bote, R., & Parga, N. (2004b). Role of synaptic filtering on the firing response of simple model neurons. Physical Review Letters, 92, 028102. Nykamp, D. Q., & Tranchina, D. (2000). A population density approach that facilitates large-scale modeling of neural networks: Analysis and an application to orientation tuning. J. Comp. Neurosci., 8, 19–30. Par´e, D., Shink, E., Gaudreau, H., Destexhe, A., & Lang, E. J. (1998). Impact of spontaneous synaptic activity on the resting properties of cat neocortical pyramidal neurons in vivo. Journal of Neurophysiology, 11, 1450–1460. Poliakov, A. V., Powers, R. K., Sawczuk, A., & Binder, M. D. (1996). Effects of background noise on the response of rat and cat motoneurons to excitatory current transients. Journal of Physiology, 495.1, 143–157.

Adapting Rate Models

2123

Powers, R. K., Sawczuk, A., Musick, J. R., & Binder, M. D. (1999). Multiple mechanisms of spike-frequency adaptation in motoneurones. J. Physiol. (Paris), 93, 101–114. Press, W., Teukolsky, S. A., Vetterling, W. T., & Flannery, B. P. (1992). Numerical recipes in C: The art of scientific computing. Cambridge: Cambridge University Press. Rauch, A., La Camera, G., Luscher, ¨ H.-R., Senn, W., & Fusi, S. (2003). Neocortical cells respond as integrate-and-fire neurons to in vivo–like input currents. J. Neurophysiol., 90, 1598–1612. Renart, A., Brunel, N., & Wang, X. J. (2003). Mean-field theory of recurrent cortical networks: From irregularly spiking neurons to working memory. In J. Feng (Ed.), Computational neuroscience: A comprehensive approach. Boca Raton, FL: CRC Press. Ricciardi, L. M. (1977). Diffusion processes and related topics in biology. Berlin: Springer-Verlag. Sah, P. (1996). Ca2+ -activated K+ currents in neurons: Types, physiological roles and modulation. Trends Neurosci., 19, 150–154. Sanchez-Vives, M. V., Nowak, L. G., & McCormick, D. A. (2000). Cellular mechanisms of long-lasting adaptation in visual cortical neurons in vitro. J. Neuroscience, 20, 4286–4299. Sawczuk, A., Powers, R. K., & Binder, M. D. (1997). Contribution of outward currents to spike frequency adaptation in hypoglossal motoneurons of the rat. Journal of Physiology, 78, 2246–2253. Schwindt, P. C., & Crill, W. E. (1982). Factors influencing motoneuron rhythmic firing: Results from a voltage-clamp study. J. Neurophysiol., 48, 875– 890. Schwindt, P., O’Brien, J. A., & Crill, W. E. (1997). Quantitative analysis of firing properties of pyramidal neurons from layer 5 of rat sensorimotor cortex. J. Neurophysiol., 77, 2484–2498. Schwindt, P. C., Spain, W. J., & Crill, W. E. (1989). Long-lasting reduction of excitability by a sodium-dependent potassium current in cat neocortical neurons. J. Neurophysiol., 61, 233–244. Shadlen, M. N., & Newsome, W. T. (1998). The variable discharge of cortical neurons: Implications for connectivity, computation and information coding. J. Neurosci., 18, 3870–3896. Shriki, O., Hansel, D., & Sompolinsky, H. (2003). Rate models for conductancebased cortical neural networks. Neural Computation, 15, 1809–1841. Siegert, A. J. F. (1951). On the first passage time probability function. Phys. Rev., 81, 617–623. Stafstrom, C. E., Schwindt, P. C., & Crill, W. E. (1984). Repetitive firing in layer V from cat neocortex in vitro. J. Neurophysiol., 52, 264–277. Traub, R. D., & Miles, R. (1991). Neuronal networks of the hippocampus. Cambridge: Cambridge University Press. Treves, A. (1993). Mean field analysis of neuronal spike dynamics. Network, 4, 259–284. Tuckwell, H. C. (1988). Introduction to theoretical neurobiology. Cambridge: Cambridge University Press.

2124

G. La Camera, A. Rauch, H.-R. Luscher, ¨ W. Senn, and S. Fusi

Wang, X. J. (1998). Calcium coding and adaptive temporal computation in cortical pyramidal neurons. J. Neurophysiol., 79, 1549–1566. Wilbur, W. J., & Rinzel, J. (1983). A theoretical basis for large coefficient of variation and bimodality in neuronal interspike interval distribution. J. Theor. Biol., 105, 345–368. Received March 28, 2003; accepted April 6, 2004.

LETTER

Communicated by Anthony Burkitt

Including Long-Range Dependence in Integrate-and-Fire Models of the High Interspike-Interval Variability of Cortical Neurons B. Scott Jackson scott [email protected] Institute for Sensory Research and Department of Bioengineering and Neuroscience, Syracuse University, Syracuse, NY 13244, U.S.A.

Many different types of integrate-and-fire models have been designed in order to explain how it is possible for a cortical neuron to integrate over many independent inputs while still producing highly variable spike trains. Within this context, the variability of spike trains has been almost exclusively measured using the coefficient of variation of interspike intervals. However, another important statistical property that has been found in cortical spike trains and is closely associated with their high firing variability is long-range dependence. We investigate the conditions, if any, under which such models produce output spike trains with both interspike-interval variability and long-range dependence similar to those that have previously been measured from actual cortical neurons. We first show analytically that a large class of high-variability integrate-and-fire models is incapable of producing such outputs based on the fact that their output spike trains are always mathematically equivalent to renewal processes. This class of models subsumes a majority of previously published models, including those that use excitation-inhibition balance, correlated inputs, partial reset, or nonlinear leakage to produce outputs with high variability. Next, we study integrate-and-fire models that have (nonPoissonian) renewal point process inputs instead of the Poisson point process inputs used in the preceding class of models. The confluence of our analytical and simulation results implies that the renewal-input model is capable of producing high variability and long-range dependence comparable to that seen in spike trains recorded from cortical neurons, but only if the interspike intervals of the inputs have infinite variance, a physiologically unrealistic condition. Finally, we suggest a new integrate-and-fire model that does not suffer any of the previously mentioned shortcomings. By analyzing simulation results for this model, we show that it is capable of producing output spike trains with interspikeinterval variability and long-range dependence that match empirical data from cortical spike trains. This model is similar to the other models in this study, except that its inputs are fractional-gaussian-noise-driven Poisson processes rather than renewal point processes. In addition to this model’s c 2004 Massachusetts Institute of Technology Neural Computation 16, 2125–2195 (2004)

2126

B. Jackson

success in producing realistic output spike trains, its inputs have longrange dependence similar to that found in most subcortical neurons in sensory pathways, including the inputs to cortex. Analysis of output spike trains from simulations of this model also shows that a tight balance between the amounts of excitation and inhibition at the inputs to cortical neurons is not necessary for high interspike-interval variability at their outputs. Furthermore, in our analysis of this model, we show that the superposition of many fractional-gaussian-noise-driven Poisson processes does not approximate a Poisson process, which challenges the common assumption that the total effect of a large number of inputs on a neuron is well represented by a Poisson process. 1 Introduction The integrate-and-fire (IF) neuron is a common model of general neuronal processing when simplicity is of the essence. Its simplicity is particularly beneficial when one wishes to model a large network of interconnected neurons, such as portions of the cortex. Each neuron in the cortex receives a very large number of inputs but produces a relatively low output firing rate, which is on the order of that of each individual input. In order for the standard, nonleaky IF model with excitatory inputs to have a large number of inputs and an output spike rate similar to its inputs, its threshold must be very large compared to the potential caused by a single input spike. In this case, a large number of input spikes must converge on the IF neuron before an output spike will be produced. Thus, although the inputs may contain a high degree of variability, the output will typically be very regular due to the averaging effect of the integrator. However, it has been known for some time that the spike trains of cortical neurons are quite variable. This discrepancy can be easily explained with a leaky IF model if the rate of decay of the potential is short relative to the time between output spikes. This explanation seems reasonable for the conditions under which the variability of cortical neurons has typically been measured, when the firing rates have been low. However, Softky and Koch (1992, 1993) showed that this high variability, as measured by the coefficient of variation (the standard deviation divided by the mean) of the interspike intervals (CVISI ), persists even when cortical neurons produce high firing rates, with interspike intervals that are short relative to reasonable decay times resulting from membrane leakage. This inconsistency between the IF model and physiological measurements suggests that the IF model may not represent the essential nature of neuronal dynamics. Softky and Koch (1992, 1993) responded to this problem by arguing that cortical neurons act as coincidence detectors instead of as integrators, where the coincidence detection mechanism is likely to be located in the dendrites (see also Softky, 1994, 1995). Others, however, have attempted to preserve the essential elements of the IF neuron through reasonable modifications

Long-Range Dependence in Models of Cortical Variability

2127

of the model. Shadlen and Newsome (1994) suggested that IF neurons that integrate over many inputs are able to produce very irregular spike trains if the average amount of inhibition at the input is equal to the average amount of excitation. But other studies (Brown & Feng, 1999; Feng & Brown, 1998a; Feng, 1999; Shadlen & Newsome, 1998; Burkitt, 2000, 2001) have shown that even when the amount of inhibition is less than, but very close to, the amount of excitation, the CVISI of the output from the IF neuron can match measurements made in real cortical neurons. In addition, realistic CVISI values can be produced in the IF neuron by using long-tailed interspikeinterval distributions for the input processes (Feng, 1997; Feng & Brown, 1998a, 1998b; Feng, Tirozzi, & Brown, 1998), by using correlated inputs (Stevens & Zador, 1998; Sakai, Funahashi, & Shinomoto, 1999; Shinomoto & Tsubo, 2001; Feng & Brown, 2000a; Feng, 2001; Feng & Zhang, 2001; Destexhe & Pare, 1999, 2000), or by using a postspike reset potential near threshold (Troyer & Miller, 1997; Bugmann, Christodoulou, & Taylor, 1997; Lansky & Smith, 1989). Furthermore, several studies have created networks of IF neurons where the network dynamics create highly variable outputs at the level of single neurons (Usher, Stemmler, Koch, & Olami, 1994; Usher, Stemmler, & Olami, 1995; Tsodyks & Sejnowski, 1995; van Vreeswijk & Sompolinsky, 1996, 1998). The primary measure that has been used to compare the variability of IF models to that of cortical neurons is the CVISI . But the CVISI is a limited measure of the variability of spike trains, since it is computed only from the variance and mean of the interspike intervals. It is not, for example, sensitive to temporal correlations between interspike intervals, which can drastically alter the overall variability of a spike train without altering the variability of the interspike intervals. However, not only are temporal correlations present in the sequence of interspike intervals of many, if not most, of the neurons in subcortical and cortical sensory pathways, but these correlations persist for an extraordinarily long time (subcortical: Teich, 1989, 1996; Teich & Lowen, 1994; Lowen & Teich, 1996b; Teich, Johnson, Kumar, & Turcott, 1990; Turcott et al., 1994; Teich, Heneghan, Lowen, Ozaki, & Kaplan, 1997; Lowen, Ozaki, Kaplan, Saleh, & Teich, 2001; cortical: Teich, Turcott, & Siegel, 1996; Gruneis, Nakao, Yamamoto, Musha, & Nakahama, 1989; Gruneis, Nakao, & Yamamoto, 1990; Gruneis et al., 1993; Wise, 1981). The common mathematical term for these persistent correlations is long-range dependence (LRD). But particularly in less theoretical areas of study, LRD processes are often called second-order self-similar processes, fractal processes, or processes with power-law statistics. LRD produces behavior in the variance of a process that is unusual with regard to standard statistical procedures and, as will become clear, is an important component in understanding the variability of neural spike trains. However, its importance to the study of highly variable integrator models of cortical processing has rarely been acknowledged (but see Usher et al., 1994, 1995), and then only with regard to networks of interconnected IF neurons.

2128

B. Jackson

In this study, we investigate, using both mathematical analysis and computer simulations, whether single-neuron “high-variability” IF models can produce LRD while their CVISI values remain close to in vivo cortical measurements (usually within the range of 0.5 to 1.5). A similar investigation of network models would also be worthwhile but is beyond the scope here. In addition to the single-neuron models mentioned above, we also describe and investigate a new model for the high variability of cortical neurons, which is motivated by the fact that LRD is present in neurons that project into cortex. Before looking at these models, however, we must have a foundation for studying LRD in point processes, the mathematical processes usually employed to represent neural spike trains. In the following section, we describe the theory of LRD in point processes and methods for analyzing LRD in point process–like data, such as spike times in neural spike trains. 2 Long-Range Dependence: Definitions and Theory 2.1 Long-Range Dependence in Stochastic Processes. Most classical statistical estimators and tests are based on the assumption that the stochastic processes to which they are applied are short-range dependent. A stationary stochastic process {Xi : i ∈ Z} is said to be short-range dependent if the autocovariance Cov{X0 , Xj } (and the autocorrelation) decays fast enough, as the separation, or lag, j is increased, that the infinite sum of these autocovariances converges to a finite constant. On a practical level, this means that an arbitrarily large, though perhaps less than total, amount of the influence of the past can be contained in a sufficiently large finite history of the process. In the frequency domain, the spectral density (or power spectrum) of a short-range dependent process has a finite limiting value as the frequency approaches zero. Although most “classical” types of stochastic processes are short-range dependent, observations from many “physical” processes are not (see Beran, 1994, for some examples). Such processes are said to be long-range dependent (LRD). The infinite sum of the autocovariances of an LRD process is unbounded since these autocovariances decay very slowly (Beran, 1994; Cox, 1984). Equivalently, the spectral density of an LRD process has a pole at zero frequency. Therefore, any finite history of such a process, no matter how large, will necessarily leave out an infinitely large amount of the influence from the past. LRD can result in grossly inaccurate results if classical statistical methods are applied to empirical data derived from processes that possess it (Beran, 1994; Cox, 1984). Therefore, it is important to recognize LRD when it occurs in stochastic processes. The divergence of the infinite sum of the autocovariances,

lim

n→∞

n j=0

Cov{X0 , Xj } = ∞,

(2.1)

Long-Range Dependence in Models of Cortical Variability

2129

is perhaps most helpful in forming an intuitive understanding of LRD. It is important to note that this is an asymptotic property and that this asymptotic property is related to the asymptotic form of the autocovariance function. Since LRD is an asymptotic property, each individual autocovariance by itself is not critical. In other words, the autocovariance at any fixed finite lag can be arbitrarily large or arbitrarily small for either a short-range or long-range dependent process. The factor that determines the “range” of the dependence is the relationship among the autocovariances at arbitrarily large lags. Thus, LRD results from the joint effect of these covariances, not from the effect of any single covariance. There are a number of different possibilities for defining LRD for arbitrary stochastic processes. Either the unbounded infinite sum of the autocorrelations or the spectral density pole at zero frequency would suffice as a definition (see, e.g., Beran, 1994). But following Daley and Vesilo (1997) and due to its ease of use analytically and empirically, we will use the asymptotic behavior of the variance of the sum of consecutive random variables in the process to define LRD. Definition 1. A stationary stochastic process {Xi : i ∈ Z}, for which each random variable Xi has finite variance, exhibits LRD when Var{ ni=1 Xi } = ∞. lim sup n n→∞ 2.2 Long-Range Dependence in Stochastic Point Processes. Spike trains are not usually modeled with stochastic processes like those considered in section 2.1. Instead, they are typically represented mathematically by point processes, where each point represents the time of occurrence of a spike. A point process on the real line (and a neural spike train in time) can be described by either its interpoint distances or by the numbers of points in any arbitrary set of intervals on the real line. If the points of a point process are given by the increasing sequence {τi : i ∈ Z}, then the sequence of intervals is given by {Yi = τi − τi−1 : i ∈ Z}. The number of points in an arbitrary interval (a, b], a < b, is given by N(a, b] = N((a, b]) = #{i : τi ∈ (a, b]}. Then, for example, the counts in a sequence of adjacent intervals of length T > 0 would be given by {N(a + (i − 1)T, a + iT] : i ∈ Z}, where a is any real number. Since either the sequence of interpoint intervals or the sequence of counts in adjacent intervals (or both) can be LRD, there are two ways in which a point process can exhibit LRD. Again following Daley and Vesilo (1997), we will call these two types of LRD long-range interval dependence and long-range count dependence, respectively. Thus, we have the following two definitions:

2130

B. Jackson

Definition 2. (Daley & Vesilo, 1997)1 A stationary point process N(·) on the real line exhibits long-range interval dependence (LRiD) when the stationary sequence of interpoint intervals {Yi }, with finite variances, is LRD in the sense that n

lim sup n→∞

Var{

i=1 Yi }

n

= ∞.

Definition 3. (Daley & Vesilo, 1997) A second-order stationary point process N(·) on the real line exhibits long-range count dependence (LRcD) when lim sup t→∞

Var{ N(0, t] } = ∞. t

The definition of LRiD is a direct analog of the general definition for LRD (see definition 1). The definition for LRcD, however, takes a slightly different form in order to avoid the use of a counting-interval-length parameter T. This parameter is superfluous with regard to the definition of LRcD since n

N((i − 1)T, iT] = N(0, nT],

i=1

and nT goes to infinity as n goes to infinity. 2.3 Long-Range Dependence in Stochastic Renewal Point Processes. The conditions under which LRD is present in renewal point processes, point processes in which all interpoint intervals are mutually independent, will play a critical role in many of the arguments in this study. Clearly, a renewal point process cannot be LRiD, since by definition there is no dependence between the intervals,2 but it can be LRcD (e.g., Daley & Vesilo, 1997; Lowen & Teich, 1993b, 1993c). But under what conditions, if any, can a renewal point process be LRcD? As it turns out, this question has a straightforward answer: a renewal point process is LRcD when the variance of the interpoint intervals is infinite. This was stated by Daley (1999), but only a terse argument was included there. Due to the abbreviated nature of those comments and some confusing elements in their argument, we will precisely state this result next. The proof, which follows Daley’s argument, may be found in the appendix. 1 The original definition of Daley and Vesilo (1997) does not contain the phrase “with finite variances,” though the finiteness of the variances of the intervals is necessary and may have been implied. 2 This is true even if the variance of the interval distribution is infinite, even though definition 2 is not applicable to that case. If an extension of the concept of LRiD to the infinite interval variance case is to be meaningful, a point process with independent intervals cannot also be LRiD.

Long-Range Dependence in Models of Cortical Variability

2131

Theorem 1. A stationary renewal point process with distribution function F of the generic interpoint-interval random variable X, which has F(0) = 0 and finite mean µ = E{X}, is LRcD if and only if E X2 = ∞.3 2.4 The Hurst Indices. The strength of LRD may be measured using the Hurst index. Strictly speaking, the Hurst index is a measure of the selfsimilarity of a stochastic process (e.g., Beran, 1994; Samorodnitsky & Taqqu, 1994; Jackson, 2003, sect. 3.2.3). A (nonconstant) self-similar process Y(t) is necessarily nonstationary (Beran, 1994), but its increments X(t) ≡ Y(t) − Y(t − s), for some fixed s > 0, may form a stationary, long-range dependent process. If we associate the Hurst index H of the self-similar process Y with its incremental process X, then the Hurst index becomes a measure of LRD. If the covariances of the stationary increments of a self-similar process exist and decay to zero as the lag is increased, then 0 < H < 1 (Beran, 1994). For the present purposes, we can ignore the range 0 < H < 0.5, since the conditions for this to occur in a physical process are very unstable (Beran, 1994). When H = 0.5, the increment process has no dependence or memory. An example of a self-similar process with stationary increments and a Hurst index of H = 0.5 is Brownian motion, of which “white” gaussian noise is the stationary increment process. But when 0.5 < H < 1, the increment process is LRD, and larger values of H are associated with stronger LRD in the increment process. In general, however, the Hurst index does not need to be associated with a self-similar process or its increments. Instead of being a “self-similarity parameter,” it may be thought of as a “long-memory parameter” for stationary processes.4 Further discussion of this connection may be found in Jackson (2003). An alternate definition of the Hurst index is n Var i=1 Xi H = sup h : lim sup =∞ . n2h n→∞ This definition is more closely related to the concept of long-range dependence and is applicable to all stationary stochastic processes, but it is still consistent with the prior interpretation of the Hurst index in the context of self-similar processes. For point processes, Daley (1999) has defined the Hurst index in an analogous manner with respect to its counting process. 3 Theorem 1 is restricted to stationary processes. However, since LRcD is a property of the limiting behavior of the process, the renewal point process need only be asymptotically stationary for the result to apply. This proves useful, for example, since we often start a renewal point process with a point at the origin. Such a process is not stationary but is asymptotically stationary. 4 In fact, the Hurst index derives its name from the exponent parameter that Hurst (1951) used to demonstrate the existence of long-range dependence in records of the level of the Nile River and records of other geophysical processes.

2132

B. Jackson

Definition 4. (Daley, 1999) The Hurst index of a stationary, ergodic,5 orderly6 point process N(·) with finite second moment is Var{N(0, t]} H = sup h : lim sup =∞ . t2h t→∞ Since for the process assumed in this definition, Var{N(0, t]} = o(t2 ) for t → ∞, the Hurst index must be no larger than one. On the other hand, as long as it is not the case that the point process is zero with probability one, lim sup t→∞

Var{N(0, t]} = ∞, t2

for any ≤ 0.

So the Hurst index for this point process must be nonnegative. Thus, 0 ≤ H ≤ 1, which is the same range that we had for self-similar processes, save the possible inclusion of the end points. Furthermore, according to definition 3, if a point process is LRcD, then H ≥ 0.5. In practice, however, most non-LRcD stochastic point processes have H = 0.5, while LRcD point processes have H > 0.5. Continuous physical processes with H < 0.5 are statistically unstable and are therefore rare in practice. Point processes with H < 0.5 should share this property and are not useful, as far as we know, for modeling neural spike trains. The Hurst index of definition 4, however, quantifies the strength of only LRcD in a point process. In order to quantify the strength of LRiD, we need an index of the dependence between the interpoint intervals. The Hurst index H of a point process was defined on the basis of the counts of the process, but we may also define HI , the Hurst index of the intervals of the process. Definition 5. Let N(·) be a stationary, ergodic, orderly point process with finite second moment, and let {Yi } be the stationary sequence of its interpoint intervals, also with finite second moment. Then the interval Hurst index of this point process is Var{ ni=1 Yi } HI = sup h : lim sup = ∞ . n2h n→∞ As for the (count) Hurst index, 0 ≤ HI ≤ 1 (Jackson, 2003), and, according to definition 2, if the point process is LRiD, then HI ≥ 0.5. Furthermore, larger values of HI signify stronger LRiD. 5 A stationary point process N(·) with finite mean density m = E{N(0, 1]} is ergodic if Pr{limx→∞ N(0, x]/x = m} = 1 (Daley, Rolski, & Vesilo, 2000). 6 A point process N(·) on the real line is orderly if Pr{N(t, t+δ] > 1} = o(δ), for all t ∈ R. This implies that the point process has no multiple simultaneous occurrences (see, e.g., Cox & Isham, 1980; Daley & Vere-Jones, 1988).

Long-Range Dependence in Models of Cortical Variability

2133

2.5 Relationship Between the Different Types of Long-Range Dependence and the Variability of Intervals. The two types of LRD, LRcD and LRiD, are not independent. In this section, we give a brief overview of the relationship between these two types of LRD and their relationship to the variability of the interpoint intervals. Jackson (2003) contains a more thorough discussion of these relationships and makes explicit what has been proven mathematically and what has not. A complete mathematical theory of this relationship, however, is not currently available. In general, LRcD can be associated with LRiD, infinite-variance interpoint intervals, or both. Presumably, LRcD cannot exist in a point process without the presence of at least one of these two properties of the interpoint intervals. Even in these seemingly straightforward statements, a complication in the theory of LRD in point processes arises. How can the intuitively plausible circumstance of simultaneous LRiD and infinite-variance interpoint intervals occur? LRiD is defined (according to definition 2) only when the interpoint intervals have finite variance, which derives from the fact that LRD is defined only for finite-variance processes (see definition 1). Unfortunately, at the present time, a general definition of LRD for infinite-variance processes does not exist. However, several methods for handling this situation have been applied in the past (see Jackson, 2003). One method for distinguishing LRD in infinite-variance processes makes use of the sample correlation. Although, strictly speaking, the correlation is not defined for infinite-variance random variables, the sample correlation can obviously be calculated for a sample from an infinite-variance process. Assuming that the sample correlation retains pertinent properties when applied to an infinite-variance process, its asymptotic properties should be useful for distinguishing short-range and long-range dependence. The study of Davis and Resnick (1986) supports this argument. They found that in the case of the moving average process,

Xn =

∞

cj Zn−j ,

j=−∞

where {Zi } is a sequence of independent and identically distributed random variables called the innovations, the sample correlation converges to the same function of the coefficients {cj } whether the innovations {Zi } have finite or infinite variance. In this study, we will essentially use this strategy for handling infinitevariance processes when we consider LRiD in both simulated and empirical neural spike trains. Thus, we will assume that the sample statistics will asymptotically behave in a similar manner whether the intervals have infinite variance or not. Furthermore, we will see that results based on this assumption are coherent.

2134

B. Jackson

2.6 Statistical Procedures for Recognizing Long-Range Dependence in Point Processes 2.6.1 Shuffled Surrogate Data. Shuffled surrogate data can prove useful in determining the relative contributions of infinite-variance intervals and LRiD to LRcD in a point process or a neural spike train. Teich, Lowen, and their coworkers (Teich, Turcott, & Lowen, 1990; Lowen & Teich, 1992; Teich & Lowen, 1994; Turcott, Barker, & Teich, 1995; Lowen & Teich, 1996b; Teich et al., 1996, 1997; Turcott & Teich, 1996) have made extensive use of this method to “distinguish those properties of the data that arise from correlation among intervals from those properties inherent in the form of the [distribution of interval lengths]” (Turcott & Teich, 1996). The recent work of Daley and his coworkers (Daley & Vesilo, 1997; Daley, 1999; Daley et al., 2000) and Kulik and Szekli (2001) further develops an understanding of the information that is available from the shuffled surrogate data. A set of shuffled surrogate data is formed by randomly shuffling the interpoint intervals of a finite sample from a stochastic point process or the interspike intervals of a finite spike train. The surrogate point process formed from the shuffled intervals has the same distribution of interpoint intervals as the original, but the dependency structure of the interpoint intervals has been disrupted. Thus, the surrogate point process is essentially equivalent to a sample from a renewal point process that has an interpoint interval distribution equivalent to the marginal distribution of the intervals in the original point process. Therefore, the surrogate point process cannot be LRiD, and any LRcD present is due to the high variability of the interspike interval distribution (Daley, 1999; theorem 1). Consequently, we expect that for a point process with no LRiD, the “amount” of LRcD in the surrogate data would be equivalent to that in the original data, since no LRcD would be destroyed by the shuffling procedure. Furthermore, for a point process with finite-variance interpoint intervals, we expect that the surrogate data would have no LRcD, since only infinite-variance intervals can create LRcD in the shuffled data. The results of these arguments are presented in Table 1 as the four possible combinations of finite- or infinite-variance intervals and LRiD or no LRiD. 2.6.2 Statistical Functions for the Variance of Aggregations. In order to investigate the presence of LRD in neural spike trains, we need to examine the behavior of the variance of aggregations of random variables as the aggregation size increases. Consider a discrete, stochastic time series X1 , X2 , X3 , . . . , and let

X(M) =

M i=1

Xi

Long-Range Dependence in Models of Cortical Variability

2135

Table 1: The Four Different Scenarios for the Presence or Absence of Long-Range Dependence in a Point Process and Their Effect on Shuffled Surrogate Data.

Variance of Intervals

LRiD in Original Data?

LRcD in Original Data?

LRcD in Surrogate Data?

“Amounts” of LRcD

Finite Finite Infinite Infinite

No Yes No Yes

No Yes Yes Yes

No No Yes Yes

Original = Surrogate Original > Surrogate Original = Surrogate Original > Surrogate

Note: The four possibilities are determined by whether the interpoint intervals are finite and whether the point process is LRiD.

be the aggregation of M adjacent random variables in the time series. Then we have chosen to plot 1 Var{X(M) } M

(2.2)

versus M. This is slightly different from the standard variance-time curve, but has some practical advantages. Our method results in a curve that will have a slope of zero if there is no dependence between the random variables, a positive slope if there is positive dependence, and a negative slope if there is negative dependence. This facilitates simple qualitative interpretation of these graphs. Furthermore, equation 2.2 yields expressions that differ by only a constant from the limit arguments in the definitions of LRcD and LRiD. We will make this statement more explicit as we introduce the precise functions that we have chosen to graph. 2.6.3 The Index-of-Dispersion Curves. There are two index-of-dispersion curves that prove useful in investigating LRD properties in point processes and spike trains: the index of dispersion curve of the counts, also known as the Fano factor curve (FFC), and the index of dispersion curve of the interpoint intervals (IDC). Each is the graph of the variance of an aggregation of random variables divided by a function of the mean of the aggregated variables versus the aggregation size. The FFC is the graph of the function (e.g., Fano, 1947; Teich, 1989; Lowen & Teich, 1991; Thurner et al., 1997; Cox & Isham, 1980) Var{N(0, t]} Var{N(0, t]} = E{N(0, t]} t E{N(0, 1]}

Var{N(0, t]} 1 = , t E{N(0, 1]}

F (t) =

(2.3)

2136

B. Jackson

while the IDC is the graph of the function (e.g., Cox & Isham, 1980),

Var{ ki=1 Xi } Var{ ki=1 Xi } 1 , I (k) = = k (E{X})2 k (E{X})2

(2.4)

where {Xi } is the sequence of interpoint intervals and X is the generic interpoint-interval random variable. Both F (t) and I (k) are equal to one for all t and k, respectively, when the point process is a Poisson process. In both equations 2.3 and 2.4, the last expression splits the definition into two parts. The second part of each of these expressions is a constant, while the first is the limit argument in the definition of either LRcD or LRiD. Thus, the FFC will increase without bound with increasing t when the point process is LRcD, and the IDC will increase without bound with increasing k when the point process is LRiD. Furthermore, when the intervals of a stationary point process have finite variance, the FFC and IDC are asymptotically related by the equation (Cox & Isham, 1980), lim I (k) = lim F (t).

k→∞

t→∞

(2.5)

From this we can prove the following relationship between LRcD and LRiD: Proposition 1. Let N(·) be a stationary, ergodic, orderly point process with finite variance (i.e., E{[N(A)]2 } < ∞ for bounded A) and a stationary sequence of intervals {Xi : i ∈ N} that has finite variance. Then N is LRcD if and only if it is LRiD. Proof. This result follows directly from equations 2.5, 2.4, and 2.3, and the definitions of LRcD and LRiD. Regardless of whether the point process is in any way LRD, the FFC and IDC can also convey information about other (short-range) dependencies. The FFC and IDC begin, at low values of t and k, respectively, at particular values and then increase and decrease depending on the correlation properties of the process. The limit of the FFC as t approaches zero is always one if the point process is stationary and orderly (Jackson, 2003). The IDC has a realizable lower limit of k = 1 where it is equal to Var{X}/(E{X})2 . Decreases in the FFC or IDC are caused by negative correlations in the process itself or between the interpoint intervals, respectively, while increases are caused by positive correlations. Since correlations in the interpoint intervals of a process create same-sign correlations in the process itself, increases or decreases in the FFC can be accompanied by similar increases or decreases in the IDC. However, decreases in the FFC may also be caused by a marginal interval distribution with a lower variance than that of the exponential distribution (the interval distribution for a Poisson process) with an equivalent mean,

Long-Range Dependence in Models of Cortical Variability

2137

while increases may be caused by a marginal interval distribution with a higher variance than that of the equivalent-mean exponential distribution. These different effects on the shape of the FFC may be teased apart using the IDC and surrogate data FFC. As already noted, the IDC will convey information about the correlations between interpoint intervals, but the surrogate data FFC will not be affected by such correlations in the original process. Instead, increases and decreases in the surrogate data FFC will be dependent on only the shape of the marginal interval distribution. Figure 1 contains sets of estimated FFCs and IDCs for several different types of LRD point processes. These curves are plotted using a logarithmic scale on both the abscissa and the ordinate since LRD processes typically produce curves that approximate a power law (a straight line on these “loglog” plots) at large counting intervals or large interpoint interval aggregations. Furthermore, processes with stronger LRD (and higher Hurst indices) have curves with larger asymptotic “log-log” slopes.

Original Data Shuffled Data

101

10

c 10 2

2

Original Data Shuffled Data

Fano Factor

b 10

2

Fano Factor

Fano Factor

a 10

1

10 0

Original Data Shuffled Data

10 1

10 0

100 -2

10

-1

0

10

10

1

10

10

2

10

10 -1 -2 10

3

2

10

1

10

0

10

0

10

1

10

2

10

3

10

0

10

1

10

10

2

10

10 -1 -2 10

3

4

Number of Aggregated Intervals

10

2

10

1

10

0

10

-1

10

0

10

1

10

2

10

3

-1

0

10

10

1

10

10

2

10

3

Counting Interval (sec)

Index of Dispersion

10

-1

10

Counting Interval (sec)

Index of Dispersion

Index of Dispersion

Counting Interval (sec)

10

4

Number of Aggregated Intervals

10

3

10

2

10

1

10

0

10

0

10

1

10

2

10

3

10

4

Number of Aggregated Intervals

Figure 1: Fano factor curves and curves of the index of dispersion of intervals for finite-length samples from three point-process types related to neural spike trains and models. Shuffled surrogate data (dashed lines) are formed by randomly shuffling the interpoint intervals of the original sample data from the point process. (a) Curves for a renewal process with an interval distribution possessing infinite variance. (b) Curves for a point process with intervals that are long-range dependent and have a distribution with finite variance. The three Fano factor curves for shuffled surrogate data and three sets of indexof-dispersion curves are from processes with interval variability greater than a Poisson process (top), equal to a Poisson process (middle), and less than a Poisson process (bottom). (c) Curves for a point process with intervals that are long-range dependent and have a distribution with infinite variance.

2138

B. Jackson

Figure 1a depicts the FFCs and IDCs for samples from an LRcD renewal point process, which, of course, is not LRiD. In this case, the interpoint intervals are completely independent, and LRcD is present only if the interval distribution has infinite variance (Daley, 1999; theorem 1). Thus, shuffling the intervals does not change the statistical properties of the point process, and the FFCs for the original and surrogate data are nearly the same, as are the IDCs for the original and surrogate data. Furthermore, the IDCs are essentially horizontal lines, owing to the independence of the intervals. Theoretically, the values of the index of dispersion of aggregated intervals should be infinite, since the variance of the interpoint interval distribution is infinite. But since they are derived from estimated values from a limited sample of the process, the IDCs are at finite values. These estimated values, however, are high, being approximately equivalent to the highest values of the corresponding FFCs. If the length of the sample from the renewal process is gradually increased, then the trend should be for these calculated IDCs to move upward without bound. Figure 1b illustrates representative FFCs and IDCs for point processes with LRiD and finite-variance interpoint intervals. In this case, although the FFC for the original data increases without bound, the FFCs for surrogate data approach a constant limiting value. If the standard deviation of the intervals is equal to their mean, as in the case of exponentially distributed intervals, then the asymptotic value of the surrogate data FFC will be one. If the interval variance is larger than in this case, then the surrogate data FFC will asymptotically approach a value greater than one, and if it is less, the FFC will approach a value less than one. The LRiD of the point process is evident in the ever-increasing original data IDCs. The IDCs for the shuffled surrogate data, on the other hand, are horizontal lines, since the shuffling procedure destroys the serial dependence between the intervals and the intervals have finite variance. The three sets of IDCs (top, middle, and bottom) correspond to the three surrogate data FFCs (top, middle, and bottom, respectively). So if the standard deviation of the intervals is equal to their mean, as in the case of exponentially distributed intervals, then the IDC values for a single “aggregated” interval are equal to one. If the interval variance is larger than the mean, then the IDCs have a value greater than one for a single interval, and if it is less than the mean, then the IDCs have a value less than one for a single interval. Finally, Figure 1c depicts the FFCs and IDCs for an LRcD point process that both is LRiD and has intervals distributed with infinite variance. In this case, the FFC for the shuffled surrogate data increases without bound due to the infinite variance of the interval distribution. However, since the intervals are also LRD, this FFC is not the same as the FFC for the original data. Instead, the surrogate data FFC has a shallower slope than the original data FFC due to the loss of LRiD caused by the shuffling procedure. The IDCs in Figure 1c do not differ significantly in shape from those in Figure 1b, although the point processes are statistically different in these two cases.

Long-Range Dependence in Models of Cortical Variability

2139

These IDCs convey the fact that the point process is LRiD, as is evident from the ever-increasing IDC for the original data. Furthermore, whereas the slopes of the original data IDCs in Figure 1b are approximately the same as those of the corresponding FFCs, the slope of the original data IDC in Figure 1c is shallower than the original data FFC. In terms of the Hurst indices, this means that in Figure 1b HI ≈ H, while in Figure 1c, HI < H. This implies that in the latter case, only some of the LRcD is due to LRiD. The rest is due to the infinite variance of the intervals. However, as in Figure 1a, the infinite variance of the interval distribution is not evident in the IDCs alone due to the finite sample time. But as the length of the sample increases, the vertical position of these curves should, on average, move upward without bound. Thus, we have defined a statistical procedure, the FFC, for recognizing LRD in the counts of point process data, and another, the IDC, for recognizing LRD in the intervals of point process data.7 Together, these two statistical curves can often detect the presence of infinite variance in the interpoint intervals as well. Furthermore, by comparing these statistical curves for an original set of data with those for a set of data obtained by randomly shuffling the original interpoint intervals, a more robust and sensitive indicator of the presence of LRcD, LRiD, and infinite interval variance is produced. Hence, using the original data FFC and IDC and the surrogate data FFC and IDC in combination is a good strategy for discerning among the four potential scenarios of Table 1 for LRD in point process data, such as those encountered when studying neural spike trains. In addition, this analysis provides information on the strength of LRcD in the data and the relative contributions of LRiD and infinite interval variance to its presence. 2.7 Long-Range Dependence in Neural Spike Trains. The spike trains of many neurons possess long-range dependence, although this term is infrequent in the neurophysiological literature. Sometimes this property is called long-term correlation, a designation that is quite similar to long-range dependence, but in other contexts, more specialized terms such as secondorder self-similarity, fractal behavior, and power law (second-order) statistics are used. The use of the latter terms derives from the fact that the correlation function (and other equivalent second-order statistical functions) of a longrange dependent process usually approximates the form of a power law function. Also, since the power law function of the power spectral density, the Fourier transform of the correlation function, has the form of 1/f α for 0 < α < 2 under these conditions, long-range-dependent spike trains are sometimes designated as having 1/f fluctuations. A large majority of the spike trains from mammalian sensory systems, at both the subcortical and cortical levels, that have been assayed for long-

7 Further discussion and more detailed mathematical analysis of the FFC and IDC may be found in Jackson (2003).

2140

B. Jackson

range dependence also exhibit it, and long-range dependence has also been found in the visual system of certain insects (Turcott et al., 1995). Long-range dependence has been found at many different levels of the mammalian visual and auditory systems, including the primary auditory nerve (Teich, 1989; Teich & Lowen, 1994; Lowen & Teich, 1996b), the lateral superior olive (Teich et al., 1990; Turcott et al., 1994), the retina and lateral geniculate nucleus (Teich, 1996; Teich et al., 1997; Lowen et al., 2001), and the visual cortex (Teich et al., 1996). Furthermore, long-range dependence has been found in other sensory and nonsensory systems: in ventrobasal neurons of the thalamus (Kodama, Mushiake, Shima, Nakahama, & Yamamoto, 1989), the somatosensory cortex (Wise, 1981), the mesencephalic reticular formation (Yamamoto, Nakahama, Shima, Kodama, & Mushiake, 1986; Gruneis et al., 1989, 1990, 1993), and the hippocampus (Mushiake, Kodama, Shima, Yamamoto, & Nakahama, 1988; Kodama et al., 1989). One notable exception to the existence of long-range dependence in mammalian sensory neurons is the peripheral vestibular system (Teich, 1989). But since the peripheral vestibular system is the only system, to the best of our knowledge, that has produced a negative result in assays for LRD, LRD seems to be a pervasive property of spike trains in most sensory neural systems and may be common in other neural systems as well. The articles mentioned above primarily analyze the count statistics of spike trains, but none of these papers refers to long-range count dependence as formulated in definition 3. However, they all illustrate statistical characteristics of spike trains that are related to long-range dependence. Often the Fano factor curves for these spike trains are shown to diverge for large counting times, implying that the spike trains exhibit LRcD. Other studies used the power spectral density instead, showing that it diverges as frequency decreases to zero. This also implies that a spike train is LRcD (see lemma 2.3 in Jackson, 2003). Although the presence of LRcD can be supported by their analyses, many of the studies mentioned above did not make use of procedures that can discern the presence of LRiD. Nevertheless, Teich, Lowen, and their coworkers often calculated FFCs for shuffled surrogate data, which can be informative with respect to LRiD, in their studies. In subcortical auditory and visual neurons, they found that FFCs for shuffled surrogate data asymptote to a constant less than one (Teich, 1989, 1996; Teich et al., 1990, 1997; Teich & Lowen, 1994; Turcott et al., 1994; Lowen & Teich, 1996b; Lowen et al., 2001). Thus, the LRcD in these systems is a result of LRiD, and the intervals have sub-Poissonian variability. In primary visual cortex, Teich et al. (1996) also found that the FFCs for shuffled surrogate data asymptote to a constant, but in this case the constant was greater than one. These results signify that as in subcortical neurons, the LRcD in the spike trains is due to LRiD. However, unlike the interspike intervals in subcortical neurons, those in cortical neurons are more highly variable than the intervals of a Poisson process. Therefore, although the FFCs for the original spike trains from both sub-

Long-Range Dependence in Models of Cortical Variability

2141

cortical and cortical neurons are similar to that in Figure 1b, the FFCs for shuffled surrogate data are different between these two sections of the nervous system. The surrogate data FFCs for subcortical neurons are similar to the lowermost dashed curve in Figure 1b, whereas those for cortical neurons are similar to the uppermost dashed curve. 3 Integrate-and-Fire Models That Produce Renewal Point Process Outputs 3.1 The Balanced Excitation-Inhibition Model. Gerstein and Mandelbrot (1964) first proposed that the output of a neuron is highly variable if there is a balance between the amounts of excitation and inhibition in its synaptic inputs. Later, Shadlen and Newsome (1994) used this idea as a solution to the apparent incompatibility, noted by Softky and Koch (1992, 1993), of temporal integration and high variability for cortical neurons. Work on this solution was further extended in several subsequent studies (Brown & Feng, 1999; Feng & Brown, 1998a; Feng, 1999; Shadlen & Newsome, 1998; Burkitt, 2000, 2001). In this section, we evaluate whether excitation-inhibition balance can produce LRcD, in addition to high variability, in simple IF models. We restrict ourselves to the simplest of these models since (1) they can be handled analytically, (2) they include the primary models considered in the literature, and (3) more complex models are covered by the arguments for generic renewal point processes in the subsequent section. This exercise will serve to illustrate the conflict between LRcD and realistic values of CVISI as it occurs in a particular type of renewal model. The model that we are considering here is the basic IF model, without leakage or reversal potentials, that has both excitatory and inhibitory inputs, all of which are Poisson processes and mutually independent. Each event, or “spike,” arriving on an input causes a postsynaptic potential (PSP), a depolarizing potential (EPSP) for an excitatory input, and a hyperpolarizing potential (IPSP) for an inhibitory input. Both EPSPs and IPSPs are Dirac delta functions, causing instantaneous jumps in the voltage, and, for simplicity and because exact analytical results are available in this case, we begin with EPSPs and IPSPs that are equal in amplitude. We denote the EPSP amplitudes by aE > 0 and the IPSP amplitudes by aI > 0. Thus, in the present case, we can let a = aE = aI . Furthermore, we let ME and MI denote the number of excitatory and inhibitory inputs, respectively, and λE and λI denote the input rate for each excitatory and inhibitory fiber, respectively. Since all inputs are independent Poisson processes, they can be combined such that only two effective inputs need to be considered: an excitatory input of rate E = ME λE and an inhibitory input of rate I = MI λI . The integral, with respect to time, of all of the resulting postsynaptic potentials is a random walk, which forms the time-varying potential, V(t), of the IF model. When this potential crosses a predetermined, constant thresh-

2142

B. Jackson

old, Vth , an output spike is initiated. Following an output spike, the voltage is reset to its resting level, v0 , and the process starts anew. Since the inputs are Poisson processes, the postsynaptic potentials are delta functions, and the IF neuron completely resets at each occurrence of an output spike, the output of the present model is clearly a renewal process. Hence, the output is completely specified by its interval distribution, or, equivalently, the first passage time of the potential V(t) to level Vth from V(0) = v0 . Tuckwell (1988) has derived the interval density, f (t), for the output of this model, f (t) = θˆ

E I

θ/2 ˆ

e−(E +I )t Iθˆ (2 E · I t), t

t > 0,

(3.1)

where θˆ = Vtha−v0 is the number of excitatory inputs required for V(t) to cross threshold, x is the least integer greater than or equal to x, and Iρ (x) is the modified Bessel function: Iρ (x) =

x 2k+ρ 1 . k! (k + ρ + 1) 2

∞ k=0

Now let X be an arbitrary interspike interval with density function f (t). Then, according to Tuckwell (1988), the first three (central) moments of X are  1, Pr{X < ∞} = E θˆ  , I

E{X} =

θˆ E −I ,

if E ≥ I , if E < I if E > I , , if E ≤ I

∞,

,

(3.2)

(3.3)

and Var{X} =

ˆ E +I ) θ( , (E −I )3

∞,

if E > I , . if E ≤ I

(3.4)

Thus, in the case when E ≤ I , the coefficient of variation of the interspike intervals, CVISI , does not exist. However, if E > I , the coefficient of variation is CV{X} =

E + I . ˆθ (E − I )

(3.5)

Long-Range Dependence in Models of Cortical Variability

2143

Therefore, by adjusting I within the interval [0, E ), the CVISI can be set to

ˆ ∞), with arbitrarily large values occurring any value in the interval [1/ θ, as I approaches E . In other words, as the amount of inhibition is increased to bring the model into perfect excitation-inhibition balance, the CVISI goes to infinity. Now we wish to know when the output of this process exhibits LRcD. Since the output of the model is a renewal point process, from theorem 1 we know that if E{X} < ∞, then the output will be LRcD if and only if E{X2 } = ∞. Furthermore, if the mean E{X} is infinite, then the second moment E{X2 } must be infinite as well. Hence, according to equation 3.4, a necessary condition for the model to be LRcD is that E ≤ I .8 Thus, in trying to fit this IF model to both the CVISI and LRcD properties of real neurons, we find ourselves at an impasse. With the amount of inhibition less than the amount of excitation, the model can match the CVISI values measured from real neurons, but under these conditions, it is not LRcD, since both the mean and variance are finite. On the other hand, by making the amount of inhibition equal to or greater than the amount of excitation, we may create LRcD in the model, but the CVISI becomes infinite. Therefore, this model of highly variable cortical neurons cannot manifest LRcD while still producing interspike-interval variability consistent with empirical measurements. A common generalization of this model is produced by relaxing the condition that the EPSPs and IPSPs have equal magnitude. In other words, consider the case when aE = aI . This is similar to the model proposed by Shadlen and Newsome (1994) for matching the variability of real cortical neurons, except that their model had a lower limit, or elastic barrier, on the voltage of the IF neuron. This model, unlike the case when aE = aI , cannot be analyzed directly. However, the potential V(t) in this case can be approximated by a Wiener process with drift, an approximation that becomes exact as the input rates E and I go to infinity and the PSP amplitudes aE and aI go to zero. Expressions for the density, distribution and first three central moments of the output intervals for this model are also available in Tuckwell (1988) and have been reproduced in Jackson (2003). From these equations, we find that for the approximation to this more general model, the coefficient of variation of the interspike intervals, CVISI , does not exist when aE E ≤ aI I . However, if aE E > aI I , the coefficient of variation is CV{X} =

a2E E + a2I I , θ(aE E − aI I )

(3.6)

where θ = Vth − v0 does not depend on the PSP amplitudes since we no longer have the simplifying assumption that they are all equal. Therefore, 8 Since, according to equation 3.3, E{X} is also infinite when ≤ , we cannot E I conclude from theorem 1 that this is also a sufficient condition.

2144

B. Jackson

√ we see that the CVISI may take any value in the interval aE /θ , ∞ when aE E > aI I , with arbitrarily large values occurring as aI I approaches aE E . Thus, we are in the same predicament as before. If aE E > aI I , then the model can be adjusted to match the CVISI values measured from real neurons. However, under this condition, both the mean and variance are finite, and therefore the model is not LRcD. But if aE E ≤ aI I , the process may be LRcD, but the CVISI is infinite. Therefore, even when the EPSP and IPSP magnitudes are unequal, the IF model is incapable of matching both the interspike-interval variability and the LRcD measured in real cortical neurons. 3.2 Renewal Integrate-and-Fire Models and Their General Properties. In the previous section, we showed that the basic high-variability IF model that requires balanced excitation and inhibition cannot produce both a finite CVISI and LRcD at the same time. However, this result is easily extended to all renewal models, a class that includes a significant portion of the single-neuron, high-variability IF models in the literature. By using the term single-neuron, we are excluding consideration of network models, where the statistical nature of the entire set of inputs to each neuron is not explicitly specified but instead consists, at least partially, of outputs from other similar neurons within an interconnected network. If each component of an IF model is renewal (i.e., has no memory of the process prior to the time of the last output spike), then the output of this model must be renewal. Hence, the following conditions are jointly sufficient to render the output of such a model renewal: 1. All inputs to the IF neuron are Poisson processes (i.e., are stationary with no autocorrelations). 2. The cross-covariance between any set of inputs is zero for any nonzero lag. 3. The PSPs are direct changes in the IF potential, and either the IF potential is reset to a fixed value after the occurrence of each output spike or these reset values form a set of independent and identically distributed random variables. 4. All other parameters of the model (e.g., threshold, leakage conductance, or reversal potentials) are either constant, set to a fixed value upon the occurrence of an output spike, or have post–output spike values that form a set of independent and identically distributed random variables. These conditions are not entirely general, but they are practically general for the high-variability IF models that have been studied in the literature. Note that if the above conditions are met, it does not matter whether the

Long-Range Dependence in Models of Cortical Variability

2145

model has leakage or reversal potentials or a dynamic threshold, as long as their parameters meet condition 4. Furthermore, the PSPs may have nonzero duration as long as condition 3 is met. Many of the nonnetwork high-variability IF models in the literature meet these four conditions and are therefore renewal. Excitation-inhibition balance models that fit this category were considered by Shadlen and Newsome (1994), Brown and Feng (1999), Feng and Brown (1998a, 1999), Feng (1999), and Burkitt (2000, 2001). Renewal models that produce highly variable outputs due to correlations between inputs9 were considered by Feng and Brown (2000a), Feng (2001), Feng and Zhang (2001), and Salinas and Sejnowski (2000). Other models that are renewal as well are the partial reset models considered in Troyer and Miller (1997) and Bugmann et al. (1997), the time-varying threshold model considered in Wilbur and Rinzel (1983), and the nonlinear leakage model considered in Feng and Brown (2000b). According to theorem 1, each of these models that has an interval distribution with finite mean either has an interval distribution with finite variance and is not LRcD or has an interval distribution with infinite variance and is LRcD. In the first case, it might be possible to match empirically measured values of CVISI , but the LRcD property of real neurons is unattainable. However, in the second case, the model will be LRcD, but it will be impossible to match empirically measured CVISI values. Thus, just like the model that was considered in section 3.1, all renewal models with finite-mean intervals fail to match both the interspike-interval variability and LRcD measured in real cortical neurons. The preceding argument assumes that the interval distribution has a finite mean. The example in section 3.1, where the limit of CVISI as the model approached the infinite-mean condition could be determined analytically, shows that at least some renewal models with infinite mean intervals fail to produce the required variability and LRcD properties, and fail in a manner similar to that described in the preceding argument for the finite mean case. This, however, does not prove that all such models fail in this way. Nevertheless, models with infinite mean intervals have at least two degenerate properties. First, such a model cannot be stationary. This means, for instance, that if we were to analyze or simulate a renewal point process with an infinite mean interval, we would need to begin with a point at some specified time—for instance, at the origin. Second, since an interval distribution with an infinite mean will also have an infinite variance, the CVISI has no meaning for such a model. Therefore, regardless of whether the model is LRcD, a direct comparison of the interval variability of the model with that of real cortical neurons is impossible using a single measure.

9 The inputs of these models are correlated only at zero lag and do not therefore violate condition 2.

2146

B. Jackson

To test whether such a model is reasonable, we should consider the behavior of both the sample mean and the sample variance of interspike intervals from physiological recordings as an increasing number of interspike intervals are included in the calculation. If both increase without bound for long recordings from cortical neurons, then it is possible that a renewal model with infinite mean interval would be reasonable. Even so, unless the sample CVISI values (the sample standard deviation divided by sample mean) converge to a finite value as the amount of data included in the calculations is increased, these values are unfit as model constraints. However, to the best of our knowledge, this convergence analysis has not been carried out on spike trains recorded from cortical neurons. 4 Integrate-and-Fire Models with Renewal Point Process Inputs 4.1 The Integrate-and-Fire Models of Feng and His Coworkers. In the previous section, we showed that any model of cortical neurons that produces a renewal output cannot exhibit both CVISI values and LRcD properties that are similar to those seen in real cortical neurons. However, there are a number of high-variability models that produce output spike trains that are not renewal point processes (RPPs). One type of nonrenewal, highvariability model includes the models analyzed by Feng and his coworkers (Feng, 1997, 1999; Feng & Brown, 1998a, 1998b; Feng et al., 1998). These models are identical to the class of models considered in section 3, except that the inputs are not Poisson point processes. Instead, the inputs can be any other type of RPP, and hence the cumulative inputs consisting of the superposition of all excitatory inputs and the superposition of all inhibitory inputs are no longer memoryless. In particular, Feng and his coworkers considered RPP inputs with positive gaussian and Pareto interval distributions. The former has a tail that decreases to zero faster than the exponential distribution (i.e., the interval distribution of a Poisson process), and the latter has a tail that decreases to zero more slowly than the exponential distribution. In general, they found that both longer-tailed input distributions and more closely balanced amounts of excitation and inhibition increased the CVISI of the output. This suggests that the CVISI of an IF model with RPP inputs can be within the physiologically realistic range, regardless of the interval distributions of the RPPs, if the ratio between the amounts of excitation and inhibition is properly adjusted. The necessary range of inhibition-excitation ratios, however, does depend on the interval distribution of the inputs. In particular, for long-tailed input distributions, less inhibition is required to produce realistic CVISI values, whereas for short-tailed input distributions, the amount of inhibition needs to be much closer to the amount of excitation (Feng & Brown, 1998a). The IF model with Poisson inputs raises enough analytical difficulties that we do not expect that complete analytical results can be obtained for the case of general RPP inputs. However, a few general observations can

Long-Range Dependence in Models of Cortical Variability

2147

be made. First, due to the integration mechanism, the likelihood of the occurrence of an output spike immediately following another output spike is low but will increase as the time since the last output spike increases. Thus, at small counting windows, we expect the counts to be negatively correlated. This will not, however, affect the correlation structure of the intervals, since the IF mechanism completely resets at the occurrence of each output spike. Therefore, the IF mechanism cannot by itself create memory in the model that is longer than the interspike intervals of the output. Hence, the mediumand long-term memory properties of the output must be governed by the inputs and the mechanisms by which they are combined. The combination of the inputs may be considered as two separate component processes: superposition and excitation-inhibition interaction. For Poisson inputs, the excitation-inhibition interaction can affect long-term memory by producing high interval variability (see section 3; Tuckwell, 1988), but with concomitant increases in the mean interval length. Balancing the amounts of excitation and inhibition will presumably have a similar type of effect when the inputs are RPPs. Any additional memory properties of the output, in particular those that are longer than the interspike intervals, must therefore originate in the superpositions of the input point processes. Thus, we should be able to gain some further intuition about the memory properties of the IF model with RPP inputs by considering the dependency structures of superpositions of RPPs. 4.2 Analytical Results for the Superposition of Renewal Point Processes. 4.2.1 The Positive Gaussian and Pareto Distributions. In the following, we will review some results relating the statistical properties of component RPPs to the properties of their superpositions and apply them for component RPPs with positive gaussian and Pareto interval distributions. These are the two distributions considered by Feng and his coworkers (Feng, 1997, 1999; Feng & Brown, 1998a, 1998b; Feng et al., 1998), and they represent two different classes of interval distributions: those with superexponential tails and those with subexponential tails. A more complete treatment, containing additional mathematical derivations, is contained in Jackson (2003, sect. 3.5.2). The positive gaussian distribution is an example of a distribution that has a support of [0, ∞) and a tail that is shorter than the exponential distribution. If X is a random variable with a gaussian, or normal, distribution and a mean of zero, then Y = |X| has a positive gaussian, or “folded” gaussian, distribution. The probability density function of the positive gaussian distribution is 2 t2 exp − , if t ≥ 0; 2 πµ f (t) = π µ (4.1) 0, otherwise,

2148

B. Jackson

where µ > 0 is the expected value. Since its tail is shorter than the exponential distribution, which has a variance of µ2 when its mean is µ, the positive gaussian distribution must have a variance that is less than µ2 . Specifically, its variance is ( π2 − 1) µ2 . The Pareto distribution is an example of a distribution that has a support of [0, ∞) and a tail that is longer than the exponential distribution. The probability density function of the Pareto distribution is f (t) =

αKα (t + K)−α−1 , 0,

if t ≥ 0; otherwise,

(4.2)

with parameters K > 0 and α > 0. K is essentially a normalization constant, and α determines the length of the tail of the Pareto distribution. The tail probability Pr{X > x} of a Pareto-distributed random variable decays as x−α , and only the moments less than α exist. The mean of the Pareto distribution, if α > 1, is K/(α − 1), while if α > 2, the variance is 2K2 /[(α − 1)(α − 2)]. 4.2.2 The Interval Distribution of the Superposition of Renewal Point Processes. It is well known that as the number of component RPPs increases, under proper normalization, their superposition approaches a Poisson process (Cox & Smith, 1954; Khintchine, 1960; Cox, 1967). Thus, as the number of inputs to the IF model increases, the model becomes more similar to that considered in section 3, where the output was LRD if and only if the variance of the intervals of the output was infinite. Clearly, then, as the number of RPP inputs increases, the model will become less and less likely to possess both a finite CVISI and LRD. The forms of the interpoint interval distribution and the serial dependence between the intervals in the superposition, for any fixed, finite number of inputs, will indicate the direction from which it approaches the Poisson process as more component processes are added. First, consider the interval distribution of the superposition process. Let G(t) be the marginal (cumulative) distribution function of the intervals of the superposition of p independent RPPs, each with intervals distributed according to F(t) with a mean of µ. Then the distribution functions of the components and the superposition are related by (Cox & Smith, 1954; Lawrance, 1973), 1 − G(t) = (1 − F(t))

p−1 1 − F(s) . ds µ

∞

t

(4.3)

From equations 4.1 and 4.3, we find that the tail of the interval distribution for a superposition of positive-gaussian RPPs is (Jackson, 2003, sect. 3.5.2) 1 − G(t) = erfc

t √

µ π

−

e

t2 π µ2

−

p−1 t t , erfc √ µ µ π

for t ≥ 0,

Long-Range Dependence in Models of Cortical Variability

where 2 erfc(x) = √ π

∞

2149

e−s ds 2

x

is the complementary error function. Since the complementary error function is always between zero and one, the second term in the difference above must be positive, and p is a positive integer, 1 − G(t) ≤ e

−

(p−1)t2 √ µ2 π

.

(4.4)

The right-hand side of equation 4.4 decreases faster than an exponential function. Thus, the tail of the interval distribution for the superposition process also decreases faster than an exponential function, as is the case for the component processes. In particular, this result implies that the intervals of the superposition process have finite means and variances. Using equations 4.2 and 4.3, we find that the tail of the interval distribution for a superposition of Pareto RPPs, when α > 1, is (Jackson, 2003, sect. 3.5.2) 1 − G(t) = K(α−1)p+1 (t + K)−[(α−1)p+1] ,

for t ≥ 0.

(4.5)

Comparing equation 4.5 to the tail of the interval distribution for the component processes, 1 − F(t) = Kα (t + K)−α ,

for t ≥ 0,

we see that the superposition of p independent Pareto RPPs, each with parameters K and α > 1, has a Pareto interval distribution with parameters K and α = (α − 1)p + 1 > 1. Thus, the intervals in the superposition have finite variance as long as (α − 1)p + 1 > 2 or, equivalently, if p > 1/(α − 1). Thus, as α, the parameter for the component processes, approaches the value of one, an increasing number of inputs are required if their superposition is to have intervals with finite variance. Furthermore, if the intervals of the component processes have finite variance, i.e., α > 2, then the intervals in their superposition do as well. 4.2.3 Short-Range Dependence Between the Intervals of the Superposition of Renewal Point Processes. The dependency structure of the intervals of the superposition of RPPs is much more difficult to derive than the their distribution. However, Enns (1970) found that the the covariance between the lengths of two adjacent intervals, τn and τn+1 , in the superposition of p independent RPPs is lim Cov(τn , τn+1 ) = µ2 I(p),

n→∞

(4.6a)

2150

B. Jackson

where

∞

I(p) =

(h(t) ∗ h(t))(1 − H(t))p−1 dt −

0

1 , p2

(4.6b)

H(t) is the distribution of the forward recurrence time, with density h(t) =

1 − F(t) , µ

for each RPP, and the symbol ∗ denotes convolution. If there is only one component process, then I(1) = 0, which is consistent with the “superposition” process being an RPP. In addition, limp→∞ I(p) = 0. This means that the superposition process approaches an RPP as the number of components is increased, in agreement with the result that the superposition approaches a Poisson process in this case. Under certain conditions, the hazard rate of the component RPPs can be used to easily determine the sign of I(p), and thus, by equation 4.6a, of the covariance of adjacent intervals in the superposition process. The hazard rate of an RPP having an interval distribution F(t), with probability density function f (t) = dtd F(t), is z(t) =

f (t) . 1 − F(t)

If the hazard rate of the component RPPs is monotone, then the sign of its derivative can be used to determine the sign of I(p) in equations 4.6a. When the hazard rate of the inputs is monotone nondecreasing, then I(p) ≤ 0, and when it is monotone nonincreasing, then I(p) ≥ 0 (Enns, 1970). The exponential distribution, which has a constant hazard rate (i.e., both nondecreasing and nonincreasing), is the boundary case, yielding I(p) ≡ 0, and uncorrelated intervals in the superposition, which is consistent with the fact that the superposition of a number of Poisson processes is also a Poisson process. Thus, if the hazard rate of the inputs is monotone nondecreasing but also nonconstant, then I(p) < 0, and adjacent intervals in the superposition process are negatively correlated. If the hazard rate of the inputs is monotone nonincreasing and nonconstant, then I(p) > 0, and adjacent intervals in the superposition process are positively correlated. The hazard rate of an RPP with positive-gaussian-distributed intervals is monotone increasing (Jackson, 2003, appendix A). Hence, adjacent intervals in the superposition of such RPPs are negatively correlated. This is a common result for RPPs that are used to model real phenomena. If the hazard rate is increasing, then as time passes since the last event, it becomes more likely, per unit time, that the next event will occur. On the other hand, the hazard rate of a Pareto RPP (Jackson, 2003, sect. 3.5.2), α z(t) = , t+K

Long-Range Dependence in Models of Cortical Variability

2151

is monotone decreasing for all allowable (i.e., positive) values of α and K, so adjacent intervals in the superposition of these RPPs are positively correlated. Thus, although the marginal interval distribution of the superposition of Pareto RPPs is also Pareto, the superposition process is not an RPP when there are multiple component processes. 4.2.4 Long-Range Dependence in the Superposition of Renewal Point Processes. We are particularly concerned with the variability and long-term memory properties of the output of the IF model with RPP inputs. As discussed earlier, LRD in the output of this model is likely to be produced by LRD in the superposition of the inputs. In order to evaluate the LRD of a superposition of independent RPPs, we can consider the variance function V(t) ≡ Var{N(0, t]}, which is the variance of the number of events in an interval of length t. For a single RPP with a finite-variance interval distribution, it is known (e.g., Cox & Smith, 1954; Cox, 1967; Feller, 1971) that V(t) ∼

σ 2t µ3

t → ∞,

as

where µ and σ 2 are, respectively, the mean and variance of the interval distribution. Thus, the variance function of the superposition of p independent, statistically identical RPPs is VS (t) ∼

pσ 2 t µ3

as

t → ∞,

(4.7)

where µ and σ 2 are the mean and variance of the component processes. Therefore, if the component processes have finite interval variances, and thus are not LRcD, then their superposition has finite variance and is not LRcD. Actually, the connection between the interval variance of the component processes and the LRcD of their superposition is even stronger than the preceding discussion indicates. Theorem 1 states that an RPP is LRcD if and only if its intervals have infinite variance. The proof given there can be readily adapted to obtain a similar result for the superposition of a finite number of independent, statistically identical RPPs.10 Theorem 2. Suppose that N1 (·), N2 (·), . . . , Np (·) are p independent stationary RPPs, each with the same distribution function F of their generic interpoint interval random variable X, which has F(0) = 0 and finite mean µ = E{X}.11 Then 10

See Jackson (2003, theorem 3.4) for the proof of this result. This theorem can easily be generalized to the case where the component processes have different interval distribution functions. In that case, the superposition process is LRcD if and only if one or more of the component processes has intervals with infinite variance. However, this generalization is unnecessary for the results of this study. 11

2152

B. Jackson

p the superposition of these p RPPs, NS (t) = i=1 Ni (t), is LRcD if and only if 2 E X = ∞, that is, the variance of the intervals in the component processes are infinite. Since any distribution with a tail that is shorter than the exponential distribution has finite variance, theorem 2 asserts that the superposition of RPPs with this kind of interval distribution is not LRcD. For component processes with intervals distributed according to a positive gaussian distribution, equation 4.7 implies that the asymptotic behavior of the variance function of the superposition process is given by VS (t) ∼

p( π2 − 1)t µ

as

t → ∞.

Thus, in accordance with theorem 2, the limit of VS (t)/t, as t → ∞, is a finite constant. More instructive is the asymptotic behavior of the FFC. For the superposition of positive gaussian RPPs, as the counting interval length increases, the Fano factor approaches VS (t) π lim FS (t) = lim = − 1 ≈ 0.5708 < 1, pt t→∞ 2

t→∞

µ

which is independent of the number of component processes, p, and the parameter µ of their interval distribution. The fact that the limiting value of the Fano factor is less than one is correlated with the prior finding that (at least) adjacent intervals are negatively correlated in the superposition. The Pareto distribution has finite variance only if α > 2. Thus, according to theorem 2, the superposition of Pareto RPPs will be LRD if and only if α ≤ 2. For component processes with intervals distributed according to a Pareto distribution with α > 2, equation 4.7 implies that the asymptotic behavior of the variance function of the superposition process is given by 2K2 t p (α−1)(α−2) 2p(α − 1)2 t VS (t) ∼ = as t → ∞. 3 K(α − 2) K α−1)

Thus, as expected, the limit of VS (t)/t, as t → ∞, is a finite constant when α > 2. Also, for the superposition of Pareto RPPs, with α > 2, the Fano factor approaches lim FS (t) = lim

t→∞

t→∞

VS (t) pt(α−1) K

=

2(α − 1) , α−2

which, as in the case of positive gaussian RPPs, is independent of the number of component processes, p. However, in this case, the limiting value of the Fano factor is dependent on a parameter, α, of the distribution. But for all

Long-Range Dependence in Models of Cortical Variability

2153

α > 2, the limiting value of the Fano factor is greater than one, which is correlated with the fact that (at least) adjacent intervals in the superposition of Pareto RPPs are positively correlated. In fact, this value is always greater than two for α > 2, with it approaching the value of two as α → ∞ and growing without bound as α → 2. 4.3 Simulation Results for the IF Model with Renewal Point Process Inputs. The analytical results of section 4.2, by describing the superposition of the inputs, provide some clues to the properties that should be expected of the output of the IF model with RPP inputs. In addition, the arguments in section 4.1 furnish additional insight into characteristics that the output is likely to possess. However, a complete analytical treatment of this model, save when the inputs are Poisson processes, is not available. Therefore, in order to test the ability of IF models with renewal point process inputs to produce realistic CVISI values and LRcD, we have run computer simulations of these models. For each model and set of parameters, we ran 10 simulations, each with a duration of 100,000 seconds. Each simulation used 100 excitatory inputs and 100r inhibitory inputs, where r was one parameter of the simulations that was varied. Thus, r represents the ratio of the number of inhibitory inputs to the number of excitatory inputs. The value of 100 was chosen to match the work of Feng and his coworkers. Moreover, this is a relatively large number of inputs but should not produce the trivial situation where the superpositions of the RPPs are nearly Poisson processes. If the results from the RPP inputs matched those from the Poisson process inputs, then this number would have to be reduced. The IF mechanism had a reset potential of zero and a threshold of unity, and each spike that occurred in either an excitatory or inhibitory input caused the potential to instantaneously increase or decrease, respectively, by 1/40 = 0.025. When the potential reached the threshold value, an output spike was registered, and the potential was reset. The value of the appropriate parameter of the interval distribution of the inputs was chosen to yield a nominal output rate of 2.5 spikes per second according to the following formula: output rate (# of excitatory inputs)(postsynaptic potential)(1 − r) 2.5 = , (4.8) 100 · 0.025 · (1 − r)

input rate =

where postsynaptic potential is the amount by which input spikes increased or decreased the potential in the IF mechanism. If possible, the interval from time zero to the first spike in each input process was chosen according to the distribution of the forward recurrence time. This ensured that each input was stationary (or at equilibrium). If this was not possible, then a pseudostationary state was created by generating a random interval length from the

2154

B. Jackson

interval distribution and choosing the origin to be a uniformly distributed point within this interval. We analyzed the output spike trains of these models using estimators of the CVISI , the FFC, and the IDC. The IDC required calculation of the sample variance of the length of, say, M adjacent intervals for many values of M. Since the IDC was plotted on a double logarithmic graph, we began with a value of M = 1 and used 10 values of M per decade, which were as equally spaced in logarithmic coordinates as possible, given that M has to be an integer. No M values greater than one-fifth of the total number of interspike intervals in the output of the simulation were used. Similarly, the FFC required calculation of the sample variance of the counts in windows of size T, say, for many values of T. The values of T ranged from 0.01 second to one-fifth of the total simulation length, with 10 logarithmically spaced values per decade. Finally, the interspike intervals of the output spike train from each of the 10 simulations for a given model and set of parameters were randomly shuffled once to produce a set of 10 surrogate spike trains. The sample variances of the counts and aggregated intervals were then calculated, in the same manner as before, for these new spike trains in order to produce the surrogate data FFCs and IDCs. 4.3.1 Simulations of the IF Model with Poisson Process Inputs. For comparison, we ran simulations of the IF model with Poisson process inputs with the same parameters as our other simulations. For the IF model with Poisson process inputs, it is possible to derive, in closed form, the theoretical value of the CVISI of its output as a function of the inhibition-excitation ratio, r. Using equation 3.5, we find that this function is CVISI (r) =

1+r , θˆ (1 − r)

(4.9)

where θˆ =

1−0 = 40 1/40

for our simulations. This function is plotted as a dashed line in each graph of Figure 2. In Figure 2a, the values of CVISI estimated from 10 simulations of the Poisson input model at each value of r are individually plotted. However, since the variability of these estimates across simulations is so low, the 10 symbols at each value of r fall almost exactly on top of each other. It is apparent from this graph that the results of our simulations agree well with the theoretical values of CVISI for this model.

Long-Range Dependence in Models of Cortical Variability

1.0 0.8

b Simulations Theory

0.6 0.4 0.2 0.0 0.0

0.2 0.4 0.6 0.8 1.0 Inhibition-Excitation Ratio, r

Coefficient of Variation

Coefficient of Variation

a

2.5 2.0

2155

Gaussian Renewal Inputs Mean for Gaussian Inputs Poisson Process Inputs

1.5 1.0 0.5 0.0 0.0

0.2 0.4 0.6 0.8 1.0 Inhibition-Excitation Ratio, r

Figure 2: The coefficient of variation of interpoint intervals as a function of the inhibition-excitation ratio r for the output of an integrate-and-fire model with inputs that are (a) Poisson point processes and (b) positive gaussian distributed renewal point processes. There were 10 simulation runs at each value of r, and the calculated value of the coefficient of variation for each of these runs is plotted with a symbol. The dashed line in each graph is the theoretical curve for the model with Poisson process inputs, and the solid line in b connects the means calculated at each value of r. For each value of r, the parameter controlling the rate of the inputs was set so that the nominal rate at the output was 2.5 spikes per second.

4.3.2 Simulations of the IF Model with Positive Gaussian Renewal Inputs. For simulations of the IF model with positive gaussian RPP inputs, the sole parameter of the interval distribution, the mean µ, was set to be the inverse of the nominal input rate from the calculation in equation 4.8. Thus, the only parameter that we varied in these simulations was the inhibition-excitation ratio r. For this model, we ran simulations with r-values of 0.0, 0.5, 0.6, 0.65, 0.7, 0.75, 0.8, 0.85, 0.9, 0.95, 0.97, 0.98, and 0.99. In Figure 2b, the calculated value of CVISI for each simulation is plotted versus the value of r. Again, at each location, there are actually 10 symbols, but the low variability of the estimates causes the symbols to fall almost exactly on top of each other at all but the highest values of r. The solid line connects the means of the 10 estimates of CVISI at each value of r, and the dashed line is the theoretical curve for the Poisson input model. These results are in close agreement with those of Figure 2a of Feng and Brown (1998a). As we expect, due to the negative correlation in the superpositions of the inputs, the CVISI values for the positive gaussian model are always less than the corresponding values for the Poisson model. However, the CVISI for the positive gaussian model does increase, apparently without bound, as r approaches one, in the same manner as that for the Poisson model. In particular, equation 4.9 specifies that the CVISI function with respect to r for the Poisson input model is the square root of a rational function with a pole

2156

B. Jackson

at one, and the increasing, convex shape of the CVISI curve for the gaussian input model is well approximated by the same type of function. Thus, as observed by Feng and Brown (1998a), the positive gaussian model requires more balance than the Poisson model in order to achieve any particular CVISI for its outputs. Only for values of r greater than about 0.9 is the CVISI greater than 0.5, which is the minimal value needed to match physiological data. Although the positive gaussian model may be able to achieve any arbitrary value of CVISI , albeit perhaps at the cost of a high degree of excitationinhibition balance, its success as a model of cortical neurons quickly diminishes when the possibility of LRD is considered. Figure 3 contains examples of FFCs and IDCs for the output of the IF model with positive gaussian renewal inputs for several representative values of the inhibition-excitation ratio r; the FFCs and IDCs for all of our simulations may be found in Jackson (2003, appendix B). The sub-Poissonian variability of the interspike intervals in the output caused by the integration mechanism and the negative correlation in the superpositions of the inputs is clearly evident in the general downward trend of both the original data and surrogate data FFCs and their asymptotic approach to a constant less than one for low r-values. Furthermore, as r is increased (i.e., as the amount of inhibition is brought into balance with the amount of excitation), the effect of the excitation-inhibition interaction is apparent in the rising value of the asymptotes of the FFCs. At a value of about r = 0.97 (not shown in Figure 3), the excitation-inhibition interaction completely cancels the two variance-decreasing effects, and the intervals of the output of the model have Poisson-like variability. Thus, for larger values of r, the asymptotic values of the FFCs are greater than one. Even at these large r-values, however, the FFCs are still below one for small counting intervals. For instance, the FFCs for r = 0.99 do not exceed one until the counting interval is about 0.1 second in length. A small amount of negative correlation between the interspike intervals of the output is also evident at low r-values in the differences between the original data FFCs and surrogate data FFCs, as well as in the difference between the original data IDCs and surrogate data IDCs. The fact that the original data FFCs are lower than the surrogate data FFCs implies that the intervals in the original data must be negatively correlated, since the intervals in the surrogate data are uncorrelated but have the same distribution as the original data intervals. This argument is supported by the original data IDCs, which are decreasing at low aggregation levels and are below the corresponding surrogate data IDCs. Since this negative correlation in the output intervals is apparent, and, in fact, its effect largest, at r = 0, where there is no excitation-inhibition interaction, it must be the result of the negative correlation between intervals in the superposition of the inputs. These effects of the negative correlation, however, gradually disappear as r increases. Thus, for r greater than about 0.9, the original data curves and the surrogate data curves are nearly equivalent, implying that the output

r

10

0.00

Fano Factor

10

1

0

-1

10

-2

10

-3

10

-2

10

-1

10

0

10

1

10

2

10

3

10

4

10

10

5

Index of Dispersion of Intervals

Long-Range Dependence in Models of Cortical Variability

1

10

0

10

-1

10

-2

10

-3

10

0

10

0.50

Fano Factor

10

1

0

-1

10

-2

10

-3

10

-2

10

-1

10

0

10

1

10

2

10

3

10

4

10

5

10

10

1

0

-1

10

-2

10

-3

10

-2

10

-1

10

0

10

1

10

2

10

3

10

10

4

10

5

10

1

0

-1

10

-2

10

-3

10

-2

10

-1

10

0

10

1

10

2

10

3

10

4

10

Counting Interval (sec)

5

10

-1

10

-2

10

-3

10

0

10

1

10

2

10

3

10

4

10

5

10

10

10

1

0

-1

10

-2

10

-3

10

0

10

1

10

2

10

3

10

4

10

5

10

Number of Aggregated Intervals

5

10

Index of Dispersion of Intervals

0.99

Fano Factor

10

4

10

0

Counting Interval (sec) 10

3

10

Number of Aggregated Intervals Index of Dispersion of Intervals

0.95

Fano Factor

10

2

10

1

Counting Interval (sec) 10

1

10

Number of Aggregated Intervals Index of Dispersion of Intervals

Counting Interval (sec) 10

2157

10

10

1

0

-1

10

-2

10

-3

10

0

10

1

10

2

10

3

10

4

10

5

10

Number of Aggregated Intervals

Figure 3: FFCs and IDCs estimated from simulations of the IF model with positive gaussian inputs. Each set of axes contains 10 curves calculated from original data (black) and 10 curves calculated from the corresponding shuffled surrogate data (gray). For each value of the inhibition-excitation ratio r, each individual FFC in the left set of axes was calculated from the same data as one of the IDCs in the right set of axes. Also, for each value of r, the parameter µ of the positive gaussian distribution, which controls the rate of the inputs, was set so that the nominal rate at the output was 2.5 spikes per second.

2158

B. Jackson

of the model is roughly a renewal process for closely balanced amounts of excitation and inhibition. Several additional characteristics of the FFCs and IDCs in Figure 3 are consistent with theoretical predictions. First, the surrogate data IDCs are all close to being horizontal lines, and the mean surrogate data IDC at each r-value is, for all practical purposes, a horizontal line. This is to be expected since surrogate data form a renewal process. Next, the value of each FFC at the smallest counting intervals is nearly one, in agreement with the theoretical result mentioned in section 2.6.3, and each FFC has the same asymptotic value as the corresponding IDC, as predicted by equation 2.5. This also means that the vertical position of the IDCs increases with increasing r. From this analysis, it is apparent that the positive gaussian input IF model is no better at producing high interval variability and LRcD than the Poisson input IF model. Not only does this model require perfect excitationinhibition balance in order to produce LRcD, but at any given level of balance, the gaussian input model is further from being LRcD than the Poisson input model. Furthermore, if the level of balance is high enough to produce values of CVISI above 0.5 in the gaussian input model, then the output of this model is nearly renewal, which is precisely the characteristic that was used to expose the inadequacy of the Poisson input model. Moreover, since the gaussian input model will more closely approximate the Poisson input model as more inputs are included, increasing the number of inputs will never completely overcome the shortcomings of the gaussian input model that we have found. In sum, the positive gaussian input IF model does not solve the high-variability problem while producing LRD and is, in fact, worse in this regard than the more tractable Poisson-input model. 4.3.3 Simulations of the IF Model with Pareto Renewal Inputs. The IF model with Pareto RPP inputs has three parameters: the inhibition-excitation ratio r, the “tail” parameter α, and the normalization constant K. For any given values for r and α > 1, the value of K was determined by setting the mean of the Pareto distribution equal to the inverse of the input rate determined by equation 4.8. Thus, we set K = (α − 1)/input rate if α > 1. When 0 < α ≤ 1, the Pareto distribution does not have a mean, and, hence, the previous calculations are not justified. α = 1 was the only value in this range that we used, and in this case we set K = 0.1/input rate, where input rate was still determined by equation 4.8, which is the same value that we would have calculated for K when α = 1.1. We found empirically that this value yielded an output spike rate in the vicinity of 2.5 spikes per second for our simulations. Thus, for the Pareto input model simulations, we varied two parameters: α and r. For α, we used values of 1.0, 1.25, 1.5, 1.75, 1.9, 2.0, 2.1, 2.5, and 3.0, and for r, we used values of 0.0, 0.5, 0.7, 0.8, 0.9, and 0.95. Ten simulations were run for each combination of these parameter values. For comparison, Feng and his coworkers (Feng, 1997; Feng & Brown, 1998b, 1998a; Feng et al., 1998) used only values of α = 1.0 and α = 2.1 in their simulations.

Long-Range Dependence in Models of Cortical Variability

2159

Besides just spanning a wide range, our additional values for α were chosen for two specific reasons. First, we wanted to use some values of α between 1.0 and 2.0, exclusive, where the Pareto distribution has infinite variance but still has a finite mean. Second, we wanted to use some values that were significantly greater than two. Although at the value of α = 2.1, the Pareto distribution does have finite mean and variance, this value is close enough to 2.0 that the differences between finite and infinite variance are not going to be very obvious. Figure 4 shows the estimated values of CVISI from these simulations plotted versus the inhibition-excitation ratio r. The graphs in each column contain the same data, only at different scales. Column a contains graphs of the mean values of CVISI for each combination of parameter values for r and α. Thus, each symbol represents the mean calculated over 10 simulations. Means calculated from simulations with the same value of α are connected by solid lines, while the dashed line is the theoretical curve for the Poisson input model. The curves all have the same increasing, convex shape, which can be well fit by the square root of a rational function with a pole at r = 1. Also, as we expect due to the positive correlations in the superpositions of the inputs, the CVISI values for the Pareto model are always greater than those for the Poisson model. Thus, in accord with the findings of Feng and Brown (1998a), the Pareto model requires less balance than the Poisson model to achieve any particular CVISI of its output. Furthermore, as α is decreased, a particular value of CVISI can be attained with less balance between the excitation and inhibition. For example, at low values of α, the CVISI is actually significantly larger than physiological measurements in the cortex at even moderate degrees of balance. Column b of Figure 4 shows, for a subset of the values of α, the CVISI estimates for all simulation runs. The solid lines are identical to those in column a, connecting the mean for each set of 10 simulation runs. Although at low r values and high α values the symbols are tightly grouped, like the CVISI estimates for the positive gaussian model, the variance of these estimates of CVISI increases significantly as r increases and as α decreases. The most dramatic effect is seen for medium to high values of r and low values of α. These increases in estimator variance are almost certainly caused by the progressive movement of the model toward a state of nonstationarity as r or α approach one. We saw in section 3 that when r is equal to one, the mean of the intervals in the output of the Poisson input model is infinite, and no point process with infinite mean intervals can be stationary. This is likely to be the case with the present model as well. On the other hand, when α equals one, the input processes have infinite mean intervals, also forcing the model to be nonstationary. Comparing Figure 4 with Figure 2a of Feng and Brown (1998a), we see that our results for α = 2.1 are in good agreement with their corresponding results. In contrast, our CVISI values for α = 1 are substantially larger than those in their Figure 2b. The different values of K used can explain some,

2160

B. Jackson

a

b 1.0 Coefficient of Variation

Coefficient of Variation

1.0 0.8 0.6 0.4 0.2 0.0

Coefficient of Variation

Coefficient of Variation

6.0 4.0 2.0

40 20 0

0.0 0.2 0.4 0.6 0.8 1.0 Inhibition-Excitation Ratio, r

8.0 6.0 4.0 2.0 0.0 0.2 0.4 0.6 0.8 1.0 Inhibition-Excitation Ratio, r

100 α = 3.0 α = 2.5 α = 2.1 α = 2.0 α = 1.9 α = 1.75 α = 1.5 α = 1.25 α = 1.0 Poisson

0.0 0.2 0.4 0.6 0.8 1.0 Inhibition-Excitation Ratio, r

Coefficient of Variation

Coefficient of Variation

60

0.2

0.0

0.0 0.2 0.4 0.6 0.8 1.0 Inhibition-Excitation Ratio, r

100 80

0.4

10.0

8.0

0.0

0.6

0.0

0.0 0.2 0.4 0.6 0.8 1.0 Inhibition-Excitation Ratio, r

10.0

0.8

80 60 40 20 0

0.0 0.2 0.4 0.6 0.8 1.0 Inhibition-Excitation Ratio, r

Figure 4: The coefficient of variation of interpoint intervals as a function of the inhibition-excitation ratio r for the output of an IF model with inputs that are Pareto-distributed renewal point processes. For each combination of the parameters, r and the “tail” parameter α of the Pareto distribution, 10 separate simulation runs were conducted with the parameter K of the Pareto distribution set so that the nominal rate at the output was 2.5 spikes per second. (Column a) The mean values of the estimates are plotted for each combination of parameters. Each solid line connects the means for a single value of α, and the dashed line is the theoretical curve for the model with Poisson process inputs. All three sets of axes contain the same data, but each displays them on a different vertical scale. (Column b) The estimates calculated from each individual simulation run are plotted for a subset of the values of α. Solid and dashed lines are the same as in column a, and all three sets of axes contain the same data.

Long-Range Dependence in Models of Cortical Variability

2161

but not all, of this disagreement. In simulations with K set to one in order to match the value used by Feng and Brown (1998a), our CVISI values were closer to, but still larger than, their results. For instance, at r = 0.9, the most balanced condition that they used, they have CVISI ≈ 3; our simulations for Figure 4 resulted in CVISI ≈ 80, and our simulations with K = 1 produced CVISI ≈ 8.3. Since Feng and Brown (1998a) do not give many details regarding their simulations, it is difficult to determine the source of the remaining difference between their results and ours. We speculate that their lower values may be the result of either a shorter simulation duration or different starting conditions for the simulation of this nonstationary process. Figures 5 through 7 contain FFCs and IDCs calculated from simulated output of the IF model with Pareto RPP inputs for several different representative combinations of values for the parameters α and r. The curves for α = 2.5, α = 1.75, and α = 1.0 are shown in Figures 5, 6, and 7, respectively. For each of these values of α, FFC and IDC sets are shown for each of four values of r ranging from zero to 0.90 or 0.95. The complete set of FFCs and IDCs for all of our simulations may be found in Jackson (2003, appendix B). These graphs possess some common characteristics across all parameter values. First, as should be the case in all situations, the FFCs approach the value of one at very small counting interval lengths, and the mean surrogate data IDCs are horizontal lines. Second, like those of the positive gaussian model, the FFCs have an initially negative slope as a result of the relative unlikelihood of the occurrence of very short intervals, which is caused by the integration mechanism. Third, unlike those of the positive gaussian model, the original data IDCs generally have an initial positive slope and remain above the corresponding surrogate data IDCs. This is a result of the positive correlation between intervals in the superposition of Pareto renewal processes, in contrast to the negative correlation in the superposition of positive Gaussian renewal processes. These positively correlated intervals also produce positive slopes in the original data FFCs subsequent to their initial declination from one. This produces original data FFCs that have an asymptotic value larger than one, except when the value of α is large and the value of r is small. In the latter case, the positive correlation in the superpositions, combined with the variance-producing effect of the excitation-inhibition interaction, is not strong enough to overcome the effect of the integration process. Like those for the positive gaussian model, the original data and corresponding surrogate data curves for the Pareto model become progressively more similar as the excitation and inhibition are brought into better balance, that is, as r is increased. This means that the surrogate data FFCs change more quickly with increases in r than do the original data FFCs and that the original data IDCs better approximate horizontal lines as r increases. In the current model, however, this is the result of a loss of positive, not negative, correlation in the output intervals. Nevertheless, this loss of correlation implies that the output of the model resembles a renewal point process at values of r close to one.

2162

B. Jackson

Fano Factor

0.00

10

10

1

0

-1

10

-2

10

-2

10

-1

10

0

10

1

10

2

10

3

10

4

10

5

10

Index of Dispersion of Intervals

α = 2.50

r

10

10

1

0

-1

10

-2

10

0

10

0.50

Fano Factor

10

10

1

0

-1

10

-2

10

-2

10

-1

10

0

10

1

10

2

10

3

10

4

10

5

10

10

1

0

-1

10

-2

10

-2

10

-1

10

0

10

1

10

2

10

3

10

10

4

10

5

10

1

0

-1

10

-2

10

-2

10

-1

10

0

10

1

10

2

10

3

10

4

10

Counting Interval (sec)

5

10

-1

10

-2

10

0

10

1

10

2

10

3

10

4

10

5

10

10

10

1

0

-1

10

-2

10

0

10

1

10

2

10

3

10

4

10

5

10

Number of Aggregated Intervals

5

10

Index of Dispersion of Intervals

Fano Factor

0.95

10

4

10

0

Counting Interval (sec) 10

3

10

Number of Aggregated Intervals Index of Dispersion of Intervals

Fano Factor

0.80

10

2

10

1

Counting Interval (sec) 10

1

10

Number of Aggregated Intervals Index of Dispersion of Intervals

Counting Interval (sec)

10

10

1

0

-1

10

-2

10

0

10

1

10

2

10

3

10

4

10

5

10

Number of Aggregated Intervals

Figure 5: FFCs and IDCs estimated from simulations of the IF model with Pareto inputs and parameter α = 2.5. Each set of axes contains 10 curves calculated from original data (black) and 10 curves calculated from the corresponding shuffled surrogate data (gray). For each value of the inhibition-excitation ratio r, each individual FFC in the left set of axes was calculated from the same data as one of the IDCs in the right set of axes. Also, for each value of r, the parameter K of the Pareto distribution was set so that the nominal rate at the output was 2.5 spikes per second.

Long-Range Dependence in Models of Cortical Variability

10

Fano Factor

10 10 10

3

2

1

0

-1

10

-2

10

-2

10

-1

10

0

10

1

10

2

10

3

10

4

10

5

10

Index of Dispersion of Intervals

α = 1.75

r

0.00

10 10 10 10

3

2

1

0

-1

10

-2

10

0

10

Counting Interval (sec)

Fano Factor

10 10 10

3

2

1

0

-1

10

-2

10

-2

10

-1

10

0

10

1

10

2

10

3

10

4

10

5

10

10

10 10

3

2

1

0

-1

10

-2

10

-2

10

-1

10

0

10

1

10

2

10

3

10

10

4

10

5

10

10 10

3

2

1

0

-1

10

-2

10

-2

10

-1

10

0

10

1

10

2

10

3

10

4

10

Counting Interval (sec)

5

10

10 10

1

0

-1

10

-2

10

0

10

1

10

2

10

3

10

4

10

5

10

10 10 10 10

3

2

1

0

-1

10

-2

10

0

10

1

10

2

10

3

10

4

10

5

10

Number of Aggregated Intervals

5

10

Index of Dispersion of Intervals

0.95

Fano Factor

10

4

10

2

Counting Interval (sec) 10

3

10

Number of Aggregated Intervals Index of Dispersion of Intervals

0.80

Fano Factor

10

2

10

3

Counting Interval (sec) 10

1

10

Number of Aggregated Intervals Index of Dispersion of Intervals

10

0.50

2163

10 10 10 10 10 10

3

2

1

0

-1

-2 0

10

1

10

2

10

3

10

4

10

5

10

Number of Aggregated Intervals

Figure 6: FFCs and IDCs estimated from simulations of the IF model with Pareto inputs and parameter α = 1.75. Curves for original data are shown in black, and those for surrogate data are shown in gray. The parameter K of the Pareto distribution was set so that the nominal rate at the output was 2.5 spikes per second for each simulation.

2164

B. Jackson

10

0.00

Fano Factor

10 10 10 10

4 3 2 1 0

-1

10

-2

10

-2

10

-1

10

0

10

1

10

2

10

3

10

4

10

5

10

Index of Dispersion of Intervals

α = 1.00

r

10 10 10 10 10

4 3 2 1 0

-1

10

-2

10

0

10

Counting Interval (sec)

0.50

Fano Factor

10 10 10 10

4 3 2 1 0

-1

10

-2

10

-2

10

-1

10

0

10

1

10

2

10

3

10

4

10

5

10

Index of Dispersion of Intervals

10

10

10 10 10

4 3 2 1 0

-1

10

-2

10

-2

10

-1

10

0

10

1

10

2

10

3

10

10

4

10

5

10

10 10 10

4 3 2 1 0

-1

10

-2

10

-2

10

-1

10

0

10

1

10

2

10

3

10

4

10

Counting Interval (sec)

5

10

10 10 10

2 1 0

-1

10

-2

10

0

10

1

10

2

10

3

10

4

10

5

10

10 10 10 10 10

4 3 2 1 0

-1

10

-2

10

0

10

1

10

2

10

3

10

4

10

5

10

Number of Aggregated Intervals

5

10

Index of Dispersion of Intervals

0.90

Fano Factor

10

4

10

3

Counting Interval (sec) 10

3

10

Number of Aggregated Intervals Index of Dispersion of Intervals

0.70

Fano Factor

10

2

10

4

Counting Interval (sec) 10

1

10

Number of Aggregated Intervals

10 10 10 10 10

4 3 2 1 0

-1

10

-2

10

0

10

1

10

2

10

3

10

4

10

5

10

Number of Aggregated Intervals

Figure 7: FFCs and IDCs estimated from simulations of the IF model with Pareto inputs and parameter α = 1.0. Curves for original data are shown in black, and those for surrogate data are shown in gray. The parameter K of the Pareto distribution was set so that the nominal rate at the output was 2.5 spikes per second for each simulation.

Long-Range Dependence in Models of Cortical Variability

2165

In section 4.1, we mentioned two component processes to the combination of the inputs in the IF model, each of which could possibly produce LRD. First, LRD may be produced through the balance of excitation and inhibition. In the Poisson input model, this could occur, but since the output of the model was a renewal process, the inhibition-excitation balance also created output intervals with infinite variance. Our simulations suggest that the balance of excitation and inhibition has a similar effect on the Pareto-input model, the output of which becomes more RPP-like with increasing r, and the CVISI of which apparently increases without bound as r increases. Thus, the balance of excitation and inhibition is also unable to produce a realistic form of LRD in this model. The second way that LRD can be present in the output of the RPP-input IF model is if the superpositions of the inputs are LRD. According to theorem 2, this will occur when, and only when, the intervals of the input have infinite variance. For Pareto RPP inputs, this condition is met when α ≤ 2. From our simulations, it is clear that when α ≤ 2, the output of the model is LRD, and the nature of the LRD is different from that produced by inhibition-excitation balance. Increasing the value of r tends to pull the IDC closer to horizontal, while increasing the variability of the inputs (i.e., decreasing α) tends to increase the slope of the IDC. Thus, while more balance reduces the dependence between the intervals of the output of the model, making the output process more similar to an LRcD RPP (with infinite interval variance), lowering the value of α strengthens the interval dependence, ultimately producing an LRcD process with LRiD and finite interval variance. These two different effects can be seen as well in the combination of original data and surrogate data FFCs. As the excitation and inhibition become more balanced, the difference between these two FFCs disappears. And as the variability of the inputs increases, this difference becomes larger, primarily due to the increasing slope of the original data FFC. This trend may break down as α approaches one, where the mean interspike interval of the input processes becomes infinite. In general, then, our simulations of the IF model with Pareto RPP inputs show that with proper adjustment of the parameters r and α, this model can produce outputs that simultaneously exhibit both LRD and high interspike interval variability like that found in cortical neurons. More specifically, from the data plotted in Figure 4, we see that the CVISI of the output of this model is between 0.5 and 1.5, a typical range of CVISI estimated from recordings of cortical neurons, for the ranges of r shown in Table 2 for each value of α used in our simulations. Combining this information with comparisons between the FFCs obtained from simulations of this model, examples of which are shown in Figures 5 through 7, and the FFCs obtained from physiological recordings that are shown in Teich et al. (1996), we can evaluate the ability of the Pareto input model to match the variability and LRD of in vivo cortical neurons.

2166

B. Jackson

Table 2: Range of the Inhibition-Excitation Ratio r, for Each Value of α, That Leads to Values of CVISI Between 0.5 and 1.5 for the IF Model with Pareto Renewal Process Inputs. α

Range of r

3.0 2.5 2.1 2.0 1.9 1.75 1.5 1.25 1.0

0.75 < r < 0.95 0.70 < r < 0.93 0.68 < r < 0.91 0.66 < r < 0.90 0.64 < r < 0.89 0.61 < r < 0.86 0.57 < r < 0.82 0.47 < r < 0.75 0.20 < r < 0.56

Note: The ranges of r were approximated from plots of CVISI versus r (see Figure 4) obtained from simulations of the model.

For 2.0 < α ≤ 3.0, when the interval variance of the inputs is finite, a value of r between about 0.7 and 0.9 is required to match physiological CVISI estimates. For r values near 0.9, these FFCs resemble those of Cell 4 and Cell 7 in Teich et al. (1996), although neither the simulation nor the physiological results suggest the presence of LRcD. At somewhat lower αvalues, between 1.5 and 2.0, the simulation FFCs with r at the high end of their CVISI -matching range tend to resemble those of cells 3 and 19 of Teich et al. (1996), which do suggest the presence of LRcD. At higher values of r, the FFCs still retain a resemblance to cells 3 and 19 of Teich et al. (1996), but their CVISI -values are too large. Of course, eventually, if r were made close enough to 1.0, the output of the model would be too renewal-like and not similar to the results from these cells. For values of α close to 1.0, the output of the model is still LRcD, but the surrogate data FFC’s asymptotes are too low for a value for r-values that lead to physiological values of CVISI . Slightly larger values of r, however, lead to FFCs resembling those of cells 3 and 19 of Teich et al. (1996), but produce too large a CVISI value. In addition, as α approaches one, the inputs are approaching a condition of nonstationarity, which results in a significant increase in the variability of our estimates of the Fano factor and the index of dispersion of intervals. This can be seen in the increased spread of the curves for α = 1.0 in Figure 7. Although the Pareto input model is able to produce outputs that share common statistical features with the spike trains in cortical neurons, the type of inputs required are not justified physiologically. In order to produce LRD in the output of the model, the inputs must have infinitely variable

Long-Range Dependence in Models of Cortical Variability

2167

intervals, which is not the case for neurons that might serve as inputs to cortical neurons: other cortical neurons or subcortical, especially thalamic, neurons. This appears to be the fundamental weakness of the Pareto input model. However, there is also another potential problem with this model. As mentioned in section 4.2.2, the Pareto input model will more closely approximate the Poisson input model as more inputs are included. Thus, increasing the number of inputs will eventually undermine the success of this model. Thus, the Pareto input model may not be successful as an explanation for high variability and LRD in neurons that receive an exceptionally high number of inputs. In the next section, we describe and explore another relatively simple model that remedies both of these problems. 5 Integrate-and-Fire Models with Fractional-Gaussian-Noise-Driven Poisson Process Inputs In the previous section, the output of the IF model was LRD when the input was LRD. If this is a general property that is not dependent on the inputs being RPPs, then LRD inputs with finite interval variance should produce LRD in the output of the IF model as well. In this section, we consider just such a model, which will thus have inputs that are more statistically compatible with known properties of in vivo neurons. More specifically, we have chosen to use inputs that are fractional-gaussian-noisedriven Poisson processes, which were developed to model LRD in biological neurons. 5.1 The Fractional-Gaussian-Noise-Driven Poisson Process. The stationary, or homogeneous, Poisson process is the simplest stochastic point process. The probability of occurrence of a point in this model is uniform throughout time, parameterized by the rate of occurrence λ, and is completely independent of the past. In addition, the occurrence of multiple points at any one time instant is virtually impossible. Customarily, this is the first model that is used for situations that can be represented as a series of events, due to its simplicity and analytical tractability, and the inputs of the high-variability IF models of cortical neurons are no exception. Nevertheless, with more study of the process to be modeled, one usually finds that more complicated models are eventually required. The inputs of the models considered in section 4 retained some of the temporal independence of the Poisson process but allowed modification of the interval distribution. The Poisson process can, however, be generalized in such a way that temporal dependence can be specified, but certain distributional properties are retained. This generalization involves replacing the (constant) rate parameter with a time-dependent function λ(t), resulting in a nonstationary, or nonhomogeneous, Poisson process. However, this model is useful only when there is a known deterministic trend in the rate of occurrence of points that can be used to define the function λ(t).

2168

B. Jackson

Often a more useful point process model is produced by replacing the rate parameter with a stochastic process (t), yielding what is called a doubly stochastic Poisson process (DSPP)—or, sometimes, called a “Cox process” since it was proposed in a seminal paper by Cox (1955). As long as the stochastic process (t) is stationary, the DSPP will be stationary but can still be dependent on its past. In essence, the DSPP has a dependency structure equivalent to its stochastic rate process. In particular, for our purposes, if its stochastic rate process is LRD, then so is the DSPP. Fractional gaussian noise is an LRD stochastic process that was introduced by Mandelbrot and coworkers (Mandelbrot, 1965; Mandelbrot & Wallis, 1968, 1969a, 1969b, 1969c; Mandelbrot & van Ness, 1968) in order to stochastically model the fluctuations in water levels of the Nile River, which, Hurst (1951) had previously discovered, possessed long memory. Fractional gaussian noise (fGn) is the increment process of fractional Brownian motion (fBm), which is the only gaussian process that is both self-similar and has stationary increments. fBm is parameterized by the self-similarity index 0 < H < 1, where H = 0.5 corresponds to the ordinary Brownian motion with independent increments. Thus, fGn can be parameterized by the selfsimilarity index of the corresponding fBm, with H = 0.5 corresponding to the ordinary “white” gaussian noise. When H < 0.5, fGn has (long-range) negative dependence, which is not useful for modeling neural spike trains. However, when H > 0.5, fGn is LRD, so we will be interested in fGn with H only in the range 0.5 ≤ H < 1. In all cases, individual random variables in the process are distributed according to a gaussian distribution, but as H approaches one, the strength of the dependence, or correlation, between these random variables increases. The fractional-gaussian-noise-driven Poisson process (fGnDP) is created by using a modified form of fGn as the stochastic rate process of a DSPP. Modification of the fGn is necessary for several reasons. First, the rate of a Poisson process cannot be negative, whereas samples of fGn can certainly assume negative values. Second, fGn has mean zero, so that simple truncation of the fGn at zero will significantly change its statistical properties. Third, fGn, like ordinary “white” gaussian noise, is inherently a discrete process, while the rate process for a DSPP should be defined in continuous time. Let {GH (k), k ∈ Z} denote standard fGn, with a mean of zero and a variance of one. Then, in order to remedy the aforementioned incongruencies, we let the rate process of a DSPP be12 t (t) = max 0, λ + σ GH , τ

(5.1)

12 This rate function is not actually a stationary stochastic process since time zero is always at the beginning of a τ -length sampling interval. However, this “small” nonstationarity facilitated mathematical analysis and did not affect our results in test simulations due to the length of the simulations and the analysis methods.

Long-Range Dependence in Models of Cortical Variability

2169

where λ and σ are positive constants and x is the largest integer less than x. The max function keeps (t) from being negative, and the λ parameter makes the mean nonzero. The expression τt converts continuous time t to discrete “time.” Thus, the value of (t) is constant on each τ -length time interval. Assuming that σ λ, so that the probability that λ + σ GH < 0 is negligible, the mean rate of the fGnDP with the rate process in equation 5.1 is E{(t)} ≈ λ. In addition, if N(t) is a DSPP with the rate process in equation 5.1, then the variance of N(t) will be approximately λt plus a term proportional to σ 2 . Teich et al. (1990) suggested using “fractal-noise-driven” Poisson process models for neural spike trains of the auditory nerve in order to account for the long-range dependence, or “fractal behavior”, that had been previously discovered (Teich, 1989). In subsequent works, the theory and application of this model to the auditory nerve and other peripheral sensory-system neurons was further developed (Teich, 1992; Teich & Lowen, 1994; Lowen & Teich, 1993a, 1995, 1996a, 1997; Kumar & Johnson, 1993; Thurner et al., 1997). The fractional-gaussian-noise-driven Poisson process is, arguably, the simplest of these models.13 Furthermore, since the theory of fractional gaussian noise has been well developed in many different contexts, and since the other “fractal” noises that were suggested as the driving noise in these models converge to fractional gaussian noise under appropriate conditions, the fGnDP is a natural starting point for using “fractal-noisedriven” Poisson process models. The fGnDP was used as a model of the spike trains in the auditory nerve because it is LRcD (for H > 0.5), but unlike LRD renewal processes, it is also LRiD and has finite interval variance that is similar to a Poisson process. These properties are shared by the auditory nerve, as well as most subcortical neurons that have been studied with respect to such properties. Thus, a further justification of the IF model with fGnDP inputs is the fact that those thalamic neurons that have been studied with respect to LRD and that also project into the cortex possess these properties as well. 5.2 The Superposition of Fractional-Gaussian-Noise-Driven Poisson Processes. The only difference between the present model and the model of section 4 is the form of the input processes. Thus, much of the reasoning in that section regarding general properties of the model applies here as well. In particular, we still expect the fGnDP-input model to exhibit negative correlation on small timescales, due to the integration mechanism, and this negative correlation will presumably manifest itself in a negatively sloped FFC for small counting windows. Also, excitation-inhibition balance will

13 In this literature, the fractional-gaussian-noise-driven Poisson process is called the fractal-gaussian-noise-driven Poisson process. We have used the former in accordance with mainstream mathematical and analytical literature on fractional gaussian noise.

2170

B. Jackson

most likely still produce memory over longer ranges by producing high interval variability. Finally, additional memory properties of the output, if present, should be evident in the superposition of the inputs. Therefore, we will next consider the properties of the superposition of fGnDPs. The superposition of two independent DSPPs is also a DSPP, as proved in the following theorem: Theorem 3. Let N1 (t) and N2 (t) be independent doubly stochastic Poisson processes with (independent) rate processes 1 (t) and 2 (t), respectively. Then the superposition N(t) = N1 (t) + N2 (t) is a doubly stochastic Poisson process with rate process (t) = 1 (t) + 2 (t). Proof. The probability generating functional of a doubly stochastic Poisson process N(t) with stochastic rate process (t) is (e.g. Cox & Isham, 1980; Daley & Vere-Jones, 1988) GN [ξ ] = E exp

∞ −∞

(ξ − 1)(t) dt

,

where E denotes expectation with respect to . Therefore, if N1 (t) and N2 (t) are independent doubly stochastic Poisson processes with independent stochastic rate processes 1 (t) and 2 (t), respectively, then the probability generating functional of their superposition N(t) = N1 (t) + N2 (t) is (e.g., Cox & Isham, 1980; Daley & Vere-Jones, 1988) GN [ξ ] = GN1 [ξ ]GN2 [ξ ] ∞ (ξ − 1)1 (t) dt = E exp −∞ ∞ · E exp (ξ − 1)2 (t) dt −∞ ∞ = E exp (ξ − 1)(1 (t) + 2 (t)) dt . −∞

Thus, N(t) is a doubly stochastic process with rate process (t) = 1 (t) + 2 (t). This result is easily extended to any finite sum of independent DSPPs by repetitive application of the previous theorem, yielding the following corollary: Corollary 1. Let N1 (t), N2 (t), . . . , Nm (t) be independent doubly stochastic Poisson processes with (independent) rate processes 1 (t), 2 (t), . . . , m (t), for finite integer m. Then the superposition N(t) = m i=1 Ni (t) is a doubly stochastic Poisson process with rate process (t) = m i=1 i (t).

Long-Range Dependence in Models of Cortical Variability

2171

Another theorem, concerning the sum of independent fractional gaussian noises, is therefore useful with regard to the fGnDP-input IF model: Theorem 4. Let G1 (k) and G2 (k) be two independent standard (i.e., zero mean, unity variance) fractional gaussian noises with identical Hurst indices H. Then their weighted sum G(k) = σ1 G1 (k) + σ2 G2 (k), σ1 , σ2 > 0, is also a fractional gaussian noise with Hurst index H, mean zero, and variance σ12 + σ22 . Proof. Let Z = {Zj , j = . . . , −1, 0, 1, . . .} be any stationary sequence. The sequence of transforms {TN , N = 1, 2, 3, . . .} defined, for each N, as TN : Z → TN Z = {(TN Z)i , i = . . . , −1, 0, 1, . . .}, where (TN Z)i =

1 (i+1)N Zj , i = . . . , −1, 0, 1, . . . , NH j=iN+1

is called the renormalization group with index H (Samorodnitsky & Taqqu, 1994). TN transforms the original sequence into a sequence composed of renormalized sums of N adjacent components of the original sequence. According to corollary 7.2.13 of Samorodnitsky and Taqqu (1994), fractional gaussian noise is the only gaussian fixed point of the renormalization group, d

where Z is by definition a fixed point of the renormalization group if TN Z = Z for all N ≥ 1. Now let G1 = {G1 (k), k = . . . , −1, 0, 1, . . .} and G2 = {G2 (k), k = . . . , −1, 0, 1, . . .} be two standard fractional gaussian noises, each with Hurst index H, and let G = σ1 G1 + σ2 G2 = {σ1 G1 (k) + σ2 G2 (k), k = . . . , −1, 0, 1, . . .}, for constants σ1 and σ2 . Also let {TN , N = 1, 2, 3, . . .} be the renormalization group with Hurst index H equal to the Hurst indices of G1 and G2 . Each TN is clearly a linear transformation, so for each N ≥ 1, TN G = TN (σ1 G1 + σ2 G2 ) = σ1 (TN G1 ) + σ2 (TN G2 ).

2172

B. Jackson

Since G1 and G2 are independent and are both fixed points of the renormalization group, d

TN G = σ1 G1 + σ2 G2 = G. Thus, G is also a fixed point of the renormalization group. Furthermore, for each k = . . . , −1, 0, 1, . . . , σ1 G1 (k) has a gaussian distribution with mean zero and variance σ12 , and σ2 G2 (k) has a gaussian distribution with mean zero and variance σ22 . Thus, G(k), for each k, also has a gaussian distribution with mean zero, and its variance is σ12 + σ22 . Finally, since G is gaussian and a fixed point of the renormalization group, it must be fractional gaussian noise. This theorem can also be extended to any finite sum by its repetitive application. Corollary 2. Let Gi (k), i = 1, 2, . . . , m, be m independent standard fractional gaussian noises with identical Hurst indices H. Then their weighted sum G(k) = m σ G i=1 i i (k), where σi > 0 for i = 1, 2, . . . , m, is also2 a fractional gaussian noise with Hurst index H, mean zero, and variance m i=1 σi . Nevertheless, according to equation 5.1, the rate process of an fGnDP is not a linear function of an fGn. In order to produce a valid rate process, the linear function of fGn,

t , λ + σ GH τ must be truncated below at zero. If this truncation were unnecessary, then, using corollaries 1 and 2, the sum of m independent fGnDPs with rate processes

t i (t) = λ + σ GH,i , i = 1, 2, . . . , m, τ would be an fGnDP with rate process (t) =

m i=1

i (t) = mλ +

√

t m σ GH . τ

(5.2)

Hence, the mean rate of this fGnDP would be mλ and the variance mσ 2 . Statistically, however, it is desirable that the rate process of fGnDP be as much like fGn as possible. The truncation process was used only since the rate of a Poisson process cannot be negative. Thus, ideally, we could consider the superposition of m fGnDPs to be an fGnDP with the rate process

Long-Range Dependence in Models of Cortical Variability

2173

5.2 truncated at zero. In other words, the rate process of the superposition would be √ t (t) = max 0, mλ + m σ GH . (5.3) τ In order to compare the difference between this “ideal” situation and the “real” superposition of fGnDPs with rate processes of the form 5.1, consider the sum of two independent fGnDPs. Suppose that the rate processes of these fGnDPs are t i (t) = max 0, λ + σ GH,i , i = 1, 2. τ Then the rate process of their superposition is (t) = 1 (t) + 2 (t) t t + max 0, λ + σ GH,2 = max 0, λ + σ GH,1 τ τ

t t = max 0, λ + σ GH,1 , λ + σ GH,2 , τ τ t , (5.4) 2λ + σ (GH,1 + GH,2 ) τ while the “ideal” situation yields a rate process of √ (t) = 2λ + 2 σ GH

t . τ

(5.5)

√ Thus, since GH,1 + GH,2 is distributed like 2GH , the “ideal” case and the “real” case are different when either σ GH,1 < −λ or σ GH,2 < −λ, but not when both are true. As we assumed previously, these occurrences should be rare. In addition, when they do occur, the differences that they produce should usually be small, since big differences would necessitate that λ + σ GH,i 0 for the process i that has an untruncated negative rate. Therefore, the superposition of m independent fGnDPs with rate processes of the form 5.1 is well approximated by a single fGnDP with the rate process in equation 5.3. 5.3 Simulation Procedures for the IF Model with fGnDP Inputs. For the IF model with fGnDP inputs, we ran simulations similar to those for the renewal input model. Again, the simulations were 100,000 seconds in duration, and 10 independent simulations were run for each set of parameter values. Each simulation used 100 excitatory inputs and 100r inhibitory inputs, as did the RPP input model simulations. The IF mechanism had a

2174

B. Jackson

reset potential of zero and a threshold of unity, and inputs caused instantaneous increases or decreases of 1/40 in the integration potential. The rate of each input was calculated according to equation 4.8, yielding a nominal output rate of 2.5 spikes per second. Each input of this model is specified by its stochastic rate process t i (t) = max 0, λ + σ GH,i , τ under the assumption that they are all statistically identical. The parameter λ, the mean rate of the fGnDP if the effect of truncation is neglected, was specified by the rate of the inputs that produced a nominal output rate of 2.5 spikes per second. In previous studies (unpublished), we found that the value of σ = 30 worked well for modeling neural spike trains when the rate was λ = 100. Since the variance of the counts √ should be additive with respect to the rate, this suggested that σ = 3 λ. Finally, the sample time τ , that is, the length of the intervals over which the rate of the Poisson processes remained constant, was set to 0.1 second. This value was chosen by balancing the cost of simulation time with the condition that this value not have a significant effect on the results. Therefore, the fGnDP input model had only two parameters that were left for us to vary: the inhibition-excitation ratio r and the Hurst index H of the inputs. For r, we used the same set of values that was used for the Pareto input model in section 4: 0.0, 0.5, 0.7, 0.8, 0.9, and 0.95. We desired to use a set of values for H that spanned the range 0.5 ≤ H < 1.0, which includes all values for which fGn is uncorrelated or LRD, but not degenerate. Thus, we chose the values 0.50, 0.55, 0.60, 0.65, 0.70, 0.75, 0.80, 0.85, 0.90, and 0.95. For the simulations of the fGnDP-input model, we used the result from section 5.2 that the superposition of independent fGnDPs may be approximated by a single fGnDP. Thus, for each simulation run, we needed to produce only two fGnDPs: one for the excitatory inputs and one for the inhibitory inputs. Both of these fGnDPs had Hurst indices of H, and the rate processes were " ! √ E (t) = max 0, 100λ + 3 100λ GH,E (10t) for the excitatory process and " ! √ I (t) = max 0, 100rλ + 3 100rλ GH,I (10t) for the inhibitory process. Thus, changes in the number of excitatory inputs (or, equivalently, in the number inhibitory inputs for a fixed value of r) are equivalent to changes in the spike rate of these inputs. Hence, given that we chose the spike rates of the inputs so that the nominal spike rate would be 2.5 spikes per second, the number of inputs that we used will actually be inconsequential to our results.

Long-Range Dependence in Models of Cortical Variability

2175

Since each of the 1200 simulated fGns (60 sets of parameters × 10 simulation runs per set × 2 fGns per run) requires 106 sample points (105 seconds × 10 sample points per second), and long fGn simulations are computationally demanding, samples of fGn were produced using a fast Fourier transform (FFT) method (Davies & Harte, 1987; Beran, 1994; Bardet, Lang, Oppenheim, Philippe, & Taqqu, 2002). However, unlike most popular FFT methods (e.g., Paxson, 1997; Ledesma & Derong, 2000), which are approximate (and often quite inexact), this method is exact. The main drawback of any FFT method, however, is that it requires large amounts of memory for such long samples. Further discussion of this and other fGn algorithms, as well as references, can be found in Jackson (2003, sect. 3.6.3). After simulating the fGnDP-input IF model with all combinations of the values of the two parameters r and α, we calculated estimates of the CVISI , the FFC, and the IDC for each simulation run in the same manner as was done for the renewal input model. This included estimating the FFC and IDC of the surrogate data, produced by shuffling the interspike intervals, for each simulation run. 5.4 Simulation Results for the IF Model with fGnDP Inputs. Figure 8 shows the estimated values of CVISI from all of our simulations of the fGnDP-input model plotted versus the inhibition-excitation ratio r. Column a contains graphs of the mean CVISI values calculated across the 10 simulations at each combination of parameter values for r and H. Means for the same value of H are connected by lines. Like the gaussian input model and the Pareto input model in section 4, the graph of CVISI as a function of r for any particular value of H has an increasing, convex shape. More specifically, these curves are well approximated by the square root of a rational function with a pole at r = 1. Furthermore, the values of CVISI all lie above the curve for the Poisson input model (dashed line), and they increase with H. Since, like the Pareto input model, the intervals in the superpositions of the inputs are positively correlated for this model, the fact that CVISI is always greater (at least for 0.5 ≤ H < 1) than in the Poisson input model is to be expected. The positive correlation between CVISI and H is also not surprising, given the results for the Pareto input model. In the Pareto input model, decreasing the value of α strengthened the dependence between intervals in the output, and, we suspect, the input superpositions, of the model. This was also associated with increases in the value of CVISI . In the present model, increasing the value of H will certainly strengthen the dependency between intervals in the input superpositions. Thus, we should expect that such increases in H lead to stronger dependence in the output as well, and that this is associated with increases in the value of CVISI . Moreover, we will see in the IDCs that increases in H do indeed strengthen the dependence between the output intervals. These results thus indicate that for higher values of H, the fGnDP input model requires less excitation-inhibition balance to achieve any particular value of CVISI . This means that for high values of H, the CVISI of the

2176

B. Jackson

output of the fGnDP input model is substantially larger than physiological estimates at even moderate degrees of balance, as was also the case with the Pareto input model. Column b of Figure 8 shows for a subset of the values of H the individual estimates of CVISI for each simulation. The lines connect the mean values of these estimates, and there are 10 symbols, one for each simulation run, for each of the different combinations of the two parameters r and H. Here again the symbols are tightly grouped for low values of r and weaker LRD (i.e., low values of H). Also, in similarity to the other two sets of simulations, the variance of estimates of CVISI increases as either r or H is increased. Again, this is a result of the fact that the model approaches a nonstationary state as r or H approach the value of one. Figures 9 through 11 show the FFCs and IDCs for simulations of the IF model with fGnDP inputs for several representative combinations of the values for the parameters H and r. The curves for Hurst indices of H = 0.5, H = 0.7, and H = 0.9 are shown in Figures 9, 10, and 11, respectively. For each of these values of the Hurst index, FFC and IDC sets are shown for the inhibition-excitation balance ratios r = 0.0, r = 0.5, r = 0.8, and r = 0.95. The entire set of FFCs and IDCs from all of our simulations may be found in Jackson (2003, appendix B). For these curves, as should always be the case, the FFCs approach a value of one at very small counting interval lengths, and the mean surrogate data IDCs are horizontal. Due to the lack of very short intervals in the output of this model, caused by the integration mechanism, the initial trend of the FFCs is downward. These initial downward trends are comparable to those seen in the FFCs of the gaussian input and Pareto input models. In contrast, the original data IDCs for the fGnDP input model, when H > 0.5, have an initial, and indeed enduring, upward trend. In fact, these IDCs all closely resemble a power law function, being very linear on a double logarithmic plot. Like the Pareto input model, but in contradistinction to the gaussian input model, a superposition of the inputs to the fGnDP model has positively correlated interspike intervals. Not only does this positive correlation produce the positive slopes of the IDCs, but it also produces positive slopes in the original data FFCs for medium and large counting intervals. Thus, except when H equals 0.5 or when H is close to 0.5 and r is small, the original data FFCs have an asymptotic value larger than one. Theoretically, the output of the fGnDP input model with H = 0.5 is a renewal process. fGn is simply the common “white” gaussian noise when H = 0.5. Thus, since this fGn has no memory and Poisson processes have no memory, the fGnDPs used as inputs to the IF model will have no memory when H = 0.5. Thus, in this sense, it is very similar to the Poisson input IF model, with the integration mechanism, which is reset at the occurrence of each output spike, being the only component possessing memory. The results of our simulations are in agreement with this prediction. As is evident in Figure 9, the original data FFCs and IDCs are nearly identical to the

Long-Range Dependence in Models of Cortical Variability

a

b

0.8 0.6 0.4 0.2 0.0

Coefficient of Variation

Coefficient of Variation

6.0 4.0 2.0

20 15 10 5 0

0.2 0.0 0.2 0.4 0.6 0.8 1.0 Inhibition-Excitation Ratio, r

8.0 6.0 4.0 2.0 0.0 0.2 0.4 0.6 0.8 1.0 Inhibition-Excitation Ratio, r

40 H = 0.5 H = 0.55 H = 0.6 H = 0.65 H = 0.7 H = 0.75 H = 0.8 H = 0.85 H = 0.9 H = 0.95 Poisson

0.0 0.2 0.4 0.6 0.8 1.0 Inhibition-Excitation Ratio, r

Coefficient of Variation

Coefficient of Variation

25

0.4

0.0

0.0 0.2 0.4 0.6 0.8 1.0 Inhibition-Excitation Ratio, r

40 30

0.6

10.0

8.0

35

0.8

0.0

0.0 0.2 0.4 0.6 0.8 1.0 Inhibition-Excitation Ratio, r

10.0

0.0

1.0

Coefficient of Variation

Coefficient of Variation

1.0

2177

35 30 25 20

80 70 60 50 40 30 20 10 0

0.0

0.2

0.4

0.6

0.8

1.0

15 10 5 0

0.0 0.2 0.4 0.6 0.8 1.0 Inhibition-Excitation Ratio, r

Figure 8: The coefficient of variation of interpoint intervals as a function of the inhibition-excitation ratio r for the output of an IF model with inputs that are fractional-gaussian-noise-driven Poisson processes. For each combination of the parameters, r and the Hurst index H of the fractional Gaussian noise, 10 separate simulation runs were conducted. (Column a) The mean values of the estimates are plotted for each combination of parameters. Each solid line connects the means for a single value of H, and the dashed line is the theoretical curve for the model with Poisson process inputs. All three sets of axes contain the same data, but each displays them on a different vertical scale. (Column b) The estimates calculated from each individual simulation run are plotted for a subset of the values of H. Solid and dashed lines are the same as in column a, and all three large sets of axes contain the same data. The inset on the bottom set of axes contains only the data for H = 0.9, revealing two additional data points that are out of range of the larger set of axes.

2178

B. Jackson

Fano Factor

0.00

10

10

1

0

-1

10

-2

10

-2

10

-1

10

0

10

1

10

2

10

3

10

4

10

5

10

Index of Dispersion of Intervals

H = 0.50 r

10

10

1

0

-1

10

-2

10

0

10

Counting Interval (sec)

10

1

0

-1

10

-2

10

-2

10

-1

10

0

10

1

10

2

10

3

10

4

10

5

10

Index of Dispersion of Intervals

0.50

Fano Factor

10

10

1

0

-1

10

-2

10

-2

10

-1

10

0

10

1

10

2

10

3

10

10

4

10

5

10

1

0

-1

10

-2

10

-2

10

-1

10

0

10

1

10

2

10

3

10

4

10

Counting Interval (sec)

5

10

-1

10

-2

10

0

10

1

10

2

10

3

10

4

10

5

10

10

10

1

0

-1

10

-2

10

0

10

1

10

2

10

3

10

4

10

5

10

Number of Aggregated Intervals

5

10

Index of Dispersion of Intervals

Fano Factor

0.95

10

4

10

0

Counting Interval (sec) 10

3

10

Number of Aggregated Intervals Index of Dispersion of Intervals

Fano Factor

0.80

10

2

10

1

Counting Interval (sec) 10

1

10

Number of Aggregated Intervals

10

10

1

0

-1

10

-2

10

0

10

1

10

2

10

3

10

4

10

5

10

Number of Aggregated Intervals

Figure 9: FFCs and IDCs estimated from simulations of the IF model with fGnDP inputs with Hurst index H = 0.5. Each set of axes contains 10 curves calculated from original data (black) and 10 curves calculated from the corresponding shuffled surrogate data (gray). For each value of the inhibition-excitation ratio r, each individual FFC in the left set of axes was calculated from the same data as one of the IDCs in the right set of axes. The spike rate of the inputs was set in order to produce a nominal output rate of 2.5 spikes per second for each simulation.

Long-Range Dependence in Models of Cortical Variability

2179

r Fano Factor

0.00

10 10 10

2

1

0

-1

10

-2

10

-2

10

-1

10

0

10

1

10

2

10

3

10

4

10

5

10

Index of Dispersion of Intervals

H = 0.70 10 10 10

2

1

0

-1

10

-2

10

0

10

Counting Interval (sec)

10 10

2

1

0

-1

10

-2

10

-2

10

-1

10

0

10

1

10

2

10

3

10

4

10

5

10

Index of Dispersion of Intervals

0.50

Fano Factor

10

10

10

2

1

0

-1

10

-2

10

-2

10

-1

10

0

10

1

10

2

10

3

10

10

4

10

5

10

10

2

1

0

-1

10

-2

10

-2

10

-1

10

0

10

1

10

2

10

3

10

4

10

Counting Interval (sec)

5

10

10

0

-1

10

-2

10

0

10

1

10

2

10

3

10

4

10

5

10

10 10 10

2

1

0

-1

10

-2

10

0

10

1

10

2

10

3

10

4

10

5

10

Number of Aggregated Intervals

5

10

Index of Dispersion of Intervals

Fano Factor

0.95

10

4

10

1

Counting Interval (sec)

10

3

10

Number of Aggregated Intervals Index of Dispersion of Intervals

Fano Factor

0.80

10

2

10

2

Counting Interval (sec) 10

1

10

Number of Aggregated Intervals

10 10 10

2

1

0

-1

10

-2

10

0

10

1

10

2

10

3

10

4

10

5

10

Number of Aggregated Intervals

Figure 10: FFCs and IDCs estimated from simulations of the IF model with fGnDP inputs with Hurst index H = 0.7. Curves for original data are shown in black, and those for surrogate data are shown in gray. The spike rate of the inputs was set in order to produce a nominal output rate of 2.5 spikes per second for each simulation.

2180

B. Jackson

4

10

3

0.00

Fano Factor

10

2

10

1

10

0

10

-1

10

-2

10

-2

10

-1

10

0

10

1

10

2

10

3

10

4

10

5

10

Index of Dispersion of Intervals

H = 0.90 r

4

10

3

10

2

10

1

10

0

10

-1

10

-2

10

0

10

Counting Interval (sec)

3

Fano Factor

10

2

10

1

10

0

10

-1

10

-2

10

-2

10

-1

10

0

10

1

10

2

10

3

10

4

10

5

10

Index of Dispersion of Intervals

4

10

0.50

10 10 10

4 3 2 1 0

-1

10

-2

10

-2

10

-1

10

0

10

1

10

2

10

3

10

4

10

5

10

10 10 10

4 3 2 1 0

-1

10

-2

10

-2

10

-1

10

0

10

1

10

2

10

3

10

4

10

Counting Interval (sec)

5

10

3

10

2

10

1

10

0

10

-1

10

-2

10

0

10

1

2

10

3

10

10

4

10

5

10

10 10 10 10 10

4 3 2 1 0

-1

10

-2

10

0

10

1

2

10

3

10

4

10

10

Number of Aggregated Intervals

5

10

Index of Dispersion of Intervals

0.95

Fano Factor

10

4

10

4

Counting Interval (sec) 10

3

10

Number of Aggregated Intervals Index of Dispersion of Intervals

0.80

Fano Factor

10

2

10

10

Counting Interval (sec) 10

1

10

Number of Aggregated Intervals

10 10 10 10 10

4 3 2 1 0

-1

10

-2

10

0

10

1

10

2

10

3

10

4

10

5

10

Number of Aggregated Intervals

Figure 11: FFCs and IDCs estimated from simulations of the IF model with fGnDP inputs with Hurst index H = 0.9. Curves for original data are shown in black, and those for surrogate data are shown in gray. The spike rate of the inputs was set in order to produce a nominal output rate of 2.5 spikes per second for each simulation.

Long-Range Dependence in Models of Cortical Variability

2181

surrogate data ones. This is the behavior that we expect from a renewal process. Due to the presence of the fGn, each input to this model should, however, be more variable than a Poisson process, which should manifest itself in higher variability at the output of the model. Indeed, as we saw in Figure 8, the variance of the output of the fGnDP input model for H ≥ 0.5 is always greater than that for the Poisson input model having the same r-value. When H > 0.5, the inputs of the fGnDP model, as well as the superpositions of the inputs, are LRD. This was also true of the Pareto input model when α ≤ 2.0. The LRD in the Pareto inputs and superpositions, however, came in the form of LRcD with infinite interval variance and no LRiD, whereas in the fGnDP inputs and superpositions, it is in the form of LRcD with LRiD and finite interval variance. But this distinction does not seem to be important for the IF model, at least with regard to the statistical procedures that we have used. This is demonstrated in the striking similarity between the results from the Pareto model with α = 1.75 in Figure 6 and those from the fGnDP model with H = 0.7 in Figure 10. Thus, disregarding the effect of the interaction between excitation and inhibition, in the fGnDP input model, the LRiD at the input propagates through the model, while in the Pareto input model, the infinite interval variance of the inputs is converted into LRiD by the model. In either case, the result at the output seems to be just about the same. The effect on the output of the fGnDP input model of balancing the amounts of excitation and inhibition is comparable to its effect on the other models that we have considered. As r increases toward one, the effect of the correlation in the inputs is gradually overwhelmed by the high variability of the excitation-inhibition interaction. The output will continue to exhibit LRcD, but this LRcD will progressively become more a result of high interval variability than of LRiD. However, the surrogate data FFCs all, except perhaps when H and r are both very close to one where the approach to nonstationarity creates unusually large variance in our estimates, asymptote to a finite constant, consistent with the intervals in the output having finite variance. The output interval variance does increase with r, causing the asymptotic values of the surrogate data FFCs to move upward, but these FFCs always remain below their original data counterparts because of the presence of LRiD. Also, in a manner analogous to that in the Pareto input model, the difference between the original data FFCs and the surrogate data FFCs at long counting intervals increases as H increases. But this trend again seems to break down, in this case as H approaches one. Pareto RPPs are non-LRD for any α greater than two, but fGnDPs are non-LRD only when H = 0.5. In addition, the Pareto RPPs will produce positive correlation between the intervals of the output for any α, including α > 2, while intervals in the output of the fGnDP model with H = 0.5 are independent. Thus, these two models behave quite differently in their non-LRD state. The non-LRD Pareto model is more flexible, allowing a

2182

B. Jackson

range of different FFC shapes and differences between the original data curves and the surrogate data curves. Only the asymptotic value of the FFCs can be adjusted, by changing the value of r, for the non-LRD fGnDP model, and original data and surrogate data FFCs are always identical. Therefore, with regard to these measures, the fGnDP with H = 0.5 is not an improvement over the Poisson input model, even for modeling non-LRD data. The fGnDP model is, however, more flexible than the Pareto model in producing weaker short-term dependence when LRD is present in the output. Due to the positive correlation that is present in the non-LRD Pareto model, the original data FFCs and IDCs are already significantly different from their surrogate data counterparts on shorter timescales when α = 2.0, the first value at which it is LRD. In contrast, the original data curves for the fGnDP gradually differentiate themselves, at all timescales, from the surrogate data curves as LRD is introduced into the output. Thus, although these two models can produce outputs with similar statistical features, the statistical characteristics of their outputs are adjustable in different ways. In short, our simulations show that with proper adjustment of the parameters r and H, the fGnDP input model can produce outputs that are similar to cortical spike trains in that they possess both LRD and intervals with finite variance. The typical range of values for CVISI estimates in cortical spike trains is 0.5 to 1.5. Table 3 shows the approximate range of the inhibition-excitation ratio r, for each value of H in our simulations, that produces values for the CVISI in this range. Using these data and comparisons of the FFCs from our simulations, some of which are shown in Figures 9 through 11 with those shown in Teich et al. (1996), the ability of the fGnDP input IF model to match statistical properties of cortical spike trains can be directly evaluated. For low values of H, the value of CVISI for the fGnDP model is within the physiological range for r between about 0.6 and 0.9. This is very similar to the Pareto model at α-values below, but close to, 2.0. For r values near 0.9, the FFCs for H ≈ 0.6 resemble those of cells 4 and 7 in Teich et al. (1996). For the Pareto model, however, such FFCs were created when α was greater than 2.0. The critical difference, however, is that the H ≈ 0.6 fGnDP model is LRD, while the α > 2.0 Pareto model is not. However, since the slope of the FFCs is so shallow for the H ≈ 0.6 fGnDP model, the LRD in the output is difficult to distinguish using an FFC calculated from any reasonable length sample of the process. For H in the neighborhood of 0.65 to 0.85, the upper limit of r-values needed to produce physiological CVISI ’s is 0.8–0.9. For this combination of parameters, the FFCs of the fGnDP model resemble those of cells 3 and 19 of Teich et al. (1996). These FFCs suggest the presence of LRcD, but the surrogate data FFCs asymptote to a finite value above one. If r is reduced, then this asymptotic value drops below one, which does not match the results in Teich et al. (1996), although the LRcD is still present. This case may, however, match the FFCs of neurons with CVISI < 1.0 if they were

Long-Range Dependence in Models of Cortical Variability

2183

Table 3: Range of the Inhibition-Excitation Ratio r, for Each Value of H, That Leads to Values of CVISI Between 0.5 and 1.5 for the IF Model with fGnDP Inputs. H

Range of r

0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95

0.69 < r < 0.96 0.66 < r < 0.95 0.63 < r < 0.94 0.59 < r < 0.92 0.55 < r < 0.90 0.50 < r < 0.87 0.45 < r < 0.81 0.32 < r < 0.77 0.25 < r < 0.64 0.06 < r < 0.28

Note: The ranges of r were approximated from plots of CVISI versus r (see Figure 8) obtained from simulations of the model.

available. Similar to those of the Pareto model, as r is increased, the FFCs still resemble the physiologically measured FFCs, but the values of CVISI become too large. As H increases to about 0.9 and above, the FFCs and CVISI values can no longer be matched to those measured physiologically. When these models have physiological values of CVISI , they are still LRcD, but the surrogate data FFCs asymptote at too low a value. Like the Pareto model with α ≈ 1.0, moderately large values of r tend to result in FFCs that suggest that the output of the model is renewal-like. Here, again, this is most likely due to the model’s approach to nonstationarity, which occurs at H = 1, coupled with the effect of tightly balanced excitation and inhibition. Also, much greater variability in the estimates of the Fano factor and the index of dispersion of intervals is evident in this parameter range. Therefore, like the Pareto input model, the fGnDP input model can produce outputs that share common statistical features with the spike trains in cortical neurons. However, the fGnDP input model has two significant advantages over the Pareto input model. First, whereas the inputs to the Pareto model that were required in order to produce LRD do not seem to be justified physiologically, the inputs to the fGnDP are known to be statistically similar to the spike trains of neurons that project into cortex. Furthermore, while increasing the number of inputs to the Pareto input model will make it less likely to produce appropriate outputs, the success of the fGnDP input model will not be affected by this change. This is due to the fortunate result that the superposition of a number of fGnDPs is also an fGnDP and does not therefore asymptotically approach a Poisson process.

2184

B. Jackson

6 Discussion A number of different types of IF models have been created in order to explain how cortical neurons can integrate over large numbers of inputs while still producing highly variable outputs. Although these models can produce values for the coefficient of variation of interspike intervals (CVISI ) similar to those calculated from in vivo cortical spike trains, we considered whether such models can also produce LRD in their outputs. Based on the observation that a large class of these models produces outputs that are renewal point processes, we were able to prove analytically that the output of these models cannot simultaneously have both LRD and a finite CVISI . Therefore, since the spike trains of in vivo cortical neurons are long-range dependent (see section 2.7), none of these renewal models produces highly variable outputs in a way that is consistent with the properties of cortical spike trains. Thus, their success in representing the cortical processing that leads to highly variable spike trains is doubtful. By considering LRD, in addition to the CVISI , we were able to show that a large portion of the single-neuron models of cortical high variability are incompatible with cortical spike trains. Shinomoto and his colleagues (Shinomoto & Sakai, 1998; Shinomoto, Sakai, & Funahashi, 1999) have previously demonstrated another statistical difference between cortical spike trains and the output of one of these models, the standard leaky IF model with Poisson inputs. They argued that the leaky IF model could not match both the coefficient of variation and the skewness coefficient of the interspike-interval density of cortical spike trains. The skewness coefficient, or just skewness, is the third central moment divided by the standard deviation cubed and is a measure of the symmetry of a distribution. Our result, however, has several advantages over the arguments in these articles. First, our result is more generally applicable. The approach that we have taken has allowed us to handle a large number of models analytically without having to deal with the details of each specific model. Second, their arguments were based on an approximation to the leaky IF model, whereas ours are directly applicable to the models themselves. Most of these models are analytically intractable in detail, but by considering general principles, we have been able to deal with them analytically without approximation. Third, in order to discount the models, Shinomoto and his colleagues placed restrictions on the parameter ranges of the models, something that was unnecessary in our arguments. Furthermore, the consideration, in this study, of a general statistical description that is applicable to many models adds to our intuition of cortical variability and narrows the range of feasible models for future studies. This general analytical approach does not, however, apply to one important type of single-neuron IF model that has previously been proposed as a solution to the high-variability problem. These models consist of an IF mechanism with inputs that are renewal point processes, but not Poisson point

Long-Range Dependence in Models of Cortical Variability

2185

processes as in the other models. Therefore, we examined these models in more detail. In particular, we investigated models with inputs that have either positive gaussian-distributed intervals or Pareto-distributed intervals, which have distribution tails that are shorter and longer, respectively, than those of a Poisson process. IF models with non-Poissonian renewal point process inputs are capable of producing correlation between the intervals in the superpositions of inputs that causes the intervals in the output of the model to be similarly correlated. If the intervals of the inputs are distributed according to a positive gaussian distribution, this correlation is negative. We expect this result to be true of any interval distribution that possesses a shorter tail than the exponential distribution, the interval distribution of a Poisson process. Although the gaussian input model can produce any value of CVISI if the inhibition-excitation ratio r is properly adjusted, the negative correlation created by these inputs hinders the production of LRD. Thus, in this sense, the gaussian input model is inferior to the standard Poisson input model with respect to modeling cortical spike trains. If the intervals of the RPP inputs are distributed according to a Pareto distribution, then positive correlation is produced in the IF model. We expect this to also be true of any interval distribution with a tail that is longer than the exponential distribution. Furthermore, if the tail of the distribution is long enough that the variance is infinite, then both the inputs and the output of the model are LRD. We found that by proper adjustment of α, a parameter of the Pareto distribution that affects the length of its tail, and the inhibition-excitation ratio r, we could produce spike trains with the Pareto input IF model that had interval variance and LRD similar to spike trains recorded from cortical neurons. Therefore, this model succeeds where the renewal IF models (i.e., those with renewal-point-process outputs) fail. Although the output of a properly adjusted Pareto input IF neuron is both long-range count dependent (LRcD) and long-range interval dependent (LRiD) with finite interval variance, the inputs do not possess LRiD, but instead are LRcD with infinite interval variance. This “conversion,” so to speak, is produced by the integration process in the model and has precedent in other models that aggregate processes with infinite variance (Mandelbrot, 1969; Taqqu & Levy, 1986; Lowen & Teich, 1993b, 1993c; Willinger, Taqqu, Sherman, & Wilson, 1995, 1997; Taqqu, Willinger, & Sherman, 1997; Heath, Resnick, & Samorodnitsky, 1998). However, although the output of this model is consistent with our current knowledge of the statistical structure of cortical spike trains, the inputs are not consistent with the statistical structure of either cortical or subcortical spike trains, including those in neurons that project into cortex. This inconsistency seriously undermines the credibility of the renewal input IF models. In addition, as the number of inputs increases, this model approximates a Poisson input model, which cannot successfully produce realistic CVISI values and LRD. Hence, it may

2186

B. Jackson

not be able to explain these properties in neurons that receive an extremely large number of inputs. We therefore suggested a model with inputs that are LRD but are statistically similar to actual neurons. The fractional-gaussian-noise-driven Poisson process (fGnDP) has elsewhere been used as a model for sensory neurons in the periphery, such as primary auditory neurons, in order to model the LRD present in their spike trains. The same statistical characteristics present in these neurons are known to exist in many subcortical sensory pathways, including certain neurons that project into the cortex. Thus, this process seems to be a reasonable choice for the inputs to the IF model of cortical neurons. We found that as for the Pareto inputs, the fGnDP inputs did indeed produce LRD in the output of the IF model. However, in the fGnDP input model, both the inputs and the outputs had LRcD and LRiD, and neither had intervals with infinite variance. By proper adjustment of the Hurst index H, a parameter that modulates the strength of the LRD present in an fGnDP, and the inhibition-excitation ratio r, we were able to produce output spike trains with the fGnDP input IF model that approximated the LRD and variability of cortical spike trains. Furthermore, these results show that a tight balance between the amounts of excitation and inhibition at the inputs is not necessary for high interspike-interval variability at the output. Although the output of this model is not necessarily more similar to that of cortical neurons than the Pareto input IF model, the physiological justification of its fGnDP inputs is reason enough to favor it over the Pareto model. Furthermore, unlike the Pareto input model, the fGnDP input model has the advantageous property that it does not lose its ability to produce output spike trains that are statistically similar to cortical neurons as the number of inputs increases. This is a consequence of the fact that the superposition of many fGnDPs is also an fGnDP and does not approximate a Poisson process. Due to the statistical similarity between fGnDPs and neural spike trains, this fact also has serious implications for the common assumption that the convergence of a large number of neural spike trains approximates a Poisson process. Other recent studies have also had success using an IF model with temporally correlated inputs to match the statistics of cortical spike trains. Salinas and Sejnowski (2002) have derived analytical results for the means and variances of interspike intervals at the outputs of both nonleaky and leaky IF neurons driven by temporally correlated noise. They concluded from their results that the temporal correlations in the inputs to a neuron are an important factor affecting the variability at the output, which is what we have found as well. However, their study did not look at temporal correlations, much less long-range correlations, in the outputs. Sakai et al. (1999) and Shinomoto and Tsubo (2001) have shown that temporally correlated inputs to leaky IF models can produce not only coefficients of variation and skewness coefficients of interspike intervals, but also correlation between consecutive interspike intervals, that match those estimated from cortical

Long-Range Dependence in Models of Cortical Variability

2187

spike trains. Although their study of the correlation between consecutive interspike intervals would not reveal long-term correlations, their result is consistent with the production of long-term correlations in the output of IF models by long-term correlations in the inputs, as we found for the fGnDP input model. Although some of the input correlations in these studies extended into the infinite past, they were short term. In other words, they decreased fast enough to result in the convergence of their infinite sum (see section 2.1). Based on our results, for which LRD was necessary at the input, this suggests that these models would not produce LRD. However, this is a relatively weak argument, and a resolution to this question would require further analysis of these models. Although the fGnDP input IF model seems to be the best “single-neuron” model heretofore suggested for matching both the interspike-interval variability and LRD of cortical spike trains, network models, like those suggested by Usher and his colleagues (1994, 1995), may also succeed at this task. In the network model of Usher and his coworkers, the output of each individual neuron is highly variable due to the complex network dynamics. The spike trains of individual units in their model did exhibit both high CVISI values and “long-term fluctuations,” but these “long-term fluctuations,” extended only to time frames on the order of 1 second, much shorter than the dependence in our fGnDP input model and in biological neurons. If network models are, however, capable of producing LRD, then many of the inputs to each model cortical neuron will be statistically similar to the fGnDPs used as inputs in this study, since many of the inputs to each individual neuron will be outputs from other neurons within the network. The main statistical difference will be that the variability of the interspike intervals in these interneuron connections will need to be higher than that in fGnDPs. The results presented in this study highlight the need for more thorough study of the variability of cortical spike trains before judging the validity of models of this variability and drawing general conclusions from them. In particular, the insufficiency of the CVISI estimator alone as a measure of this variability and the necessity of also using a measure that is sensitive to the correlational structure of the spike trains is demonstrated. We have assumed, for the sake of argument, that empirical measurements of the CVISI of cortical spike trains, which have resulted in values typically in the range from 0.5 to 1.2, are valid. However, another possible scenario exists. If the “real” CVISI of cortical spike trains were actually very large, such that for all practical purposes it could be assumed infinite, estimates of CVISI from short-duration recordings of cortical spike trains might still be small. Under such conditions, it is theoretically possible that cortical spike trains could be long-range dependent, but that empirical estimates of CVISI could be on the order of one. However, since CVISI estimates in this case would increase with increasing recording duration, and since such estimates have been measured under numerous recording conditions by many different

2188

B. Jackson

researchers, this potentiality seems improbable. More likely, under such conditions, empirical estimates of CVISI would be widely variable and span a much larger range than 0.5 to 1.2. However, to rule out this possibility, the asymptotic behavior of the CVISI estimator applied to cortical spike trains should be investigated. Practically, this may be accomplished for a particular spike train recording by studying the trend of the CVISI estimates as more and more of the record length is included in the calculation. If the total record length is long enough, this analysis should reveal whether the CVISI estimate converges or diverges. Appendix: Proof of Theorem 1 Daley (1999) contains only a terse argument for theorem 1. Furthermore, there is a mismatch between the text and the equation in his argument. The text specifies when equation 1.2 in that article diverges, but equation 1.2 is an equation for Var{ N(0, t] }, not Var{ N(0, t] }/t. The former, which diverges for most interesting stochastic point processes, is only a necessary condition for long-range dependence, whereas the latter is the definition of longrange dependence. For these reasons, we have included the following proof (following Daley’s argument) of theorem 1. Proof. Since the interpoint interval random variable X has a finite mean, the moments E{N(0, 1]} = 1/µ and E{N2 (0, 1]} are finite, where N(·) is the counting measure of the point process. Therefore, by theorem 3.5.III of Daley and Vere-Jones (1988), 2

E{N (0, t]} = 0

t

2U(s) − 1 ds, µ

where

∞ lim Pr{N(0, t] ≥ j|N(−h, 0] > 0} U(t) = j=0

h↓0

is called the expectation function. For a general stationary ergodic point processes with finite second moments, U(t) is the analog of the renewal function (and, for a renewal point process, is the renewal function). The variance function can now be written as Var{N(0, t]} = E{N2 (0, t]} − (E{N(0, t]})2 2 t 2U(s) − 1 t = ds − µ µ 0

Long-Range Dependence in Models of Cortical Variability

2189

t

2s 2U(s) − 1 − 2 ds = µ µ 0 t s t 2 U(s) − ds − = µ 0 µ µ So, Var{N(0, t]} 2 = t µ

t

s 1 1 U(s) − ds − , t 0 µ µ

which goes to infinity as t → ∞ if and only if the integrand goes to infinity. In other words, Var{N(0, t]} =∞ t

t = ∞, lim U(t) − t→∞ µ lim

t→∞

if and only if (A.1)

where the first limit is the definition (see definition 3) of LRcD. Now, if σ 2 = Var{X2 }, then for a renewal point process (Feller, 1971), 0 ≤ U(t) −

σ 2 + µ2 t → µ 2µ2

as t → ∞,

where the right side is replaced by ∞ if Var{X2 } does not exist. Therefore,

t (A.2) =∞ if and only if E{X2 } = ∞. lim U(t) − t→∞ µ Putting statements A.1 and A.2 together yields the desired result. It may prove instructive to consider the following simpler alternate proof of the fact that an LRcD renewal point process necessarily has infinitely variable interpoint intervals (i.e., the “only if” part of the “if and only if” of theorem 1). This proof requires only a well-known application of the central limit theorem to renewal point processes (e.g., Cox, 1967; Feller, 1971). If the variance of the generic interpoint interval random variable X is finite, then for large t, N(0, t] is asymptotically normally distributed with mean t/µ and variance σ 2 t/µ3 , where µ and σ 2 are the mean and variance of X, respectively. Therefore, if E{X2 } < ∞, then Var{N(0, t]} σ2 = 3 < ∞, t→∞ t µ lim

and the process is not LRcD. Thus, if the renewal point process is LRcD, then E{X2 } = ∞.

2190

B. Jackson

Acknowledgments I am extremely grateful to Laurel Carney for many helpful discussions, suggestions, and editorial readings of the manuscript for this article and for the many other ways that she supported this study. I also thank Hyune-Ju Kim and Phillip Griffin for comments on an earlier version. This research was supported by the Lewis DiCarlo Endowment, the Jerome R. & Arlene L. Gerber Fund, and NIH R01DC01641. References Bardet, J.-M., Lang, G., Oppenheim, G., Philippe, A., & Taqqu, M. S. (2002). Generators of long-range dependent processes: A survey. In P. Doukhan, G. Oppenheim, & M. S. Taqqu (Eds.), Long-range dependence: Theory and applications. Boston: Birkhauser. Beran, J. (1994). Statistics for long-memory processes. Boca Raton, FL: Chapman & Hall/CRC Press. Brown, D., & Feng, J. (1999). Is there a problem matching real and model CV(ISI)? Neurocomputing, 26–27, 87–91. Bugmann, G., Christodoulou, C., & Taylor, J. G. (1997). Role of temporal integration and fluctuation detection in the highly irregular firing of a leaky integrator neuron model with partial reset. Neural Computation, 9(5), 985– 1000. Burkitt, A. N. (2000). Interspike interval variability for balanced networks with reversal potentials for large numbers of inputs. Neurocomputing, 32–33, 313–321. Burkitt, A. N. (2001). Balanced neurons: Analysis of leaky integrate-and-fire neurons with reversal potentials. Biological Cybernetics, 85(4), 247–255. Cox, D. R. (1955). Some statistical methods connected with series of events. Journal of the Royal Statistical Society, Series B, 17(2), 129–157. Cox, D. R. (1967). Renewal theory. London: Chapman and Hall. Cox, D. R. (1984). Long-range dependence: A review. In H. A. David & H. T. David (Eds.), Statistics: An appraisal (pp. 55–74). Ames, IA: Iowa State University Press. Cox, D. R., & Isham, V. (1980). Point processes. London: Chapman and Hall. Cox, D. R., & Smith, W. L. (1954). On the superposition of renewal processes. Biometrika, 41(1/2), 91–99. Daley, D. J. (1999). The Hurst index of long-range dependent renewal processes. Annals of Probability, 27(4), 2035–2041. Daley, D. J., Rolski, T., & Vesilo, R. (2000). Long-range dependent point processes and their Palm-Khinchin distributions. Advances in Applied Probability, 32(4), 1051–1063. Daley, D. J., & Vere-Jones, D. (1988). An introduction to the theory of point processes. New York: Springer-Verlag. Daley, D. J., & Vesilo, R. (1997). Long range dependence of point processes, with queueing examples. Stochastic Processes and Their Applications, 70(2), 265–282.

Long-Range Dependence in Models of Cortical Variability

2191

Davies, R. B., & Harte, D. S. (1987). Tests for Hurst effect. Biometrika, 74(1), 95–101. Davis, R., & Resnick, S. (1986). Limit theory for the sample covariance and correlation functions of moving averages. Annals of Statistics, 14(2), 533–558. Destexhe, A., & Pare, D. (1999). Impact of network activity on the integrative properties of neocortical pyramidal neurons in vivo. Journal of Neurophysiology, 81(4), 1531–1547. Destexhe, A., & Pare, D. (2000). A combined computational and intracellular study of correlated synaptic bombardment in neocortical pyramidal neurons in vivo. Neurocomputing, 32–33, 113–119. Enns, E. G. (1970). A stochastic superposition process and an integral inequality for distributions with monotone hazard rates. Australian Journal of Statistics, 12, 44–49. Fano, U. (1947). Ionization yield of radiations. II. The fluctuations of the number of ions. Physical Review, 72, 26–29. Feller, W. (1971). An introduction to probability theory and its applications (2nd ed.). New York: Wiley. Feng, J. (1997). Behaviours of spike output jitter in the integrate-and-fire model. Physical Review Letters, 79(22), 4505–4508. Feng, J. (1999). Origin of firing variability of the integrate-and-fire model. Neurocomputing, 26–27, 117–122. Feng, J. (2001). Is the integrate-and-fire model good enough? A review. Neural Networks, 14(6–7), 955–975. Feng, J., & Brown, D. (1998a). Impact of temporal variation and the balance between excitation and inhibition on the output of the perfect integrate-andfire model. Biological Cybernetics, 78(5), 369–376. Feng, J., & Brown, D. (1998b). Spike output jitter, mean firing time and coefficient of variation. Journal of Physics A, 31(4), 1239–1252. Feng, J., & Brown, D. (1999). Coefficient of variation of interspike intervals greater than 0.5. How and when? Biological Cybernetics, 80(5), 291–297. Feng, J., & Brown, D. (2000a). Impact of correlated inputs on the output of the integrate-and-fire model. Neural Computation, 12(3), 671–692. Feng, J., & Brown, D. (2000b). Integrate-and-fire models with nonlinear leakage. Bulletin of Mathematical Biology, 62(3), 467–481. Feng, J., Tirozzi, B., & Brown, D. (1998). Output jitter diverges to infinity, converges to zero or remains constant. In M. Verleysen (Ed.), 6th European Symposium on Artificial Neural Networks. ESANN’98 (pp. 39–45). Brussels, Belgium: D-Facto. Feng, J., & Zhang, P. (2001). Behavior of integrate-and-fire and Hodgkin-Huxley models with correlated inputs. Physical Review E, 63(5, pt. 1–2), 051902/1–11. Gerstein, G., & Mandelbrot, B. (1964). Random walk models for the spike activity of a single neuron. Biophysical Journal, 4, 41–68. Gruneis, F., Nakao, M., Mizutani, Y., Yamamoto, M., Meesmann, M., & Musha, T. (1993). Further study on 1/f fluctuations observed in central single neurons during REM sleep. Biological Cybernetics, 68(3), 193–198. Gruneis, F., Nakao, M., & Yamamoto, M. (1990). Counting statistics of 1/f fluctuations in neuronal spike trains. Biological Cybernetics, 62(5), 407–413.

2192

B. Jackson

Gruneis, F., Nakao, M., Yamamoto, M., Musha, T., & Nakahama, H. (1989). An interpretation of 1/f fluctuations in neuronal spike trains during dream sleep. Biological Cybernetics, 60(3), 161–169. Heath, D., Resnick, S., & Samorodnitsky, G. (1998). Heavy tails and long range dependence in on/off processes and associated fluid models. Mathematics of Operations Research, 23(1), 145–165. Hurst, H. E. (1951). Long-term storage capacity of reservoirs. Transactions of the American Society of Civil Engineers, 116, 770–799. Jackson, B. S. (2003). Consequences of long-range temporal dependence in neural spiking activity for theories of processing and coding. Unpublished doctoral dissertation, Syracuse University. Khintchine, A. Y. (1960). Mathematical methods in the theory of queueing. New York: Hafner Publishing Co. Kodama, T., Mushiake, H., Shima, K., Nakahama, H., & Yamamoto, M. (1989). Slow fluctuations of single unit activities of hippocampal and thalamic neurons in cats. I. Relation to natural sleep and alert states. Brain Research, 487(1), 26–34. Kulik, R., & Szekli, R. (2001). Sufficient conditions for long-range count dependence of stationary point processes on the real line. Journal of Applied Probability, 38(2), 570–581. Kumar, A. R., & Johnson, D. H. (1993). Analyzing and modeling fractal intensity point processes. Journal of the Acoustical Society of America, 93(6), 3365–3373. Lansky, P., & Smith, C. E. (1989). The effect of a random initial value in neural first-passage-time models. Mathematical Biosciences, 93(2), 191–215. Lawrance, A. J. (1973). Dependency of intervals between events in superposition processes. Journal of the Royal Statistical Society. Series B, 35(2), 306–315. Ledesma, S., & Derong, L. (2000). Synthesis of fractional gaussian noise using linear approximation for generating self-similar network traffic. Computer Communication Review, 30(2), 4–17. Lowen, S. B., Ozaki, T., Kaplan, E., Saleh, B. E. A., & Teich, M. C. (2001). Fractal features of dark, maintained, and driven neural discharges in the cat visual system. Methods, 24(4), 377–394. Lowen, S. B., & Teich, M. C. (1991). Doubly stochastic Poisson point process driven by fractal shot noise. Physical Review A, 43(8), 4192–4215. Lowen, S. B., & Teich, M. C. (1992). Auditory-nerve action potentials form a nonrenewal point process over short as well as long time scales. Journal of the Acoustical Society of America, 92(2 pt. 1), 803–806. Lowen, S. B., & Teich, M. C. (1993a). Estimating the dimension of a fractal point process. Proceedings of the SPIE, 2036, 64–76. Lowen, S. B., & Teich, M. C. (1993b). Fractal renewal processes. IEEE Transactions on Information Theory, 39(5), 1669–1671. Lowen, S. B., & Teich, M. C. (1993c). Fractal renewal processes generate 1/f noise. Physical Review E, 47(2), 992–1001. Lowen, S. B., & Teich, M. C. (1995). Estimation and simulation of fractal stochastic point processes. Fractals, 3(1), 183–210. Lowen, S. B., & Teich, M. C. (1996a). Refractoriness-modified fractal stochastic point processes for modeling sensory-system spike trains. In J. M. Bower (Ed.), Computational neuroscience (pp. 447–452). San Diego: Academic Press.

Long-Range Dependence in Models of Cortical Variability

2193

Lowen, S. B., & Teich, M. C. (1996b). The periodogram and Allan variance reveal fractal exponents greater than unity in auditory-nerve spike trains. Journal of the Acoustical Society of America, 99(6), 3585–3591. Lowen, S. B., & Teich, M. C. (1997). Estimating scaling exponents in auditorynerve spike trains using fractal models incorporating refractoriness. In E. R. Lewis, G. R. Long, R. F. Lyon, P. M. Narins, C. R. Steele, & E. Hecht-Pointar (Eds.), Diversity in auditory mechanics (pp. 197–204). Singapore: World Scientific. Mandelbrot, B. B. (1965). Une classe de processus stochastiques homoth´etiques a soi: Application a` loi climatologique de H. E. Hurst. Comptes Rendus de l’Acad´emie des Sciences de Paris, 240, 3274–3277. Mandelbrot, B. B. (1969). Long-run linearity, locally gaussian process, H-spectra and infinite variances. International Economic Review, 10(1), 82–111. Mandelbrot, B. B., & van Ness, J. W. (1968). Fractional Brownian motions, fractional noises and applications. SIAM Review, 10(4), 422–437. Mandelbrot, B. B., & Wallis, J. R. (1968). Noah, Joseph and operational hydrology. Water Resources Research, 4(5), 909–918. Mandelbrot, B. B., & Wallis, J. R. (1969a). Computer experiments with fractional gaussian noises. Water Resources Research, 5(1), 228–267. Mandelbrot, B. B., & Wallis, J. R. (1969b). Robustness of the rescaled range R/S in the measurement of noncyclic long run statistical dependence. Water Resources Research, 5, 967–988. Mandelbrot, B. B., & Wallis, J. R. (1969c). Some long-run properties of geophysical records. Water Resources Research, 5, 321–340. Mushiake, H., Kodama, T., Shima, K., Yamamoto, M., & Nakahama, H. (1988). Fluctuations in spontaneous discharge of hippocampal theta cells during sleep-waking states and PCPA-induced insomnia. Journal of Neurophysiology, 60(3), 925–939. Paxson, V. (1997). Fast, approximate synthesis of fractional gaussian noise for generating self-similar network traffic. Computer Communication Review, 27(5), 5–18. Sakai, Y., Funahashi, S., & Shinomoto, S. (1999). Temporally correlated inputs to leaky integrate-and-fire models can reproduce spiking statistics of cortical neurons. Neural Networks, 12(7–8), 1181–1190. Salinas, E., & Sejnowski, T. J. (2000). Impact of correlated synaptic input on output firing rate and variability in simple neuronal models. Journal of Neuroscience, 20(16), 6193–6209. Salinas, E., & Sejnowski, T. J. (2002). Integrate-and-fire neurons driven by correlated stochastic input. Neural Computation, 14(9), 2111–2155. Samorodnitsky, G., & Taqqu, M. S. (1994). Stable non-gaussian random processes: Stochastic models with infinite variance. Boca Raton, FL: Chapman & Hall/CRC. Shadlen, M. N., & Newsome, W. T. (1994). Noise, neural codes and cortical organization. Current Opinion in Neurobiology, 4(4), 569–579. Shadlen, M. N., & Newsome, W. T. (1998). The variable discharge of cortical neurons: Implications for connectivity, computation, and information coding. Journal of Neuroscience, 18(10), 3870–3896. Shinomoto, S., & Sakai, Y. (1998). Spiking mechanisms of cortical neurons. Philosophical Magazine B, 77(5), 1549–1555.

2194

B. Jackson

Shinomoto, S., Sakai, Y., & Funahashi, S. (1999). The Ornstein-Uhlenbeck process does not reproduce spiking statistics of neurons in prefrontal cortex. Neural Computation, 11(4), 935–951. Shinomoto, S., & Tsubo, Y. (2001). Modeling spiking behavior of neurons with time-dependent Poisson processes. Physical Review E, 64(4–1), 041910. Softky, W. R. (1994). Sub-millisecond coincidence detection in active dendritic trees. Neuroscience, 58(1), 13–41. Softky, W. R. (1995). Simple codes versus efficient codes. Current Opinion in Neurobiology, 5(2), 239–247. Softky, W. R., & Koch, C. (1992). Cortical cells should fire regularly, but do not. Neural Computation, 4(5), 643–646. Softky, W. R., & Koch, C. (1993). The highly irregular firing of cortical cells is inconsistent with temporal integration of random EPSPs. Journal of Neuroscience, 13(1), 334–350. Stevens, C. F., & Zador, A. M. (1998). Input synchrony and the irregular firing of cortical neurons. Nature Neuroscience, 1(3), 210–217. Taqqu, M. S., & Levy, J. B. (1986). Using renewal processes to generate longrange dependence and high variability. In E. Eberlein & M. S. Taqqu (Eds.), Dependence in probability and statistics: A survey of recent results (Vol. 11, pp. 73–89). Boston: Birkhauser. Taqqu, M. S., Willinger, W., & Sherman, R. (1997). Proof of a fundamental result in self-similar traffic modeling. Computer Communication Review, 27(2), 5– 23. Teich, M. C. (1989). Fractal character of the auditory neural spike train. IEEE Transactions on Biomedical Engineering, 36(1), 150–160. Teich, M. C. (1992). Fractal neuronal firing patterns. In T. McKenna, J. Davis, & S. F. Zornetzer (Eds.), Single neuron computation (pp. 589–625). Boston: Academic Press. Teich, M. C. (1996). Self-similarity in sensory neural signals. In J. K. J. Li & S. S. Reisman (Eds.), Proceedings of the IEEE 22nd Annual Northeast Bioengineering Conference (pp. 36–37). New Brunswick, NJ: IEEE. Teich, M. C., Heneghan, C., Lowen, S. B., Ozaki, T., & Kaplan, E. (1997). Fractal character of the neural spike train in the visual system of the cat. Journal of the Optical Society of America A, 14(3), 529–546. Teich, M. C., Johnson, D. H., Kumar, A. R., & Turcott, R. G. (1990). Rate fluctuations and fractional power-law noise recorded from cells in the lower auditory pathway of the cat. Hearing Research, 46(1–2), 41–52. Teich, M. C., & Lowen, S. B. (1994). Fractal patterns in auditory nerve-spike trains. IEEE Engineering in Medicine and Biology Magazine, 13(2), 197–202. Teich, M. C., Turcott, R. G., & Lowen, S. B. (1990). The fractal doubly stochastic Poisson point process as a model for the cochlear neural spike train. In P. Dallos, C. D. Geisler, J. W. Matthews, M. A. Ruggero, & C. R. Steele (Eds.), The mechanics and biophysics of hearing (Vol. 87, pp. 354–361). New York: Springer-Verlag. Teich, M. C., Turcott, R. G., & Siegel, R. M. (1996). Temporal correlation in cat striate-cortex neural spike trains. IEEE Engineering in Medicine and Biology Magazine, 15(5), 79–87.

Long-Range Dependence in Models of Cortical Variability

2195

Thurner, S., Lowen, S. B., Feurstein, M. C., Heneghan, C., Feichtinger, H. G., & Teich, M. C. (1997). Analysis, synthesis, and estimation of fractal-rate stochastic point processes. Fractals, 5, 565–596. Troyer, T. W., & Miller, K. D. (1997). Physiological gain leads to high ISI variability in a simple model of a cortical regular spiking cell. Neural Computation, 9(5), 971–983. Tsodyks, M. V., & Sejnowski, T. (1995). Rapid state switching in balanced cortical network models. Network, 6(2), 111–124. Tuckwell, H. C. (1988). Introduction to theoretical neurobiology (Vol. 2). Cambridge: Cambridge University Press. Turcott, R. G., Barker, D. R., & Teich, M. C. (1995). Long-duration correlation in the sequence of action potentials in an insect visual interneuron. Journal of Statistical Computation and Simulation, 52(3), 253–271. Turcott, R. G., Lowen, S. B., Li, E., Johnson, D. H., Tsuchitani, C., & Teich, M. C. (1994). A nonstationary Poisson point process describes the sequence of action potentials over long time scales in lateral-superior-olive auditory neurons. Biological Cybernetics, 70(3), 209–217. Turcott, R. G., & Teich, M. C. (1996). Fractal character of the electrocardiogram: Distinguishing heart-failure and normal patients. Annals of Biomedical Engineering, 24(2), 269–293. Usher, M., Stemmler, M., Koch, C., & Olami, Z. (1994). Network amplification of local fluctuations causes high spike rate variability, fractal firing patterns and oscillatory local field potentials. Neural Computation, 6(5), 795–836. Usher, M., Stemmler, M., & Olami, Z. (1995). Dynamic pattern formation leads to 1/f noise in neural populations. Physical Review Letters, 74(2), 326–329. van Vreeswijk, C., & Sompolinsky, H. (1996). Chaos in neuronal networks with balanced excitatory and inhibitory activity. Science, 274(5293), 1724–1726. van Vreeswijk, C., & Sompolinsky, H. (1998). Chaotic balanced state in a model of cortical circuits. Neural Computation, 10(6), 1321–1371. Wilbur, W. J., & Rinzel, J. (1983). A theoretical basis for large coefficient of variation and bimodality in neuronal interspike interval distributions. Journal of Theoretical Biology, 105(2), 345–368. Willinger, W., Taqqu, M. S., Sherman, R., & Wilson, D. V. (1995). Self-similarity through high-variability: Statistical analysis of ethernet LAN traffic at the source level. Computer Communication Review, 25(4), 100–113. Willinger, W., Taqqu, M. S., Sherman, R., & Wilson, D. V. (1997). Self-similarity through high-variability: Statistical analysis of ethernet LAN traffic at the source level. IEEE/ACM Transactions on Networking, 5(1), 71–86. Wise, M. E. (1981). Spike interval distributions for neurons and random walks with drift to a fluctuating threshold. In C. Taillie, G. Patil, & B. Baldessari (Eds.), Statistical distributions in scientific work (Vol. 6, pp. 211–231). Boston: Reidel. Yamamoto, M., Nakahama, H., Shima, K., Kodama, T., & Mushiake, H. (1986). Markov-dependency and spectral analyses on spike-counts in mesencephalic reticular neurons during sleep and attentive states. Brain Research, 366(1–2), 279–289. Received August 15, 2003; accepted April 4, 2004.

LETTER

Communicated by Sam Roweis

Learning Eigenfunctions Links Spectral Embedding and Kernel PCA Yoshua Bengio [email protected]

Olivier Delalleau [email protected]

Nicolas Le Roux [email protected]

Jean-Fran¸cois Paiement [email protected]

Pascal Vincent [email protected]

Marie Ouimet [email protected] D´epartement d’Informatique et Recherche Op´erationnelle, Centre de Recherches Math´ematiques, Universit´e de Montr´eal, Montr´eal, Qu´ebec, Canada, H3C 3J7

In this letter, we show a direct relation between spectral embedding methods and kernel principal components analysis and how both are special cases of a more general learning problem: learning the principal eigenfunctions of an operator defined from a kernel and the unknown data-generating density. Whereas spectral embedding methods provided only coordinates for the training points, the analysis justifies a simple extension to out-of-sample examples (the Nystrom ¨ formula) for multidimensional scaling (MDS), spectral clustering, Laplacian eigenmaps, locally linear embedding (LLE), and Isomap. The analysis provides, for all such spectral embedding methods, the definition of a loss function, whose empirical average is minimized by the traditional algorithms. The asymptotic expected value of that loss defines a generalization performance and clarifies what these algorithms are trying to learn. Experiments with LLE, Isomap, spectral clustering, and MDS show that this out-of-sample embedding formula generalizes well, with a level of error comparable to the effect of small perturbations of the training set on the embedding. 1 Introduction In the past few years, many unsupervised learning algorithms have been proposed that share the use of an eigendecomposition for obtaining a lowerdimensional embedding of the data that characterize a nonlinear manifold near which the data would lie: locally linear embedding (LLE) (Roweis & c 2004 Massachusetts Institute of Technology Neural Computation 16, 2197–2219 (2004)

2198

Y. Bengio et al.

Saul, 2000), Isomap (Tenenbaum, de Silva, & Langford, 2000), and Laplacian eigenmaps (Belkin & Niyogi, 2003). There are also many variants of spectral clustering (Weiss, 1999; Ng, Jordan, & Weiss, 2002) in which such an embedding is an intermediate step before obtaining a clustering of the data that can capture flat, elongated, and even curved clusters. The two tasks (manifold learning and clustering) are linked because the clusters that spectral clustering manages to capture can be arbitrary curved manifolds (as long as there are enough data to locally capture the curvature of the manifold). Clustering and manifold learning are intimately related: both clusters and manifold are zones of high density. All of these unsupervised learning methods are thus capturing salient features of the data distribution. As shown here, spectral clustering is in fact working in a way that is very similar to manifold learning algorithms. The end result of most inductive machine learning algorithms is a function that minimizes the empirical average of a loss criterion (possibly plus regularization). The function can be applied on new points, and for such learning algorithms, it is clear that the ideal solution is a function that minimizes the expected loss under the unknown true distribution from which the data were sampled, also known as the generalization error. However, such a characterization was missing for spectral embedding algorithms such as metric multidimensional scaling (MDS) (Cox & Cox, 1994), spectral clustering (see Weiss, 1999, for a review), Laplacian eigenmaps, LLE, and Isomap, which are used for either dimensionality reduction or clustering. They do not provide a function that can be applied to new points, and the notion of generalization error is not clearly defined. This article seeks to provide an answer to these questions. A loss criterion for spectral embedding algorithms can be defined. It is a reconstruction error that depends on pairs of examples. Minimizing its average value yields the eigenvectors that provide the classical output of these algorithms, that is, the embeddings. Minimizing its expected value over the true underlying distribution yields the eigenfunctions of a linear operator that is defined by a kernel (which is not necessarily positive semidefinite) and the data generating density. When the kernel is positive semidefinite and we work with the empirical density, there is a direct correspondence between these algorithms and kernel principal components analysis (PCA) (Scholkopf, ¨ Smola, & Muller, ¨ 1998). Our work is therefore a direct continuation of previous work (Williams & Seeger, 2000) noting that the Nystrom ¨ formula and the kernel PCA projection (which are equivalent) represent an approximation of the eigenfunctions of the above linear operator (called G here). Previous analysis of the convergence of generalization error of kernel PCA (Shawe-Taylor, Cristianini, & Kandola, 2002; Shawe-Taylor & Williams, 2003) also helps to justify the view that these methods are estimating the convergent limit of some eigenvectors (at least when the kernel is positive semidefinite). The eigenvectors can thus be turned into estimators of eigenfunctions, which can therefore be applied to new points, turning the spectral embedding algo-

Learning Eigenfunctions Links Spectral Embedding and Kernel PCA

2199

rithms into function induction algorithms. The Nystrom ¨ formula obtained this way is well known (Baker, 1977) and will be given in equation 1.1. This formula has been used previously for estimating extensions of eigenvectors in gaussian process regression (Williams & Seeger, 2001), and Williams and Seeger (2000) noted that it corresponds to the projection of a test point computed with kernel PCA. In order to extend spectral embedding algorithms such as LLE and Isomap to out-of-sample examples, this article defines for these spectral embedding algorithms data-dependent kernels Kn that can be applied outside the training set. See also the independent work of Ham, Lee, Mika, and Scholkopf ¨ (2003) for a kernel view of LLE and Isomap, but where the kernels are only applied on the training set. Additional contributions of this article include a characterization of the empirically estimated eigenfunctions in terms of eigenvectors in the case where the kernel is not positive semidefinite (often the case for MDS and Isomap), a convergence theorem linking the Nystrom ¨ formula to the eigenfunctions of G, as well as experiments on MDS, Isomap, LLE and spectral clustering and Laplacian eigenmaps showing that the Nystrom ¨ formula for out-of-sample examples is accurate. All of the algorithms described in this article start from a data set D = (x1 , . . . , xn ) with xi ∈ Rd sampled independently and identically distributed (i.i.d.) from an unknown distribution with density p. Below we will use the notation Ex [ f (x)] = f (x)p(x)dx for averaging over p(x) and n 1 f (xi ) Eˆ x [ f (x)] = n i=1

for averaging over the data in D, that is, over the empirical distribution ˜ y), symmetric ˆ denoted p(x). We will denote kernels with Kn (x, y) or K(x, functions, not always positive semidefinite, that may depend not only on x and y but also on the data D. The spectral embedding algorithms construct an affinity matrix M either explicitly through Mij = Kn (xi , xj ) or implicitly through a procedure that takes the data D, and computes M. We denote by vik the ith coordinate of the kth eigenvector of M, associated with the eigenvalue k . With these notations, the Nystrom ¨ formula discussed above can be written √ n n fk,n (x) = vik Kn (x, xi ), (1.1) k i=1 where fk,n is the kth Nystrom ¨ estimator with n samples. We will show in section 4 that it estimates the kth eigenfunction of a linear operator.

2200

Y. Bengio et al.

2 Kernel Principal Component Analysis Kernel PCA is an unsupervised manifold learning technique that maps data points to a generally lower-dimensional space. It generalizes the principal component analysis approach to nonlinear transformations using the kernel trick (Scholkopf, ¨ Smola, & Muller, ¨ 1996; Scholkopf ¨ et al., 1998; Scholkopf, ¨ Burges, & Smola, 1999). The algorithm implicitly finds the leading eigenvectors and eigenvalues of the covariance of the projection φ(x) of the data in feature space, where φ(x) is such that the kernel Kn (x, y) = φ(x)·φ(y) (i.e., Kn must not have negative eigenvalues). If the data are centered in feature space (Eˆ x [φ(x)] = 0), the feature space covariance matrix is C = Eˆ x [φ(x)φ(x) ],

(2.1)

with eigenvectors wk and eigenvalues λk . To obtain a centered feature space, a kernel K˜ (e.g., the gaussian kernel) is first normalized into a data-dependent kernel Kn via ˜ y) − Eˆ x [K(x ˜ , y)] − Eˆ y [K(x, ˜ y )] + Eˆ x [Eˆ y [K(x ˜ , y )]]. (2.2) Kn (x, y) = K(x, The eigendecomposition of the corresponding Gram matrix M is performed, solving Mvk = k vk , as with the other spectral embedding methods (Laplacian eigenmaps, LLE, Isomap, MDS). However, in this case, one can obtain a test point projection Pk (x) that is the inner product of φ(x) with the eigenvector wk of C, and using the kernel trick, it is written as the expansion n 1 Pk (x) = wk · φ(x) = √ vik Kn (xi , x). k i=1

(2.3)

Note that the eigenvectors of C are related to the eigenvectors of M through λk = k /n and n 1 vik φ(xi ), wk = √ k i=1

as shown in Scholkopf ¨ et al. (1998). Ng et al. (2002) already noted the link between kernel PCA and spectral clustering. Here we take advantage of that link to propose and analyze out-of-sample extensions for spectral clustering and other spectral embedding algorithms. Recently Ham et al. (2003) showed how Isomap, LLE, and Laplacian eigenmaps can be interpreted as performing a form of kernel PCA, although they do not propose to use that link in order to perform function induction (i.e., obtain an embedding for out-of-sample points). Recent work has shown convergence properties of kernel PCA that are particularly important here. Shawe-Taylor et al. (2002) & Shawe-Taylor and

Learning Eigenfunctions Links Spectral Embedding and Kernel PCA

2201

Williams (2003) give bounds on the kernel PCA convergence error (in the sense of the projection error with respect to the subspace spanned by the eigenvectors), using concentration inequalities. 3 Data-Dependent Kernels for Spectral Embedding Algorithms The spectral embedding algorithms can be seen to build an n × n similarity matrix M, compute its principal eigenvectors vk = (v1k , . . . , vnk ) with eigenvalues k , and associate with the ith training example the em1 bedding √ with √ coordinates (vi1 , vi2 , . . .) (for Laplacian eigenmaps and LLE) or ( 1 vi1 , 2 vi2 , . . .) (for Isomap and MDS). In general, we will see that Mij depends not only on (xi , xj ) but also on the other training examples. Nonetheless, as we show below, it can always be written Mij = Kn (xi , xj ) ˜ is first where Kn is a data-dependent kernel. In many algorithms, a matrix M formed from a simpler, often data-independent kernel (such as the gaussian kernel) and then transformed into M. By defining a kernel Kn for each of these methods, we will be able to generalize the embedding to new points x outside the training set via the Nystrom ¨ formula. This will only require computations of the form Kn (x, xi ) with xi a training point. 3.1 Multidimensional Scaling. Metric MDS (Cox & Cox, 1994) starts from a notion of distance d(x, y) that is computed between each pair of ˜ ij = d2 (xi , xj ). These distances are then training examples to fill a matrix M converted to equivalent dot products using the double-centering formula, which makes Mij depend not only on (xi , xj ) but also on all the other examples: 1 Mij = − 2

1 1 1 ˜ Sk , Mij − Si − Sj + 2 n n n k

(3.1)

√ n ˜ ij . The embedding of the example xi is given by k vik where Si = j=1 M where v·k is the kth eigenvector of M. To generalize MDS, we define a corresponding data-dependent kernel that generates the matrix M, 1 Kn (a, b) = − (d2 (a, b) − Eˆ x [d2 (x, b)] − Eˆ x [d2 (a, x )] + Eˆ x,x [d2 (x, x )]), (3.2) 2 where the expectations are taken on the training set D. An extension of metric MDS to new points has already been proposed in Gower (1968), in which 1 For Laplacian eigenmaps and LLE, the matrix M discussed here is not the one defined in the original algorithms, but a transformation of it to reverse the order of eigenvalues, as we see below.

2202

Y. Bengio et al.

one solves exactly for the coordinates of the new point that are consistent with its distances to the training points, which in general requires adding a new dimension. Note also that Williams (2001) makes a connection between kernel PCA and metric MDS, remarking that kernel PCA is a form of MDS when the kernel is isotropic. Here we pursue this connection in order to obtain out-of-sample embeddings. 3.2 Spectral Clustering. Several variants of spectral clustering have been proposed (Weiss, 1999). They can yield impressively good results where traditional clustering looking for “round blobs” in the data, such as K-means, would fail miserably (see Figure 1). It is based on two main steps: first embedding the data points in a space in which clusters are more “obvious” (using the eigenvectors of a Gram matrix) and then applying a classical clustering algorithm such as K-means, as in Ng et al. (2002). To construct the the spectral clustering affinity matrix M, we first apply a dataindependent kernel K˜ such as the gaussian kernel to each pair of examples: ˜ ij = K(x ˜ i , xj ). The matrix M ˜ is then normalized, for example, using divisive M normalization (Weiss, 1999; Ng et al., 2002):2 ˜ ij M Mij = . Si Sj

(3.3)

To obtain m clusters, the first m principal eigenvectors of M are computed, and K-means is applied on the unit-norm coordinates, obtained from the embedding vik of each training point xi . To generalize spectral clustering to out-of-sample points, we define a kernel that could have generated that matrix: Kn (a, b) =

˜ b) 1 K(a, . n ˆ ˜ ˜ , b)] Ex [K(a, x)]Eˆ x [K(x

(3.4)

This normalization comes out of the justification of spectral clustering as a relaxed statement of the min-cut problem (Chung, 1997; Spielman & Teng, 1996) (to divide the examples into two groups such as to minimize the sum of the “similarities” between pairs of points straddling the two groups). The additive normalization performed with kernel PCA (see equation 2.2) makes sense geometrically as a centering in feature space. Both normalization procedures make use of a kernel row and column average. It would be interesting to find a similarly pleasing geometric interpretation to the divisive normalization.

˜ ij . This alternative M Better embeddings are usually obtained if we define Si = j =i normalization can also be cast into the general framework developed here, with a slightly different kernel. 2

Learning Eigenfunctions Links Spectral Embedding and Kernel PCA

2203

Figure 1: Example of the transformation learned as part of spectral clustering. Input data are on the left and transformed data on the right. Gray level and cross and circle drawing are used only to show which points get mapped where: the mapping reveals both the clusters and the internal structure of the two manifolds.

3.3 Laplacian Eigenmaps. The Laplacian eigenmaps method is a recently proposed dimensionality-reduction procedure (Belkin & Niyogi, 2003) that was found to be very successful for semisupervised learning. The authors use an approximation of the Laplacian operator such as the gaussian kernel or the k-nearest-neighbor graph: the symmetric matrix whose element (i, j) is 1 if xi and xj are k-nearest-neighbors (xi is among the k nearest neighbors of xj or vice versa) and 0 otherwise. Instead of solving an ordinary eigenproblem, the following generalized eigenproblem is solved: ˜ k = σk Syk , (S − M)y

(3.5)

with eigenvalues σk , eigenvectors yk , and S the diagonal matrix with ele˜ The smallest eigenvalue is left ments Si previously defined (row sums of M). out, and the eigenvectors corresponding to the other small eigenvalues are used for the embedding. This is actually the same embedding that is computed with the spectral clustering algorithm from (Shi & Malik, 1997). As noted in Weiss (1999) (normalization lemma 1), an equivalent result (up to a component-wise scaling of the embedding) can be obtained by considering the principal eigenvectors vk of the normalized matrix M defined in equation 3.3. To fit the common framework for spectral embedding in this article, we have used the latter formulation. Therefore, the same data-dependent kernel can be defined as for spectral clustering, equation 3.4, to generate the matrix M; that is, spectral clustering adds a clustering step after a Laplacian eigenmap dimensionality reduction. 3.4 Isomap. Isomap (Tenenbaum et al., 2000) generalizes MDS to nonlinear manifolds. It is based on replacing the Euclidean distance by an empirical approximation of the geodesic distance on the manifold. We define

2204

Y. Bengio et al.

the geodesic distance D(·, ·) with respect to a data set D, a distance d(·, ·), and a neighborhood k as follows:

D(a, b) = min π

|π |

d(πi , πi+1 ),

(3.6)

i=1

where π is a sequence of points of length |π| = l ≥ 2 with π1 = a, πl = b, πi ∈ D ∀i ∈ {2, . . . , l − 1} and (πi ,πi+1 ) are k-nearest-neighbors of each other. The length |π| = l is free in the minimization. The Isomap algorithm obtains the normalized matrix M from which the embedding is derived by transforming the raw pairwise distances matrix as follows: (1) compute the ˜ ij = D2 (xi , xj ) of squared geodesic distances with respect to the data matrix M D and (2) apply to this matrix the distance-to-dot-product transformation, √ equation 3.1, as for MDS. As in MDS, the embedding of xi is k vik rather than vik . There are several ways to define a kernel that generates M and also generalizes out-of-sample. The solution we have chosen simply computes the geodesic distances without involving the out-of-sample point(s) along the geodesic distance sequence (except possibly at the extreme). This is automatically achieved with the above definition of geodesic distance D, which uses only the training points to find the shortest path between a and b. The double-centering kernel transformation of equation 3.2 can then be applied, using the geodesic distance D instead of the MDS distance d. A formula has been proposed (de Silva & Tenenbaum, 2003) to approximate Isomap using only a subset of the examples (the “landmark” points) to compute the eigenvectors. Using the notation of this article, that formula is 1 ek (x) = √ vik (Eˆ x [D2 (x , xi )] − D2 (xi , x)). 2 k i

(3.7)

The formula is applied to obtain an embedding for the nonlandmark exam¨ formula ples. One can show (Bengio et al., 2004) that ek (x) is the Nystrom when Kn (x, y) is defined as above. Landmark Isomap is thus equivalent to performing Isomap on the landmark points only and then predicting the embedding of the other points using the Nystrom ¨ formula. It is interesting to note a recent descendant of Isomap and LLE, Hessian Eigenmaps (Donoho & Grimes, 2003), which considers the limit case of the continuum of the manifold and replaces the Laplacian in Laplacian eigenmaps by a Hessian. 3.5 Locally Linear Embedding. The LLE algorithm (Roweis & Saul, 2000) looks for an embedding that preserves the local geometry in the neighborhood of each data point. First, a sparse matrix of local predictive weights Wij is computed, such that j Wij = 1, Wij = 0 if xj is not a k-nearest-neighbor ˜ = (I−W) (I−W) of xi and ||( Wij xj )−xi ||2 is minimized. Then the matrix M j

˜ is formed. The embedding is obtained from the lowest eigenvectors of M,

Learning Eigenfunctions Links Spectral Embedding and Kernel PCA

2205

except for the eigenvector with the smallest eigenvalue, which is uninteresting because it is proportional to (1, 1, . . . , 1) (and its eigenvalue is 0). To select the principal eigenvectors, we define our normalized matrix here by ˜ and ignore the top eigenvector (although one could apply an M = cI − M additive normalization to remove the components along the (1, 1, . . . , 1) direction). The LLE embedding for xi is thus given by the vik , starting at the second eigenvector (since the principal one is constant). If one insists on having a positive semidefinite matrix M, one can take for c the largest ˜ (note that c changes the eigenvalues only additively and eigenvalue of M has no influence on the embedding of the training set). In order to find our kernel Kn , we first denote by w(x, xi ) the weight of xi in the reconstruction of any point x ∈ Rd by its k nearest neighbors in the training set (if x = xj ∈ D, w(x, xi ) = δij ). Let us first define a kernel Kn by Kn (xi , x) = Kn (x, xi ) = w(x, xi ) and Kn (x, y) = 0 when neither x nor y is in the training set D. Let Kn be such that Kn (xi , xj ) = Wij + Wji − k Wki Wkj and Kn (x, y) = 0 when either x or y is not in D. It is then easy to verify that the kernel Kn = (c − 1)Kn + Kn is such that Kn (xi , xj ) = Mij (again, there could be other ways to obtain a data-dependent kernel for LLE that can be applied out-of-sample). Something interesting about this kernel is that when c → ∞, the embedding obtained for a new point x converges to the extension of LLE proposed in Saul and Roweis (2002), as shown in Bengio et al. (2004) (this is the kernel we actually used in the experiments reported here). As noted independently in Ham et al. (2003), LLE can be seen as performing kernel PCA with a particular kernel matrix. The identification becomes more accurate when one notes that getting rid of the constant eigenvector (principal eigenvector of M) is equivalent to the centering operation in feature space required for kernel PCA (Ham et al., 2003). 4 Similarity Kernel Eigenfunctions As noted in Williams and Seeger (2000), the kernel PCA projection formula (equation 2.3) is proportional to the so-called Nystrom ¨ formula (Baker, 1977; Williams & Seeger, 2000), equation 1.1, which has been used successfully to “predict” the value of an eigenvector on a new data point, in order to speed up kernel methods computations by focusing the heavier computations (the eigendecomposition) on a subset of examples (Williams & Seeger, 2001). The use of this formula can be justified by considering the convergence of eigenvectors and eigenvalues, as the number of examples increases (Baker, 1977; Koltchinskii & Gin´e, 2000; Williams & Seeger, 2000; Shawe-Taylor & Williams, 2003). Here we elaborate on this relation in order to better understand what all these spectral embedding algorithms are actually estimating. If we start from a data set D, obtain an embedding for its elements, and add more and more data, the embedding for the points in D converges (for eigenvalues that are unique): Shawe-Taylor and Williams (2003) give bounds

2206

Y. Bengio et al.

on the convergence error (in the case of kernel PCA). Based on these kernel PCA convergence results, we conjecture that in the limit, each eigenvector would converge to an eigenfunction for the linear operator defined below, in the sense that the ith element of the kth eigenvector converges to the application of the kth eigenfunction to xi . In the following, we assume that the (possibly data-dependent) kernel Kn is bounded (i.e., |Kn (x, y)| < c for all x, y in R) and has a discrete spectrum; it can be written as a discrete expansion: Kn (x, y) =

∞

αk φk (x)φk (y).

k=1

Consider the space Hp of continuous functions f that are square integrable as follows: f 2 (x)p(x)dx < ∞, with the data-generating density function p(x). One must note that we actually work not on functions but on equivalence classes: we say two functions f and g belong to the same equivalence class (with respect to p) if and only if ( f (x) − g(x))2 p(x)dx = 0 (if p is strictly positive, then each equivalence class contains only one function). We will assume that Kn converges uniformly in its arguments and in probability to its limit K as n → ∞. This means that for all > 0, lim P( sup |Kn (x, y) − K(x, y)| ≥ ) = 0.

n→∞

x,y∈Rd

We will associate with each Kn a linear operator Gn and with K a linear operator G, such that for any f ∈ Hp , Gn f = and

n 1 Kn (·, xi ) f (xi ) n i=1

(4.1)

K(·, y) f (y)p(y)dy,

(4.2)

Gf =

which makes sense because we work in a space of functions defined everywhere. Furthermore, as Kn (·, y) and K(·, y) are square integrable in the sense defined above, for each f and each n, Gn ( f ) and G( f ) are square integrable in the sense defined above. We will show that the Nystrom ¨ formula, equation 1.1, gives the eigenfunctions of Gn (see proposition 1), that their value on the training examples corresponds to the spectral embedding, and that they converge to the eigenfunctions of G (see proposition 2) if they converge at all. These results will hold even if Kn has negative eigenvalues.

Learning Eigenfunctions Links Spectral Embedding and Kernel PCA

2207

The eigensystems of interest are thus the following: Gfk = λk fk

(4.3)

Gn fk,n = λk,n fk,n ,

(4.4)

and

where (λk , fk ) and (λk,n , fk,n ) are the corresponding eigenvalues and eigenfunctions. Note that when equation 4.4 is evaluated only at the xi ∈ D, the set of equations reduces to the eigensystem Mvk = nλk,n vk . The following proposition gives a more complete characterization of the eigenfunctions of Gn , even in the case where eigenvalues may be negative. The next two propositions formalize the link already made in Williams and Seeger (2000) between the Nystrom ¨ formula and eigenfunctions of G. Gn has in its image m ≤ n eigenfunctions of the form: √ n n fk,n (x) = vik Kn (x, xi ) k i=1

Proposition 1.

(4.5)

with corresponding nonzero eigenvalues λk,n = nk , where vk = (v1k , . . . , vnk ) is the kth eigenvector of the Gram matrix M, associated with the eigenvalue k . For xi ∈ D these functions coincide with the corresponding eigenvectors, in the sense √ that fk,n (xi ) = nvik . Proof. First, we show that the fk,n coincide with the eigenvectors of M at xi ∈ D. For fk,n associated with nonzero eigenvalues, √ n n vjk Kn (xi , xj ) fk,n (xi ) = k j=1 √ n k vik = k √ = nvik . (4.6) The vk being orthonormal, the fk,n (for different values of k) are therefore different from each other. Then for any x, (Gn fk,n )(x) =

n 1 Kn (x, xi ) fk,n (xi ) n i=1

n 1 k = √ Kn (x, xi )vik = fk,n (x), n n i=1

(4.7)

which shows that fk,n is an eigenfunction of Gn with eigenvalue λk,n = k /n.

2208

Y. Bengio et al.

The previous result shows that the Nystrom ¨ formula generalizes the spectral embedding outside the training set. However, there could be many possible generalizations. To justify the use of this particular generalization, the following theorem helps in understanding the convergence of these functions as n increases. We would like the out-of-sample embedding predictions obtained with the Nystrom ¨ formula to be somehow close to the asymptotic embedding (the embedding one would obtain as n → ∞). Note also that the convergence of eigenvectors to eigenfunctions shown in Baker (1977) applies to data xi , which are deterministically chosen to span a domain, whereas here the xi form a random sample from an unknown distribution. Proposition 2. If the data-dependent bounded kernel Kn (|Kn (x, y)| ≤ c) converges uniformly in its arguments and in probability, with the eigendecomposition of the Gram matrix converging, and if the eigenfunctions fk,n of Gn associated with nonzero eigenvalues converge uniformly in probability, then their limits are the corresponding eigenfunctions of G. Denote fk,∞ the nonrandom function such that

Proof.

sup | fk,n (x) − fk,∞ (x)| → 0

(4.8)

x

in probability. Similarly, let K the nonrandom kernel such that sup |Kn (x, y) − K(x, y)| → 0

(4.9)

x,y

in probability. Let us start from the Nystrom ¨ formula and insert fk,∞ , taking advantage of Koltchinskii and Gin´e (2000), theorem 3.1, which shows that λk,n → λk almost surely, where λk are the eigenvalues of G: fk,n (x) = =

n 1 fk,n (xi )Kn (x, xi ) nλk,n i=1

(4.10)

n 1 fk,∞ (xi )K(x, xi ) nλk i=1

+

n λk − λk,n fk,∞ (xi )K(x, xi ) nλk,n λk i=1

+

n 1 fk,∞ (xi )[Kn (x, xi ) − K(x, xi )] nλk,n i=1

+

n 1 Kn (x, xi )[ fk,n (xi ) − fk,∞ (xi )]. nλk,n i=1

(4.11)

Learning Eigenfunctions Links Spectral Embedding and Kernel PCA

2209

Below, we will need to have shown that fk,∞ (x) is bounded. For this, we use the assumption that Kn is bounded, independently of n: |Kn (x, y)| < c. Then n n 1 1 | fk,n (x)| = fk,n (xi )Kn (x, xi ) ≤ | fk,n (xi )||Kn (x, xi )| nλk,n i=1 n|λk,n | i=1 ≤

n c | fk,n (xi )| n|λk,n | i=1

≤

n √ c n|vik | n|λk,n | i=1

≤ √ ≤

n 1 c √ n|λk,n | i=1 n

c |λk,n |

,

where in the second line, we used the bound on Kn , on third equation 4.6, the n and on the fourth, the fact that the maximum of i=1 ai s.t. ai ≥ 0 and n 2 √1 . Finally, using equation 4.8 and the a = 1 is achieved when a = i i=1 i n convergence of λk,n , we can deduce that | fk,∞ | ≤ c/|λk |, thus is bounded. We now insert λ1k fk,∞ (y)K(x, y)p(y) dy in equation 4.11 and obtain the following inequality: fk,n (x) − 1 fk,∞ (y)K(x, y)p(y) dy λk n 1 1 ≤ fk,∞ (xi )K(x, xi ) − fk,∞ (y)K(x, y)p(y) dy nλk i=1 λk n λ − λ k k,n + fk,∞ (xi )K(x, xi ) nλk,n λk i=1 n 1 + fk,∞ (xi )[Kn (x, xi ) − K(x, xi )] nλk,n i=1 n 1 + Kn (x, xi )[ fk,n (xi ) − fk,∞ (xi )] nλk,n i=1 ≤ A n + B n + C n + Dn .

(4.12)

From our convergence hypothesis (equations 4.8 and 4.9), the convergence of λk,n to λk almost surely, and the fact that fk,∞ , K, and Kn are bounded, it is

2210

Y. Bengio et al.

clear that the last three terms Bn , Cn , and Dn converge to 0 in probability. In addition, applying the law of large numbers, the first term An also converges to 0 in probability. Therefore, fk,n (x) →

1 λk

fk,∞ (y)K(x, y)p(y) dy

in probability for all x. Since we also have fk,n (x) → fk,∞ (x), we obtain λk fk,∞ (x) =

fk,∞ (y)K(x, y)p(y) dy,

which shows that fk,∞ is an eigenfunction of G, with eigenvalue λk ; therefore fk,∞ = fk : the limit of the Nystrom ¨ function, if it exists, is an eigenfunction of G.

Kernel PCA has already been shown to be a stable and convergent algorithm (Shawe-Taylor et al., 2002; Shawe-Taylor & Williams, 2003). These articles characterize the rate of convergence of the projection error on the subspace spanned by the first m eigenvectors of the feature space covariance matrix. When we perform the PCA or kernel PCA projection on an out-of-sample point, we are taking advantage of the above convergence and stability properties in order to trust that a principal eigenvector of the empirical covariance matrix estimates well a corresponding eigenvector of the true covariance matrix. Another justification for applying the Nystrom ¨ formula outside the training examples is therefore, as already noted earlier and in Williams and Seeger (2000), in the case where Kn is positive semidefinite, that it corresponds to the kernel PCA projection (on a corresponding eigenvector of the feature space correlation matrix C). Clearly, we have with the Nystrom ¨ formula a method to generalize spectral embedding algorithms to out-of-sample examples, whereas the original spectral embedding methods provide only the transformed coordinates of training points (i.e., an embedding of the training points). The experiments described below show empirically the good generalization of this out-ofsample embedding. An interesting justification for estimating the eigenfunctions of G has been shown in Williams and Seeger (2000). When an unknown function f is to be estimated with an approximation g that is a finite linear combination of basis functions, if f is assumed to come from a zero-mean gaussian process prior with covariance E f [ f (x) f (y)] = K(x, y), then the best choices of basis functions, in terms of expected squared error, are (up to rotation/scaling) the leading eigenfunctions of the linear operator G as defined above.

Learning Eigenfunctions Links Spectral Embedding and Kernel PCA

2211

5 Learning Criterion for the Leading Eigenfunctions Using an expansion into orthonormal bases (e.g., generalized Fourier decomposition in the case where p is continuous), the best approximation of K(x, y) (in the sense of minimizing expected squared error) using only m terms is the expansion that uses the first m eigenfunctions of G (in the order of decreasing eigenvalues): m

λk fk,n (x) fk,n (y) ≈ Kn (x, y).

k=1

This simple observation allows us to define a loss criterion for spectral embedding algorithms, something that was lacking up to now for such algorithms. The limit of this loss converges toward an expected loss whose minimization gives rise to the eigenfunctions of G. One could thus conceivably estimate this generalization error using the average of the loss on a test set. That criterion is simply the reconstruction error: L(xi , xj ) = Kn (xi , xj ) −

m

2 λk,n fk,n (xi ) fk,n (xj )

.

k=1

Proposition 3. Asymptotically, the spectral embedding for a continous kernel K with discrete spectrum is the solution of a sequential minimization problem, iteratively minimizing the expected value of the loss criterion L(xi , xj ). First, with {( fk , λk )}m−1 k=1 already obtained, one can recursively obtain (λm , fm ) by minimizing

Jm (λ , f ) =

K(x, y) − λ f (x) f (y)

−

m−1

2 λk fk (x) fk (y)

p(x)p(y)dxdy,

(5.1)

k=1

where by convention we scale f such that f (x)2 p(x) = 1 (any other scaling can be transferred into λ ). Second, if the Kn are bounded (independently of n) and the fk,n converge uniformly in probability, with the eigendecomposition of the Gram matrix converging, the Monte Carlo average of this criterion, 2 n m n 1 λk,n fk,n (xi ) fk,n (xj ) , Kn (xi , xj ) − n2 i=1 j=1 k=1 converges in probability to the above asymptotic expectation.

2212

Y. Bengio et al.

Proof. We prove the first part of the proposition concerning the sequential minimization of the loss criterion, which follows from classical linear algebra (Strang, 1980; Kreyszig, 1990). We proceed by induction, assuming that we have already obtained f1 , . . . , fm−1 orthogonal eigenfunctions in order of decreasing absolute value of λi . We want to prove that (λ , f ) that minimizes Jm is (λm , fm ). Setting ∂∂λJm = 0 yields

λ = f , Kf −

m−1

λi fi (x) fi (y). f (x) f (y)p(x)p(y)dxdy.

(5.2)

i=1

Thus, we have Jm = Jm−1 −2 +

λ f (x) f (y)(K(x, y) −

m−1

λi fi (x) fi (y))p(x)p(y)dxdy

i=1

(λ f (x) f (y))2 p(x)p(y)dxdy,

which gives Jm = Jm−1 − λ2 , so that λ2 should be maximized in order to minimize Jm . Take the derivative of Jm with regard to the value of f at a particular point z (under some regularity conditions to bring the derivative inside the integral), and set it equal to zero, which yields the equation

K(z, y) f (y)p(y)dy =

+

λ f (z) f (y)2 p(y)dy m−1

λi fi (z) fi (y) f (y)p(y)dy.

i=1

Using the constraint || f ||2 = f , f = (Kf )(z) = λ f (z) +

m−1

f (y)2 p(y)dy = 1, we obtain:

λi fi (z) fi (y) f (y)p(y)dy,

(5.3)

i=1

m−1 + which rewrites into Kf = λ f i=1 λi fi f , fi . Writing Kf in the basis of ∞ all the eigenfunctions, Kf = i=1 λi fi f , fi , we obtain λ f = λm fm f , fm +

∞ i=m+1

λi fi f , fi .

Learning Eigenfunctions Links Spectral Embedding and Kernel PCA

2213

Since the fi are orthogonal, take the norm and apply Parseval’s theorem: λ2 = λm 2 f , fm 2 +

∞

λi 2 f , fi 2 .

i=m+1

If the eigenvalues are distinct, we have λm > λi for i > m, and the last expression is maximized when f , fm = 1 and f , fi = 0 for i > m, which proves that f = fm is in fact the mth eigenfunction of the kernel K and thereby λ = λm . If the eigenvalues are not distinct, then the result can be generalized in the sense that the choice of eigenfunctions is not unique anymore, and the eigenfunctions sharing the same eigenvalue form an orthogonal basis for a subspace. This concludes the proof of the first statement. To prove the second part (convergence statement), we want to show that the difference between the average cost and the expected asymptotic cost

n (x, y) = m λk,n fk,n (x) fk,n (y) and K(x,

y) = tends toward 0. If we write K k=1 m k=1 λk fk (x) fk (y), that difference is n n

1 2 2

n (xi , xj ) − Ex,y K(x, y) − K(x,

y) K (x , x ) − K n i j n2 i=1 j=1 n n

1 2 2

i , xj ) − Ex,y K(x, y) − K(x,

y) ≤ 2 K(xi , xj ) − K(x n i=1 j=1 n n 1

n (xi , xj ) − K(xi , xj ) + K(x

i , xj ) + 2 Kn (xi , xj ) − K n i=1 j=1

Kn (xi , xj ) − Kn (xi , xj ) + K(xi , xj ) − K(xi , xj ) .

The eigenfunctions and the kernel being bounded, the second factor in the product (in the second term of the inequality) is bounded by a constant B with probability 1 (because of the λk,n converging almost surely). Thus, we have with probability 1: n n

1 2 2

(x , x ) − K (x , x ) − E K(x, y) − K(x, y) K n i j n i j x,y n2 i=1 j=1 n

1 n 2 2

K(xi , xj ) − K(xi , xj ) − Ex,y K(x, y) − K(x, y) ≤ 2 n i=1 j=1

2214

Y. Bengio et al.

n B n

Kn (xi , xj ) − K(xi , xj ) − Kn (xi , xj ) + K(xi , xj ) . + 2 n i=1 j=1 But then, with our convergence and bounding assumptions, the second term in the inequality converges to 0 in probability. Furthermore, by the law of large numbers, the first term also tends toward 0 (in probability) as n goes to ∞. We have therefore proved the convergence in probability of the average loss to its asymptotic expectation. Note that the empirical criterion is indifferent to the value of the solutions fk,n outside the training set. Therefore, although the Nystrom ¨ formula gives a possible solution to the empirical criterion, there may be other solutions. Remember that the task we consider is that of estimating the eigenfunctions of G, that is, approximating a similarity function K where it matters according to the unknown density p. Solutions other than the Nystrom ¨ formula might also converge to the eigenfunctions of G. For example, one could use a nonparametric estimator (such as a neural network) to estimate the eigenfunctions. Even if such a solution does not yield the exact eigenvectors on the training examples (i.e., does not yield the lowest possible error on the training set), it might still be a good solution in terms of generalization, in the sense of good approximation of the eigenfunctions of G. It would be interesting to investigate whether the Nystrom ¨ formula achieves the fastest possible rate of convergence to the eigenfunctions of G. 6 Experiments We want to evaluate whether the precision of the generalizations suggested in the previous sections is comparable to the intrinsic perturbations of the embedding algorithms. The perturbation analysis will be achieved by replacing some examples by others from the same distribution. For this purpose, we consider splits of the data in three sets, D = F ∪ R1 ∪ R2 , and training with either F ∪ R1 or F ∪ R2 , comparing the embeddings on F. For each algorithm described in section 3, we apply the following procedure: 1. We choose F ⊂ D with m = |F| samples. The remaining n − m samples in D/F are split into two equal-size subsets, R1 and R2 . We train (obtain the eigenvectors) over F∪R1 and F∪R2 , and we calculate the Euclidean distance between the aligned embeddings obtained for each xi ∈ F. When eigenvalues are close, the estimated eigenvectors are unstable and can rotate in the subspace they span. Thus, we estimate an alignment (by linear regression) between the two embeddings using the points in F. 2. For each sample xi ∈ F, we also train over {F ∪ R1 }/{xi }. We apply the Nystrom ¨ formula to out-of-sample points to find the predicted

Learning Eigenfunctions Links Spectral Embedding and Kernel PCA

2215

embedding of xi and calculate the Euclidean distance between this embedding and the one obtained when training with F ∪ R1 , that is, with xi in the training set (in this case, no alignment is done since the influence of adding a single point is very limited). 3. We calculate the mean difference (and its standard error, shown in Figure 2) between the distance obtained in step 1 and the one obtained in step 2 for each sample xi ∈ F, and we repeat this experiment for various sizes of F. The results obtained for MDS, Isomap, spectral clustering, and LLE are shown in Figure 2 for different values of |R1 |/n (i.e., the fraction of points exchanged). Experiments are done over a database of 698 synthetic face images described by 4096 components that is available online at http://isomap. stanford.edu. Similar results have been obtained over other databases, such as Ionosphere (http://www.ics.uci.edu/˜mlearn/MLSummary.html) and

D

D

R

D

R

D

R

R

Figure 2: δ (training set variability minus out-of-sample error), with respect to ρ (proportion of substituted training samples) on the Faces data set (n = 698), obtained with a two-dimensional embedding. (Top left) MDS. (Top right) Spectral clustering or Laplacian eigenmaps. (Bottom left) Isomap. (Bottom right) LLE. Error bars are 95% confidence intervals. Exchanging about 2% of the training examples has an effect comparable to using the Nystrom ¨ formula.

2216

Y. Bengio et al.

swissroll (http://www.cs.toronto.edu/˜roweis/lle/). Each algorithm generates a two-dimensional embedding of the images, following the experiments reported for Isomap. The number of neighbors is 10 for Isomap and LLE, and a gaussian kernel with a standard deviation of 0.01 is used for spectral clustering and Laplacian eigenmaps. Then 95% confidence intervals are drawn beside each mean difference of error on the figure. As expected, the mean difference between the two distances is almost monotonically increasing as the number |R1 | of substituted training samples grows, mostly because the training set embedding variability increases. Furthermore, we find that in most cases, the out-of-sample error is less than or comparable to the training set embedding instability when around 2% of the training examples are substituted randomly. 7 Conclusion Spectral embedding algorithms such as spectral clustering, Isomap, LLE, metric MDS, and Laplacian eigenmaps are very interesting dimensionalityreduction or clustering methods. However, up to now they lacked a notion of generalization that would allow easily extending the embedding out-ofsample without again solving an eigensystem. This article has shown with various arguments that the well-known Nystrom ¨ formula can be used for this purpose and that it thus represents the result of a function induction process. These arguments also help us to understand that these methods do essentially the same thing, but with respect to different kernels: they estimate the eigenfunctions of a linear operator associated with a kernel and with the underlying distribution of the data. This analysis also shows that these methods are minimizing an empirical loss and that the solutions toward which they converge are the minimizers of a corresponding expected loss, which thus defines what good generalization should mean, for these methods. It shows that these unsupervised learning algorithms can be extended into function induction algorithms. The Nystrom ¨ formula is a possible extension, but it does not exclude other extensions that might be better or worse estimators of the eigenfunctions of the asymptotic linear operator G. When the kernels are positive semidefinite, these methods can also be immediately seen as performing kernel PCA. Note that Isomap generally yields a Gram matrix with negative eigenvalues, and users of MDS, spectral clustering, or Laplacian eigenmaps may want to use a kernel that is not guaranteed to be positive semidefinite. The analysis in this article can still be applied in that case, even though the kernel PCA analogy does not hold anymore. The experiments performed here have shown empirically on several data sets that the predicted out-of-sample embedding is generally not far from the one that would be obtained by including the test point in the training set and that the difference is of the same order as the effect of small perturbations of the training set. An interesting parallel can be drawn between the spectral embedding

Learning Eigenfunctions Links Spectral Embedding and Kernel PCA

2217

algorithms and the view of PCA as finding the principal eigenvectors of a matrix obtained from the data. This article parallels for spectral embedding the view of PCA as an estimator of the principal directions of the covariance matrix of the underlying unknown distribution, thus introducing a convenient notion of generalization, relating to an unknown distribution. Finally, a better understanding of these methods opens the door to new and potentially much more powerful unsupervised learning algorithms. Several directions remain to be explored: • Using a smoother distribution than the empirical distribution to define the linear operator Gn . Intuitively, a distribution that is closer to the true underlying distribution would have a greater chance of yielding better generalization, in the sense of better estimating eigenfunctions of G. This relates to putting priors on certain parameters of the density, as in Rosales and Frey (2003). • All of these methods are capturing salient features of the unknown underlying density. Can one use the representation learned through the estimated eigenfunctions in order to construct a good density estimator? Looking at Figure 1 suggests that modeling the density in the transformed space (right-hand side) should be much easier (e.g., would require much fewer gaussians in a gaussian mixture) than in the original space. • Learning higher-level abstractions on top of lower-level abstractions by iterating the unsupervised learning process in multiple layers. These transformations discover abstract structures such as clusters and manifolds. It might be possible to learn even more abstract (and less local) structures, starting from these representations. Acknowledgments We thank L´eon Bottou, Christian L´eger, Sam Roweis, Yann Le Cun, and Yves Grandvalet for helpful discussions, the anonymous reviewers for their comments, and the following funding organizations: NSERC, MITACS, IRIS, and the Canada Research Chairs. References Baker, C. (1977). The numerical treatment of integral equations. Oxford: Clarendon Press. Belkin, M., & Niyogi, P. (2003). Laplacian eigenmaps for dimensionality reduction and data representation. Neural Computation, 15(6),1373–1396. Bengio, Y., Paiement, J., Vincent, P., Delalleau, O., Le Roux, N., & Ouimet, M. (2004). Out-of-sample extensions for LLE, Isomap, Mds, eigenmaps, and spectral clustering. In S. Thrun, L. Saul, & B. Scholkopf ¨ (Eds.), Advances in neural information processing systems, 16. Cambridge, MA: MIT Press.

2218

Y. Bengio et al.

Chung, F. (1997). Spectral graph theory. Providence, RI: American Mathematical Society. Cox, T., & Cox, M. (1994). Multidimensional scaling. London: Chapman & Hall. de Silva, V., & Tenenbaum, J. (2003). Global versus local methods in nonlinear dimensionality reduction. In S. Becker, S. Thrun, & K. Obermayer (Eds.), Advances in neural information processing systems, 15 (pp. 705–712). Cambridge, MA: MIT Press. Donoho, D., & Grimes, C. (2003). Hessian eigenmaps: new locally linear embedding techniques for high-dimensional data (Tech. Rep. No. 2003-08). Stanford, CA: Department of Statistics, Stanford University. Gower, J. (1968). Adding a point to vector diagrams in multivariate analysis. Biometrika, 55(3), 582–585. Ham, J., Lee, D., Mika, S., & Scholkopf, ¨ B. (2003). A kernel view of the dimensionality reduction of manifolds (Tech. Rep. No. TR-110). Tubingen, ¨ Germany; Max Planck Institute for Biological Cybernetics. Koltchinskii, V., & Gin´e, E. (2000). Random matrix approximation of spectra of integral operators. Bernoulli, 6(1), 113–167. Kreyszig, E. (1990). Introductory functional analysis with applications. New York: Wiley. Ng, A. Y., Jordan, M. I., & Weiss, Y. (2002). On spectral clustering: Analysis and an algorithm. In T. G. Dietterich, S. Becker, & Z. Ghahramani (Eds.), Advances in neural information processing systems, 14, Cambridge, MA: MIT Press. Rosales, R., & Frey, B. (2003). Learning generative models of affinity matrices. In Proceedings of the 19th Annual Conference on Uncertainty in Artificial Intelligence (pp. 485–492). San Francisco: Morgan Kaufman. Roweis, S., & Saul, L. (2000). Nonlinear dimensionality reduction by locally linear embedding. Science, 290(5500), 2323–2326. Saul, L., & Roweis, S. (2002). Think globally, fit locally: Unsupervised learning of low dimensional manifolds. Journal of Machine Learning Research, 4,119–155. Scholkopf, ¨ B., Burges, C. J. C., & Smola, A. J. (1999). Advances in kernel methods— Support vector learning. Cambridge, MA: MIT Press. Scholkopf, ¨ B., Smola, A., & Muller, ¨ K.-R. (1996). Nonlinear component analysis as a kernel eigenvalue problem (Tech. Rep. No. 44), Tubingen, ¨ Germany: Max Planck Institute for Biological Cybernetics. Scholkopf, ¨ B., Smola, A., & Muller, ¨ K.-R. (1998). Nonlinear component analysis as a kernel eigenvalue problem. Neural Computation, 10, 1299–1319. Shawe-Taylor, J., Cristianini, N., & Kandola, J. (2002). On the concentration of spectral properties. In T. Dietterich, S. Becker, & Z. Ghahramani (Eds.), Advances in neural information processing systems, 14. Cambridge, MA: MIT Press. Shawe-Taylor, J., & Williams, C. (2003). The stability of kernel principal components analysis and its relation to the process eigenspectrum. In S. Becker, S. Thrun, & K. Obermayer (Eds.), Advances in neural information processing systems, 15. Cambridge, MA: MIT Press. Shi, J., & Malik, J. (1997). Normalized cuts and image segmentation. In Proc. IEEE Conf. Computer Vision and Pattern Recognition (pp. 731–737). New York: IEEE.

Learning Eigenfunctions Links Spectral Embedding and Kernel PCA

2219

Spielman, D., & Teng, S. (1996). Spectral partitionning works: planar graphs and finite element meshes. In Proceedings of the 37th Annual Symposium on Foundations of Computer Science. New York: IEEE. Strang, G. (1980). Linear algebra and its applications. New York: Academic Press. Tenenbaum, J., de Silva, V., & Langford, J. (2000). A global geometric framework for nonlinear dimensionality reduction. Science, 290(5500), 2319–2323. Weiss, Y. (1999). Segmentation using eigenvectors: A unifying view. In Proceedings IEEE International Conference on Computer Vision (pp. 975–982). New York: IEEE. Williams, C. (2001). On a connection between kernel pca and metric multidimensional scaling. In T. Leen, T. Dietterich, & V. Tresp (Eds.), Advances in neural information processing systems, 13 (pp. 675–681). Cambridge, MA: MIT Press. Williams, C., & Seeger, M. (2000). The effect of the input density distribution on kernel-based classifiers. In Proceedings of the Seventeenth International Conference on Machine Learning. San Mateo, CA: Morgan Kaufmann. Williams, C. K. I., & Seeger, M. (2001). Using the Nystrom ¨ method to speed up kernel machines. In T. Leen, T. Dietterich, & V. Tresp (Eds.), Advances in neural information processing systems, 13 (pp. 682–688). Cambridge, MA: MIT Press. Received March 24, 2003; accepted March 28, 2004.

LETTER

Communicated by Dmitri Chklovskii

Predicting Axonal Response to Molecular Gradients with a Computational Model of Filopodial Dynamics Geoffrey J. Goodhill [email protected] Department of Neuroscience, Georgetown University Medical Center, Washington, D.C. 20007, U.S.A.

Ming Gu [email protected] Computation and Neural Systems, California Institute of Technology Pasadena, CA 91125, USA

Jeffrey S. Urbach [email protected] Department of Physics, Georgetown University, Washington, D.C. 20057, U.S.A.

Axons are often guided to their targets in the developing nervous system by attractive or repulsive molecular concentration gradients. We propose a computational model for gradient sensing and directed movement of the growth cone mediated by filopodia. We show that relatively simple mechanisms are sufficient to generate realistic trajectories for both the short-term response of axons to steep gradients and the long-term response of axons to shallow gradients. The model makes testable predictions for axonal response to attractive and repulsive gradients of different concentrations and steepness, the size of the intracellular amplification of the gradient signal, and the differences in intracellular signaling required for repulsive versus attractive turning. 1 Introduction A key method used by developing axons to navigate to appropriate targets in the embryonic nervous system is guidance by molecular gradients (Mueller, 1999; Song & Poo, 2001; Yu & Bargmann, 2001; Dickson, 2002, Huber, Kolodkin, Ginty, & Cloutier 2003; Guan & Rao, 2003). Both gradient sensing and directed movement are primarily mediated by the growth cone, a highly specialized structure at the axonal tip (Gordon-Weeks, 2000). Growth cones typically consist of a central zone surrounded by web-like veils, the lamellipodia, and highly dynamic spike-like protrusions, the filopodia, which usually have an average lifetime of just a few minutes (Rehder c 2004 Massachusetts Institute of Technology Neural Computation 16, 2221–2243 (2004)

2222

G. Goodhill, M. Gu, and J. Urbach

& Kater, 1996). Growth cones can display turning responses in a few minutes when exposed to a steep gradient in a two-dimensional culture system (e.g. Gundersen & Barrett, 1979; Zheng, Felder, Conner, & Poo, 1994; Zheng, Wan, & Poo, 1996; Ming et al., 1997; Song, Ming, & Poo, 1997; de la Torre, Hopker, ¨ Ming, & Poo, 1997; Song et al., 1998; Hong, Hinck, Nishiyama, Poo, & Tessier-Lavigne, 1999; Hong, Nishiyama, Henley, Tessier-Lavigna, & Poo, 2000; Zheng, 2000; Ming et al., 2002; Nishiyama et al., 2003). When exposed to a shallow gradient generated by diffusion from nearby tissue in a three-dimensional culture, system growth cones display quite variable trajectories, but with a tendency for smooth turning up (or down for repulsive factors) the gradient on a timescale of several hours (e.g. Lumsden & Davies, 1983, 1986; Tessier-Lavigne, Placzek, Lumsden, Dodd, & Jessell, 1988; Colamarino & Tessier-Lavigne, 1995; Keynes et al., 1997; Richards, Koester, Tuttle, & O’Leary, 1997; Varela-Echavarria, Tucker, Puschel, ¨ & Guthrie, 1997; Brose et al., 1999; Braisted et al., 2000; Caton et al., 2000; Patel et al., 2001; Nguyen Ba-Charvet et al., 2001; Shu & Richards, 2001; Anderson et al., 2003, Charron, Stein, Jeong, McMahon, & Tessier-Lavigne, 2003). Attraction can sometimes be converted to repulsion, and vice versa, by methods including manipulation of the levels of cyclic nucleotides in the growth cone (Song et al., 1997, 1998; Song & Poo, 2001; Nishiyama et al., 2003), by growing on alternative substrates (Hopker, ¨ Shewan, Tessier-Lavigne, Poo, & Holt, 1999), and electrical activity (Ming, Henley, Tessier-Lavigne, Song, & Poo, 2001). The axonal trajectories generated by repulsion roughly mirror those generated by attraction. Several lines of evidence suggest that filopodia often make a critical contribution to the sensing and movement capabilities of the growth cone both in vitro and in vivo (Argiro, Bunge, & Johnson, 1985; O’Connor, Duerr, & Bentley, 1990; Myers & Bastiani, 1993; Davenport, Dou, Rehder, & Kater, 1993; Zheng et al., 1996; Steketee & Tosney, 1999; but see also Wang, Liu, Diefenbach, & Jay, 2003). When the filopodia are eliminated, growth cones cannot navigate their environment and do not respond to either substrate bound or diffusible guidance cues (Bentley & Toroian-Raymond, 1986; Chien, Rosenthal, Harris, & Holt, 1993). In a steep gradient, in vitro filopodia become asymmetrically distributed toward the source of an attractive factor and away from the source of a repulsive factor on the timescale of a few minutes (Zheng et al., 1996; Zheng, 2000). The general issue of chemotaxis by small sensing devices has been extensively studied experimentally in systems such as bacteria, leukocytes, and slime molds (e.g., Devreotes & Zigmond, 1988; Eisenbach, 1996; Parent & Devreotes, 1999). In addition, chemotaxis in these systems has been subjected to a variety of theoretical analyses (Tranquillo, 1990), including predictions of the minimum detectable gradient steepness (Berg & Purcell, 1977), models of the trajectories generated (Tranquillo & Lauffenburger, 1987), and models of the intracellular transduction pathways that mediate response to gradients (Moghe & Tranquillo, 1995; Barkai & Leibler, 1997). However, such quantitative analyses are much less well developed for growth cones

Axonal Response to Gradients

2223

(Buettner, 1995; Robert & Sweeney, 1997; Hely & Willshaw, 1998; Goodhill, 1998; Goodhill & Urbach, 1999; Meinhardt, 1999). In particular, the theoretical consequences for growth cone gradient sensing of the unique dynamical properties of filopodia remain unexplored, and the precise requirements for generating similar trajectories to those seen experimentally in both attractive and repulsive gradients have not been investigated. Here, we present the first quantitative model for how filopodia mediate both attractive and repulsive gradient-sensing and -directed movement by growth cones. Our basic hypothesis is that new filopodia tend to be generated in the direction where ligand binding is highest, in the case of attraction, and lowest, in the case of repulsion, and that the growth cone takes a turn toward the average direction of the filopodia. We show that this hypothesis is sufficient to account for the trajectories generated by growth cones in both attractive and repulsive gradients on both short and long timescales. The model is expressed at the level of filopodial dynamics and does not address in detail the sequence of intracellular signaling events leading from receptor binding to the spatially anisotropic polymerization and bundling of actin that actually causes filopodial extension (Song & Poo, 2001; Tanaka & Sabry, 1995; Suter & Forscher, 2000; Steketee, Balazovich, & Tosney, 2001; Zhou, Waterman-Storer, & Cohan, 2002). However, the model makes a number of predictions about the mathematical constraints that these intracellular transformations should satisfy to reproduce trajectories seen experimentally. In particular, it predicts the degree of amplification required from the binding signal to the turning signal and how the signal transduction pathways mediating attraction may differ from those mediating repulsion. The model also suggests that the response to attractive and repulsive gradients may not be entirely symmetric. 2 The Model We consider an idealized growth cone consisting of a two-dimensional semicircular body from which several one-dimensional filopodia extend (see Figure 1A). Both the (one-dimensional) surface of the growth cone body and the filopodia are covered with receptors at random locations. The probability pi for receptor i to be bound is given by pi = Ci /(Ci + KD ), where Ci is the external ligand concentration at the position of receptor i and KD is the dissociation constant for the receptor-ligand complex. At each time step, the state of each receptor (bound or unbound) is updated by random assignment with this probability. Receptor binding is averaged within a local region (hereafter referred to as a bin), with bins equally spaced around the growth cone. Receptor binding is then divided by the total number of receptors in that bin to give the fraction of bound receptors per bin, b(θ ). This binding signal is assumed to establish a shallow internal gradient of an intracellular signaling molecule. This shallow gradient by itself is insufficient to generate the degree of turning shown by real axons in gradients (see

2224

A

G. Goodhill, M. Gu, and J. Urbach

B P(filopodium generation)

0.15

0.1

0.05

0

−50

0

50

Angle around growth cone

Figure 1: The model. (A) Model growth cone. The small circles represent receptors, which are distributed randomly on the surface of both the growth cone and the filopodia. (B) Schematic example of the internal signaling gradient generated after amplification. An attractive external ligand gradient points to the right of the growth cone, and the concentration at the center of the growth cone is KD . Solid line: no amplification (i.e., b(θ)). Dashed line: amplification = 5 (i.e., b5 (θ )). Dash-dotted line: amplification = 10 (i.e., b10 (θ)). Note that the larger the amplification, the higher the probability is that a new filopodium will be generated on the right rather than the left of the growth cone (though there is always a nonzero probability of a filopodium appearing at any angle). For clarity, a fractional change in concentration of 50% across the growth cone is shown; the gradients actually used in the model are much shallower than this.

section 3). We therefore further assume that there is an amplification step between this gradient and the probability of generating a new filopodium. In reality, this amplification arises from a complex network of intracellular signalling (Parent & Devreotes, 1999; Song & Poo, 2001; Guan & Rao, 2003). The precise details of this network are not yet well understood for growth cones; we simply assume that its ultimate effect is to raise the shallow internal gradient to a power n (see Figure 1B; this is somewhat analogous to the standard assumption in many neural network models that the output of a neuron is simply a sigmoid function of its input (Haykin, 1999)). This amplified gradient then gives the probability P(θ ) of generating a new filopodium in each bin, that is, P(θ) ∝ bn (θ) in the attractive case. For repulsive ligand gradients, an inversion step is required. We consider three forms: P(θ) ∝ (1 − b(θ))n , P(θ) ∝ 1 − bn (θ), and P(θ ) ∝ b−n (see section 5). The bin from which a new filopodium will extend is then chosen from this probability distribution, and a new filopodium is extended from a position chosen from a uniform distribution within this bin. At the same time, the oldest filopodium is retracted, maintaining a constant number of filopodia on the growth cone. The growth cone then moves a constant small distance, mostly in the forward direction but with a slight deviation to one side or the other (a rate of growth that depends on the concentration of factor could

Axonal Response to Gradients

2225

be added as appropriate). This deviation is in the direction defined by the net pull of the filopodia. The entire cycle of receptor binding, internal gradient establishment, amplification, bin selection, filopodia generation, and directed movement is then repeated. We have investigated two versions of the model. In the angle version, the distance the growth cone moves forward at each time step is constant, and the deviation is toward the mean angle of the filopodia. In the force version, the distance moved and deviation are proportional to the force exerted by the filopodia, assuming each pulls equally along its length. The results generated by these two versions of the model are very similar, and we therefore present results only for the simpler angle model (see also section 4). 2.1 Parameters. Unless otherwise stated, the parameter values used to generate the data shown in section 3 were those listed below and in Table 1, which also lists some typical values from the experimental literature. This list represents a small fraction of the quantitative data in the literature, and in some axon guidance contexts, different parameter values would be appropriate. Additional parameter values were as follows. The number of receptors on the growth cone was 3000 in total, with 100 each per filopodium (1500 total on filopodia) and the remaining 1500 distributed on the body of

Table 1: Parameters Used in the Model. Parameter

Model Experimental Value Values

Axon Type

Growth cone width (microns) Growth cone speed (microns/hour)

20

DRG Glass, PLL Bray & Chapman, 1985 Xenopus spinal Glass Zheng et al., 1996 PC12, SCG Collagen Aletta & Greene, 1988 Xenopus spinal Glass Zheng, 2000 Zebrafish retinal In vivo Hutson & Chien, 2002 Xenopus spinal Glass Zheng et al., 1996

Number of filopodia per

15

growth cone Filopodium length (microns)

20

10

18 10 20±2, 18±4 18–24 35–45 11–18 (turning) 40 (not turning) 4–6 6 8–20 10–25 6.9±0.4 5–15 3–4 4–7

Filopodium lifetime (min)

7.5

9 6 7.5

Substrate

References

Zebrafish retinal In vivo LBD, SND In vivo

Hutson & Chien, 2002 Kim, Kolodziej, & Chiba, 2002 Xenopus spinal Glass Zheng et al., 1996 PC12, SCG Collagen Aletta & Greene, 1988 DRG Glass, PLL Bray & Chapman, 1985 Aplysia PLL Goldberg & Burmeister, 1986 Zebrafish retinal in vivo Hutson & Chien, 2002 LBD, SND in vivo Kim, Kolodziej, & Chiba, 2002 Xenopus spinal Glass Zheng et al., 1996 LBD, SND in vivo Kim et al., 2002 Xenopus spinal Glass Gomez, Robles, Poo, & Spitzer, 2001

Notes: Parameter values used in the simulations reported here and some representative values from the literature. Note that since the number of filopodia is assumed fixed, the filopodium lifetime determines the size of each time step: for the parameters here, the time step is 30 seconds. DRG, dorsal root ganglion; PLL, poly-L lysine; SCG, superior cervical ganglion.

2226

G. Goodhill, M. Gu, and J. Urbach

the growth cone. The number of angle bins was 25 (an average of 120 receptors per bin). The ligand concentration was equal to KD at the starting point of each growth cone trajectory. The concentrations of diffusible attractants encountered by axons in vivo are generally unknown, but KD is the concentration at which growth cones are expected to be maximally sensitive to gradients (Tranquillo, 1990). The gradient was exponential, with a steepness (expressed as percentage change in ligand concentration across 10µm) varying from 0.1% to 10%. The new orientation of the growth cone is equal to 0.97 times the previous orientation, plus 0.03 times the average angle of the filopodia. This value controls how quickly the axon reorients in response to changes in filopodia distribution. Values much lower than 0.97 produced trajectories for long-term responses that were more convoluted than those observed experimentally. The length of each time step was 30 seconds, and the distance moved forward in this time was 1/6 µm (equivalent to 20 µm/hour). At successive time steps, the growth cone makes a statistically independent measurement of the concentration in each bin. For diffusion-limited receptor-binding kinetics, the minimum time between statistically independent concentration measurements can be estimated to be the radius of the sensing device squared divided by the diffusion constant (Berg & Purcell, 1977). For diffusion constants in the range 10−6 to 10−7 cm2 /sec, this time is between 1 and 10 seconds, which as required is less than the time step used in the model. Each individual axon was initialized with a different random seed and a different initial distribution of filopodia. The total time simulated was 2 hours for “short” trajectories and 40 hours for “long” trajectories. 3 Results Our model represents the main steps that may occur inside the growth cone to convert a shallow external ligand gradient to directed movement (see Figure 1). The gradient of receptor binding over the surface of the growth cone produces a difference in the likelihood that filopodia will be generated on one side of the growth cone compared to the other. This difference could arise, for example, from an internal gradient in actin polymerization or bundling triggered by a gradient of an intracellular signaling molecule. In each time step, the growth cone moves forward and turns slightly toward the net direction of the filopodia, due either to the forces generated by filopodial adhesions or the asymmetric effects of the filopodia on microtubule dynamics. Two crucial components in the conversion from the initial shallow gradient to the gradient of probability in filopodium generation are amplification and inversion. Amplification is the process that steepens the initial gradient so that it is effective for guidance (illustrated in Figure 1B). Inversion is required for repulsive ligand gradients: besides being amplified, the direction of the initial shallow gradient must now be reversed so

Axonal Response to Gradients

2227

that filopodia are more likely to be generated on the down-gradient side of the growth cone than the up-gradient side. Growth cones can display a wide variety of behaviors in different contexts. For instance, a single filopodial contact can sometimes cause a complete redirection of the growth cone (O’Connor et al., 1990). Our concern was to model two specific well-characterized behaviors of axons growing in or on a uniform substrate: a rapid turn in response to a steep external ligand gradient and a slow turn in response to a shallow external ligand gradient. Experimental data for the first case come primarily from Poo and colleagues (Zheng et al., 1994, 1996; Ming et al., 1997, 2002; de la Torre et al., 1997; Song et al., 1997, 1998; Hong et al., 1999, 2000; Zheng, 2000; Nishiyama et al, 2003). By slowly ejecting a chemotropic factor from a pipette near to a growth cone growing on a two-dimensional substrate covered in fluid medium, gradients are established of steepness 5% to 10% over 10 µm at the growth cone (Zheng et al., 1994). Under appropriate conditions, the axons turn toward (or away from) the pipette on a timescale of the order of 1 hour. Experimental data regarding the long-term response of axons to shallow gradients come from the 3D collagen gel assay (e.g., Lumsden & Davies, 1983, 1986; Tessier-Lavigne et al., 1988; Colamarino & Tessier-Lavigne, 1995; Keynes et al., 1997; Richards et al., 1997; Varela-Echavarria et al., 1997; Brose et al., 1999; Braisted et al., 2000; Caton et al., 2000; Patel et al., 2001; Nguyen Ba-Charvet et al., 2001; Shu & Richards, 2001; Anderson et al., 2003; Charron et al., 2003). Here, a target explant of factor-secreting tissue or a block of, for example, 293T cells transfected with the factor is placed close to an explant containing the neurons under investigation. Under appropriate conditions, axons emerge from the latter explant and grow preferentially toward (or away from) the target on a timescale of the order of 1 day. This is generally interpreted to mean that the factor has diffused from the target into the collagen, creating a gradient (Goodhill, 1997, 1998), though such gradients have not been directly measured. To investigate the behavior of the model for short trajectories in a steep attractive gradient, we simulated 100 axons, each starting from the origin and initially moving in the positive y direction (see Figure 2). An exponential gradient with a fractional change of either 0% or 5% per 10 µm was present directed along the x-axis, at an angle of 90 degrees to the initial trajectory. Parameter values are given in section 2, and the relationship between filopodial initiation probability P and receptor binding b was given by P ∝ bn . Figures 2A and 2B show a typical set of trajectories for an amplification factor n = 10 for zero gradient and a 5% gradient, respectively (Zheng et al., 1994, 1996; Ming et al., 1997, 2002; de la Torre et al., 1997; Song et al., 1997, 1998; Hong et al., 1999, 2000; Zheng, 2000) Analysis of the model below shows that any combination of amplification and steepness with the same product (e.g., n = 5, steepness = 10%) would produce approximately the same effect. Figure 2C shows the cumulative distribution of final turning angles for zero gradient, amplification = 10, and for a 5%

2228

G. Goodhill, M. Gu, and J. Urbach

B 40

40

30

30

Distance (microns)

Distance (microns)

A 20 10 0 −10 −40

−20

0 20 Distance (microns)

10 0 −10 −40

40

C

20

−20

0 20 Distance (microns)

40

D 20

Mean x displacement (microns)

Cumulative angle distribution (%)

100 80 60 40 20 0

−60

−40

−20 0 20 40 Turning angle (degrees)

60

15 10 5 0 −5 −10 −15 −20 0

20

40

60 80 Time (minutes)

100

120

Figure 2: Short axon trajectories generated by the model. An attractive gradient of steepness 0% (A) or 5% (B) points to the right. One hundred simulated axons are shown, each starting from a different random seed. Amplification = 10 in both cases. Growth cones themselves are not drawn. (C) Cumulative distribution of final axon turning angles. The left-most curve is for zero gradient; the other three curves moving from left to right are for a 5% gradient with amplification = 1,5,10. (D) Mean x displacement of each population of 100 axons as a function of time. The four curves from bottom to top correspond to the four curves from left to right in C.

gradient, amplification = 1,5,10. An alternative measure of the response to the cumulative distribution is the mean distance each population of axons has moved in the x direction as a function of time (see Figure 2D). For amplification = 1, only a modest degree of turning up the gradient is seen, and there is a wide spread in trajectories, while for amplification = 10, there is more robust turning and a smaller spread of trajectories. It is thus clear that amplification in the range 5 to 10 in the model is necessary to reproduce the behavior of axons seen experimentally. We also investigated the sensitivity of the model to the number of filopodia used in the range 5 to 25 (data not shown); there were no significant changes in the observed behavior. The model is somewhat sensitive to the total number of receptors on the growth cone. The mean x displacement after 2 hours increased slightly as recep-

Axonal Response to Gradients

2229

A

B 20

Mean x displacement (microns)

Mean x displacement (microns)

30 25 20 15 10 5 0 0

2

4 6 8 Gradient steepness (%)

10

15

10

5

0 0.001

0.01

0.1 1 10 Concentration/Kd

100

1000

C 10

Gradient steepness (%)

9 8 7 6 5 4 3 2 1 0.5 0.1 0

10

20 30 40 Turning time (mins)

50

60

Figure 3: Statistics of short axon trajectories. (A) Mean × displacement after 2 hours as a function of gradient steepness. Error bars are standard error in the mean (SEMs) averaged over 100 axons per run. The three curves are amplification = 1,5,10 from bottom to top. (B) Mean x displacement as a function of concentration at the growth cone, for amplification = 1,5,10. Error bars are SEMs averaged over five runs of 100 axons per run. (C) Gradient steepness versus turning time. Solid lines are amplification = 1 to 10 moving from right to left. Dashed line: amplification = 100. Dotted line: theoretical prediction based on thermal noise from Berg and Purcell (1977). Dot-dash line: theoretical prediction based on receptor noise from Tranquillo (1990).

tor numbers increased above 3000 and decreased significantly as receptor numbers decreased below 1000 (data not shown). The model produces specific predictions for how the axonal response to gradients varies with gradient steepness and with absolute concentration (Goodhill, 1998; Goodhill & Urbach, 1999). Figure 3A plots the mean x displacement after 2 hours as a function of (attractive) gradient steepness for amplification = 1,5,10. As expected, the mean x displacement increases with increasing gradient steepness. Figure 3B plots x displacement as a function of the concentration at the starting position of the growth cones for amplification = 1,5,10. It can be seen that the response drops off

2230

G. Goodhill, M. Gu, and J. Urbach

away from C = Kd . For high concentrations, most receptors are bound most of the time, whereas for low concentrations, hardly any receptors are bound, and in neither case is it possible to detect a small difference in average binding between the two sides of the growth cone. This roughly matches data for leukocyte chemotaxis (Zigmond, 1977), and for growth cones using a new experimental assay we have recently developed (Rosoff et al., 2004). In order to compare the behavior of the model with theoretical predictions regarding gradient detection by small sensing devices, we simulated gradient steepness versus turning time, defined as the first time at which the mean x displacement exceeds the standard deviation of x displacement (see Figure 3C). As expected, the greater the amplification, the shorter is the turning time for a given gradient steepness. Also shown in Figure 3C is the theoretical prediction of Berg and Purcell (1977), as applied to growth cones (Goodhill & Urbach, 1999). This model calculates the unavoidable thermal fluctuations in the local concentration of chemoattractant, and thus the minimum possible detectable gradient steepness. Here, turning time is taken to be the time over which the sensing device averages concentration measurements before making a decision about the direction of the gradient. There are at least three effects that reduce the senstivity of our model below the theoretical maximum. One effect is the randomness arising from the generation of new filopodia (see section 2), which is included to model intracellular noise. This effect can be effectively elimated by using an extremely high value of the amplification, such as that used to generate the dashed line (amplification = 100). A second effect is the noise arising from the inherently stochastic receptor binding. Another curve in Figure 3C shows the maximum possible turning time after allowing for receptor noise (Tranquillo, 1990). Finally, at short times, the turning time is limited by the fact that the growth cone can reorient a relatively small amount in each time step (see section 2; without this inertia, the trajectories are highly irregular and do not match those seen in experiments). Besides being attracted by molecular gradients, axons can often also be repelled by these gradients (Song et al., 1997, 1998; Song & Poo, 2001; Nishiyama et al., 2003). We used the same model parameters for repulsion as for attraction, except that now filopodia are preferentially generated on the side of the growth cone facing down the gradient. Figure 4A shows repulsive turning when the the probability of generating a new filopodium is given by P(θ) ∝ (1 − b(θ))n (5% gradient to the right, amplification = 10). Figure 4B shows response as a function of concentration. This curve is similar to that for attractive turning shown in Figure 3B, except reversed: in the repulsive case, sensitivity drops off faster at low concentration than high concentration, whereas the opposite is true for attraction. We also investigated P(θ) ∝ b−n and P(θ) ∝ (1 − bn (θ)) (data not shown). The former case produced a response as a function of concentration very similar to the attractive case, while the latter case produced very little turning. The bio-

Axonal Response to Gradients

2231

A

B Mean x displacement (microns)

−20

Distance (microns)

40 30 20 10 0 −10 −40

−20

0 20 Distance (microns)

40

−15

−10

−5

0 0.001

0.01

0.1 1 10 Concentration/Kd

100

1000

Figure 4: Short trajectories in a repulsive gradient. (A) As before, the gradient increases to the right (steepness = 5%, amplification = 10). (B) The repulsive response as a function of concentration for amplification = 1,5,10. Note the mirror symmetry between B and Figure 3B.

logical implementation of these different forms for P(θ) is considered in the section 5, and their mathematical properties are analyzed below. Although simulating the short-term response of axons in a steep gradient allows comparison with experiments using the pipette assay, more relevant to understanding the behavior of axons in vivo are simulations of the long-term response of axons to shallow gradients, as in 3D collagen gels (Lumsden & Davies, 1983, 1986; Tessier-Lavigne et al., 1988; Colamarino & Tessier-Lavigne, 1995; Keynes et al., 1997; Richards et al., 1997; VarelaEchavarria et al., 1997; Brose et al., 1999; Braisted et al., 2000; Caton et al., 2000; Patel et al., 2001; Nguyen Ba-Charvet et al., 2001; Shu & Richards, 2001; Anderson et al., 2003). Figure 5 shows results of the model after 40 hours of simulated growth from an “explant,” giving axons of length 800 µm. One hundred axons were initially distributed uniformly in a disk (explant) of radius 300 µm, with random initial directions. The disks drawn in Figure 5 are added simply to show the extent of the explant; as in the experimental case, it obscures the initial trajectories of the axons until they leave the explant. The trajectories are generated with the gradient increasing in the positive y direction (upward in the figure: the actual gradient steepness present in standard 3D collagen gel experiments is unknown). All other parameters were as for the short trajectories above, with amplification = 10. The explants generated by the model in this case resemble (at least qualitatively) explants seen experimentally (Lumsden & Davies, 1983, 1986; Tessier-Lavigne et al., 1988; Colamarino & Tessier-Lavigne, 1995; Keynes et al., 1997; Richards et al., 1997; Varela-Echavarria et al., 1997; Brose et al., 1999; Braisted et al., 2000; Caton et al., 2000; Patel et al., 2001; Nguyen Ba-Charvet et al., 2001; Shu & Richards, 2001; Anderson et al., 2003). Figure 6A shows the mean y displacement as a function of the concentration at the center of the explant.

2232

G. Goodhill, M. Gu, and J. Urbach

Figure 5: Long axon trajectories from an explant. Amplification = 10, time simulated = 40 hours. Gradient points upward, with the steepness indicated for each row. Each column represents a different random seed (the same within each column).

Again, sensitivity decays faster at high concentrations than low concentrations. Figure 6B shows the analogous result in a repulsive gradient with the same parameters, using the P(θ) ∝ (1 − b(θ))n form. Again, sensitivity decays faster at low concentrations than high concentrations. Figure 6C shows mean y displacement as a function of gradient steepness. 4 Analysis of the Model Here we analyze certain aspects of the behavior of our model mathematically: the trajectory of the initial turn up the gradient and the concentration dependence of the sensitivity. 4.1 Calculation of Trajectories. The model is fundamentally stochastic: receptor binding is probabilistic, as is the position at which a new filopodium is generated (biased by the direction of maximum binding). However, for a simplified and deterministic version, a closed-form solution for the initial turn can be derived. Consider the “bare” (no filopodia) growth cone shown in Figure 7A, being attracted by an exponential gradient along the x axis of form C = C0 eαx . (Refer to Table 2 for terminology.) The average binding b(θ) at each position on the growth cone is given by b(θ) =

C0 eα(x+r sin(θ+φ)) . KD + C0 eα(x+r sin(θ+φ))

Axonal Response to Gradients

2233

A

B −600

Mean y displacement (microns)

Mean y displacement (microns)

600 500 400 300 200 100

−500 −400 −300 −200 −100 0

0 0.001

0.01

0.1 1 10 Concentration/Kd

100

1000

0.001

0.01

0.1 1 10 Concentration/Kd

100

1000

C Mean y displacement (microns)

600

500

400

300

200

100

0 0.10.15

0.25

0.5

1

Gradient steepness (%)

Figure 6: Analysis of long axon trajectories from an explant. Amplification = 10, time simulated = 40 hours. (A,B) Mean y displacement as a function of concentration at the center of the explant (A = attraction, B = repulsion). The three curves in each case are (from top to bottom) for steepness = 1%, 0.5%, and 0.1%. (C) Mean y displacement as a function of gradient steepness (top line, amplification = 10; bottom line, amplification = 5).

For shallow gradients, small α, bn (θ) is approximately n ¯ αx Ce nαr n b (θ) = sin(θ + φ) . 1+ ¯ αx ¯ αx 1 + Ce 1 + Ce Define a normalized and amplified probability distribution: bn (θ) . B(θ) = +π/2 n −π/2 b (ψ)dψ Then B(θ) = ≈

1+ 1 π

nαr ¯ αx sin(θ + 1+Ce sin φ π + 2nαr ¯ αx 1+Ce

φ)

2nαr sin φ nαr sin(θ + φ) 1− + . ¯ αx + 1) ¯ αx + 1 π(Ce Ce

2234

G. Goodhill, M. Gu, and J. Urbach

Table 2: Terminology for the Analysis of the Model. Parameter

Definition

r φ θ C¯ C0 α b(θ ) B(θ ) n η

k

Growth cone radius Angle of growth cone relative to coordinate frame Angle over the growth cone surface ∈ [−π/2, π/2] Ligand concentration normalized by KD (dimensionless) Ligand concentration at the origin Gradient steepness, where concentration = C0 eαx Receptor binding as a function of θ Normalized and amplified binding density Amplification parameter Time step in differential equation for φ Time step in differential equation for x, y 2ηnαr ¯ π(1+C)

The binding-weighted direction θ specified by this distribution is first to order in α and assuming nα is still small: +π/2 2nαr θ = cosφ. (4.1) ψB(ψ)dψ ≈ ¯ π(Ceαx + 1) −π/2 We now regard this as a small increment in the angle φ of the growth cone at each time step, that is, φ˙ ∝ θ . This leads to the set of coupled differential equations 2nαr cosφ, x˙ = sin φ, y˙ = cosφ, ¯ π(Ceαx + 1) where η and are small. These equations are analytically intractable. However, for this deterministic version of the model, it is reasonable to assume that the background concentration does not change significantly over the length of the initial turn. That is, we can assume eαx ≈ 1 for the initial turn. The equation for φ˙ then becomes simply φ˙ = η

2nαr cosφ. π(C¯ + 1) Assuming φ(0) = 0, that is, the initial direction of the growth cone is perpendicular to the gradient as in our simulations, this has the solution φ˙ = η

φ = arcsin [tanh(kt)] , 2ηnαr ˙ and y˙ using this expression where k = π(1+ ¯ . Solving the equations for x C) yields the following parametric trajectory:

ln(cosh(kt)), k 2 y(t) = arctan(tanh(kt/2))cosh(kt) 1 − tanh2 (kt). k

x(t) =

Axonal Response to Gradients

A

2235

B Distance (microns)

40

ϕ θ r (x,y) Gradient

30 20 10 0 −10 −40

−20

0 20 Distance (microns)

40

Figure 7: Analysis of the model. (A) Schematic of the terminology used in the analysis. (B) Comparison of simulated trajectories (black lines; mean trajectory is solid white line) with the prediction of the analysis (dashed white line).

Figure 7 compares this predicted trajectory for typical parameters with the actual trajectories produced by simulations of the full stochastic model. The solid white line in Figure 7B is the mean of the set of trajectories simulated with the full model, while the dashed white line is the trajectory generated by the closed-form solution above: there is a good match. nαr This analysis reveals that k ∝ 1+ is a key determinant of the rate of C¯ turning. Any combination of n, α, and r that has the same product will have the same rate of turning. Thus, doubling the gradient steepness is equivalent to doubling the amplification, or doubling the width of the growth cone, and so on. In the force version of the model, the total pull is the vector sum of the forces exerted by the filopodia. The total filopodial force can then be decomposed into a forwards (in the current direction of the growth cone) and turning (orthogonal to the current direction) force. The expected values of these forces are given by B(θ)cos(θ) and B(θ ) sin(θ ) , respectively. For the turning force, we have B(θ) sin(θ) =

π/2

−π/2

B(ψ) sin(ψ)dψ =

nαr ¯ αx + 1) 2(Ce

cosφ.

This is exactly the same as the equivalent equation for the “angle” version of the model (see equation 4.1), except for the replacement of 2/π by 1/2. This demonstrates analytically that the force and angle models generate very similar trajectories. 4.2 Gradient Sensitivity. The concentration dependence of the gradient sensitivity can be understood from a first-order approximation of the concentration dependence of the filopodia initiation probability, B(C), near

2236

G. Goodhill, M. Gu, and J. Urbach

the average concentration at the growth cone, Co . By Taylor expansion,

dB(C)

B(C) ≈ B(Co ) + (C − Co ) dC C=Co

(C − Co ) Co dB(C)

. = B(Co ) 1 + Co B(Co ) dC C=Co We have studied the concentration dependence of the sensitivity for gradients with constant fractional change across the growth cone, so the variation o in the quantity C−C Co across the growth cone is comparable at all concentra θmax B(Co ) = 1, tions. In addition, the normalization condition requires that θmin so the dependence on concentration and amplification of the effective gain

Co dB(C)

Gef f is determined by Gef f ≡ B(Co ) dC

. By the chain rule, C=Co

Gef f =

Co dB

db

. B(Co ) db b(Co ) dC Co

For brevity, we now drop the subscript from Co . The last term in the expression above comes from the variation of the receptor binding with concentration and is independent of the form of the amplification: db Kd Kd = b2 2 . = dC (C + Kd )2 C Note that this function is approximately constant for C Kd and falls off rapidly when C is larger than Kd , as a consequence of the receptor saturation. For the first form of amplification considered, B(b) = A ∗ bn , where A is the normalization constant, K nKd C d (n−1) Gef f = nAb . (4.2) b2 2 = Abn C C + Kd For C Kd , Gef f ≈ n, while for C Kd , Gef f ≈ nKd /C. Thus, the amplification is effective at low concentrations, although the overall sensitivity drops, presumably because of statistical noise. At high concentrations, the amplification is ineffective because of the receptor saturation mentioned above. For the first type of repulsion, B(b) = A ∗ (1 − b)n , we find Gef f = −n ∗ (Kd /C) ∗

b2 C . = −n (1 − b) C + Kd

This looks just like equation 4.2, with C/Kd substituted for Kd /C and a change of sign. This relationship is a consequence of the fact that changing from b to

Axonal Response to Gradients

2237

1 − b is formally equivalent to switching from amplifying bound receptors to amplifying unbound receptors. Finally, for B(b) = A ∗ (1 − bn ), Gef f = −n ∗ (Kd /C) ∗

Kd Cn b(n+1) = −n . (1 − bn ) C + Kd (C + Kd )n − Cn

For C Kd , Gef f ≈ −n(C/Kd )n , while for C Kd , Gef f ≈ −1. 5 Discussion Our model demonstrates that both the short-term response of axons to steep gradients and the long-term response of axons to shallow gradients can be reproduced from a set of relatively simple assumptions. The model makes specific and quantitative predictions for the amount and type of amplification of the binding signal required, the way the response varies with gradient steepness and with ligand concentration, how the turning in attractive gradients may differ from turning in repulsive gradients, and how intracellular signaling may vary between these two cases. The model also captures surprisingly well the degree of variability of axonal responses seen experimentally. While the majority of axons appear to be influenced by the gradient, some do not, even in the case of long trajectories. In experimental assays, a heterogeneous response is usually attributed to the presence of a heterogeneous population of axons. An alternative explanation suggested by our simulation data is that, at least in some cases, the stochastic nature of filopodium generation sometimes directs axons on an apparently unresponsive trajectory. How might the sequence of events that leads to growth cone turning in the model be implemented biologically? As long as the products of the signaling cascade (Mueller, 1999; Song & Poo, 2001; Yu & Bargmann, 2001; Guan & Rao, 2003) do not diffuse too rapidly, a shallow internal gradient (e.g., of G-protein signaling) will follow directly from the gradient in receptor binding. Amplification of this gradient can come from the cooperativity of receptor binding and from autocatalytic behavior, for instance, involving Ca2+ (Zheng, 2000; Gomez & Spitzer, 2000; Hong et al., 2000; Gomez, Robles, Poo, & Spitzer, 2001), and activator-inhibitor dynamics (Meinhardt, 1999; Song & Poo, 2001). A gradient of activation of small GTPases of the Rho family (Hall, 1998), perhaps acting via Cdc42, could then produce a gradient of actin polymerization, with new filopodia more likely to sprout where the polymerization is enhanced. The positions of the filopodia are coupled to the direction of growth cone advance through adhesions between the filopodia and the substrate, and coupling between retrograde F-actin flow and microtubule extension (Suter & Forscher, 2000; Zhou et al., 2002). Our model assumes that the overall rate at which new filopodia form is independent of the receptor binding (see Zheng, 2000). Since the outputs

2238

G. Goodhill, M. Gu, and J. Urbach

of the signaling cascade enhance or depress actin polymerization locally, some other process must be utilized to keep the overall level of polymerization constant. Such global effects could come from competition for limited resources for actin polymerization or from activator-inhibitor dynamics. Experimental data suggest that the same pathways may be involved in both attractive and repulsive responses (Song & Poo, 2001; Yu & Bargmann, 2001). The most straightforward way to generate a repulsive response in our model is to effectively switch the roles of bound and unbound receptors by replacing P(b) ∝ bn with P(b) ∝ (1 − b)n . A possibly more realistic way of implementing this involves the silencing of an attractive response by heteromeric receptor complexes, such as the interaction of DCC and UNC-5 in netrin signaling (Yu & Bargmann, 2001), where the attractive function of DCC is silenced by interaction with the UNC-5/netrin complex. If an attractive signaling complex (e.g., netrin/DCC) is saturated, the amount of intracellular signaling will decrease as the proportion of silencing receptorsligand complexes (UNC-5/netrin) increases. Alternatively, the products of the amplified cascade could depress rather than enhance polymerization activity. As shown in section 4.2, the chemotactic response is primarily determined by the first derivative of the amplification function, and thus the form P(b) ∝ b−n produces a repulsive response approximately equal and opposite to the attractive response produced by bn . This could arise from a Ca2+ set point (Gomez & Spitzer, 2000; Petersen & Cancela, 2000) for optimum axon outgrowth: the polymerization gradient is in the same direction as the Ca2+ gradient when the Ca2+ level is below the set point, and in the opposite direction when the level is above the setpoint (see Zheng, 2000; Hong et al., 2000). Experimental measurements of the concentration dependence of attractive and repulsive chemotactic sensitivities should be able to distinguish between the different types of attractive and repulsive amplification considered here. This could be done with both the pipette assay and the collagen gel assay we have recently developed (Rosoff et al., 2004). Given the number of systems where the switch has been observed (Song & Poo, 2001; Yu & Bargmann, 2001; Guan & Rao, 2003), it is quite possible that more than one type of behavior will be observed. We have chosen mathematical models for intracellular amplification that are relatively simple and realistic. As the signal transduction pathways become more fully elucidated, it will be possible to directly calculate from chemical kinetics the transformation between receptor binding and filopodia initiation probability. This transformation will undoubtedly depend on a variety of factors, but is likely to share the basic characteristics of one or more of the general forms we have considered here. Acknowledgments This work was funded by grants from the NIH, NSF, and Whitaker Foundation.

Axonal Response to Gradients

2239

References Aletta, J. M., & Greene, L. A. (1988). Growth cone configuration and advance: A time-lapse study using video-enhanced differential interference contrast microscopy. J. Neurosci., 8, 1425–1435. Anderson, C. N., Ohta, K., Quick, M. M., Fleming, A., Keynes, R., & Tannahill D. (2003). Molecular analysis of axon repulsion by the notochord. Development, 130, 1123–1133. Argiro, V., Bunge, M. B., & Johnson, M. I. (1985). A quantitative study of growth cone filopodial extension. Journal of Neuroscience Research, 13, 149–162. Barkai, N., & Leibler, S. (1997). Robustness in simple biochemical networks. Nature, 387, 913–917. Bentley, D., & Toroian-Raymond A. (1986). Disoriented pathfinding by pioneer neuron growth cones deprived of filopodia by cytochalasin treatement. Nature, 323, 712–715. Berg, H. C., & Purcell, E. M. (1977). Physics of chemoreception. Biophysical Journal, 20, 193–219. Braisted, J. E., Catalano, S. M., Stimac, R. Kennedy T. E., Tessier-Lavigne, M., Shatz, C. J., & O’Leary, D. D. M. (2000). Netrin-1 promotes thalamic axon growth and is required for proper development of the thalamocortical projection. J. Neurosci., 20, 5792–5801. Bray, D. B., & Chapman, K. (1985). Analysis of microspike movements on the neuronal growth cone. J. Neurosci., 5, 3204–3213. Brose, K., Bland, K. S., Wang, K. H., Arnott, D., Henzel, W., Goodman, C. S., Tessier-Lavigne, M., & Kidd, T. (1999). Slit proteins bind robo receptors and have an evolutionarily conserved role in repulsive axon guidance. Cell, 96, 795–806. Buettner, H. M. (1995). Computer simulation of nerve growth cone filopodial dynamics for visualization and analysis. Cell Motil. Cytoskeleton, 32, 187–204. Caton, A., Hacker, A., Naeem, A., Livet, J., Maina, F., Bladt, F., Klein, R., Birchmeier, C., & Guthrie, S. (2000). The branchial arches and HGF are growthpromoting and chemoattractant for cranial motor axons. Development, 127, 1751–1766. Charron, F., Stein, E., Jeong, J., McMahon, A. P., & Tessier-Lavigne, M. (2003). The morphogen sonic hedgehog is an axonal chemoattractant that collaborates with netrin-1 in midline axon guidance. Cell, 113, 11–23. Chien, C. B., Rosenthal, D. E., Harris, W. A., & Holt, C.E. (1993). Navigational errors made by growth cones without filopodia in the embryonic Xenopus brain. Neuron, 11, 237–251. Colamarino, S. A., & Tessier-Lavigne, M. (1995). The axonal chemoattractant netrin-1 is also a chemorepellent for trochlear motor axons. Cell, 81, 621–629. Davenport, R. W., Dou. P., Rehder, V., & Kater, S. B. (1993). A sensory role for neuronal growth cone filopodia. Nature, 361, 721–724. de la Torre, J. R., Hopker, ¨ V. H., Ming, G-l., & Poo, M-M. (1997). Turning of retinal growth cones in a netrin-1 gradient mediated by the netrin receptor DCC. Neuron, 19, 1211–1224.

2240

G. Goodhill, M. Gu, and J. Urbach

Devreotes, P. N., & Zigmond, S. H. (1988). Chemotaxis in eukaryotic cells: A focus on leukocytes and Dictyostelium. Ann. Rev. Cell. Biol., 4, 649–686. Dickson, B. J. (2002). Molecular mechanisms of axon guidance. Science, 298, 1959–1964. Eisenbach, M. (1996). Control of bacterial chemotaxis. Molecular Microbiology, 20, 903–910. Goldberg, D. J., & Burmeister, D. W. (1986). Stages in axon formation: Observations of growth of Aplysia axons in culture using video-enhanced contrastdifferential interference contrast microscopy. J. Cell. Biol., 103, 1921–1931. Gomez, T. M., Robles, E., Poo, M-M., & Spitzer, N. C. (2001). Filopodial calcium transients promote substrate-dependent growth cone turning. Science, 291, 1983–1987. Gomez, T. M., & Spitzer, N. C. (2000). Regulation of growth cone behavior by calcium: New dynamics to earlier perspectives. J. Neurobiol., 44, 174–183. Goodhill, G. J. (1997). Diffusion in axon guidance. European Journal of Neuroscience, 9, 1414–1421. Goodhill, G. J. (1998). Mathematical guidance for axons. Trends in Neurosciences, 21, 226–231. Goodhill, G. J., & Urbach, J. S. (1999). Theoretical analysis of gradient detection by growth cones. Journal of Neurobiology, 41, 230–241. Gordon-Weeks, P. R. (2000). Neuronal growth cones. Cambridge: Cambridge University Press. Guan, K. L., &, Rao Y. (2003). Signalling mechanisms mediating neuronal responses to guidance cues. Nature Rev. Neurosci., 4, 941–956. Gundersen, R. W., & Barrett, J. N. (1979). Neuronal chemotaxis: Chick dorsalroot axons turn toward high concentrations of nerve growth factor. Science, 206, 1079–1080. Hall, A. (1998). Rho GTPases and the actin cytoskeleton. Science, 279, 509–514. Haykin, S. (1999). Neural networks: A comprehensive foundation. Upper Saddle River, NJ: Prentice Hall. Hely, T. A., & Willshaw, D. J. (1998). Short term interactions between microtubules and actin filaments underlie long term behaviour in neuronal growth cones. Proc. Roy. Soc. B, 265, 1801–1807. Hong, K., Hinck, L., Nishiyama, M., Poo, M-M., & Tessier-Lavigne, M. (1999). A ligand-gated association between cytoplasmic domains of UNC5 and DCC family receptors converts netrin-induced growth cone attraction to repulsion. Cell, 92, 205–215. Hong, K., Nishiyama, M., Henley, J., Tessier-Lavigne, M., & Poo, M-M. (2000). Calcium signalling in the guidance of nerve growth by netrin-1. Nature, 403, 93–98. Hopker, ¨ V. H., Shewan, D., Tessier-Lavigne, M., Poo, M-M., & Holt, C. (1999). Growth-cone attraction to netrin-1 is converted to repulsion by laminin-1. Nature, 401, 69–73. Huber, A. B., Kolodkin, A. L., Ginty, D. D., & Cloutier, J. F. (2003). Signaling at the growth cone: Ligand-receptor complexes and the control of axon growth and guidance. Annu. Rev. Neurosci., 26, 509–563. Hutson, L. D., & Chien, C.-B. (2002). Pathfinding and error correction by retinal axons: The role of astray/robo2. Neuron, 33, 205–217.

Axonal Response to Gradients

2241

Keynes, R., Tannahill, D., Morgenstern, D. A., Johnson, A. R., Cook, G. M. W., & Pini, A. (1997). Surround repulsion of spinal sensory axons in higher vertebrate embryos. Neuron, 18, 889–897. Kim, M. D., Kolodziej, P., & Chiba, A. (2002). Growth cone pathfinding and filopodial dynamics are mediated separately by Cdc42 activation. J. Neurosci., 22, 1794–1806. Lumsden, A. G. S., & Davies, A. M. (1983). Earliest sensory nerve fibres are guided to peripheral targets by attractants other than nerve growth factor. Nature, 306, 786–788. Lumsden, A. G. S. and Davies, A. M. (1986). Chemotropic effect of specific target epithelium in the developing mammalian nervous system. Nature, 323, 538– 539. Meinhardt, H. (1999). Orientation of chemotactic cells and growth cones: Models and mechanisms. Journal of Cell Science, 112, 2867–2874. Ming, G., Henley, J., Tessier-Lavigne, M., Song, H., & Poo, M. (2001). Electrical activity modulates growth cone guidance by diffusible factors. Neuron, 29, 441–452. Ming, G-I., Song, H-J., Berninger, B., Holt, C. E., Tessier-Lavigne, M., & Poo, M-M. (1997). cAMP-dependent growth cone guidance by netrin-1. Neuron, 19, 1225–1235. Ming, G., Wong, S. T., Henley, J., Yuan, X., Song, H., Spitzer, N., & Poo, M. (2002). Adaptation in the chemotactic guidance of nerve growth cones. Nature, 417, 411–418. Moghe, P. V., & Tranquillo, R. T. (1995). Stochasticity in membrane-localized ligand-receptor G-protein binding—consequences for leukocyte movement behavior. Ann. Biomed. Eng., 23, 257–267. Mueller, B. K. (1999). Growth cone guidance: First steps towards a deeper understanding. Annu. Rev. Neurosci., 22, 351–388. Myers, P. Z., & Bastiani, M. J. (1993). Growth cone dynamics during the migration of an identified commissural growth cone. J. Neurosci., 13, 127–143. Nguyen Ba-Charvet, K. T., Brose, K., Ma, L., Wang, K. H., Marillat, V., Sotelo, C., Tessier-Lavigne, M., & Chedotal, A. (2001). Diversity and specificity of actions of Slit2 proteolytic fragments in axon guidance. J. Neurosci., 21, 4281– 4289. Nishiyama, M., Hoshino, A., Tsai, L., Henley, J. R., Goshima, Y., Tessier-Lavigne, M., Poo, M. M., & Hong K. (2003). Cyclic AMP/GMP-dependent modulation of Ca2+ channels sets the polarity of nerve growth-cone turning. Nature, 424. 990–995. O’Connor, T. P., Duerr, J. S., & Bentley, D. (1990). Pioneer growth cone steering decisions mediated by single filopodial contacts in situ. J. Neurosci., 10, 3935– 3946. Parent, C. A., & Devreotes, P. N. (1999). A cell’s sense of direction. Science, 284, 765–770. Patel, K., Nash, J. A., Itoh, A., Liu, Z., Sundaresan, V., & Pini, A. (2001). Slit proteins are not dominant chemorepellents for olfactory tract and spinal motor axons. Development, 128, 5031–5037. Petersen, O. H., & Cancela, J. M. (2000). Attraction or repulsion by local Ca(2+) signals. Curr. Biol., 10, R311–314.

2242

G. Goodhill, M. Gu, and J. Urbach

Rehder, V., & Kater, S. B. (1996). Filopodia on neuronal growth cones: Multifunctional structures with sensory and motor capabilities. Seminars in the Neurosciences, 8, 81–88. Richards, L. J., Koester, S. E., Tuttle, R., & O’Leary, D. D. M. (1997). Directed growth of early cortical axons is influenced by a chemoattractant released from an intermediate target. J. Neurosci., 17, 2445–2458. Robert, M. E., & Sweeney, J. D. (1997). Computer model: Investigating role of filopodia-based steering in experimental neurite galvanotropism. J. Theor. Biol., 188, 277–288. Rosoff, W. J., Urbach, J. S., Esrick, M., McAllister, R. G., Richards, L. J., & Goodhill, G. J. (2004). A novel chemotaxis assay reveals the extreme sensitivity of axons to molecular gradients. Nat. Neurosci., 7, 678–682. Shu, T., & Richards, L. J. (2001) Cortical axon guidance by the glial wedge during the development of the corpus callosum. J. Neurosci., 21, 2749–2758. Song, H., Ming, G., He, Z., Lehmann, M., Tessier-Lavigne, M., & Poo, M-M. (1998). Conversion of neuronal growth cone responses from repulsion to attraction by cyclic nucleotides. Science, 281, 1515–1518. Song, H. J., Ming, G. L., & Poo, M-M. (1997). cAMP-induced switching in turning direction of nerve growth cones Nature, 388, 275–279. Song, H., & Poo, M-M. (2001). The cell biology of neuronal navigation. Nat. Cell. Biol., 3, E81–88. Steketee, M., Balazovich, K., & Tosney, K. W. (2001). Filopodial initiation and a novel filament-organizing center, the focal ring. Mol. Biol. Cell., 12, 2378–2395. Steketee, M. B., & Tosney, K. W. (1999). Contact with isolated sclerotome cells steers sensory growth cones by altering distinct elements of extension. J. Neurosci., 19, 3495–3506. Suter, D. M., & Forscher, P. (2000). Substrate-cytoskeletal coupling as a mechanism for the regulation of growth cone motility and guidance. J. Neurobiol., 44, 97–113. Tanaka, E., & Sabry, J. (1995). Making the connection: Cytoskeletal rearrangements during growth cone guidance. Cell, 83, 171–176. Tessier-Lavigne, M., Placzek, M., Lumsden, A. G. S., Dodd, J., & Jessell, T. M. (1988). Chemotropic guidance of developing axons in the mammalian central nervous system. Nature, 336, 775–778. Tranquillo, R. T. (1990). Models of chemical gradient sensing by cells. In W. Alt & G. Hoffman (Eds.), Biological motion, New York: Springer-Verlag. (pp. 415– 441). Tranquillo, R. T., & Lauffenburger, D. A. (1987). Stochastic-model of leukocyte chemosensory movement. J. Math. Bio., 25, 229–262. Varela-Echavarria, A., Tucker, A., Puschel, ¨ A. W., & Guthrie, S. (1997). Motor axon subpopulations respond differentially to the chemorepellents netrin-1 and semaphorin D. Neuron, 18, 193–207. Wang, F. S., Liu, C. W., Diefenbach, T. J., & Jay, D. G. (2003). Modeling the role of myosin 1c in neuronal growth cone turning. Biophys. J., 85, 3319–3328. Yu, T. W., & Bargmann, C. I. (2001). Dynamic regulation of axon guidance. Nat. Neurosci., 4 (Suppl). 1169–1176. Zheng, J. Q. (2000). Turning of nerve growth cones induced by localized increases in intracellular calcium ions. Nature, 403, 89–93.

Axonal Response to Gradients

2243

Zheng, J. Q., Felder, M., Conner, J. A., & Poo, M-M. (1994). Turning of growth cones induced by neurotransmitters. Nature, 368, 140–144. Zheng, J. Q., Wan, J-J., & Poo, M-M. (1996). Essential role of filopodia in chemotropic turning of nerve growth cone induced by a glutamate gradient. J. Neurosci., 16, 1140–1149. Zhou, F. Q., Waterman-Storer, C. M., & Cohan, C. S. (2002). Focal loss of actin bundles causes microtubule redistribution and growth cone turning. J. Cell Biol., 157, 839–849. Zigmond, S. H. (1977). Ability of polymorphonuclear leukocytes to orient in gradients of chemotactic factors. J. Cell. Biol., 75, 606–616. Received February 4, 2004; accepted April 29, 2004.

LETTER

Communicated by Michael Dickinson

Insect-Inspired Estimation of Egomotion Matthias O. Franz [email protected] ¨ biologische Kybernetik, Tubingen, ¨ Max-Planck-Institut fur Germany

Javaan S. Chahl [email protected] Center of Visual Sciences, Research School of Biological Sciences, Australian National University, Canberra, Australia

Holger G. Krapp [email protected] Department of Zoology, University of Cambridge, Cambridge, U.K.

Tangential neurons in the fly brain are sensitive to the typical optic flow patterns generated during egomotion. In this study, we examine whether a simplified linear model based on the organization principles in tangential neurons can be used to estimate egomotion from the optic flow. We present a theory for the construction of an estimator consisting of a linear combination of optic flow vectors that incorporates prior knowledge about the distance distribution of the environment and about the noise and egomotion statistics of the sensor. The estimator is tested on a gantry carrying an omnidirectional vision sensor. The experiments show that the proposed approach leads to accurate and robust estimates of rotation rates, whereas translation estimates are of reasonable quality, albeit less reliable. 1 Introduction A moving visual system generates a characteristic pattern of image motion on its sensors. The resulting optic flow field is an important source of information about the egomotion of the visual system (Gibson, 1950). In the fly brain, part of this information is analyzed by a group of wide-field, motionsensitive neurons, the tangential neurons in the lobula plate (Hausen, 1993; Egelhaaf et al., 2002). A detailed mapping of their local preferred directions and motion sensitivities (Krapp & Hengstenberg, 1996) reveals a striking similarity to certain egomotion-induced optic flow fields (see Figure 1). This suggests that each tangential neuron extracts a specific egomotion component from the optic flow that may be useful for gaze stabilization and flight steering. c 2004 Massachusetts Institute of Technology Neural Computation 16, 2245–2260 (2004)

2246

M. Franz, J. Chahl, and H. Krapp

75

elevation [deg.]

45

15

-15

-45

-75 0

30

60

120 90 azimuth [deg.]

150

180

Figure 1: Mercator map of the response field of the neuron VS10. The orientation of each arrow gives the local preferred direction, and its length denotes the relative local motion sensitivity. The results suggest that VS10 responds maximally to rotation around an axis at an azimuth of about 50 degrees and an elevation of about 0 degrees (after Krapp, Hengstenberg, & Hengstenberg, 1998).

A recent study (Franz & Krapp, 2000) has shown that a simplified computational model of the tangential neurons as a weighted sum of flow measurements was able to explain certain properties of the observed response fields. The weights were chosen according to an optimality principle that minimizes the output variance of the model caused by noise and distance variability between different scenes. In that study, however, we mainly focused on a comparison between the sensitivity distribution in tangential neurons and the weight distribution of such optic flow processing units. Here, we present a classical linear estimation approach that extends the previous model to the complete egomotion problem. We again use linear combinations of local flow measurements, but instead of prescribing a fixed motion axis and minimizing the output variance, we minimize the quadratic error in the estimated egomotion parameters. The derived weight sets for the single-model neurons are identical to those obtained from one of the model variants discussed in Franz and Krapp (2000). Of primary interest for this article is that the new approach yields a novel, extremely simple estimator for egomotion that consists of a linear combination of model neurons. Our experiments indicate that this insect-inspired estimator shows—in spite of its simplicity—an astonishing performance that often comes close to the more elaborate approaches of classical computer vision.

Insect-Inspired Estimation of Egomotion

A.

B.

y

optic flow vectors

2247 LPD unit vectors

ui

LMSs

summation

w11

di pi

w12

vi

+

w13

z x

Figure 2: (A) Sensor model. At each viewing direction di , there are two measurements xi and yi of the optic flow pi along two directions ui and vi on the unit sphere. (B) Simplified model of a tangential neuron. The optic flow and the local noise signal are projected onto a unit vector field of local preferred directions (LPDs). The projections are weighted (local motion sensitivities, LMSs) and linearly integrated. The model assumes that the integrated output encodes an egomotion component defined by either a translation or a rotation axis.

This article is structured as follows. In section 2, we describe the derivation of the egomotion estimator from a least-squares principle. In section 3, we subject the obtained model to a rigorous real-world test on a gantry carrying an omnidirectional vision sensor. The evidence and the properties of such a neural representation of egomotion are discussed in section 4. A preliminary account of our work has appeared in Franz and Chahl (2003). 2 Optimal Linear Estimators for Egomotion 2.1 Egomotion Sensor and Neural Model. In order to simplify the mathematical treatment, we assume that the N motion detectors of our egomotion sensor are arranged on the unit sphere. The viewing direction of the inputs to a particular motion detector with index i is denoted by the radial unit vector di . At each viewing direction, we define a local two-dimensional coordinate system on the sphere consisting of two orthogonal tangential unit vectors ui and vi (see Figure 2A).1 We assume that we measure the local flow component along both unit vectors subject to additive noise. Formally, this means that we obtain at each viewing direction two measurements xi and yi along ui and vi , respectively, given by xi = pi · ui + nx,i

and

yi = pi · vi + ny,i ,

(2.1)

where nx,i and ny,i denote additive noise components with a given covariance Cn and pi the local optic flow vector. When the spherical sensor trans1 For mathematical convenience, we do not take into account the hexagonal arrangement of the optical axes of the photoreceptors within the fly compound eye.

2248

M. Franz, J. Chahl, and H. Krapp

lates with T while rotating with R about an axis through the origin, the egomotion-induced image flow pi at di is pi = −µi (T − (T · di )di ) − R × di .

(2.2)

µi is the inverse distance between the origin and the object seen in direction di , the so-called nearness (Koenderink & van Doorn, 1987). The entire collection of flow measurements xi and yi comprises the input to a simplified neural model consisting of a weighted sum of all local measurements (see Figure 2B), θˆ =

N i

wx,i xi +

N

wy,i yi ,

(2.3)

i

with local weights wx,i and wy,i . In this model, the local motion sensitivity is defined as wi = (wx,i , wy,i ) , and the local preferred direction is parallel to the vector w1i (wx,i , wy,i ) . The resulting local motion sensitivities and local preferred directions can be compared to measurements on real tangential neurons (Franz & Krapp, 2000). As our basic hypothesis, we assume that the output of such neural models is used to estimate a egomotion component of the sensor. Since the output is a scalar, we need in the simplest case an ensemble of six neural models to encode all six rotational and translational degrees of freedom. To keep the mathematical treatment simple, we assume that the motion axes of interest are aligned with the global coordinate system. In principle, any set of linearly independent axes could be used. The local weights of each unit are chosen to yield an optimal linear estimator for the respective egomotion component. In addition, we allow the neural models to interact linearly, such that the whole ensemble output is a linear combination of the individual neural outputs. This last step is necessary since the neural models do not react specifically to their own egomotion component due to the broad tuning of the motion detector model (cf. equation 2.1). The response of the neural model can be made more specific by using the output of the other neurons to suppress the signal caused by other egomotion components (Krapp, Hengstenberg, & Egelhaaf, 2001; Haag & Borst, 2003). 2.2 Prior Knowledge. An estimator for egomotion consisting of a linear combination of flow measurements necessarily has to neglect the dependence of the optic flow on object distances. As a consequence, the estimator output will be different from scene to scene, depending on the current distance and noise characteristics. The best the estimator can do is to add up as many flow measurements as possible, hoping that the individual distance deviations of the current scene from the average over all scenes will cancel

Insect-Inspired Estimation of Egomotion

2249

each other. Clearly, viewing directions with low distance variability and small noise content should receive a higher weight in this process. In this way, prior knowledge about the distance and noise statistics of the sensor and its environment can improve the reliability of the estimate. If the current nearness at viewing direction di differs from the average nearness µ¯ i over all scenes by µi , the measurement xi (or yi , respectively) can be written as (see equations 2.1 and 2.2) xi = −(µ¯ i u i , (ui × di ) )

T + nx,i − µi ui T, R

(2.4)

where the last two terms vary from scene to scene, even when the sensor undergoes exactly the same egomotion. To simplify the notation, we stack all 2N measurements over the entire motion detector array in the vector x = (x1 , y1 , x2 , y2 , ..., xN , yN ) . Similarly, the egomotion components along the x-, y-, and z-directions of the global coordinate sytem are combined in the vector θ = (Tx , Ty , Tz , Rx , Ry , Rz ) , the scene-dependent terms of equation 2.4 in the 2N-vector n = (nx,1 − µ1 u1 T, ny,1 − µ1 v1 T, . . .) and the scene-independent terms in the 6xN matrix F = ((−µ¯ 1 u ¯ 1 v 1 , −(u1 × d1 ) ), (−µ 1 , −(v1 × d1 ) ), . . .) . The entire ensemble of measurements over the sensor becomes thus x = Fθ + n.

(2.5)

Assuming that T, nx,i , ny,i , and µi are uncorrelated, the covariance matrix C of the scene-dependent measurement component n is given by Cij = Cn,ij + Cµ,ij u i CT uj ,

(2.6)

with Cn being the covariance of n, Cµ of µ and CT of T. These three covariance matrices, together with the average nearness µ¯ i , constitute the prior knowledge required for deriving the optimal estimator. 2.3 Optimized Linear Estimator. Using the notation of equation 2.5, we write the output of the whole ensemble as a linear estimator θˆ = Wx.

(2.7)

W denotes a M × 2N weight matrix where each of the M rows consists of a linear combination of the weight sets of the neural models (see equation 2.3). The optimal weight matrix is chosen to minimize the mean square error e of the estimator given by ˆ 2 ) = tr[WCW ], e = E(θ − θ

(2.8)

2250

M. Franz, J. Chahl, and H. Krapp

where E denotes the expectation. We additionally impose the constraint that the estimator should be unbiased for n = 0, that is, θˆ = θ . From equations 2.5 and 2.7, we obtain the constraint equation WF = 1M×M .

(2.9)

The solution minimizing the associated Euler-Lagrange functional ( is a M × M-matrix of Lagrange multipliers), J = tr[WCW ] + tr[ (1M×M − WF)],

(2.10)

can be found analytically and is given by W=

1 −1 F C , 2

(2.11)

with = 2(F C−1 F)−1 . The rows of F C−1 correspond to the neural model of equation 2.3.2 acts as a correction matrix that suppresses the part of the neural signal caused by the egomotion components to which the neuron is not tuned to. When computed for the typical interscene covariances of a flying animal, the resulting weight sets are able to explain many of the receptive field characteristics of the tangential neurons (Franz & Krapp, 2000). However, the question remains whether the output of such an ensemble of neural models can be used for some real-world task. This is by no means evident given the fact that—in contrast to most approaches in computer vision—the distance distribution of the current scene is completely ignored by the linear estimator. 3 Experiments 3.1 Linear Estimator for an Office Robot. As our test scenario, we consider the situation of a mobile robot in an office environment. This scenario allows for measuring the typical motion patterns and the associated distance statistics that otherwise would be difficult to obtain for a flying agent. The distance statistics were recorded using a rotating laser scanner. The 26 measurement points were chosen along typical trajectories of a mobile robot while wandering around and avoiding obstacles in an office environment. The recorded distance statistics therefore reflect properties of both the environment and the specific movement patterns of the robot. From these measurements, the average nearness µ¯ i and its covariance Cµ were 2 The resulting local motion sensitivities correspond exactly to those obtained from the linear range model in Franz and Krapp (2000) if one assumes a diagonal C.

Insect-Inspired Estimation of Egomotion

A.

2251

75 2.25

2.

2.5

-15

1

-150

-120

75

-60

-30 0 30 azimuth (deg.)

1.25

1.25 1.5

1.25

75

1. 1. 25 75 1 0.75

1.25

1 0.75

0.5

-120

0.5

0.25

0.25

-90

-60

180

0.75

1 1.

1.5

0.25

-150

150

0.5

1.75

1 0.75

-45

120

0.75 1

-15

90

0.25

0.5

1.25 1.5

1.5

60

0.25

0.75

15

-75 -180

0.75

0.25

45

2

1 0.75

-90

2.5 5

1.25

1.5

-75 -180

1.7

1.5

1.5 1.25 0.75

elevation (deg.)

2

1.75

-45

B.

25

3 2.75

2.25 2

1.75

2.25

15

2.5

elevation (deg.)

45

-30 0 30 azimuth (deg.)

60

90

120

150

180

Figure 3: Distance statistics of an indoor robot (0 azimuth corresponds to forward direction; the distances on the contour lines are given in m). (A) Average distances from the origin in the visual field (N = 26). Darker areas represent larger distances. (B) Distance standard deviation in the visual field (N = 26). Darker areas represent stronger deviations.

computed (cf. Figure 3; we used distance instead of nearness for easier interpretation). The distance statistics show a pronounced anisotropy, which can be attributed to three main factors. (1) Since the robot tries to turn away from the obstacles, the distance in front and behind the robot tends to be larger than on its sides (see Figure 3A). (2) The camera on the robot usually moves at a fixed height above ground (here, 0.62 m) on a flat surface. As a consequence, distance variation is particularly small at very low elevations (see Figure 3B). (3) The office environment also contains corridors. When the robot follows the corridor while avoiding obstacles, distance variations in the frontal region of the visual field are very large (see Figure 3B).

2252

M. Franz, J. Chahl, and H. Krapp

The estimation of the translation covariance CT is straightforward since our robot can translate only in forward direction, along the z-axis. CT is therefore 0 everywhere except the lower right diagonal entry, which corresponds to the square of the average forward speed of the robot (here, 0.3 m/s). The motion detector noise was assumed to be zero mean, uncorrelated, and uniform over the image, which results in a diagonal Cn with identical entries. The noise standard deviation of 0.34 deg/s was determined by presenting a series of artificially translated images of the laboratory (moving at 1.1 deg/s) to the flow algorithm used in the implementation of the estimator (see section 3.2). µ, ¯ Cµ , CT and Cn constitute the prior knowledge necessary for computing the estimator (see equations 2.6 and 2.11). The optimal weight sets for the neural models for the six degrees of freedom (each of which corresponds to a row of F C−1 ) are shown in Figure 4. All neural models have in common that image regions near the rotation or translation axis receive less weight. In these regions, the egomotion components to be estimated generate only small flow vectors which are easily corrupted by noise. Equation 2.11 predicts that the estimator will preferably sample in image regions with smaller distance variations. In our measurements, this is mainly the case at the ground around the robot (see Figure 3). The rotation-selective neural models assign higher weights to distant image regions, since distance variations at large distances have a smaller effect. In our example, distances are largest in front and behind the robot so that the neural model for yaw assigns the highest weights to these regions (see Figure 4F). This effect is less pronounced in the other rotational neural models because the translational flow is almost orthogonal to their local directions and thus interferes to a much lesser degree. Although the small weights near the motion axes and the overall distribution of local directions are similar to those found in tangential neurons, our neural models show specific adaptations to the indoor robot scenario: the highly weighted ground regions are exactly the opposite to our model predictions for a flying animal where the ground region shows a stronger distance variability than regions near and above the horizon (Franz & Krapp, 2000). The predicted dorsoventral asymmetry with small weights in the ground region is indeed observed in the tangential neurons (see Figure 1 and Krapp, et al., 1998). The strong weighting of the frontal region in the yaw neural model (see Figure 4F) is also corridor specific, so it is not surprising that this feature is not present in an animal that evolved in an open outdoor environment.

3.2 Gantry Experiments. The egomotion estimates from the ensemble of neural models were tested on a gantry with three translational and one rotational (yaw) degree of freedom. Since the gantry had a position accuracy below 1 mm, the programmed position values were taken as ground truth for evaluating the estimator’s accuracy.

Insect-Inspired Estimation of Egomotion D. 75

75

45

45

elevation (deg.)

elevation (deg.)

A.

2253

15 15

75

75 0

30

60 90 120 azimuth (deg.)

150

180

B.

0

30

60 90 120 azimuth (deg.)

150

180

0

30

60 90 120 azimuth (deg.)

150

180

E. 75

75

45

45

elevation (deg.)

elevation (deg.)

15 45

45

15 15

15 15 45

45

75

75 0

30

60 90 120 azimuth (deg.)

150

180

C.

F. 75

75

45

45

elevation (deg.)

elevation (deg.)

15

15 15

15 15 45

45

75

75 0

30

60 90 120 azimuth (deg.)

150

180

0

30

60 90 120 azimuth (deg.)

150

180

Figure 4: Neural models computed as part of the linear estimator. Notation is identical to Figure 1. The depicted region of the visual field extends from −15◦ to 180◦ azimuth and from −75◦ to 75◦ elevation. The model neurons are tuned to (A) forward translation, (B) translations to the right, (C) downward translation, (D) roll rotation, (E) pitch rotation, and (F) yaw rotation.

As vision sensor, we used a camera mounted above a mirror with a circularly symmetric hyperbolic profile. This setup allowed for a 360 degree horizontal field of view extending from 90 degrees below to 45 degrees above the horizon. Such a large field of view considerably improves the estimator’s performance since the individual distance deviations in the scene

2254

M. Franz, J. Chahl, and H. Krapp

A.

B.

estimated self-motion

rotation

translation

estimator response [%]

150

100

50

0

1

2

C.

4

5

D.

0.6

estimator response

0.6

estimator response

3

location

true self-motion

0.5 0.4 0.3 0.2

0.4 0.3 0.2 0.1

0.1 0

0.5

0 1

2

translation speed

3

1

2

3

angular velocity

Figure 5: Gantry experiments. Results are given in arbitrary units, true rotation values are denoted by a dashed line, and translation by a dash-dot line. Gray bars denote translation estimates, and white bars rotation estimates. (A) Estimated versus real egomotion. (B) Estimates of the same egomotion at different locations. (C) Estimates for constant rotation and varying translation. (D) Estimates for constant translation and varying rotation.

are more likely to be averaged out. More details about the omnidirectional camera can be found in Chahl and Srinivasan (1997). In each experiment, the camera was moved to 10 different start positions in the lab at the same height above ground (0.62 m) as the robot camera,3 but with largely varying distance distributions near and above the horizon. After recording an image of the scene at the start position, the gantry translated and rotated at various speeds and directions and took a second image. After the recorded image pairs (10 for each type of movement) were unwarped, we computed 3 The translational neurons in Figure 4 for the mobile robot case assign a high weight to the ground region. As a consequence, the translation estimates strongly depend on the correct height above ground, whereas rotational neurons are only indirectly affected.

Insect-Inspired Estimation of Egomotion

2255

the optic flow input for the neural models using a standard gradient-based scheme (Srinivasan, 1994). The average error of the rotation rate estimates over all trials (N = 450) was 0.7◦ /s (5.7% relative error; see Figure 5A), the error in the estimated translation speeds (N = 420) was 8.5 mm/s (7.5% relative error). The estimated rotation axis had an average error of magnitude 1.7 degrees, and the estimated translation direction was 4.5 degrees. The larger error of the translation estimates is mainly caused by the direct dependence of the translational flow on distance (see equation (2.2)) whereas the rotation estimates are only indirectly affected by distance errors via the current translational flow component, which is largely filtered out by the local direction template. The larger sensitivity of the translation estimates to distance variations can be seen by moving the sensor at the same translation and rotation speeds in various locations. The rotation estimates remain consistent over all locations, whereas the translation estimates show a higher variance and also a location-dependent bias, for example, very close to laboratory walls (see Figure 5B). A second problem for translation estimation comes from the different properties of rotational and translational flow fields. Due to its distance dependence, the translational flow field shows a much wider range of local image velocities than a rotational flow field. The smaller translational flow vectors are often swamped by simultaneous rotation or noise, and the larger ones tend to be in the upper saturation range of the used optic flow algorithm. This can be demonstrated by simultaneously translating and rotating the sensor. Again, rotation estimates remain consistent at different translation speeds while translation estimates are strongly affected by rotation (see Figures 5C and 5D). 4 Discussion 4.1 Egomotion Estimation. Our experiments show that it is possible to obtain useful egomotion estimates from an ensemble of linear neural models in a real-world task. Although a linear approach necessarily has to ignore the distances of the currently perceived scene, an appropriate choice of local weights and a large field of view are capable of reducing the influence of noise and distance variability on the estimates. In particular, rotation estimates were highly accurate and consistent across different scenes and different simultaneous translations. Translation estimates were not as accurate and were less robust against changing scenes and simultaneous rotation. The performance difference was to be expected because of the direct distance dependence of the translational optic flow, which leads to a larger variance of the estimator output. This problem can be resolved only by also estimating the distances in the current scene (e.g., in the iterative schemes in Koenderink & van Doorn, 1987; Heeger & Jepson, 1992). This, however, requires significantly more complex computations. Another reason is the limited dynamic range of the flow algorithm used in the experiments, as

2256

M. Franz, J. Chahl, and H. Krapp

discussed in the previous section. One way to overcome this problem would be using an optic flow algorithm that estimates image motion on different temporal or spatial scales, which is computationally more expensive. Our results show that the linear estimator accurately estimates rotation under general egomotion conditions and without any knowledge of the object distances of the current scene. The estimator may be used in technical applications such as image stabilization of a moving camera or the removal of the rotational component from the currently measured optic flow. Both measures considerably simplify the estimation of distances from the remaining optic flow and the detection of independently moving objects. In addition, the simple architecture of the estimator allows for an efficient implementation at low computational costs, which are several orders of magnitude smaller than the costs of computing the entire optic flow input. The components of the estimator are simplified neural models, which, when computed for a flying animal, are able to reproduce characteristics of the tangential neuron receptive field organization, that is, the distribution of local motion sensitivities and local preferred directions (Franz & Krapp, 2000). Our study suggests that tangential neurons may be used for selfmotion estimation by linearly combining their outputs at the level of the lobula plate (e.g., Krapp et al., 2001) or at later integration stages. Evidence for the latter possibility comes from recent electrophysiological studies on motor neurons, which innervate the fly neck motor system and mediate gaze stabilization behavior (Huston & Krapp, 2003). The possible behavioral role of such egomotion estimates, however, will critically depend on the dynamic properties of the whole sensorimotor loop, as well as on specific tuning of the motion processing stage providing the input. An example of using integrated optic flow for controlling a robotic system is described in Reiser and Dickinson (2003). 4.2 Neural Computation of Egomotion. The description of any egomotion requires at most six degrees of freedom. Therefore, an ensemble of six neurons, as in our gantry experiments, would be sufficient to encode the entire egomotion of the fly. There are, however, at least 13 tangential neurons (3 HS and 10 VS neurons) in either side of the fly lobula plate, which do not cover all degrees of freedom, for example, none of the currently known receptive fields represents lift translation (reviews in Hausen, 1984, 1993; Krapp, 2000). A plausible explanation might be that the axes covered by tangential neurons—thus constituting the sensory coordinate system—are aligned with the axes used by the motor coordinate system. Recent studies on gaze stabilization in Calliphora suggest that in some cases, the output of individual tangential neurons is connected to individual motor neurons driving certain head movements (Huston & Krapp, 2003). Another hint comes from an interesting property of our linear model. The linearly combined output of two model neurons corresponds to the linear combination of their respective weight sets. For the tangential neurons, this

Insect-Inspired Estimation of Egomotion B. 75

75

45

45

elevation [deg.]

elevation [deg.]

A.

2257

15 -15

15 -15 -45

-45

-75

-75 0

30

120 60 90 azimuth [deg.]

150

180

0

30

120 60 90 azimuth [deg.]

150

180

Figure 6: Hypothetical neurons constructed for (A) roll and (B) pitch rotation.

would mean that the summed output of several neurons may be treated as a superposition of their individual local response properties. The receptive fields of many tangential neurons often cover only a smaller part of the visual field, perhaps due to anatomical or developmental constraints. By summing the output of several neurons, one could build estimators with extended receptive fields covering more than one visual hemisphere. This is demonstrated in Figure 6, where we construct a hypothetical pitch neuron from the inverted output of VS1-3 added to the output of VS8-10. Also shown in Figure 6 is the response field of VS6, which was shown to be ideally suited to sense roll rotations (Franz & Krapp, 2000). 4.3 Linearity. Finally, we have to point out a basic difference between the proposed theory and optic flow processing in the fly. It assumes that the motion detector signals the tangential neurons integrate depend linearly on velocity (see equation 2.1; Reichardt, 1987). The output of fly motion detectors, however, is linear only within a limited velocity range. The motion detector output also depends on the spatial pattern properties of the visual surroundings (Borst & Egelhaaf, 1993). These properties are reflected by the tangential neurons’ response properties. Beyond certain image velocities, for instance, their response stays at a plateau when the velocity is increased. Even higher velocities result in a decrease of the neuron’s response (Hausen, 1982). Within the plateau range, tangential neurons can indicate only the presence and the sign of a particular egomotion, not the actual velocity. A detailed comparison between linear model neurons and tangential neurons shows characteristic differences. Under the conditions of the neurophysiological experiments reported in Franz and Krapp (2000), tangential neurons seem to operate in the plateau range rather than in the linear range. Under such response regimes, a linear combination of tangential neuron outputs would not indicate the true egomotion.

2258

M. Franz, J. Chahl, and H. Krapp

Physiological mechanisms have been described that may help to overcome these limitations to a certain degree. A nonlinear integration of local motion detector signals, known as dendritic gain control (Borst, Egelhaaf, & Haag, 1995), prevents the output of the tangential neurons from saturating when its entire receptive field is stimulated. This mechanism results in a size-invariant response, which still depends on velocity. Harris, O’Carroll, and Laughlin (2000) show that contrast gain control is of similar significance. It contributes to the neuron’s adaptation to visual motion, that is, it prevents the tangential neurons from saturating at high visual contrasts and image velocities. Although these mechanisms may not establish a linear dependence over the entire velocity range, they may considerably extend it. Evidence supporting this idea comes from a study by Lewen, Bialek, and de Ruyter van Steveninck (2001). The authors performed electrophysiological experiments on the H1 tangential neuron in a natural outdoor environment and at bright daylight. They show that the linear dynamic range of H1 under these conditions is significantly extended compared to stimulation with a periodic grating within the same range of velocities but applied in the laboratory. Despite these results, it is still not entirely clear whether an extended linear dynamic range of the tangential neurons is sufficient to cover all needs in gaze stabilization and flight steering. Within the linear range, however, the fly might take advantage of all the beneficial properties the linear model offers. For instance, it may combine the outputs of several tangential neurons to form matched filters for particular egomotions. In case of the intrinsic tangential neuron VCH, thought to be involved in figure-ground discrimination (Warzecha, Borst, & Egelhaaf, 1992), this seems to hold true. VCH receives input from several other tangential neurons, the response fields of which are well characterized. By combining the response fields of the inputting tangential neurons, the VCH response field is readily explained (Krapp et al., 2001). This suggests that linear combination of tangential neuron response fields may well be an option for the fly visual system to estimate egomotion. Acknowledgments The authors wish to thank J. Hill, M. Hofmann, and M. V. Srinivasan for their help. Financial support was provided by the Human Frontier Science Program and the Max-Planck-Gesellschaft. References Borst, A., & Egelhaaf, M. (1993). Detecting visual motion: Theory and models. In F. Miles & J. Wallman (Eds.), Visual motion and its role in stabilization of gaze (pp. 3–27). Amsterdam: Elsevier.

Insect-Inspired Estimation of Egomotion

2259

Borst, A., Egelhaaf, M., & Haag, J. (1995). Mechanisms of dendritic integration underlying gain control in fly motion-sensitive interneurons. J. Comput. Neurosci., 2, 5–18. Chahl, J. S., & Srinivasan, M. V. (1997). Reflective surfaces for panoramic imaging. Applied Optics, 36(31), 8275–8285. Egelhaaf, M., Kern, R., Krapp, H. G., Kretzberg, J., Kurtz, R., & Warzecha, A.-K. (2002). Neural encoding of behaviourally relevant visual-motion information in the fly. TINS, 25(2), 96–102. Franz, M. O., & Chahl, J. S. (2003). Linear combinations of optic flow vectors for estimating self-motion: A real-world test of a neural model. In S. Becker, S. Thrun, & K. Obermayer (Eds.), Advances in neural information processing systems, 15, Cambridge, MA: MIT Press. Franz, M. O., & Krapp, H. G. (2000). Wide-field, motion-sensitive neurons and matched filters for optic flow fields. Biol. Cybern., 83, 185–197. Gibson, J. J. (1950). The perception of the visual world. Boston: Houghton Mifflin. Haag, J., & Borst, A. (2003). Orientation tuning of motion-sensitive neurons shaped by vertical-horizontal network interactions. J. Comp. Physiol. A, 189, 363–370. Harris, R. A., O’Carroll, D. C., & Laughlin, S. B. (2000). Contrast gain reduction in fly motion adaptation. Neuron, 28, 595–606. Hausen, K. (1982). Motion sensitive interneurons in the optomotor system of the fly. II. The horizontal cells: Receptive field organization and response characteristics. Biol. Cybern, 46, 67–79. Hausen, K. (1984). The lobula-complex of the fly: Structure, function and significance in visual behaviour. In M. A. Ali (Eds.), Photoreception and vision in invertebrates (pp. 523–559). New York: Plenum Press. Hausen, K. (1993). The decoding of retinal image flow in insects. In F. A. Miles & J. Wallman (Eds.), Visual motion and its role in the stabilization of gaze (pp. 203–235). Amsterdam: Elsevier. Heeger, D. J., & Jepson, A. D. (1992). Subspace methods for recovering rigid motion. I: Algorithm and implementation. Intl. J. Computer Vision, 7, 95–117. Huston, S. J., & Krapp, H. G. (2003). Visual receptive field of a fly neck motor neuron. In N. Elsner & H.-U. Schnitzler (Eds.), G¨ottingen neurobiology report 2003. Stuttgart: Thieme. Koenderink, J. J., & van Doorn, A. J. (1987). Facts on optic flow. Biol. Cybern., 56, 247–254. Krapp, H. G. (2000). Neuronal matched filters for optic flow processing in flying insects. In M. Lappe (Ed.), Neuronal processing of optic flow (pp. 93–120). San Diego, CA: Academic Press. Krapp, H. G., & Hengstenberg, R. (1996). Estimation of self-motion by optic flow processing in single visual interneurons. Nature, 384, 463–466. Krapp, H. G., Hengstenberg, R., & Egelhaaf, M. (2001). Binocular input organization of optic flow processing interneurons in the fly visual system. J. Neurophysiol., 85, 724–734. Krapp, H. G., Hengstenberg, B., & Hengstenberg, R. (1998). Dendritic structure and receptive field organization of optic flow processing interneurons in the fly. J. Neurophysiology, 79, 1902–1917.

2260

M. Franz, J. Chahl, and H. Krapp

Lewen, G. D., Bialek, W., & de Ruyter van Steveninck, R. R. (2001). Neural coding of naturalistic motion stimuli. Network: Comput. Neural Syst., 12, 317–329. Reichardt, W. (1987). Evaluation of optical motion information by movement detectors. J. Comp. Physiol. A, 161, 533–547. Reiser, M. B., & Dickinson, M. H. (2003). A test bed for insect-inspired robotic control. Phil. Trans. Royal Soc. Lond. A, 361, 2267–2285. Srinivasan, M. V. (1994). An image-interpolation technique for the computation of optic flow and egomotion. Biol. Cybern., 71, 401–415. Warzecha, A.-K., Borst, A., & Egelhaaf, M. (1992). Photo-ablation of single neurons in the fly visual system reveals neuronal circuit for detection of small objects. Neurosci. Letters, 141, 119–122. Received October 31, 2003; accepted May 3, 2004.

LETTER

Communicated by Emilio Salinas and Peter Thomas

Correlated Firing Improves Stimulus Discrimination in a Retinal Model Garrett T. Kenyon [email protected] Physics Division, Los Alamos National Laboratory, Los Alamos, NM 87545, U.S.A.

James Theiler [email protected] Non-Proliferation and International Security, Los Alamos National Laboratory, Los Alamos, NM 87545, U.S.A.

John S. George [email protected] Physics Division, Los Alamos National Laboratory, Los Alamos, NM 87545, U.S.A.

Bryan J. Travis [email protected] Environmental and Earth Sciences, Los Alamos National Laboratory, Los Alamos, NM 87545, U.S.A.

David W. Marshak [email protected] Department of Neurobiology and Anatomy, University of Texas Medical School, Houston, TX 77030, U.S.A.

Synchronous firing limits the amount of information that can be extracted by averaging the firing rates of similarly tuned neurons. Here, we show that the loss of such rate-coded information due to synchronous oscillations between retinal ganglion cells can be overcome by exploiting the information encoded by the correlations themselves. Two very different models, one based on axon-mediated inhibitory feedback and the other on oscillatory common input, were used to generate artificial spike trains whose synchronous oscillations were similar to those measured experimentally. Pooled spike trains were summed into a threshold detector whose output was classified using Bayesian discrimination. For a threshold detector with short summation times, realistic oscillatory input yielded superior discrimination of stimulus intensity compared to rate-matched Poisson controls. Even for summation times too long to resolve synchronous inputs, gamma band oscillations still contributed to improved discrimination by reducing the total spike count variability, c 2004 Massachusetts Institute of Technology Neural Computation 16, 2261–2291 (2004)

2262

G. Kenyon, J. Theiler, J. George, G. Travis, and D. Marshak

or Fano factor. In separate experiments in which neurons were synchronized in a stimulus-dependent manner without attendant oscillations, the Fano factor increased markedly with stimulus intensity, implying that stimulus-dependent oscillations can offset the increased variability due to synchrony alone. 1 Introduction Neurons are thought to represent sensory information primarily as changes in their individual firing rates. In order to obtain reliable estimates of ratecoded information on physiological timescales, from tens to hundreds of msec, it may be necessary to estimate the average firing rate over an ensemble of similarly activated cells. Averaging yields a more accurate estimate of the ensemble firing rate only to the extent that the input spike trains are statistically independent (Mazurek & Shadlen, 2002; Shadlen & Newsome, 1994, 1998). When the input spike trains are instead partially synchronized, averaging their firing rates no longer produces the same improvement in signal to noise, as fluctuations in the number of spikes arising from individual cells will no longer tend to cancel out across the population of input fibers. Nonetheless, firing correlations are ubiquitous in the nervous system, often associated with coherent oscillations in the gamma frequency band (40–120 Hz) that synchronizes activity both within and between brain areas (Fries, Schroder, Roelfsma, Singer, & Engel, 2002; Singer & Gray, 1995). It is thus important to understand what consequences such synchronous oscillations might have for how information is represented by central neurons. One possibility is that firing correlations simply impose an upper limit on the amount of rate-coded information a population of neurons can represent in its pooled activity over a given unit of time (Mazurek & Shadlen, 2002; Shadlen & Newsome, 1994, 1998). From this perspective, firing correlations are an inevitable but undesirable consequence of the massive interconnectivity of neural circuits but otherwise serve no significant information processing function. A second possibility is that the correlations themselves may encode relevant information, a view supported by studies showing that coherent oscillations may be involved in a variety of cognitive processes, including attention (Fries, Reynolds, Rorie, & Desimone, 2001), perception (Srinivasan, Russell, Edelman, & Tononi, 1999), top-down priming (Engel, Fries, & Singer, 2001), and feature integration (Fries et al., 2002; Singer & Gray, 1995). Theoretical arguments indicate that correlations are not always detrimental to a population code and can actually increase the overall information content depending on the specific nature of the correlations, tuning differences between the individual neurons, and how the population-coded signals are extracted (Abbott & Dayan, 1999). Here, we adopt a more empirical approach. Specifically, we employ artificial spike trains exhibiting stimulus-dependent synchronous oscillations similar to those observed between ganglion cells in the cat retina, which we use to explore how realistic

Correlated Firing Improves Discrimination

2263

firing correlations affect the extraction of visual information by a general threshold detection process. Information extraction was assessed by quantifying the performance of a Bayesian discriminator in classifying different stimulus intensities based on the output of a threshold detector. When driven by correlated inputs in which the stimulus intensity was encoded by both the firing rates of individual cells and the strength of their synchronous oscillations, threshold elements mediated equal or superior stimulus discrimination than when driven by statistically independent Poisson generators that produced, on average, the same number of spikes. Threshold detectors with short integration times and low background event rates (i.e., coincidence detectors) always extracted more information from spike trains with realistic correlations than from Poisson controls, as the stimulus information encoded by the level of synchrony could be extracted directly. Neurons in the lateral geniculate nucleus (LGN) and visual cortex are preferentially sensitive to synchronous inputs (Alonso, Usrey, & Reid, 1996; Usrey, Alonso, & Reid, 2000) and may respond differentially to coincidences in their primary afferents. Somewhat surprisingly, even threshold detectors with long integration times, which were thus unable to resolve synchronous inputs, still mediated equal or superior stimulus discrimination when driven by correlated inputs compared to Poisson controls, consistent with previous theoretical analysis (Abbott & Dayan, 1999). Because the stimulus-evoked oscillations became stronger as a function of the stimulus intensity, the loss of rate-coded information due to the increased firing synchrony was effectively countered by a reduction in the variability of the total input spike count. 2 Methods 2.1 Artificial Spike Trains. Spike trains were generated by one of two computer models, whose parameters were adjusted to qualitatively match the stimulus-dependent, synchronous oscillations recorded from cat ganglion cells (Mastronarde, 1989; Neuenschwander, Castelo-Branco, & Singer, 1999; Neuenschwander & Singer, 1996). Some theoretical studies of correlated activity have employed mathematically generated spike trains in which the degree of firing synchrony was independent of, and thus conveyed no information about, the applied stimulus (Mazurek & Shadlen, 2002; Shadlen & Newsome, 1994, 1998). However, the synchronous oscillations between retinal neurons depend strongly on stimulus parameters such as size (Ariel, Daw, & Rader, 1983; Ishikane, Kawana, & Tachibana, 1999; Neuenschwander et al., 1999), contrast (Neuenschwander et al., 1999), connectedness (Ishikane et al., 1999; Neuenschwander & Singer, 1996), and velocity (Neuenschwander et al., 1999). To produce realistic firing correlations, we adopted two very different yet complementary strategies for generating artificial spike trains. In the first case, we used a detailed model of axonmediated inhibitory feedback in the inner retina in order to generate syn-

2264

G. Kenyon, J. Theiler, J. George, G. Travis, and D. Marshak

chronous oscillations via a complex, but biologically plausible, dynamical process. In the second case, synchronous oscillations were generated using a minimalist approach in which each ganglion cell generated Poisson distributed spikes at a rate modulated by a common source of oscillatory input. Although both models exhibited similar firing correlations, as assessed by comparing their multiunit cross-correlation histograms (multiunit CCHs), defined below, the two models were based on very different underlying mechanisms. Thus, we were able to investigate whether our conclusions regarding the information content of spike trains with realistic correlations were robust with respect to their underlying mode of generation. To guard against the possibility that our mathematically generated spike trains exaggerated the information conveyed by correlations, both models were subject to the following constraints. First, when corroborating information was available, the largest stimulus-evoked correlations between the model-generated spike trains were comparable to the correlations measured experimentally under similar conditions. Thus, under these circumstances, we are confident that none of the correlations generated by the model were substantially in excess of the correlations present physiologically. Second, the level of correlations in the absence of stimulation was comparable to the spontaneous correlations measured experimentally, ensuring that a correlation code would not possess an unfair advantage by starting from an unphysiologically low baseline level. Third, because the retinal model did not incorporate several known adaptation mechanisms, particularly those occurring in the outer retina, we ensured that the plateau firing rates evoked by stimuli of various intensities were probably larger than those that would occur physiologically. Thus, the model was probably conservative and likely biased in favor of a rate code. 2.2 Retinal Model. Artificial spike trains with realistic firing correlations were generated by a retinal model (see Figure 1), details of which have been published previously (Kenyon et al., 2003). The model retina was organized as a 32 × 32 array with wraparound boundary conditions containing five distinct cell types: bipolar cells (BP), small amacrine cells (SA), large amacrine cells (LA), polyaxonal amacrine cells (PA), and alpha ganglion cells (GC). All cell types were modeled as single-compartment, RC circuit elements obeying a first-order differential equation of the following form:   (k,k ) 1 (k) (k) (k ) (k,k ) (k) (k,k ) (k) V˙ ij = − (k) Vij − Wii · f (Vi j ) · Wjj − b − Lij  , (2.1) τ i j k where Vij(k) is a 2D array denoting the normalized membrane potentials of all cells of type k, (1 ≤ k ≤ 5), with i denoting the row and j the column of the corresponding cell, τ (k) is the time constant, b(k) is a bias current for setting

Correlated Firing Improves Discrimination

2265

A. Feedforward & Feedback Inhibition

BP

LA

GC

SA

B. Serial Inhibition

PA

LA

SA

C. Resonance Circuit

GC

PA

PA

Figure 1: The retinal model contained five cell types—bipolar (BP) cells, small (SA), large (LA) and polyaxonal (PA) amacrine cells, and alpha ganglion (GC) cells—arranged as a 32 × 32 square mosaic with wraparound boundary conditions. Conceptually, connections could be organized into three categories. (A) Feedforward and feedback inhibition. Excitatory synapses from BPs were balanced by a combination of reciprocal synapses and direct inhibition of the GCs, primarily mediated by the nonspiking amacrine cell types. (B) Serial inhibition. The three amacrine cell types regulated each other through a negative feedback loop. (C) Resonance circuit. The PAs were excited locally via electrical synapses with GCs and whose axons mediated wide field inhibition to all cell types, but most strongly onto the GCs. Not all connections are shown. Symbols: Excitation (triangles), inhibition (circles), gap junctions (resistors).

the resting potential, L(k) ij is an external input representing light stimulation,

) Wii(k,k gives the connection strengths between presynaptic, k , and postsy ) naptic, k, cell types as a function of their row (vertical) separation, Wjj(k,k gives the same information as a function of the column (horizontal) separa tion, and the functions f (k,k ) give the associated input-output relations for the indicated pre- and postsynaptic cell types, detailed below. The output of the axon-mediated inhibition was delayed by 2 msec, except for the axonal connections onto the axon-bearing amacrine cells, which was delayed for 1 msec. All other synaptic interactions were delayed by one time step, which equaled 1 msec. Although the ratio of ganglion cells to bipolar cells

2266

G. Kenyon, J. Theiler, J. George, G. Travis, and D. Marshak

reflects known density differences, their true anatomical ratio is higher, but computational constraints limited the total number of neurons that could be efficiently simulated. However, for the relatively large stimuli used in our experiments, fine details of the ganglion cell receptive field structure should not have been critical. Numerical ratios were always powers of 2 to ensure that the underlying connectivity was translationally invariant between processing modules. All equations were integrated in Matlab using a direct Euler method. Previous analysis has verified that the synchronous oscillations predicted by the model are robust with respect to individual parameter variation and integration step size (Kenyon et al., 2003). The input-output function for gap junctions was given by the identity

f (k,k ) (Vij(k ) ) = Vij(k ) ,

(2.2)

where the dependence on the presynaptic potential has been absorbed into the definition of τ (k) . This is possible because both the decay term in equation 2.1 and the omitted dependence on the presynaptic potential in equation 2.2 depend linearly on Vij(k) , allowing the coefficients to be combined. Many retinal neurons, including bipolar cells and most amacrine cells, do not fire action potentials but rather generate stochastically distributed postsynaptic potentials at a rate proportional to their membrane voltage (Freed, 2000). Such stochastic inputs are likely to contribute significantly to ganglion cell membrane noise (van Rossum, O’Brien, & Smith, 2003). The input-output function for graded stochastic synapses was constructed by comparing, on each time step, a random number with a Fermi function: 

 f (k,k ) (Vij(k ) )

= θ 

1

1 + exp(−αVij(k ) )



 − r ,

(2.3)

where α sets the gain (equal to 4 for all nonspiking synapses), r is a uniform random deviate equally likely to take any real value between 0 and 1, and θ is a step function, θ(x) = 1, x ≥ 0; θ(x) = 0, x < 0. Finally, the input-output relation used for spiking synapses was

f (k,k ) (Vij(k ) ) = θ(Vij(k ) ).

(2.4)

A modified integrate-and-fire mechanism was used to model spike generation. A positive pulse (amplitude = 10.0) was delivered to the cell on the time step after the membrane potential crossed threshold, followed by a negative pulse (amplitude = −10.0) on the subsequent time step. This resulted in a 1 msec action potential followed by an afterhyperpolarization that decayed with the time constant of the cell. Action potentials produced impulse responses in electrically coupled cells as well, an important element

Correlated Firing Improves Discrimination

2267

of the oscillatory feedback dynamics. The bias current, b, was decremented by −0.5 following each spike, and then returned to the resting value with the time constant of the cell, adding to the relative refractory period. There was in addition an absolute refractory period of 1 msec. Along both the horizontal and vertical directions, synaptic strengths fell off as gaussian functions of the distance between the pre- and postsynaptic cells. For a given vertical separation, the weight factor was determined by a gaussian function of the following form,

) Wi(k,k (k) ,i (k )

=α

W (k,k ) exp

i(k) − i(k ) 2 − , 2 2σ (k )

(2.5)

) where Wi(k,k (k) ,i (k ) is the weight factor from presynaptic cells of type k at col

umn index i(k ) to the postsynaptic cells of type k at column index i(k) , where the superscripts k and k are necessary because the number of rows may be different for the two cell types, α is a numerically determined normalization factor that ensured the total synaptic input integrated over all presynaptic cells of type k to every postsynaptic cell of type k equaled W (k,k ) , σ (k ) is the gaussian radius of the interaction, which depended only on the presynaptic cell type, and the quantity i(k) − i(k ) denotes the vertical distance between the pre- and postsynaptic cells, taking into account the wraparound boundary conditions employed to mitigate edge effects and assuming all cell types are distributed uniformly over rectilinear grids of equal total area and whose center points coincide. An analogous weight factor describes the dependence on horizontal separation. Equation 2.5 was augmented by a cutoff condition that prevented synaptic interactions beyond a specified distance, determined by the radius of influence of the presynaptic outputs and the postsynaptic inputs, roughly corresponding to the axonal and dendritic fields, respectively. A synaptic connection was possible only if the output radius of the presynaptic cell overlapped the input radius of the postsynaptic cell. Except for axonal connections, the input and output radii were the same for all cell types. For the large amacrine cells and the ganglion cells, the radius of influence extended out to the centers of the nearest neighboring cells of the same type, producing a coverage factor greater then one (Vaney, 1990). The radii of the bipolar, small, and axon-bearing amacrine cells (nonaxonal connections only) extended only halfway to the nearest cell of the same type, giving a coverage factor of one (Cohen & Sterling, 1990). The radius of the axonal connections was equal to nine ganglion cell diameters. The external input was multiplied by a gain factor of 3, chosen so that a stimulus intensity of 1 would produce an approximately saturating response in the model bipolar cells. Values for model parameters are listed in Tables 1 and 2.

2268

G. Kenyon, J. Theiler, J. George, G. Travis, and D. Marshak

Table 1: Cellular Parameters.

BP SA LA PA GC

τ

b

n×n

d

σ

10.0 25.0 20.0 5.0 5.0

−0.0 −0.5 −0.25 −0.025 −0.025

64 × 64 64 × 64 32 × 32 64 × 64 32 × 32

0.25 0.25 1.0 0.25/9.0a 1.0

0.25 0.25 0.5 0.25/3.0a 0.5

Notes: τ : time constant (msec); b: bias; n × n: array size; d: cutoff radius, σ : gaussian radius (see equation 2.5). a Inner radius/outer radius.

Table 2: Synaptic Weights.

BP SA LA PA GC

BP

SA

LA

PA

GC

* 3.0b 3.0b 0.75b 9.0b

−0.375b * * −0.75b −4.5b

−3.0b −3.0b 0.25a 0.25a −4.5b

−3.0b/−15.0c 0.0b/−15.0c −3.0a/−15.0c 0.25a/−45.0c 0.25a /−270.0c

* * * 0.25a,d *

Notes: Each term represents the total integrated weight from all synapses arising from the corresponding presynaptic type (columns) to each cell of the corresponding postsynaptic type (rows), (the quantity W (k,k ) in equation 2.5). Asterisk = absence of corresponding connection. Synapse type indicated by superscript: a gap junction, b nonspiking synapse, c spiking synapse. d Maximum coupling efficiency (ratio of post- to presynaptic depolarization) for this gap junction synapse: DC = 11.3%, action potential = 2.7%.

2.3 Rate-Modulated Poisson Process. A second method of generating artificial spike trains was based on a minimalist assumption: that firing correlations between cat ganglion cells are due to a common oscillatory input. An oscillatory time series of a duration, T, and temporal resolution,

t, possessing realistic temporal correlations was constructed as follows. Starting with the discrete frequencies, fk , fk =

k T , 0≤k< , T

t

the discrete Fourier coefficients were defined as follows:

( f k − f0 ) 2 Ck = e2πir exp , 2σ 2

(2.6)

(2.7)

where f0 is the central oscillation frequency, σ is the width of the spectral peak in the associated power spectrum, and r is a uniform random deviate between 0 and 1 that randomized the phases of the individual Fourier

Correlated Firing Improves Discrimination

2269

components (generated by the Matlab intrinsic function RAND). These coefficients were used to convert back to the time domain using the discrete inverse Fourier transform, Rn = A

1 N−1 Ck e−2πifk tn + R0 , N k=1

(2.8)

where the real part of Rn denotes the value of the time-dependent firing rate at the discrete times, tn = n · t, N = T · t, A is an empirically determined scale factor and we have added a constant offset, R0 , which sets the mean firing rate. The quantity A was determined by the formula A = 0.25 + (1.1 − 0.25)(6 + I)/5, where I is the stimulus intensity in log2 units, with values ranging from −6 to −1, and the other coefficients were determined empirically to produce a reasonable match to the retinal model. The quantity A was then normalized relative to the standard deviation of Rn over all time steps with A = 1. Thus, the highest stimulus intensities produced rate fluctuations on the order of the standard deviation due to noise. The width of the frequency spectrum, σ , was given by an analogous formula: σ = 10−(10−6.25)(6+I)/5. Negative values of Rn were truncated at zero and the resulting time series rescaled so that its average value remained equal to R0 . The time series defined by Rn was used to generate oscillatory spike trains via a pseudorandom process, Sn = θ (Rn t − r),

(2.9)

where Rn · t is the probability of a spike in the nth time bin, θ is a step function, θ (x < 0) = 0, θ(x > 0) = 1, and r is again a uniform random deviate. In the limit that Rn · t 1, the above procedure reduces to a rate-modulated Poisson process. The same time series, Rn , was used to modulate the firing rate of each element contributing to the artificially generated multiunit spike train, thus producing temporal correlations due to shared input. 2.4 Correlated Poisson Process. To generate artificial spike trains that possess spatial but not temporal correlations, we employed a pseudo-Poisson process to generate a template spike train from which mutually correlated spike trains could be constructed. To produce a random template spike train of duration T, temporal resolution t, and mean spike rate R0 , we simply used equation 2.9 with the replacement Rn → R0 : S(0) n = θ(R0 t − r).

(2.10)

To produce spatially correlated spike trains, the template time series, S(0) n , was used to construct new spike trains according to the formula

(R0 t) S(k) = θ S · C t + (1 − S ) · (1 − C t) − r , (2.11) n n n 1 − R0 t

2270

G. Kenyon, J. Theiler, J. George, G. Travis, and D. Marshak

where S(k) n denotes the spike train of the kth cell, C is the conditional firing rate given that a spike occurred in the corresponding time bin of the template train, and r is again a uniform random deviate. The maximum value of C· t was (1 − R0 · t) · 0.5. 2.5 Data Analysis. Reported correlations are expressed as a fraction of the expected synchrony due to chance, measured during either baseline activity or the plateau portion of the response to a sustained stimulus (200–600 msec). With this normalization, a correlation amplitude of one at zero delay corresponded to a doubling in the number of synchronous events over the expected rate. Correlations were plotted as a function of the delay after averaging over all events occurring during the plateau portion of the response. For each delay value, this average was compensated for edge effects arising from the finite length of the two spike trains. To increase the signal to noise, the firing rates or correlations were averaged over all cells, or distinct cell pairs, responding to the same stimulus, producing a multiunit peristimulus time histogram (PSTH) or multiunit cross-correlation histogram (CCH), respectively. Autocorrelation functions were not included in the multiunit CCH, which thus included only cross-correlations between distinct cell pairs. Error bars were estimated by assuming Poisson statistics for the count in each histogram bin. All correlations were obtained by averaging over 200 stimulus trials, using a bin width of 1 msec. 2.6 Bayes Discriminator. A Bayes discriminator was used to distinguish between different intensities based on the distribution of threshold events across independent stimulus trials. For each intensity, the number of suprathreshold events was determined on successive trials and the results normalized as a probability distribution. For any given pair of intensities, the percentage of correct classifications made by an ideal observer was inversely related to the degree of overlap between the two distributions. Total overlap corresponded to performance at chance (50% correct), while zero overlap implied perfect discrimination (100% correct). For the threshold detection process, all spikes occurring within a given time bin, whose width varied from 1 to 20 msec depending on the experiment, were summed together and the result compared to a threshold. There was no overlap between successive time bins. The duration of the discrimination interval, which varied from 50 to 400 msec, was adjusted so as to approximately normalize the task difficulty as the number of inputs was varied. When the discrimination interval contained the transient portion of the response (0–50 msec), constructing an appropriate Poisson control was less straightforward than during the plateau period, during which the firing rate could be treated as constant. Moreover, because high-frequency oscillations during the response peak were strongly stimulus locked, it was not possible to use the multiunit PSTH to estimate the time-dependent event rates of the equivalent Poisson generators, as these would have contained

Correlated Firing Improves Discrimination

2271

the high-frequency stimulus-locked oscillations that the control is intended to eliminate. By using a boxcar filter with a width of 9 msec, however, we were able to modify the multiunit PSTH so as to remove high-frequency components but without eliminating the central response peak. The average number of spikes was always the same for both the model ganglion cells and the Poisson controls. 3 Results The principal characteristics of the retinal model, used to generate most of the artificial spike trains used in this study, are best illustrated by examining the multiunit PSTHs and multiunit CCHs computed from the simulated responses to a narrow bar stimulus of maximum intensity (see Figure 2A). The multiunit PSTH, which combines the responses of all 12 model ganglion cells directly beneath the stimulus, consisted of a sharp peak, about 50 msec wide, followed by a plateau period of sustained elevated activity (see Figure 2B). Relative to baseline, the sustained increase in spike activity produced by the bar stimulus was comparable to, or larger than, the sustained increase in firing exhibited by cat ganglion cells in response to high-contrast features (Creutzfeldt, Sakmann, Scheich, & Korn, 1970; Enroth-Cugell & Jakiela, 1980). Likewise, the pronounced downward notch separating the peak and plateau regions is characteristic of ganglion cell responses to large, flashed stimuli (Cox & Rowe, 1996). Despite the absence of periodic structure in the plateau portion of the multiunit PSTH, there were nonetheless very prominent high-frequency oscillations in the multiunit CCH recorded during the plateau portion of the response (see Figure 2C, solid black line). In the cat retina, high-contrast stimuli produce an approximate doubling in the number of synchronous events relative to the expected level due to chance (Neuenschwander et al., 1999). In the spike trains generated by the retinal model, the synchronous oscillations evoked by a maximum intensity bar stimulus resulted in a qualitatively similar increase in the number of synchronous events relative to the expected level. Thus, the artificially generated spike trains used in this study reflect physiologically reasonable levels of correlated activity. During the plateau portion of the response, the high-frequency oscillations were not strongly phase-locked to the stimulus onset, but rather tended to become phase randomized over time, as revealed by the decline in correlation strength with increasing delay. Consistent with the absence of stimulus-locked oscillations during the plateau period, the shift predictor was negligible (see Figure 2C, dashed gray line). Correlations between cat ganglion cells recorded during sustained activity decline in a similar manner with increasing delay and likewise possess negligible shift predictors (Neuenschwander et al., 1999). We also computed the multiunit CCH for the 50 msec period following stimulus onset encompassing the response peak (see Figure 2D, solid black

2272

G. Kenyon, J. Theiler, J. George, G. Travis, and D. Marshak

B

50 Hz

A

0.2

0.6 sec

D

C plateau

peak

0.5

0.5

-0.5

-0.5

-50

50 msec

-20

20 msec

Figure 2: Artificial spike trains generated by the retinal model. (A) A column of 12 model ganglion cells was stimulated by a narrow bar (white rectangle, intensity = 12 , stimulus dimensions 2 × 12 GC receptive field diameters). Circles indicate GC receptive field diameter. (B) Multiunit peristimulus time histogram (PSTH) obtained by averaging the individual PSTHs over all ganglion cells activated by the stimulus (bin width, 1 msec). The solid line at the bottom of the panel indicates the stimulus duration (600 msec). Vertical ticks denote the peak and plateau portions of the response, 0–50 msec and 200–600 msec, respectively. (C) Multiunit cross-correlation histogram (CCH), obtained by combining individual CCHs from all distinct pairs of ganglion cells activated by the stimulus measured during the plateau portion of the response (solid black lines). Correlations expressed as a fractional change from the expected synchrony due to chance (dimensionless units). Shift predictors (dashed gray lines) obtained by recomputing the multiunit CCH using spike trains from different stimulus trials. (D) Multiunit CCH recorded and measured during the response peak (same organization as (C). Correlations were larger during the response peak, as was the shift predictor, due to the strong stimulus locking of the high-frequency oscillations.

Correlated Firing Improves Discrimination

2273

line), during which the high-frequency oscillations were strongly stimuluslocked, as revealed by the periodic structure in the multiunit PSTH as well as by the large shift predictor (see Figure 2D, dashed gray line). A tendency for high-frequency oscillations to become less stimulus locked as a function of time from stimulus onset is also seen in experimental data (Neuenschwander et al., 1999). 3.1 Firing Correlations and Firing Rate Are Both Modulated by Stimulus Intensity. To investigate the relationship between correlation strength and stimulus contrast, the same narrow bar covering 12 ganglion cells was varied over a 32-fold range of intensities (see Figure 3). Both the peak and plateau firing rates of the model ganglion cells activated by the stimulus, as measured by the multiunit PSTH, increased in a graded manner as the stimulus intensity was raised (see Figure 3B). Similarly, the degree of synchrony during both the plateau and peak portions of the response, assessed by the amplitude of the multiunit CCH at zero delay, also increased with stimulus intensity (see Figures 3C and 3D, respectively). The amplitude of stimulus-evoked high-frequency oscillations between cat ganglion cells depends similarly on luminance contrast (Neuenschwander et al., 1999). As a function of stimulus intensity, synchrony could be modulated over a greater dynamic range, measured relative to baseline activity, than could the fractional change in the multiunit firing rate, during both the response plateau (see Figure 3D) and the response peak (see Figure 3E). Overall, firing correlations were very sensitive to stimulus intensity, and thus might convey information about luminance contrast in addition to that represented by the mean firing rate across the ensemble of activated cells. 3.2 Correlated Inputs Transmit More Information Through Coincidence Detectors than Independent Rate-Matched Controls. A simple threshold detector with a short integration time window and a low rate of suprathreshold background events was able to extract the stimulus intensity more reliably from a hybrid rate/correlation code than from an equal number of statistically independent, rate-matched Poisson event generators. Spikes from a column of 12 ganglion cells activated by a narrow bar were summed into a simple threshold detector (see Figure 4). The event rate of the detector was determined by the total number of time bins on which the input crossed threshold during a 200 msec epoch within the plateau portion of the response. For these experiments, the detection threshold was set to a level that required three or more spikes to arrive within the same 2 msec time bin in order to produce a detector event. To assess the extent to which the detector was able to utilize firing correlations between its inputs, the 12 stimulated ganglion cells were replaced by independent Poisson generators that, on average, produced the same number of spikes per unit time. For both correlated and Poisson input, the baseline detector event rate was very low, around 1 Hz (see Figure 4A). A high-intensity stimulus produced

2274

G. Kenyon, J. Theiler, J. George, G. Travis, and D. Marshak

B

A

C

plateau -2

2 -2

-2

peak -2

2

20 Hz

-2

-3

-3

-4

-4

-3

-4

0.2

0.6 sec

-50

D

-20

E 10

fractional increase

50

peak

plateau

firing rate synchrony

20 msec

1

5

0.1

10 0

0 -5

-3

-1

-5

stimulus intensity (log2 units)

-3

-1

Correlated Firing Improves Discrimination

2275

a greater increase in the detector event rate when the inputs were correlated as compared to the case when the inputs were independent. This extra sensitivity reflected the fact that a threshold process with a short integration time is well suited for detecting rare synchronous events (Kenyon, Fetz, & Puff, 1990; Kenyon, Puff, & Fetz, 1992), a property also exhibited by cortical and subcortical neurons (Alonso et al., 1996; Usrey et al., 2000). When driven by correlated input from the model ganglion cells, the output of the threshold detector allowed for better discrimination between different stimulus intensities than when driven by independent Poisson generators. The probability of observing a given number of detection events was plotted for a range of stimulus intensities (see Figure 4B). As the stimulus intensity increased, the distributions shifted to the right, reflecting the greater number of suprathreshold inputs. When the threshold detector was driven by correlated input from the retinal model, the event distributions for different intensities were more separable than when driven by the ratematched controls. To quantify the degree to which firing correlations caused detector output to be more discriminable, we used a Bayes discriminator to estimate the maximum percentage of intensity comparisons that could be correctly classified (Duda, Hart, & Stork, 2001). Starting from several different baselines, firing correlations allowed for a higher percentage of correct intensity classifications over a broad range of intensity increments (see Figure 4C). The abscissa of each point gives the final stimulus intensity, while the x-intercept of the line passing through that point yields the corresponding baseline intensity. The ordinate of each point gives the percentage of trials on which the given pair of stimulus intensities could be correctly classified, using only the single trial output of the detector. For many pairwise intensity discriminations, synchronous oscillations allowed approximately 10 additional trials out of 100 to be correctly classified (solid black lines) compared to the Poisson control (dashed gray lines). Averaged

Figure 3: Facing page. Firing correlations are modulated over a greater dynamic range than firing rate as a function of stimulus intensity. The stimulus was again a narrow bar covering 12 ganglion cells. (A) Intensity series formed by the multiunit PSTHs of ganglion cells activated by the stimulus. The intensity, in log2 units, is indicated at the upper right of each histogram (bin width = 10 msec). (B) Intensity series formed by the multiunit CCHs between all pairs of ganglion cells activated by the stimulus, relative to the baseline level of synchrony. Firing correlations during the plateau response are strongly modulated by stimulus intensity. (C) Intensity series formed by the multiunit CCHs recorded during the response peak (same organization as (B). (D,E) Fractional change from baseline in synchrony (black lines, circles) as a function of stimulus intensity, compared to the fractional change in firing rate (gray lines, squares), recorded during the (D) plateau and (E) peak portions of the response. In both cases, synchrony was modulated over a greater dynamic range than the firing rate.

2276

G. Kenyon, J. Theiler, J. George, G. Travis, and D. Marshak

A baseline

model

stimulated

22.8 Hz

200 msec

.8

probability

0.3 Hz

53.9 Hz

0

B

Poisson

0.7 Hz

-f

Poisson

model

-6 -5 -4 -3 -2 -1

.4

0 0

10

20

0

suprathreshold events

20

10

C model Poisson

% correct

100%

50%

-f

-5

-3

log2 intensity

-1

Correlated Firing Improves Discrimination

2277

over all intensity increments, the extra number of correct classifications was approximately 5 in 100, but this value is likely conservative due to saturation. Our results suggest that when retinal output is decoded by a threshold detection process with a short integration time window, synchronous oscillations allow information to be transmitted more reliably than would be the case with independent firing rates. During the peak portion of the response, firing correlations also permitted better discrimination of intensity increments, as represented by the output of the threshold detector applied to the first 50 msec of the response (see Figure 5). However, constructing an appropriate Poisson control was less straightforward than during the plateau period, as the high-frequency oscillations occurring during the response peak were strongly stimulus locked (see Figure 5A, left bottom panel, smooth trace). In order to produce control spike trains that contained only firing correlations due to the transient itself, we used a boxcar filter with a width of 9 msec to remove high-frequency oscillations from the multiunit PSTH without eliminating the central response peak. The average number of input spikes arriving during the response peak was the same for both the model ganglion cells and the time-dependent Poisson control. The distribution of detector events was more separable when input was correlated by high-frequency oscillations than when input was correlated only by the transient response peak (see Figure 5B). A Bayesian Figure 4: Facing page. Firing correlations during the plateau portion of the response allow improved discrimination of stimulus intensity compared to independent Poisson input. (A) Example of the threshold detection process. Stimuli consisted of a narrow bar presented at various intensities (same stimulus as in Figure 3). Ganglion cell input to the threshold detector during the plateau portion of the response is shown on the left and equivalent Poisson input on the right. The baseline activity of the detector in the absence of stimulation (top row) is very low. A stimulus with intensity = −1 in log2 units (bottom row) produced strong correlations between ganglion cells, resulting in a relatively large number of suprathreshold events. Dashed line: detector threshold. Dotted lines: average input ± S.D. Summation window, 2 msec. (B) Normalized probability distributions giving, for each stimulus intensity, the relative number of suprathreshold events during the 200 msec analysis interval. (Left) Retinal model. (Right) Poisson control. The distributions of suprathreshold events produced by the retinal model were more separable. (C) Percentage correct intensity classifications by an ideal observer. The ideal observer chose between two equally likely stimulus intensities based on the number of detector events on each trial. Input to the detector came from either the retinal model (solid black lines) or Poisson controls (dashed gray lines). Each point represents the fraction of trials on which the intensity indicated by the abscissa was correctly distinguished from a lower intensity, denoted by the intersection of each line with the x-axis. Error bars computed assuming binary statistics for the overlap between each pair of distributions (omitted from Poisson controls for clarity).

2278

G. Kenyon, J. Theiler, J. George, G. Travis, and D. Marshak

A

Poisson

stimulated

baseline

model 0.9 Hz

0.7 Hz

109.3 Hz

74.3 Hz

200 msec

0

B probability

1.0

model

-f

Poisson Poisson

-6 -5 -4 -3 -2 -1

0.5

0 0

5

0

10

5

suprathreshold events

C model Poisson

% correct

100%

50%

-f

-5

-3

log2 intensity

-1

Figure 5: Firing correlations during the response peak allow improved discrimination of stimulus intensity. Same organization as in Figure 4. To remove the high-frequency stimulus-locked oscillations present in the response peak (A, bottom left), the multiunit PSTH was smoothed by replacing the measured firing rate in each bin by the average of the surrounding nine bins (A, bottom right). Firing correlations during the response peak allowed for improved discrimination of stimulus intensity compared to the case where correlations were due solely to the response transient itself (B,C).

Correlated Firing Improves Discrimination

2279

discriminator analysis again confirmed that on average, 5 additional trials out of 100 could be correctly classified when input to the threshold detector was correlated by stimulus-dependent, synchronous oscillations as compared to the nonoscillatory control (see Figure 5C). These results show that gamma band oscillations can convey relevant stimulus information even when superimposed with other sources of stimulus-dependent firing correlations. The above results presumably depend on the detailed nature of the firing correlations present in the input spike trains. In particular, two sets of input spike trains might possess similar correlations as assessed by the multiunit CCH, an average over many independent trials, but still differ significantly with respect to the correlations present on individual trials. Our retinal model provides a biologically plausible method for generating realistic correlations, but as with any other model of complex neural circuitry, it is impossible to know to what extent the proposed physiological mechanisms are correct. To control for the possibility that the correlations produced by our retinal model might have inadvertently biased our results, a second set of input spike trains was generated using a very different mechanism. In contrast to the inhibitory feedback dynamics employed in our retinal model, the second model required only a source of common input to modulate, in phase, the firing rates of the stimulated cells. The parameters of the common input model were adjusted so that both approaches yielded similar multiunit CCHs over a range of stimulus intensities (see Figure 6A). Despite employing very different underlying mechanisms, both models supported very similar levels of stimulus discrimination, with the difference in total performance being less than 1 trial in 100 (see Figure 6B). These results indicate that the information conveyed by spatiotemporal correlations is unlikely to be an artifact of a particular mode of generation, but rather reflects a general property of neural populations exhibiting synchronous gamma band oscillations. 3.3 Correlations Mediate Greater or Equivalent Stimulus Discrimination for a Wide Class of Threshold Detectors. The hybrid rate/correlation code continued to mediate equal or superior stimulus discrimination even when the integration time of the threshold detector was increased so as to diminish the importance of synchronous inputs. To quantify the performance of the ideal observer for a given threshold detector, we defined the quantity

, which gives the difference in the percentage of correctly classified trials using correlated input compared to the Poisson control, averaged over all intensity pairs. Our results show that correlated input mediates greater or equivalent stimulus discrimination over a wide range of integration times and background activity levels (see Figure 7). Plotted as a function of integration time, was largest for small summation intervals well suited to resolve synchronous inputs (see Figure 8A). When plotted as a function of the detection threshold, generally increased for integration times less

2280

G. Kenyon, J. Theiler, J. George, G. Travis, and D. Marshak

A

feedback model

common input model

correlograms

14

0 -40

B

0

msec

40

-40

0

msec

40

feedback common

% correct

100%

50% -f

-5

-3

-1

log2 intensity Figure 6: Different mechanisms for generating synchronous oscillations yield similar stimulus discrimination. (A, left): Multiunit CCHs generated by the retinal model in response to stimuli of varying intensity, starting with no stimulus (bottom row) and ranging from –6 to –1 (log2 units, top row). Multiunit CCHs expressed as a fraction of asymptotic baseline level. (A, right): Multiunit CCHs produced by an equal number of Poisson event generators whose rates were modulated by a source of common oscillatory input (same organization as in left column). Both sets of artificial spike trains exhibited similar correlations as measured by the multiunit CCH, but their correlations may have differed at the single trial level. (B) Despite employing very different mechanisms, stimulus discrimination was similar for the two models (same parameters as used in Figure 4).

Correlated Firing Improves Discrimination

2281

than approximately 10 msec, since larger thresholds produced lower values of background activity and thus made the detection process more sensitive to synchrony (see Figure 7B). A somewhat surprising aspect of our results was that as the integration time became large enough to effectively discard intensity information encoded by the degree of synchrony, did not become strongly negative, but rather approached an asymptotic level near zero. Since synchrony adversely affects the amount of information that can be extracted from the average firing rate over a population of similarly activated neurons (Mazurek & Shadlen, 2002; Shadlen & Newsome, 1994, 1998), it might have been anticipated that once the integration time became sufficiently long to discount any information encoded by the degree of synchrony, independent Poisson inputs would have mediated greater stimulus discrimination than correlated inputs. However, such reasoning fails to consider the effects of stimulusevoked oscillations, which like synchrony also increased as a function of stimulus intensity, as indicated by the increased persistence of periodic structure in the multiunit CCH as the stimulus intensity was increased (see Figure 3). As a result of the stronger oscillations evoked by higher stimulus intensities, the spike trains became more regular, and thus the total number of inputs over the course of each 200 msec discrimination trial became more reliable predictors of the stimulus intensity. To quantify the reliability of the afferent spike trains, we computed the Fano factor of the multiunit input as a function of stimulus intensity (see Figure 7C). The Fano factor is defined as the variance in the number of spikes divided by the mean and is equal to one for a Poisson process (Teich, 1989). At all intensities, the Fano factor of the combined correlated input (solid line) was less than that of the Poisson control (dashed line), indicating that the stimulus-evoked oscillations caused the total number of spikes to be less variable, and therefore more reliable, than for independent Poisson generators. In the absence of oscillations, the Fano factor would have increased markedly with stimulus intensity as a consequence of the increased synchrony. To illustrate this fact, a second control was employed in which stimulus-dependent synchrony was introduced in the absence of oscillations. The maximum synchrony was set approximately equal to the maximum synchrony present in the hybrid rate/correlation code, while the minimum synchrony was set to zero and a linear interpolation was used for intermediate intensities. By allowing no more than one spike in each 10 msec time bin, thereby introducing a refractory period, the Fano factor in the absence of spatial correlations could be reduced to approximately the same level exhibited by the model during background activity. In the absence of oscillations, strong synchrony produced a large rise in the Fano factor of the combined input as the stimulus intensity was increased (dotted line). Our results demonstrate that a refractory period by itself cannot account for the reliability of the total integrated input in the presence of strong synchrony, but that such reliability results naturally from the asso-

2282

G. Kenyon, J. Theiler, J. George, G. Travis, and D. Marshak

20

' % correct

A

0

10

5

B 20

15

20

integration time (msec)

' % correct

1 msec

0

20 msec 2

6

4

C 1.5

Fano factor

8

10

threshold

correlated Poisson Poisson

1

0.5

0

model

-f

-5

-3

log2 intensity

-1

Correlated Firing Improves Discrimination

2283

ciated high-frequency oscillations. Thus, stimulus-dependent synchronous oscillations do not necessarily result in less information transfer, even when only the total spike count over a relatively long interval is considered. 3.4 Hybrid Rate/Correlation Codes Work Best over Limited Numbers of Cells. Up to this point, we have considered only the effects of correlations between relatively small numbers of inputs. While this is consistent with the convergence ratios of sensory input onto neurons in the lateral geniculate nucleus and V1 (Reid & Alonso, 1995; Usrey, Reppas, & Reid, 1999), hybrid rate/correlation codes might cease to be advantageous as the number of inputs is increased. To explore the behavior of a hybrid rate/correlation code as a function of the number of input spike trains, we used a much larger stimulus that activated a 12 × 12 array of neurons. Oscillations between retinal ganglion cells increase markedly in response to the larger stimuli (Ariel et al., 1983; Ishikane et al., 1999; Neuenschwander et al., 1999), and this was also true in our retinal model. However, because the maximum correlation strength between cat alpha cells has not been precisely determined for such stimuli, our results should be interpreted only as a rough estimate of the information that might be conveyed by a hybrid rate/correlation code as the number of inputs increases. As a function of the number of input spike trains, reached a maximum for a relatively small number of correlated inputs, between 10 and 50 (see Figure 8A). For integration times less than 5 msec, remained greater than or equal to zero regardless of the number of inputs. For a threshold process employing longer integration times, correlations produced progressively poorer stimulus discrimination as the number of inputs increased. Figure 7: Facing page. Hybrid rate/correlation codes yield superior or equivalent stimulus discrimination for a broad class of threshold detectors. (A) , the difference in the percentage of successfully classified trials using a hybrid code as opposed to independent Poisson controls, plotted as a function of temporal integration window. declined with increasing integration time but did not become strongly negative. Individual points are for different thresholds. (B)

versus detection threshold for different integration times. Same data as in previous panel. The increase in with threshold declines progressively at longer integration times. (C) Fano factors (variance/mean of the total input spike count) plotted versus stimulus intensity. Solid line + circles: retinal model. Dashed line + squares: Poisson control. Dotted line + triangles: modified Poisson process in which the separate generators were synchronized by an amount proportional to the stimulus intensity, but in which no oscillations were present. To account for the effects of a refractory period on the Fano factor, no more than one spike was allowed to occur in any 10 msec time bin. The Fano factor for the retinal model remained less than one due to the presence of high-frequency oscillations, while synchrony alone produced a large increase in variability relative to independent Poisson inputs.

2284

G. Kenyon, J. Theiler, J. George, G. Travis, and D. Marshak

A

' % correct

12

1 msec

8

2 msec

4

0

10 msec 0

50

100

150

# of inputs

B 4

-1

Fano factor

3

-2 2

-3 1

-f 0

50

100

150

# of inputs Figure 8: Hybrid rate/correlation codes work optimally for a limited numbers of input neurons. These experiments used a 12 × 12 uniform spot that produced larger synchronous oscillations than did a narrow bar. (A) versus the number of input cells plotted for several different integration times. reached a peak at between 10 and 50 inputs and remained positive as the number of inputs increased as long as the integration time was small, but becomes negative for the same number of inputs if the integration time was too long to resolve synchronous inputs. (B) Fano factor versus number of inputs plotted for several different intensities (log2 units). At high intensities, which produce strong synchronous oscillations, the Fano factor increased sharply as a function of the number of inputs, thus accounting for the poorer performance of the hybrid code in this regime.

Correlated Firing Improves Discrimination

2285

As the number of inputs was raised, we increased the threshold so as to maintain the background detection rate as close to 1 Hz as possible, but without falling below 0.1 Hz. As the number of inputs increased, it was necessary to use a lower threshold for the Poisson control than for the correlated input, due to the small oscillations present in the background activity, in order to maintain the level of background suprathreshold events close to the target level of 1 Hz. We obtained similar results when the target level of background suprathreshold events was allowed to increase, but at smaller values of on average. In order to maintain task difficulty as the number of inputs was increased, the duration of each discrimination trial was progressively lowered from 400 to 100 msec. The dependence of the hybrid code on the number of cells feeding into the detector is paralleled by the behavior of the Fano factor of the combined input (see Figure 8B). When the number of inputs was small, the Fano factor was always less than one, regardless of the stimulus intensity. As the number of inputs increased, the Fano factor became much greater than one at all but the smallest stimulus intensities. This result is consistent with previous studies showing that synchrony becomes progressively more destructive of information encoded by the average firing rate as the number of neurons contributing to the average increases (Mazurek & Shadlen, 2002; Shadlen & Newsome, 1994, 1998). When the number of inputs is very large, therefore, a hybrid rate/correlation code is likely to be effective only when the integration time is small enough to resolve the stimulus information encoded directly by the degree of synchrony itself. 4 Discussion Several influential studies of how firing correlations affect the representation of information in neural ensembles focused primarily on how synchrony impedes accurate estimation of the average firing rate of the population (Mazurek & Shadlen, 2002; Shadlen & Newsome, 1994, 1998). In particular, these studies emphasized how averaging over ensembles of similarly activated neurons reduces only the trial-to-trial variability in the estimated firing rate to the extent that the individual cells are uncorrelated. From this point of view, synchrony is an inevitable feature of densely interconnected networks and, as such, limits the effective size of neural ensembles to several tens of strongly correlated cells. However, these studies generally ignored the possibility that synchronous oscillations could themselves directly encode stimulus information and thus compensate for the attendant loss of the information encoded by the ensemble-averaged firing rate. Here, we have demonstrated that realistic synchronous oscillations can lead to improved information transmission under fairly general assumptions. In particular, synchrony, when evoked in an intensity-dependent manner, can significantly improve information transfer through threshold neurons functioning as coincidence detectors. In a complementary fashion, oscillations,

2286

G. Kenyon, J. Theiler, J. George, G. Travis, and D. Marshak

when also proportional to intensity, mediate greater stimulus discrimination by making the total number of input spikes more reliable regardless of the integration time window employed. General principles of population encoding and decoding, especially those involving correlations between multiple neurons, are still very difficult to investigate experimentally. To reproduce the results of the present model, it would be necessary to record simultaneously from multiple input neurons and at least one common postsynaptic neuron and to manipulate the firing correlations between the input neurons through direct multielectrode stimulation or via pharmacological techniques. While it is not clear that such biological experiments are currently feasible, there does exist sufficient information to construct artificial spike trains possessing physiologically realistic correlations. Moreover, it is not necessary to possess a complete understanding of the physiological mechanisms that give rise to strong correlations in order to analyze how their information content might be extracted using a biologically plausible mechanism. Here, we have focused on signal detection using a threshold process that captures certain essential aspects of single-neuron dynamics. Moreover, we were able to compare the stimulus discrimination accomplished by a hybrid rate/correlation code with that mediated by a pure rate code that produced, on average, the same number of spikes. Finally, by using a computational model, we were able to examine the effects of synchronous oscillations over a wide class of threshold detection processes, and in this way we obtained insights into the physiological conditions necessary to use a hybrid rate/correlation code. The main drawback of using a computational model is the need to ascertain whether the results are physiologically relevant. Fortunately, synchronous oscillations between retinal ganglion cells have been sufficiently well characterized so as to impose tight constraints on artificially generated spike trains. Where possible to verify, the model retina appeared to favor a rate code over a correlation code. This was the case for the narrow bar stimulus employed in our first several experiments. The maximum synchrony between the model spike trains, measured relative to chance and averaged over all cells responding to the bar, was somewhat less than the levels of synchrony between widely separated cells recorded in the cat retina in response to similar stimuli (Neuenschwander & Singer, 1996). Likewise, the maximum sustained increase in firing rate in our artificial spike trains was probably exaggerated due to the lack of several known adaptation mechanisms in the model, particularly those present in the outer retina, although our model did account in some degree for adaptation effects mediated by synaptic inhibition. Our results were also useful for illustrating general phenomena, such as how hybrid rate/correlation codes might scale as the number of cells conveying redundant information increased. The effectiveness of a hybrid rate/correlation code was found to involve a trade-off between the loss of information in the ensemble-averaged firing rates due to synchrony

Correlated Firing Improves Discrimination

2287

and a corresponding gain in information due to the information encoded by the synchronous oscillations themselves. At longer integration times that were insensitive to synchronous input, stimulus-dependent oscillations still contributed to improved performance on intensity discrimination tasks by causing spike counts to become more regular. As the number of neurons increased, the loss of rate-coded information due to synchrony became more severe. Our results are likely to be helpful for interpreting synchronous oscillations in the retina and elsewhere in the central nervous system. Gamma band oscillations are ubiquitous in the vertebrate retina, having been measured extracellularly in cats (Laufer & Verzeano, 1967; Neuenschwander et al., 1999; Neuenschwander & Singer, 1996; Steinberg, 1966), rabbits (Ariel et al., 1983), frogs (Ishikane et al., 1999) and mudpuppy (Wachtmeister & Dowling, 1978), as well as in the ERGs of humans (De Carli et al., 2001; Wachtmeister, 1998) and primates (Frishman et al., 2000). The conservation of retinal oscillations across such a broad range of vertebrate species suggests they may be important for visual function. Synchronous oscillations have also been recorded elsewhere in the mammalian nervous system, including visual (Gray, Konig, ¨ Engel, & Singer, 1989; Gray & Singer, 1989; Kreiter & Singer, 1996; Livingstone, 1996) and sensorimotor (Murthy & Fetz, 1996a, 1996b) cortex and the hippocampus (Traub, Whittington, Collins, Buzsaki, & Jefferys, 1996). Numerous interpretations of the informationprocessing function accomplished by synchronous oscillations have been suggested (Engel et al., 2001; Fries et al., 2001, 2002; Singer & Gray, 1995; Srinivasan et al., 1999). Here, we note that since synchronous oscillations are widely present throughout the brain, the nervous system might as well make use them. If correlations are indeed an unavoidable consequence of neural connectivity, our results suggests that the loss of information encoded by the average firing rate could be mitigated by employing a hybrid rate/correlation code. Correlations may also convey information that is not well represented by the ensemble-average firing rate. Our results suggest that the brain might exploit synchronous oscillations to represent additional types of information while compensating for the loss of information encoded by the ensemble-averaged firing rate. Several of our results were predicated on an analysis of the correlations present during the plateau portion of the response, using a discrimination window of 200 msec. While the majority of stimulus information is probably transmitted during the first 50 msec encompassing the response peak, additional information is likely transmitted by later components as well. The mean intersaccade interval for primates during a free viewing visual search task is approximately 200 msec (Mazer, Gallant, & Gustavsen, 2003), while optimal temporal frequencies for neurons in areas 17/18 in the cat visual cortex are typically in the range of a few Hz (Movshon, Thompson, & Tolhurst, 1978), implying sustained activations on the order of several hundred msec. Moreover, stimulus-related increases in gamma band power have re-

2288

G. Kenyon, J. Theiler, J. George, G. Travis, and D. Marshak

cently been shown to persist for at least 1 sec during free viewing conditions (Salazar, Kayser, & Konig, ¨ 2004). Thus, the 200 msec time window used in several of our experiments is broadly consistent with relevant physiological timescales. In addition, we found that stimulus-dependent synchronous oscillations mediated improved discrimination even for analysis windows confined to the first 50 msec following stimulus onset, despite the presence of strong additional correlations due to the response peak itself in both the model and control data. It has been argued that information transfer is maximized when retinal outputs are uncorrelated (Srinivasan, Laughlin, & Dubs, 1982) and recent studies indicate that the relatively strong firing correlations sometimes observed between neighboring ganglion cells convey little additional information about natural scenes (Nirenberg, Carcieri, Jacobs, & Latham, 2001). Our results are not in conflict with these findings. In our analysis, retinal inputs were pooled into a single variable, roughly corresponding to a postsynaptic membrane potential. The only benefit of statistical independence was therefore a possible improvement in the signal to noise of the pooled input. Moreover, information theory by itself does not address how stimulus properties are extracted by target neurons. While formal mathematical measures predict that ganglion cells convey more information when their activity is uncorrelated, experimental and theoretical evidence suggests that synchronous inputs are particularly salient (Alonso et al., 1996; Kenyon et al., 1990; Usrey et al., 2000), and our current results demonstrate that information encoded by firing correlations between model ganglion cells can be efficiently extracted by threshold neurons under fairly general assumptions. Simultaneous recordings in cat from the retina, the LGN, and area 18 of the visual cortex indicate that synchrony between retinal ganglion cells can be propagated to higher levels in the visual system (Castelo-Branco, Neuenschwander, & Singer, 1998). It may therefore be necessary to consider the impact of correlated activity in order to fully account for its role in information processing. Finally, we note that using firing correlations to encode local stimulus properties does not preclude additional encoding functions that have been suggested (Meister & Berry, 1999).

Acknowledgments We acknowledge useful discussions with Rob Smith and Greg Stephens. This work was supported by the Lab Directed Research and Development Program at the Los Alamos National Laboratory, the Department of Energy Office of Nonproliferation Research and Engineering, and the MIND Institute for Functional Brain Imaging. D.W.M. was supported by the National Eye Institute EY06472 and the National Institute of Neurological Disease and Stroke NS38310.

Correlated Firing Improves Discrimination

2289

References Abbott L. F., & Dayan, P. (1999). The effect of correlated variability on the accuracy of a population code. Neural Comput., 11, 91–101. Alonso, J. M., Usrey, W. M., & Reid, R. C. (1996). Precisely correlated firing in cells of the lateral geniculate nucleus. Nature, 383, 815–819. Ariel, M., Daw, N. W., & Rader, R. K. (1983). Rhythmicity in rabbit retinal ganglion cell responses. Vision Research, 23, 1485–1493. Castelo-Branco, M., Neuenschwander, S., & Singer, W. (1998). Synchronization of visual responses between the cortex, lateral geniculate nucleus, and retina in the anesthetized cat. J. Neurosci., 18, 6395–6410. Cohen, E., & Sterling, P. (1990). Convergence and divergence of cones onto bipolar cells in the central area of cat retina. Philosophical Transactions of the Royal Society of London—Series B: Biological Sciences, 330, 323–328. Cox, J. F., & Rowe, M. H. (1996). Linear and nonlinear contributions to step responses in cat retinal ganglion cells. Vision Research, 36, 2047–2060. Creutzfeldt, O. D., Sakmann, B., Scheich, H., & Korn, A. (1970). Sensitivity distribution and spatial summation within receptive-field center of retinal oncenter ganglion cells and transfer function of the retina. J. Neurophysiol., 33, 654–671. De Carli, F., Narici, L., Canovaro, P., Carozzo, S., Agazzi, E., & Sannita, W. G. (2001). Stimulus- and frequency-specific oscillatory mass responses to visual stimulation in man. Clin. Electroencephalogr., 32, 145–151. Duda, R. O., Hart, P. E., & Stork, D. G. (2001). Pattern classification. New York: Wiley. Engel, A. K., Fries, P., & Singer, W. (2001). Dynamic predictions: Oscillations and synchrony in top-down processing. Nat. Rev. Neurosci., 2, 704–716. Enroth-Cugell, C., & Jakiela, H. G. (1980). Suppression of cat retinal ganglion cell responses by moving patterns. J. Physiol., 302, 49–72. Freed, M. A. (2000). Rate of quantal excitation to a retinal ganglion cell evoked by sensory input. J. Neurophysiol., 83, 2956–2966. Fries, P., Reynolds, J. H., Rorie, A. E., & Desimone, R. (2001). Modulation of oscillatory neuronal synchronization by selective visual attention. Science, 291, 1560–1563. Fries, P., Schroder, J. H., Roelfsema, P. R., Singer, W., & Engel, A. K. (2002). Oscillatory neuronal synchronization in primary visual cortex as a correlate of stimulus selection. J. Neurosci., 22, 3739–3754. Frishman, L. J., Saszik, S., Harwerth, R. S., Viswanathan, S., Li, Y., Smith, E. L., III, Robson, J. G., & Barnes, G. (2000). Effects of experimental glaucoma in macaques on the multifocal ERG. Multifocal ERG in laser-induced glaucoma. Doc. Ophthalmol., 100, 231–251. Gray, C. M., Konig, P., Engel, A. K., & Singer, W. (1989). Oscillatory responses in cat visual cortex exhibit inter-columnar synchronization which reflects global stimulus properties. Nature, 338, 334–337. Gray, C. M., & Singer, W. (1989). Stimulus-specific neuronal oscillations in orientation columns of cat visual cortex. Proc. Natl. Acad. Sci. U.S.A., 86, 1698–1702.

2290

G. Kenyon, J. Theiler, J. George, G. Travis, and D. Marshak

Ishikane, H., Kawana, A., & Tachibana, M. (1999). Short- and long-range synchronous activities in dimming detectors of the frog retina. Vis. Neurosci., 16, 1001–1014. Kenyon, G. T., Fetz, E. E., & Puff, R. D. (1990). Effects of firing synchrony on signal propagation in layered networks. In D. S. Touretzky (Ed.), Advances in neural information processing systems, 2 (pp. 141–148). San Mateo, CA: Morgan Kaufmann. Kenyon, G. T., Moore, B., Jeffs, J., Denning, K. S., Stephens, G. J., Travis, B. J., George, J. S., Theiler, J., & Marshak, D. W. (2003). A model of high-frequency oscillatory potentials in retinal ganglion cells. Vis. Neurosci., 20, 465–480. Kenyon, G. T., Puff, R. D., & Fetz, E. E. (1992). A general diffusion model for analyzing the efficacy of synaptic input to threshold neurons. Biol. Cybern., 67, 133–141. Kreiter, A. K., & Singer, W. (1996). Stimulus-dependent synchronization of neuronal responses in the visual cortex of the awake macaque monkey. J. Neurosci., 16, 2381–2396. Laufer, M., & Verzeano, M. (1967). Periodic activity in the visual system of the cat. Vision Res., 7, 215–229. Livingstone, M. S. (1996). Oscillatory firing and interneuronal correlations in squirrel monkey striate cortex. J. Neurophysiol., 75, 2467–2485. Mastronarde, D. N. (1989). Correlated firing of retinal ganglion cells. Trends in Neurosciences, 12, 75–80. Mazer, J. A., Gallant, J. L., & Gustavsen, K. (2003). Goal-related activity in V4 during free viewing visual search: Evidence for a ventral stream visual salience map. Neuron, 18, 1241–1250. Mazurek, M. E., & Shadlen, M. N. (2002). Limits to the temporal fidelity of cortical spike rate signals. Nat. Neurosci., 5, 463–471. Meister, M., & Berry, M. J., II. (1999). The neural code of the retina. Neuron, 22, 435–450. Movshon, J. A., Thompson, I. D., & Tolhurst, D. J. (1978). Spatial and temporal contrast sensitivity of neurones in areas 17 and 18 of the cat’s visual cortex. J. Physiol., 283, 101–120. Murthy, V. N., & Fetz, E. E. (1996a). Oscillatory activity in sensorimotor cortex of awake monkeys: Synchronization of local field potentials and relation to behavior. J. Neurophysiol., 76, 3949–3967. Murthy, V. N., & Fetz, E. E. (1996b). Synchronization of neurons during local field potential oscillations in sensorimotor cortex of awake monkeys. J. Neurophysiol., 76, 3968–3982. Neuenschwander, S., Castelo-Branco, M., & Singer, W. (1999). Synchronous oscillations in the cat retina. Vision Res., 39, 2485–2497. Neuenschwander, S., & Singer, W. (1996). Long-range synchronization of oscillatory light responses in the cat retina and lateral geniculate nucleus. Nature, 379, 728–732. Nirenberg, S., Carcieri, S. M., Jacobs, A. L., & Latham, P. E. (2001). Retinal ganglion cells act largely as independent encoders. Nature, 411, 698–701. Reid, R. C., & Alonso, J. M. (1995). Specificity of monosynaptic connections from thalamus to visual cortex. Nature, 378, 281–284.

Correlated Firing Improves Discrimination

2291

Salazar, R. F., Kayser, C., & Konig, P. (2004). Effects of training on neuronal activity and interactions in primary and higher visual cortices in the alert cat. J. Neurosci., 24, 1627–1636. Shadlen, M. N., & Newsome, W. T. (1994). Noise, neural codes and cortical organization. Curr. Opin. Neurobiol., 4, 569–579. Shadlen, M. N., & Newsome, W. T. (1998). The variable discharge of cortical neurons: Implications for connectivity, computation, and information coding. J. Neurosci., 18, 3870–3896. Singer, W., & Gray, C. M. (1995). Visual feature integration and the temporal correlation hypothesis. Annual Review of Neuroscience, 18, 555–586. Srinivasan, M. V., Laughlin, S. B., & Dubs A. (1982). Predictive coding: A fresh view of inhibition in the retina. Proc. R. Soc. Lond. B Biol. Sci., 216, 427–459. Srinivasan, R., Russell, D. P., Edelman, G. M., & Tononi, G. (1999). Increased synchronization of neuromagnetic responses during conscious perception. J. Neurosci., 19, 5435–5448. Steinberg, R. H. (1966). Oscillatory activity in the optic tract of cat and light adaptation. J. Neurophysiol., 29, 139–156. Teich, M. C. (1989). Fractal character of the auditory neural spike train. IEEE Trans. Biomed. Eng., 36, 150–160. Traub, R. D., Whittington, M. A., Colling, S. B., Buzsaki, G., & Jefferys, J. G. (1996). Analysis of gamma rhythms in the rat hippocampus in vitro and in vivo. J. Physiol. (Lond.), 493, 471–484. Usrey, W. M., Alonso, J. M., & Reid, R. C. (2000). Synaptic interactions between thalamic inputs to simple cells in cat visual cortex. J. Neurosci., 20, 5461–5467. Usrey, W. M., Reppas, J. B., & Reid, R. C. (1999). Specificity and strength of retinogeniculate connections. J. Neurophysiol., 82, 3527–3540. van Rossum, M. C., O’Brien, B. J., & Smith, R. G. (2003). Effects of noise on the spike timing precision of retinal ganglion cells. J. Neurophysiol., 89, 2406–2419. Vaney, D. I. (1990) The mosaic of amacrine cells in the mammalian retina. In N. N. Osborne & G. J. Chader (Eds.), Progress in retinal research (pp. 49–100). Oxford: Pergamon Press. Wachtmeister, L., & Dowling, J. E. (1978). The oscillatory potentials of the mudpuppy retina. Invest. Ophthalmol. Vis. Sci., 17, 1176–1188. Received April 7, 2003; accepted May 3, 2004.

LETTER

Communicated by Heinrich Buelthoff

A Temporal Stability Approach to Position and Attention-Shift-Invariant Recognition Muhua Li [email protected]

James J. Clark [email protected] Centre for Intelligent Machines, McGill University, Montr´eal, Qu´ebec, Canada H3A 2A7

Incorporation of visual-related self-action signals can help neural networks learn invariance. We describe a method that can produce a network with invariance to changes in visual input caused by eye movements and covert attention shifts. Training of the network is controlled by signals associated with eye movements and covert attention shifting. A temporal perceptual stability constraint is used to drive the output of the network toward remaining constant across temporal sequences of saccadic motions and covert attention shifts. We use a four-layer neural network model to perform the position-invariant extraction of local features and temporal integration of invariant presentations of local features in a bottom-up structure. We present results on both simulated data and real images to demonstrate that our network can acquire both position and attention shift invariance.

1 Introduction Humans are adept at visually recognizing objects or patterns under different viewing conditions. They are tolerant of position shifts, rotations, and deformations in the visual images. Psychological evidence (Bridgeman, Von der Hejiden, & Velichkovsky, 1994; Deubel, Bridgeman, & Schneider, 1998; Leopold & Logothetis, 1998; Norman, 2002; Walsh & Kulikowski, 1998) shows that there exist mechanisms along the visual pathway that maintain perceptual stability in the face of these variations in visual input. Hubel and Wiesel (1962) found that simple neurons in the primary visual cortex respond selectively to stimuli with specific orientation, while complex neurons present certain position-invariant properties. Neurons in higher visual areas, such as the inferotemporal cortex (IT), have larger receptive fields and show more complex forms of invariance. They respond consistently to scaled and shifted versions of the preferred stimuli (Gross & Mishkin, 1977; Ito, Tamura, Fujita, & Tanaka, 1995; Perrett, Rolls, & Caan, 1982; Rolls, 2000). c 2004 Massachusetts Institute of Technology Neural Computation 16, 2293–2321 (2004)

2294

M. Li and J. Clark

Maintaining perceptual stability is also an emerging issue in computer vision systems. An important consideration in the design of robotic vision systems is to be able to recognize the external world from the video stream acquired as the robot is wandering about. The video input is often erratic and unstable because the robot moves its eyes, head, and body to perceive the surroundings and avoid obstacles when it performs tasks. To perform well in recognition tasks, a robot should be able to maintain a constant perception of the structure of an object when changing views of the object during its motor activities. A number of models have been proposed to describe the mechanisms underlying perceptual stability, such as spatial-phase invariance, translation invariance, and scale invariance (Chance, Nelson, & Abbott, 2000; Fukushima, 1980; Riesenhuber & Poggio, 1999; Salinas & Sejnowski, 2001). In particular, temporal association is deemed an important factor in the development of transformation invariance (Miyashita, 1988; Rolls, 1995). Temporal continuity was first employed by Foldi´ ¨ ak (1991) to capture the temporal relationship of input patterns. It has been demonstrated that transformation invariances such as translation or position invariance and viewpoint invariance can be learned by imposing temporal continuity on the response of a network to temporal sequences of patterns (Bartlett & Sejnowski, 1998; Becker, 1993, 1999; Einh¨auser, Kayser, Konig, ¨ & Kording, ¨ 2002; Foldi´ ¨ ak, 1991; Kording ¨ & Konig, ¨ 2001). The human visual system as a whole seamlessly combines retinal images and visual-related motor commands to give a complete representation of the observed external environment. However, most research work done so far has focused on achieving different degrees of invariance based only on the sensory input, while ignoring the important role of visual-related motor signals. In our opinion, visual-related self-action signals are crucial in learning spatial invariance, as they provide information as to the nature of changes in the visual input. A critical issue that must be considered in modeling human vision is that the visual system has to deal with an overwhelming amount of information. It is well known that selective attention plays an important role in the human visual system by permitting the focusing on a small fraction of the total input visual information (Koch & Ullman, 1985; Maunsell & Cook, 2002). Shifting of attention enables the visual system to actively, and efficiently, acquire useful information from the external environment for further processing. Our goal is to develop object recognition systems that use covert and overt shifts in attention for feature selection. Covert attention shifts result from a change in feature selection processes occurring with the eye held fixed. Overt attention shifts refer to the change in the image data being attended to that results from a large, saccadic eye movement. Both overt and covert attention shifts cause changes to the visual input that the object recognition system works on. It is important that the functioning of the object recognition system be invariant to the effects of these attention shifts.

Learning of Position and Attention Shift Invariance

2295

With respect to changes induced by eye movements, or overt attention shifts, this invariance is specifically position invariance, where the recognition process should provide the same answer regardless of the location on the retina that the image of the object is projected. Most object recognition techniques that employ attention shifts are mainly based on covert or overt attention shifts, and they rarely consider both. The bulk of these methods consider only covert shifts, where the retinal input to the systems remains unchanged during the learning and recognition process (Kikuchi & Fukushima, 2001; Olshausen, Anderson, & Van Essen, 1993). If we directly apply such methods to overt attention shifts, the distortions due to the nonuniformity of the retina and the nonlinearity of projection on to the hemispherical retina may cause problems when foveating eye movements take place. For example, even though Kikuchi and Fukushima’s model of invariant pattern recognition (2001) employs a “scan path” of eye saccades, it does not model any associated relative distortions. In their approach, the only effect of eye movement is a spatial displacement of the imaged features. Their model achieves shift invariance and scale invariance based on extracted spatial relations, which are internally encoded by the visual system as a chain of saccadic vectors and fixated local features. This model is too simplistic, however; a true model of recognition with eye movements must take into account the image distortions resulting from eye movements. In this letter, we propose a new approach to attaining position invariance in which the processes of covert and overt attention shifts play a central role. We implicitly assume that the variation of feature positions on various cortical feature maps arises entirely from the action of covert and overt attention shifts. Motion of scene features in the external world is irrelevant, as it is the action of the attention systems that determines the location of the scene features in the internal representations. In this way of thinking, position invariance is really invariance to attention shifts, whether they be covert or overt. Desimone (1990) points out that the effects on the visual cortex of covert and overt attention shifts are very similar. It is conceivable, therefore, that we could develop a unified approach in which covert and overt shifts are not distinguished. We employ a temporal difference learning scheme where knowledge of the attention shift command is used to gate the learning process, permitting temporal correlation to take place between visual inputs across attention shifts. We implement a four-layer neural network model and test it on simulated data consisting of various geometrical shapes undergoing transformations. 2 A Neural Network Model of Attention Shift Invariance The overall model being proposed is composed of two submodules, as illustrated in Figure 1. One is the attention control module, which generates attention-shift signals according to a saliency map. This module also gen-

2296

M. Li and J. Clark

Executive m odule O utput layer

…

Attention sh ift signal

W

Integration

Hidden … ..layer

A

Extraction

Saccade motor signal

Control m odule

Encoded layer

… ..

Encoding

bf Input layer

……

Local feature im age

Attention windo w

Localized feature within the attention windo w

R aw retinal im age

Figure 1: The proposed neural network model is composed of two modules: a control module and an executive module. The control module is an attentionshift mechanism that generates attention-shift signals and saccade motor signals to trigger the learning processes in the executive module. It also selects local features, which are part of the raw retinal image falling within the attention window, as input to the executive module. The executive module consists of a four-layer network, which accomplishes the extraction of position-invariant local features and the integration of attention-shift-invariant complex features from lower level to higher level, respectively.

erates saccadic motor command signals (or overt attention shift signals), which are used to determine the timing for learning. The module obtains as input local feature images from the raw retinal images via a dynamically position-changing attention window. The second submodule is the executive module, which performs the learning of invariant neural representations across attention shifts in temporal sequences. Two forms of learning, position-invariant extraction of local features, and integration of position-invariant object representation (a composition of a set of local features) across attention shifts, are triggered by the saccadic motor signals and attention shift signals from the control module, respectively. 2.1 Temporal Continuity Approaches to Development of Position Invariance. In this section, we detail how our network learns invariance to position changes that result from eye movements. Our approach is based on

Learning of Position and Attention Shift Invariance

2297

the work of Foldi´ ¨ ak (1991) and Einh¨auser et al. (2002). They proposed methods for developing position invariance that rely on a temporal continuity of the images of objects projected onto the retina. These methods have been shown to work well in learning position invariance in both simulated data and real-world video sequences. Einh¨auser et al. proposed a three-layer feedforward network model capable of learning from natural stimuli that develops receptive field properties matching those of cortical simple and complex neurons. Hebbian learning in each output layer cell emphasizes the temporal structure of the input in the learning process. Experimental results show that the middle layer cells learn simple cell response properties that have strong selectivity to both orientation and position. The output layer cells learn complex cell response properties, which exhibit a level of position invariance while preserving orientation selectivity. However, their learning algorithm depends crucially on temporal smoothness in the input. The learning result is very sensitive to the timescale and the temporal structure in the input. When the time interval between successive input scenes becomes large, the temporal difference in the input data can also become large. Equivalently, when structures in the scene are moving rapidly, the temporal changes in the input stream can be large. Einh¨auser et al. reported that the output layer cells lose the position-invariant property when the input lacks temporal smoothness. In order to produce position-invariant recognition, the visual system must be presented with images of an object at different locations on the retina. In techniques such as those of Einh¨auser et al. (2002) and Foldi´ ¨ ak (1991), it is mainly the motion of the objects in the external world that produces the required presentations of the object image across the retina. There are a number of problems with this. Most important, the motion of objects in 3D space can change the appearance of objects significantly. Thus, the problem of developing position invariance is converted to the much more difficult problem of developing viewpoint invariance. The difference in the appearance of a moving object is generally greater as the displacement increases. This means that only local position invariance can be learned. In this letter, we propose a position-invariance learning method that is not overly affected by external object motion. The key aspect of our approach is the use of rapid attention shifts (overt or covert) to provide the necessary object image displacements. In this way, position invariance can be seen to arise from attention shift invariance. The short time between acquisition of images across an attention shift minimizes the change in the image due to motion of the object in space. Thus, in our approach, most of the change in the image is due to the attention shifts (and not to the motion of the object). Clearly, learning invariance with respect to some quantity requires exposure to data that varies only with respect to this quantity. Since we are learning invariance to attention shifts (covert or overt), we require a signal in which the variation is entirely due to attention shifts. This is accomplished by using only those images associated with attention shifts. At other times, the

2298

M. Li and J. Clark

images are changing due to extraneous factors, such as object motion, and thus can interfere only with the learning of the invariance. Our approach has the additional advantage that the attention shift command can be used as a signal to direct learning. We can, for example, arrange for learning to take place only during the period of time immediately before and after the attention shifts. 2.2 Extraction of Position-Invariant Local Features. The need for development of position invariance arises due to projective distortions and the nonuniform distribution of visual sensors on the retinal surface. These factors result in qualitatively different signals when object features are projected onto different positions of the retina. For example, when a linear object feature in space is projected onto a hemispherical surface, such as the retina, it is relatively undistorted when projected near the optical axis (i.e., near the fovea), whereas its image becomes curved when projected away from the optical axis (e.g., in the periphery). The problem of finding a position-invariant representation for such features can therefore be thought of as that of finding the underlying relationship between various distorted retinal images of the same physical feature at different retinal positions. We propose that this problem can be simplified if a canonical representation of the feature can be specified. The foveating capability of the human visual system gives us such a canonical representation. It is the role of the foveating system to shift the image of a feature being attended to in the retinal periphery to become centered on the fovea. The foveal image of an object feature is a suitable candidate for the feature’s canonical representation since, statistically, among all the retinal images of a feature, the foveal image is the most frequently observed. Furthermore, the process of fixation and tracking ensures that the foveal image representation is very stable relative to the peripheral images. When we refer to the neural representation of a feature’s foveal image as its canonical representation, the problem of position-invariant representation of a feature can be interpreted as one of associating the neural representations of all of its peripheral images with its single canonical representation. At a deeper level, the approach that we are proposing involves executing self-actions of the observer (in this case, saccadic eye movements) and observing the resulting changes in the retinal image. The idea that knowledge of self-action and the resulting sensory changes plays a role in perception is becoming popular. For example, O’Regan and No¨e (2001) proposed that visual percepts are based on the sensorimotor contingencies that describe the relation between motor activities and visual sensory input. Our approach to the learning of position invariance is based on the proposal of Clark and O’Regan (2000) that position invariance could be achieved through learning of the sensorimotor contingencies associated with a given feature. They presented a prototype of an association mechanism using the temporal difference learning schema of Sutton and Barto

Learning of Position and Attention Shift Invariance

2299

(1981) to learn the association between pre- and postmotor visual input data, leading to the desired position-invariance properties. A saccade is employed to foveate preattended features, so that associations between presaccadic peripheral stimuli and postsaccadic foveal stimuli (the canonical image) can be learned each time a saccade occurs. Given an input presaccadic neural response X, an association matrix V, and a reinforcement reward λ, their learning rule is as follows: Vij = α(λ(t) + Vij (t − 1)[γ Xj (t) − Xj (t − 1)])X¯ j (t)

(2.1)

X¯ j (t) = δ(Xj (t − 1) − X¯ j (t − 1)).

(2.2)

with

The reinforcement reward λ(t) here is the postsaccadic neural response to the foveal feature (the canonical image), and has the same dimension as X. Clark and O’Regan’s model (2000) works well in handling geometric distortions of images features due to position variance. However, a limitation of their model is that the association is very space and time-consuming, with resource requirements growing exponentially with the number of input neurons. In this letter, we provide a more efficient version of the Clark-O’Regan approach. Our aim is to reduce the computational requirements of their model while retaining the capability of learning position invariance of local features. We make a modification to their learning rule, using temporal differences over longer timescales rather than just over pairs of successive time steps. In addition, we use a sparse coding approach to reencode the simple neural responses, which reduces the size of the association weight matrix and therefore the computational complexity. Our model includes an input layer, an encoded layer, and a hidden layer, as well as the connection matrix between the layers. We model the input layer neuronal receptive field profile with a Gaborlike function. We refer to the input layer units as simple neurons, as they have similar properties to the simple cells of the visual cortex. The response of each simple neuron to a retinal image is the convolution of its receptive field profile and the image. The simple neural responses then are encoded by a sparse coding approach (Hyv¨arinen & Hoyer, 2001; Olshausen & Field, 1996, 1997) to reduce the statistical redundancies in the input pattern. The learning of basis functions sets and their sparsely distributed coefficients ensures that only a small number of active neurons in the encoded layer represent the original input pattern. The details of the encoding process are as follows. Let F denote the simple neuronal responses. A set of basis functions bf and a set of corresponding sparsely distributed coefficients ai are learned

2300

M. Li and J. Clark

to represent F: F(j) =

ai ∗ bfi (j) ⇒ F = bf ∗ a.

(2.3)

i

The basis function learning process is a solution to a regularization problem that finds the minimum of a functional E. This functional measures the difference between the original neural responses F and the reconstructed responses F = bf ∗ a, subject to a constraint of sparse distribution on the coefficients λ: 2 1 ai ∗ bfij + α Sparse(ai ) Fj − E(bf, a) = 2 j i i

(2.4)

where Sparse(a) = ln(1 + a2 ).

(2.5)

Sparseness is enforced by the second term of equation 2.4, which drives the coefficients a toward small values. In our implementation, E is minimized over its two arguments bf and a, respectively. The minimization is first performed over a, with a fixed value of bf, and then performed over bf. The inner minimization loop over a is performed by iterating the nonlinear conjugate gradient method (Shewchuk, 1994) until the derivative of E(bf , a) with respect to a is zero: ∂E(bfij , ai ) ∂Sparse(ai ) = bfij ∗ Fj − ak ∗ bfkij − α ∗ . ∂Ri ∂ai j k

(2.6)

The outer minimization loop over bf is accomplished by simple gradient descent: bfij = η < ai ∗ Fj − ak ∗ bfkj > . (2.7) k

After each learning step, bf is normalized to ensure that bf = 1. The normalization prevents bf from being unbounded, which would otherwise lead to undesired zero values of a. The sparsely distributed coefficients a then become the output of the encoded layer, which we denote as S. A weight matrix between the encoded layer and the hidden layer serves to associate the encoded simple neuron responses related to the same physical stimulus at different retinal positions. Immediately after a saccade takes place, this weight matrix A is updated according to a temporal difference reinforcement learning rule, to strengthen

Learning of Position and Attention Shift Invariance

2301

the weight connections between the neuronal responses to the presaccadic feature to those of the postsaccadic feature. The neuronal response in the hidden layer H is represented by the following equation: H = A ∗ S.

(2.8)

The weight matrix A is updated only at those times when a saccade occurs. The updating is done with the following temporal reinforcement learning rule: ˜ − 1)], (2.9) ˜ − 1))) ∗ S(t A(t) = η ∗ [((1 − κ) ∗ R(t) + k ∗ (γ ∗ H(t) − H(t where ˜ ˜ − 1)) H(t) = α1 ∗ (H(t) − H(t ˜ ˜ S(t) = α2 ∗ (S(t) − S(t − 1)).

(2.10)

The factor γ is adjusted to obtain desirable learning dynamics. The parameters η, α1 , and α2 are learning rates with predefined constant values. In order to investigate the use of the temporal reinforcement in the learning of position invariance, we introduce a weighting parameter κ to balance the importance between the reinforcement reward and the temporal output difference between successive steps. The effect of a varying κ will be demonstrated in section 3.1. ˜ of the neural responses in the The short-term memory traces, H˜ and S, hidden layer and the encoded layer are maintained to emphasize the temporal influence of a response pattern at one time step on later time steps. These are temporally low-pass filtered traces of the activities of the hidden layer neurons and encoded layer neurons, respectively. Therefore, the learning rule incorporates a Hebbian term between the input trace and the output trace residuals (the difference between the current and the trace activity), as well as between the input trace and the reinforcement signal. This temporal reinforcement learning rule is not the same as traditional trace rules (Foldi´ ¨ ak, 1991; Wallis, Rolls, & Foldi´ ¨ ak, 1993; Wallis & Rolls, 1997), which emphasize the Hebbian connection between the input stimulus and the decaying trace of previous output stimuli. Equation 2.9 differs slightly from equation 2.1 in Clark and O’Regan (2000), in that we use the temporal difference of the output trace residuals (over longer timescales) instead of the pair-wise temporal difference. This modification enables us to have a longer trace of activities of hidden layer neurons in previous time steps, which helps to obtain more globally optimal solutions. The reinforcement reward R(t) is the sparsely encoded simple neural response right after a saccade. The weight update rule correlates this reinforcement reward R(t) and (an estimate of) the temporal difference of the

2302

M. Li and J. Clark

Hidden neuronal response

Simple neuronal response trace

H

~ H Correlation

T

~ S Suppression during saccade

T

Figure 2: Illustration of temporal difference learning under a temporal perceptual stability constraint. The short-term memory trace of neural response in the encoded layer emphasizes the temporal influence of previous neural responses on later training. The weights between the encoded layer neurons and the hidden layer neurons are enhanced when there is a significant temporal difference between the current hidden neural output and the previous output.

hidden layer neural responses with the memory trace of the encode layer neural responses. The constraint of temporal perceptual stability requires that updating is necessary only when there is a difference between current neural response and previous neural responses kept in the short-term memory trace, as illustrated in Figure 2. Our proposed position-invariant approach is able to eliminate the limitations of Einh¨auser et al.’s model (2002) without imposing an overly strong constraint on the temporal smoothness of the scene images. For example, in the case of recognizing a rapidly moving object, a uniform temporal sampling results in the object appearing in significantly different positions on the retina. This could cause a temporal discontinuity in the input that will cause problems for the Einh¨auser et al. model. Even worse, the appearance of the object may have changed due to a change in its pose as it moves. This means that the variation in the input data depends not only on the position of the object, but also on its orientation in space. Such object motion will not affect the learning result of our approach, however, because it employs a nonuniform temporal sampling, in which images are obtained only immediately before and after an attention shift (either overt or covert). As the attention shift takes little time, there is little effect of object motion on the input data. Most of the variation in the position of the object in the image is due to the attention shift.

Learning of Position and Attention Shift Invariance

2303

2.3 Temporal Integration of a Position-Invariant Representation of an Object Across Attention Shifts. The integration level of the executive submodule in our system is concerned with the invariant representation of an object across attention shifts. Position invariance is implicitly incorporated because the attention shift invariance is based on a temporal integration of position-invariant local features. Attention shift information is provided in our model by the control module. This module receives as input the retinal image of an object (combination of simple features). It constructs a saliency map (Itti, Koch, & Niebur, 1998; Koch & Ullman, 1985) that is used to select the most salient area as the next attention-shift target. The saliency map is a weighted sum of feature saliencies, such as edge orientation, color, and edge contrast. The selection of feature types and their corresponding weights depends on the tasks to be performed. Currently, our implementation uses gray-level images, and we use only orientation contrast and intensity contrast as saliency map features (refer to Itti et al., 1998, for implementation details). Intensity features, I(σ ), are obtained from an eight-level gaussian pyramid computed from the raw input intensity, where the scale factor σ ranges from [0..8]. Local orientation information is obtained by convolution with oriented Gabor pyramids O(σ, θ), where σ ∈ [0..8] is the scale and θ ∈ [0◦ , 45◦ , 90◦ , 135◦ ] is the preferred orientation. Feature maps are calculated by a set of “center-surround” operations, denoted by , which are implemented as the difference between fine (at scale c ∈ [2, 3, 4]) and coarse scales (at scale s = c + δ, with δ ∈ [3, 4]): I(c, s) = |I(c)I(s)|

(2.11)

O(c, s, θ) = |O(c, θ)O(s, θ)|.

(2.12)

In total, 30 feature maps—6 for intensity and 24 for orientation—are calculated and are combined into two conspicuity maps, at the scale (σ = 4) of the saliency map, through a cross-scale addition ⊕: I¯ =

4 c+4

N(I(c, s))

c=2 s=c+3

O¯ =

N

θ∈{0◦ ,45◦ ,90◦ ,135◦ }

(2.13)

4 c+4

N(O(c, s, θ )) ,

(2.14)

c=2 s=c+3

where N(·) is a map normalization operator. The saliency map S is obtained by the weighted sum (here, we choose all weights to have the same value) of the two maps: S=

1 ¯ + N(O)). ¯ (N(I) 2

(2.15)

2304

M. Li and J. Clark

A winner-take-all algorithm (Koch & Ullman, 1985) determines the location of the most salient feature in the calculated saliency map. This location then becomes the target of the next attention shift. In the case of an overt attention shift, the positional information of the target (including saccadic direction and amplitude) is sent to the executive submodule to command execution of a saccade. The target is foveated after the commanded motion, and a new retinal image is formed. The new image is fed into the module as input for the next learning iteration. A covert attention shift, on the other hand, will not foveate the attended target, and therefore the subsequent retinal image input remains unchanged. Since both overt and covert attention shifts play an important role in determining the timing for learning process at this stage, we use an attention-shift signal instead of a saccade signal as a motor signal from the control module to trigger the integration learning. In the implementation of our model, an inhibition-of-return (IOR) mechanism is added to prevent immediate attention shifts back to the current feature of interest to allow other parts of the object to be explored. The localized image features, which are obtained when part of an object falls in the attention window before and after attention shifts, are fed into the input layer of the four-layer network. Given that position-invariant representations of local features have already been learned, an integration of local features from an object can be learned in a temporal sequence as long as the attention window stays within the range of the object. Here we assume that attention always stays on the same object during the recognition procedure of an object even in the presence of multiple objects. In our experiments, this assumption is enforced by considering only scenes that contain a single object. In practice, of course, there will be attention shifts between different objects. Although we have not yet tested our method in such situations, it is expected that such interobject attention shifts will only slow learning. This is because a given object will typically be viewed in proximity to a wide range of different objects and backgrounds. Thus, there will be no persistent pairing of an object feature with a particular background feature, and no strong association will be made. The only persistent associations will be those of features within the same object. The learning at this stage includes two further aspects: a winner-take-all interaction between the output layer neural activities and a fatigue effect on the continuously active output layer neurons. The winner-take-all interaction ensures that only one neuron in the output layer wins the competition to respond actively to a certain input pattern. The fatigue process is a modified implementation of inhibition of return, which prevents one unit from winning all of the input patterns. The fatigue process gradually decreases the fixation of interest on the same object after several attention shifts. Although in our testing we restrict the scenes to contain only one object at a time, we have several objects to be learned on the model. Therefore, it is necessary that the currently active neuron will be suppressed for a while

Learning of Position and Attention Shift Invariance

2305

when the learning moves to the next object. The fatigue effect is controlled by a fixation-of-interest function FOI(u). A u value is kept for each output layer neuron in an activation counter initialized to zero. Each counter traces the recent neural activities of its corresponding output layer neuron. The counter automatically increases by 1 if the corresponding neuron is activated and decreases by 1 until 0 if not. If a neuron is continuously active over a certain period, the possibility of its subsequent activation (i.e., its fixation of interest on the same stimulus) is gradually reduced, allowing other neurons to be activated. A gaussian function of u2 is used for this purpose: FOI(u) = e−u

4

/σ 2

.

(2.16)

The output layer neural response C0 is obtained by multiplying the hidden layer neural responses H with the integration weight matrix W. C0 is then adjusted by multiplying with FOI(u) and is biased by the local estimation of the maximum output layer neural responses (weighted by a factor κ < 1): C = C0 ∗ FOI(u) − k ∗ C˜ 0 .

(2.17)

If Ci exceeds a threshold, the corresponding output layer neuron is activated (Ci = 1). The temporal integration of local features is accomplished by dynamically tuning the connection weight matrix between the hidden layer and the output layer. Responses to local features of the same object can be correlated by applying the constraint that output layer neural responses remain constant over time. Given as input the hidden layer neural responses H from the output of the lower layers, and as output the output layer neural responses C, the weight matrix W is dynamically tuned in a Hebbian manner using ˆ of the complex layer neural responses C: the short-term memory trace C ˆ − η ∗ C(t)) ∗ H(t) − C(t) ∗ W(t)] W(t) = γ ∗ [(C(t)

(2.18)

ˆ = α ∗ (C(t) − C(t ˆ − 1)). C(t)

(2.19)

with

ˆ acts as an estimate of the neuron’s recent The short-term memory trace C responses. The second term of the learning rule emphasizes the importance of the temporal difference between successive steps in maintaining a stable state. The last term is a local operation that keeps each weight bounded. 2.4 Discussion. The development of the human visual system proceeds gradually from the very basic learning stage, as in the way a newborn baby learns to recognize the complicated external world by exploring simple

2306

M. Li and J. Clark

shapes and colors step by step. Similarly, in our model, the integration of responses to local features that belong to the same object is based on lower-level extraction of position-invariant local features that have already been learned to some extent. The integration becomes faster when positioninvariant representations of local features are correlated in a temporal order rather than the correlation between numerous different neural responses to all local features in random positions. This is also a reason that we do not explicitly distinguish overt and covert attention shift at this stage. In the case of covert attention shifts, although the attended local features are not brought into the fovea, the representations of these peripheral local features are position invariant based on the learning accomplished by the first stage. In the case of overt attention shifts, the attended local features are retargeted to the fovea, and therefore the representations of these local features are already identical to the learned position-invariant representations. Therefore, both types of attention shift can function under this integration. Our approach is basically a description of a technique for encoding invariant neural responses to changes induced by attention shifts; therefore, attention shifts are an important part of encoding invariant representations for the input patterns, but not necessarily for recognition of an already encoded object. We only need to assume that these attention shifts do occur and that only a single object is being viewed. For online learning, the two processes of feature extraction and integration are concurrently performed. Because the early learning process of integration is essentially random and has no effect on the later result, we can use a gradually increasing parameter to adjust the learning rate of integration. This parameter can be thought of as an evaluation of the gained experience at the basic learning stage. The value of this parameter is set near 0 at the beginning of the learning and near 1 after a certain amount of learning, at which point the extraction process is deemed to have gained sufficient confidence in its experience on extracting position-invariant local features. 3 Simulation and Results We designed two experiments to test our model’s position-invariant and attention-shift-invariant properties, respectively. In our model, position invariance is achieved when a set of neurons can discriminate one stimulus from others across all positions. We refer to a set of neurons, as our representation is in the form of a population code, in which more than one neuron may exhibit a strong response to one set of stimuli. Between each set of neurons there might be some overlap, but the combinations of actively responding neurons are unique and can therefore be distinguished from each other. We consider attention-shift invariance to have been achieved when the position-invariant set of neurons retains their coherence across attention shifts, when such attention shifts stay on the same object.

Learning of Position and Attention Shift Invariance

2307

We designed a third experiment to show that our model performs better than the models of Foldi´ ¨ ak (1991) and Einh¨auser et al. (2002) when the input patterns lack temporal smoothness. 3.1 Demonstration of Position Invariance. To demonstrate the process of position-invariant local feature extraction, we focus on the extraction submodule. This module is composed of three layers: the input layer, the encoded layer, and the hidden layer. We use two different test sets of local features as training data at this stage: a set of computer-generated images of simple oriented linear features and a set of computer-modified images of real objects. We first implemented a simplified model that has 648 input layer neurons, 25 encoded layer neurons, and 25 hidden layer neurons for testing with the first training data set. The receptive fields of the input layer neurons are generated by Gabor functions over a 9 × 9 grid of spatial displacements, each with eight different orientations evenly distributed from 0 to 180 degrees. The first training image set is obtained by projecting straight lines of four different orientations ([0 degrees, 45 degrees, 90 degrees, 135 degrees]) through a pinhole eye model (as shown in Figure 3) onto seven different positions of a spherical retinal surface. The simulated retinal images each have a size of 25×25 pixels. The training data are shown in Figure 4A, along with a subset of the input layer receptive fields (see Figure 4B). Figure 5 shows the 25 basis functions (which are the receptive fields of the encoded layer neurons), trained using Olshausen and Field’s (1997) sparse coding approach on simple neural responses. It was found in our experiment that some neurons in the hidden layer responded more actively to one of the stimuli regardless of its positions on the retina than to all other stimuli, as demonstrated in Figure 6. For example, neuron 8 exhibits a higher firing rate to line 4 than to any of the other lines, while neuron 17 responds to line 1 most actively. The other neurons remain inactive to the stimuli, which leaves possible space to respond to other stimuli in the future. It was next shown that the value of the weighting parameter κ in equation 2.9 had a significant influence on this submodule performance. To evaluate the performance, the standard deviation of activities of the hidden layer neurons are calculated when the submodule is trained with different values of κ(= 0, 0.2, 0.5, 0.7, and 1). The standard deviation of the neural activities is calculated over a set of input stimuli. The value stays low when the neuron tends to maintain a constant response to the temporal sequence of a feature appearing at different positions. Figure 7 shows the standard deviation of the firing rate of the 25 hidden layer neurons with different values of κ. The standard deviation becomes larger as κ increases. This result shows that the reinforcement reward plays an important role in the learning of position invariance. When κ is near 1, which means the learning depends fully on the

2308

M. Li and J. Clark

Figure 3: Distorted retinal images obtained when features projected through a pinhole eye model onto the hemispherical surface of the retina.

Learning of Position and Attention Shift Invariance

2309

Figure 4: (A) Computer-simulated retinal images of lines with four orientations at seven positions used as training data set. (B) A random sample of 100 out of 648 Gabor receptive field profiles of the simple neurons.

2310

M. Li and J. Clark

Figure 5: Basis functions visualized as the receptive field profiles of the 25 encoded layer neurons. The basis functions were trained using Olshausen and Field’s sparse coding approach.

temporal difference between stimuli before and after a saccade, the hidden layer neurons are more likely to have nonconstant responses. In our second simulation we tested image sequences of real-world objects, such as a teapot and a bottle (see Figures 8B and 8C). The images of these objects were projected onto the simulated retina at nine different positions following routes such as that illustrated in Figure 8A. Each retinal image has a size of 64 × 48 pixels. The number of neurons in the encoded layer and the hidden layer has been increased from 25 to 64 from the numbers used in the previous experiment. This was required because the size of the basis function set to encode the sparse representations should also increase as the complexity of the input images increases.

Learning of Position and Attention Shift Invariance

2311

Figure 6: Neural activities of the four most active hidden layer neurons responding to computer-simulated data set at different positions. The neuron firing rates for each of the four stimuli (four lines with different orientations) at each of the seven retinal positions are shown. Each neuron has its preferred orientation selectivity across all positions.

Figure 9 shows the neural activities of the four most active neurons in the hidden layer when responding to the two image sequences of a teapot and a bottle, respectively. Neurons 3 and 54 exhibit relatively strong responses to the teapot across all nine positions, while neuron 27 mainly responds to the bottle. Neuron 25 has strong overlapping neural activities to both stimuli. The sets of neurons that have relatively strong activities are different from each other, satisfying our definition of position invariance. 3.2 Demonstration of Attention Shift Invariance. For simplicity in this experiment, we use binary images of basic geometrical shapes such as rectangles, triangles, and ovals. These geometrical shapes are, as in the previous experiment, projected onto the hemispherical retinal surface through a pin-

2312

M. Li and J. Clark

Figure 7: Comparisons of position-invariant submodule performance with varied weighting parameter κ (κ = 0, 0.2, 0.5, 0.7, 1), using a measurement of standard deviation of each neuronal response to a stimulus across different positions. The weighting parameter emphasizes the importance of the reinforcement reward with small κ. Lower standard deviation values mean that the neural responses remain stable while higher values mean instability. The values for the 25 neurons in the hidden layer are shown.

hole. Their positions relative to the fovea change as a result of saccadic movements. Here we use a weighted combination of intensity contrast and orientation contrast to compute the saliency map, as they are the most important and distinct attributes of the geometrical shapes we use in the training. A winnertake-all mechanism is employed to select the most salient area as the next fixation target. After a saccade is performed to foveate the fixation target, the saliency map is updated based on the newly formed retinal image, and a new training iteration begins. Figure 10 shows a sequence of saliency maps calculated from retinal images of geometrical shapes for a sequence of saccades. Figures 11B and 11D show a sequence of pre- and postsaccadic local features of the retinal images of a rectangular shape falling in a 25 × 25 pixel

Learning of Position and Attention Shift Invariance

2313

Figure 8: Training image sequences of two real objects (B and C). The images in the sequences were taken at nine positions following a path as indicated in A.

2314

M. Li and J. Clark

Figure 9: Neural activities of the four most active hidden layer neurons responding to two real objects at different positions. The neuron firing rates for each of the two stimuli (a teapot and a bottle) at each of the nine positions are shown.

attention window, respectively. The local features shown in Figure 11 after a saccade are not exactly the ideal canonical foveal images because of calculation errors in the position of the saccadic target. This situation also occurs in human vision where saccadic eye movements are not always able to put the selected target exactly in the fovea. In fact, undershooting of the target is the usual situation. This undershot local feature is likely to be refoveated by a subsequent small, corrective saccade. An enhanced algorithm dealing with this undershooting was described in Li and Clark (2002). Even if the correction of undershooting is not taken into consideration in this model, we still can obtain invariance, although the efficiency of the model performance will be impaired somewhat. This is because these noncanonical foveal features will exhibit greater variability than the ideal canonical features and therefore require a longer learning process. But the temporal association mechanism is still able to associate the various near-canonical

Learning of Position and Attention Shift Invariance

2315

Figure 10: Dynamically changing saliency maps for three geometrical shapes. They are computed from the retinal images after the first six saccades following an overt attention shift. The small, bright rectangle indicates an attention window centered at the most salient point in the saliency map.

Figure 11: Local features of a rectangular shape before (b) and after (d) an overt attention shift. (a, c) Retinal images of the same rectangle at different positions due to overt attention shifts.

2316

M. Li and J. Clark

Figure 12: Neural activities of the two most active output layer neurons across attention shifts. The neuron firing rates for the two geometric shapes of a triangle and a rectangle are shown.

neural representations and produce a stable neural response to the same stimulus across transformation. We show in Figure 12 some of the output layer neural responses (neurons 2 and 5) to two geometrical shapes: a rectangle and a triangle. Neuron 2 responds to the rectangle more actively than to the triangle, while neuron 5 has a more active response to the triangle than to the rectangle.

3.3 Comparison with Other Temporal Approaches. In this section, we demonstrate how our proposed approach performs well in situations where the input lacks smoothness in time. Position-invariant learning models that use temporal continuity, such as those of Foldi´ ¨ ak (1991) and Einh¨auser et al. (2002), are observed to perform poorly in these situations. We use a digital camera mounted on a computer-controlled pan-tilt unit (PTU) to acquire images around a toy bear. Images of the object are acquired as the PTU randomly changes its pan and tilt positions. The PTU movements are constrained so as to keep the bulk of the object in view at all times. The action of the PTU simulates human eye and head movements, which result

Learning of Position and Attention Shift Invariance

2317

in the displacement of the object features on the imaging surface. Pairs of images before and after each movement are obtained and converted into gray-level images. These image pairs are fed into the model as training data in a random order. The resulting time sequence of images is not smooth at all. This is an extreme test but nonetheless realistic, and it clearly demonstrates the difference in performance between our method and the methods based on temporal continuity. We implement Foldi´ ¨ ak’s trace rule (1991) and the learning rule for the position-invariant complex neuron of the top layer as given in Einh¨auser et al. (2002). We use the training data to train using both of these rules as well as with our proposed model. The learning results are compared using the mean variance of the output neuron responses over the whole stimuli set. If the model is to exhibit position invariance, the output neuron responses should remain nearly constant and therefore have a low variance. We show the results produced by the three models in Figure 13. Each time unit in the plot represents 25 learning iterations. The figure shows that our model converges to a stable state very quickly, with a low mean variance. The mean variance in Einh¨auser’s model is larger than in our approach, and it descends very slowly over the time interval. Foldi´ ¨ ak’s model produces an increasing response variance with time, implying a complete failure of the learning process for such a nonsmooth input sequence. 4 Conclusions In this letter, we have presented a neural network model that achieves position invariance. Our approach is based on a study of a more general problem: learning invariance to attention shifts. Attention shifts are the primary reason for images of object features to be projected at various locations on the retina. Object motion in the world is rarely the cause of such variation, as pursuit tracking of object features cancels out this motion. Following Desimone (1990), we treat covert and overt attention shifts as equivalent, from the point of view of their effect on the visual cortex. For the task of learning position invariance, the advantage of treating image feature displacements as being due to attention shifts is the fact that attention shifts are rapid and that there is a neural command signal associated with them. The rapidity of the shift means that learning can be concentrated to take place only in the short time interval around the occurrence of the shift. This focusing of the learning solves the problems with time-varying scenery that plagued previous methods, such as those proposed by Foldi´ ¨ ak (1991), Becker (1993, 1999), Kording ¨ and Konig ¨ (2001), and Einh¨auser et al. (2002). We used an extension of Clark and O’Regan’s (2000) association model to learn position invariance across overt attention shifts via temporal difference learning on pairs of pre- and postsaccadic stimuli. The extension involves the use of a sparse coding approach, which reduces the size of the association weight matrix and therefore the computational complexity.

2318

M. Li and J. Clark

Figure 13: Comparison of the performance of three models (our proposed model, Foldi´ ¨ ak’s, and Einh¨auser’s respectively) in learning of position invariance in the case of time-varying scenery. The performance is evaluated by mean variance of the hidden layer neuron responses over the whole stimuli set along a time interval. Each time unit in the x-axis is composed of 25 learning iterations.

We apply the constraint of temporal stability across attention shifts, and temporally integrate position-invariant neural response patterns of local features within attention windows to attain attention-shift-invariant object representations. We implemented a simplified version of our model and tested it with both computer-simulated data and computer-modified images of real objects. In these tests, local features were obtained from retinal images falling in an attention window by an attention shift mechanism. The incorporation of the attention shift mechanism speeds up the learning process by actively acquiring useful information about the correlated relationship between different neural responses of a same local feature at various positions, and

Learning of Position and Attention Shift Invariance

2319

relationship between the partial and the whole (i.e., local features of an object and the object as a whole entity). The results show that our model works well in achieving both position invariance and attention-shift invariance, regardless of retinal distortions. We demonstrated that our method performs well in realistic situations in which the temporal sequence of input data is not smooth, situations in which earlier approaches have had difficulty. Acknowledgments We acknowledge funding support from the Institute for Robotics and Intelligent Systems. M.L. thanks Precarn for its financial support. References Bartlett, S. M., & Sejnowski, T. J. (1998). Learning viewpoint invariant face representations from visual experience by temporal association. In H. Wechsler, P. J. Phillips, V. Burce, S. Fogelman-Soulie, & T. Huang (Eds.), Face recognition: From theory to applications (pp. 381–390). New York: Spinger-Verlag. Becker, S. (1993). Learning to categorize objects using temporal coherence. In C. L. Giles, S. J. Hanson, & J. D. Cowan (Eds.), Advances in neural information processing systems, 5 (pp. 361–368). San Mateo, CA: Morgan Kaufmann. Becker, S. (1999). Implicit learning in 3D object recognition: The importance of temporal context. Neural Computation, 11(2), 347–374. Bridgeman, B., Van der Hejiden, A. H. C., & Velichkovsky, B. M. (1994). A theory of visual stability across saccadic eye movements. Behavioural and Brain Sciences, 17(2), 247–292. Chance, F. S., Nelson, S. B., & Abbott, L. F. (2000). A recurrent network model for the phase invariance of complex cell responses. Neurocomputing, 32-33, 339–334. Clark, J. J., & O’Regan, J. K. (2000). A temporal-difference learning model for perceptual stability in color vision. In Proceedings of 15th International Conference on Pattern Recognition (Vol. 2, pp. 503–506). Los Alammitos, CA: IEEE Computer Society. Desimone, R. (1990). Complexity at the neuronal level (commentary on “Vision and complexity,” by J. K. Tsotsos). Behavioural and Brain Sciences, 13(3), 446. Deubel, H., Bridgeman, B., & Schneider, W. X. (1998). Immediate post-saccadic information mediates space constancy. Vision Research, 38, 3147–3159. Einh¨auser, W., Kayser, C., Konig, ¨ P., & Kording, ¨ K. P. (2002). Learning the invariance properties of complex cells from their responses to natural stimuli. European Journal of Neuroscience, 15, 475–486. Foldi´ ¨ ak, P. (1991). Learning invariance from transformation sequences. Neural Computation, 3, 194–200. Fukushima, K. (1980). Neocognitron: A self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position. Biological Cybernetics, 36, 193–202.

2320

M. Li and J. Clark

Gross, C. G., & Mishkin, M. (1977). The neural basis of stimulus equivalence across retinal translation. In S. Harnad, R. Doty, J. Jaynes, L. Goldstein, & G. Krauthamer (Eds.), Lateralization in the nervous system (pp. 109–122). New York: Academic Press. Hubel, D. H., & Wiesel, T. N. (1962). Receptive fields, binocular interaction and functional architecture in the cat’s visual cortex. Journal of Physiology, 160, 106–154. Hyv¨arinen, A., & Hoyer, P. (2001). A two-layer sparse coding model learns simple and complex cell receptive fields and topography from natural images. Vision Research, 41, 2413–2423. Ito M., Tamura H., Fujita I., & Tanaka K. (1995). Size and position invariance of neuronal responses in monkey inferotemporal cortex. Journal of Neurophysiology, 73(1), 218–226. Itti, L., Koch, C., & Niebur, E. (1998). A model of saliency-based visual attention for rapid scene analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(11), 1254–1259. Kikuchi, M., & Fukushima, K. (2001). Invariant pattern recognition with eye movement: A neural network model. Neurocomputing, 38-40, 1359–1365. Koch, C., & Ullman, S. (1985) Shifts in selective visual attention: Towards the underlying neural circuitry. Human Neurobiology, 4, 219–227. Kording, ¨ K. P., & Konig, ¨ P. (2001). Neurons with two sites of synaptic integration learn invariant representations. Neural Computation, 13, 2823–2849. Leopold, D. A., & Logothetis, N. K. (1998). Microsaccades differentially modulate neural activity in the striate and extrastriate visual cortex. Experimental Brain Research, 123, 341–345. Li, M., & Clark, J. J. (2002). Sensorimotor learning and the development of position invariance. Poster session presented at the 2002 Neural Information and Coding Workshop, Les Houches, France. Maunsell, J. H. R., & Cook, E. P. (2002). The role of attention in visual processing. Philosophical Transactions of the Royal Society of London, Series B, Biological Sciences, 357, 1063-1072. Miyashita, Y. (1988). Neuronal correlate of visual associative long-term memory in the primate temporal cortex. Nature, 335(27), 817–820. Norman, J. (2002). Two visual systems and two theories of perception: An attempt to reconcile the constructivist and ecological approaches. Behavioural and Brain Sciences, 25(1), 73–96. Olshausen, B. A., Anderson, C. H., & Van Essen, D. C. (1993). A neurobiological model of visual attention and invariant pattern recognition based on dynamic routing of information. Journal of Neuroscience, 13(11), 4700–4719. Olshausen, B. A., & Field, D. J. (1996). Emergence of simple-cell receptive field properties by learning a sparse code for natural images. Nature, 381, 607–609. Olshausen, B. A., & Field, D. J. (1997). Sparse coding with an overcomplete basis set: A strategy employed by V1? Vision Research, 37, 3311–3325. O’Regan, J. K., & No¨e, A. (2001). A sensorimotor account of vision and visual consciousness. Behavioral and Brain Sciences, 24(5), 939–973. Perrett, D. I., Rolls, E. T., & Caan, W. (1982). Visual neurons responsive to faces in the monkey temporal cortex. Experimental Brain Research, 47, 329–342.

Learning of Position and Attention Shift Invariance

2321

Riesenhuber, M., & Poggio, T. (1999). Hierarchical models of object recognition in cortex. Nature Neuroscience, 11(2), 1019–1025. Rolls, E. T. (1995). Learning mechanisms in the temporal lobe visual cortex. Behavioural Brain Research, 66, 177–185. Rolls, E. T. (2000). Functions of the primate temporal lobe cortical visual areas in invariant visual object and face recognition. Neuron, 27(2), 205–218. Salinas, M., & Sejnowski, T. J. (2001). Gain modulation in the central nervous system: Where behaviour, neurophysiology, and computation meet. Neuroscientist, 7(5), 430–440. Shewchuk, J. R. (1994). An introduction to the conjugate gradient method without the agonizing pain. Available online at: http://www-2.cs.cmu.edu/ ∼jrs/jrspapers.html. Sutton, R. S., & Barto, A. G. (1981). Toward a modern theory of adaptive networks: Expectation and prediction. Psychological Review, 88(2), 135–170. Wallis, G., & Rolls, E. T. (1997). Invariant face and object recognition in the visual system. Progress in Neurobiology, 51, 167–194. Wallis, G., Rolls, E. T., & Foldi´ ¨ ak, P. (1993). Learning invariant responses to the natural transformations of objects. International Joint Conference on Neural Networks, 2, 1087–1090. Walsh, V., & Kulikowski, J. J. (Eds.). (1998). Perceptual constancy: Why things look as they do. Cambridge: Cambridge University Press. Received April 23, 2003; accepted April 5, 2004.

LETTER

Communicated by Jonathan Victor

Testing for and Estimating Latency Effects for Poisson and Non-Poisson Spike Trains Val´erie Ventura [email protected] Department of Statistics and Center for the Neural Basis of Cognition, Carnegie Mellon University, Pittsburgh, PA 15213, U.S.A.

Determining the variations in response latency of one or several neurons to a stimulus is of interest in different contexts. Two common problems concern correlating latency with a particular behavior, for example, the reaction time to a stimulus, and adjusting tools for detecting synchronization between two neurons. We use two such problems to illustrate the latency testing and estimation methods developed in this article. Our test for latencies is a formal statistical test that produces a p-value. It is applicable for Poisson and non-Poisson spike trains via use of the bootstrap. Our estimation method is model free, it is fast and easy to implement, and its performance compares favorably to other methods currently available. 1 Introduction It is often of interest to determine the response latencies of one or more neurons to repeated presentations of a stimulus. One may be interested in adjusting synchronization tools, most commonly the cross-correlogram (CC) and joint peristimulus time histogram (JPSTH), for trial-to-trial effects such as excitability and latency effects (Brody, 1999a, 1999b; Baker & Gerstein, 2001). Another application consists of correlating neural response latency to a particular behavioral variable such as reaction time, as for example, in Everling, Dorris, and Munoz (1998), Everling, Dorris, Klein, and Munoz (1999), Everling and Munoz (2000), Hanes and Schall (1996), and Horwitz and Newsome (2001). We illustrate both applications in section 5. Methods to estimate latencies are also varied. For example, Brody (1999b) estimates latencies based on minimizing the peak of the CC shuffle corrector, although he cautions that his estimates are flawed when latencies are not, in fact, present. Baker and Gerstein (2001) rectify this problem by providing three alternative estimation methods, all based on detecting the time at which the firing rate increases from the baseline, which we refer to as change-point methods. Two of their estimates possess good properties but are computationally expensive, and depend on an assumed model for the spike trains. Baker and Gerstein (2001) also provide a diagnostic that indicates whether excitability and latency effects may be present, although no c 2004 Massachusetts Institute of Technology Neural Computation 16, 2323–2349 (2004)

2324

V. Ventura

measure of significance, for example, a p-value, is given for this diagnostic. Latency estimation methods have also been developed for continuous response waveforms like electroencephalograms rather than for point processes, although perhaps they could be applied successfully to smoothed spike trains. For example, Woody (1967) takes the latency of a waveform on a particular trial to be the shift that maximizes the correlation between the shifted waveform and a template. In the statistical literature, Pham, Mocks, Kohler, and Gasser (1987) develop a change-point method for continuous signals based on a maximum likelihood paradigm that involves some modeling assumptions and substantial machinery. In section 2, we describe a method to estimate latencies, which, like the method of Woody (1967), uses the whole duration of the response of the neuron rather than just the time at which the firing rate increases from baseline, as do change-point methods. It compares well to existing methods in terms of efficiency and computational simplicity. Specifically, our estimates require only calculations of sample means so that they are simple and very fast to obtain, do not require any model assumptions, and have smaller biases and variances than other available methods, based on a variety of simulated data, as shown in section 3. Section 4 concerns statistical inferences about latencies. Because estimating nonexistent effects typically adds random noise to statistical procedures, we first propose a formal statistical test for latency effects; by “formal,” we mean that we provide not only a diagnostic test but also a p-value. We also explain how variances can be calculated for the latency estimates. The methods of this section are straightforward for Poisson spike trains but require a more careful treatment otherwise, which is why we deferred its treatment to the end of the article. Finally, we validate our methods based on two real applications in section 5 and conclude in section 6. 2 Latency Estimation The methods developed in this letter rely on a simple result: the spike times of a Poisson process with firing rate λ(t) can be viewed as a random sample from a distribution with density proportional to λ(t). Although this result applies to Poisson spike trains only, we show that our estimation method applies generally. Say that K trials of a neuron were recorded under identical repeated presentations of a particular stimulus or experiment. Let λ(t) denote the time-varying firing rate of the neuron. If τk ≥ 0 denotes the latency of trial k and if we assume that the only effect of a latency is to delay the onset of the response to the stimulus by τk , then the firing rate of trial k is λ(t − τk ); hence, its spike times can be considered a random sample with distribution proportional to λ(t − τk ). Therefore, assuming that there is no source of variability between trials other than latency effects and the random variability associated with the spike generation mechanism, the spike times combined

Testing and Estimating Latencies

2325

over all K trials can be viewed as a random sample with distribution proportional to K

λ(t − τk ).

(2.1)

k=1

The mixture distribution 2.1 has the same “shape” as the peristimulus time histogram (PSTH) of the observed trials. We take our estimates of τ1 , . . . , τk to be the values that minimize the variance V(τ ) of equation 2.1 with respect to τ = (τ1 , . . . , τK ), that is, (τˆ1 , . . . , τˆK ) = argminV(τ ),

V(τ ) = var

K

λ(t − τk ) .

(2.2)

k=1

To make an analogy from probability densities back to spike trains, this criterion finds the set of latency shifts that make the PSTH the narrowest and therefore the highest. Note that adding the same arbitrary constant τ0 to all latencies shifts the PSTH of the spike trains by τ0 , but does not change its shape, which suggests that equation 2.2 will produce latency estimates that are defined relative to one another. However, latency estimates are often used to calculate correlations, for example, with a behavioral variable, which are invariant to a constant shift in either or both variables. In particular, only the relative times between two neuron spikes are needed to calculate their CC. We first select a time window [T1 , T2 ], which includes the period of response of the neuron to the stimulus on all trials. We divide [T1 , T2 ] into small enough bins so that at most one spike falls into any bin on each trial; these intervals are centered at times tj . Let nkj = 0 or nkj = 1 record the absence or presence of a spike at time tj for spike train k. Letting τk denote the true latency for trial k, the variance to be minimized with respect to all τk is V(τ ) = E2 (τ ) − E21 (τ ), where E1 (τ ) and E2 (τ ) are the first two moments of the distribution of all trials combined, given in equation 2.1. Specifically, E1 (τ ) =

1 1 hkj tj − τk = t¯k − τk , K k j K k

(2.3)

E2 (τ ) =

1 hkj (tj − τk )2 , K k j

(2.4)

and

2326

V. Ventura

where hkj = nkj /nk· are the nkj standardized by the total number of spikes, nk· = j nkj , for trial k, and t¯k = j hkj tj . Setting ∂V(τ )/∂τi to zero gives, for i = 1, . . . , K, t¯k , τk = t¯i − K−1 (2.5) τi − K−1 k

k

with solutions of the form τˆk = t¯k − τ0 ,

for all k = 1, . . . , K,

(2.6)

for an arbitrary τ0 . That is, the latency estimate for trial k is the sample mean t¯k of the spike times that occurred between T1 and T2 , shifted by an arbitrary amount τ0 . The latencies are relative rather than absolute, since τ0 can take any value. Should absolute latencies be needed, all we need is the absolute time position T so that T+ τˆk become the absolute latencies; this is illustrated in section 2.1. Because latency estimates are sample means, smoothing the spike trains first does not affect the latency estimates, but it reduces their variability (see section 3). The benefits of smoothing neural data are discussed more generally in Kass, Ventura, and Cai (2003). We used a kernel smoother so that if nkj = 1 or 0 denotes the presence or absence of a spike at time tj , the smoothed spike train has values n∗kj

−1 −1

= (T2 − T1 ) h

i

ti − tj nki K h

,

(2.7)

where the summation is over all the time bins in [T1 , T2 ], K(·) is the standard normal kernel, and h is the bandwidth, which controls the amount of smoothness. To estimate the latencies from smoothed spike trains, we apply our estimation procedure ∗with hkj in equations 2.3 and 2.4 replaced by ∗ = n∗ /n∗ , where n∗ = hjk j nkj . kj k· k· Before we illustrate this procedure, recall that it is based on the fact that the spike times of a Poisson process can be considered a random sample with distribution proportional to the firing rate λ(t). However, because the resulting latency estimates are the sample means of the spike times, the method is also valid for non-Poisson spike trains, since the center of mass of a distribution can be estimated consistently from either a random (Poisson) or a correlated (non-Poisson) sample from λ(t); this is illustrated in sections 2.1 and 3. What the distribution of the spike trains has an effect on are secondorder properties such as variances of estimates, confidence intervals, and statistical tests of hypotheses (see section 4). 2.1 Implementation and Illustration. There are several options to implement the estimation procedure. For example, one can slide a window of

Testing and Estimating Latencies

2327

length T2 − T1 along each trial until the sample spike times in the moving window match across trials, or one can fix the estimation window [T1 , T2 ] and calculate the mean spike times for each trial in that window. We implemented the latter. We calculate equation 2.6, shift the trials, and iterate. Indeed, even though equation 2.5 is deterministic, V(τ ) and its derivatives use spikes in the fixed time window [T1 , T2 ]; therefore, once the trials have been shifted, a not exclusively different set of spikes contributes to the new V(τ ) and thus yields new latencies; we use them to adjust the previous estimates. Convergence is achieved when the latency adjustments are negligible, or when the decrease in variance V(τ ) in equation 2.2 is smaller than some ε > 0. The second criterion is easier to implement since it requires monitoring of only one quantity, and we declare that convergence is reached when the change in V(τ ) is smaller than 1% of V(τ ) for three successive iterations. To illustrate the procedure, we simulated K = 500 gamma(8) spike trains with firing rate shown in the middle top panel of Figure 2, with a 20 Hz baseline and a 60 Hz stimulus induced rate, then shifted the trials according to latencies sampled from a uniform distribution on [400, 1000] msec. (More details about simulated samples are provided in section 3.) The top center panel of Figure 1 shows a raster plot of the first 30 spike trains, and the top left panel shows the PSTH of all K = 500 trials, with the underlying firing rate superimposed; the effect of the latencies can clearly be seen. The top right panel plots the true versus the estimated latencies obtained from the first iteration of the algorithm; both sets of latencies are recentered around zero to compare relative rather than absolute latencies. The successive rows of panels show the result of the successive iterations, with raster plots and PSTHs based on the spike trains reshifted according to the current latency estimates. The algorithm converged after seven iterations. We used τ0 = min(t¯k ) in equation 2.6 to ensure that [T1 , T2 ] contains the response of the neuron to the stimulus even after the trials are shifted. With that choice, the trial with the smallest estimated latency is not shifted, while the other trials get shifted to align with it, as seen in Figure 1. Hence, to transform relative latencies τˆk into absolute latencies T + τˆk , we take T to be the time of onset of response to stimulus based on the PSTH of the shifted trials; T can thus be determined by any change-point method or simply by the naked eye. The resulting estimate of T will be accurate since it is based on a PSTH rather than on a single spike train. In Figure 1, the smallest true latency was τ0 = 418.17 msec, while we obtained T = 418.54 msec by clicking on a computer plot of the PSTH of the shifted trials.

2.2 Simultaneously Recorded Neurons. Assume that N neurons are recorded simultaneously and that the same latency applies to all neurons in a particular trial, up to a constant, so that the latency for neuron i in trial k is τk + βi , i = 1, . . . , N, k = 1, . . . , K. Our estimation procedure applies as

V. Ventura

0

500 Time (ms)

1500

D

0

500 Time (ms)

1500

E .. ......... .. . ..... ............................. ................... ........................ ...................... ........ .... . ...... ...... ................... ..................... .................... .. ... ... ..... ..... ................... .................. .. . .... ..... ............ .............. .......... . ...... ................................. .. ..... .. .. ... ......... ............................. ............... ................................ ... . .... ........ .......... . ...... .. .. ........ ....................................................................... ................................... ... ........ . . . . . . . . . .. ... .... ..... ... .............................................. .............................. ... . ........ ... .... ...... ........... ................. ................................................ .................. .... ........ ........ ...... .......... ......... ............................... .................. ........................ ........................... ........ . . . ... ..... ........ .. .......... ............................ ....... ............. .................. ................ .. ...... .......... ....................................... ............... ............................................. ....... .... ..... ...... .... .. . ... . . .... . .. . ................... ... . ....... ....... .. .......... .............. ......................................................................................... ..... ........... . . ... .... ........... ................. .................. ................................................................ . . . .. . .. .... ... . ...... .......... ..... ..... ...................................................... ...... ...... .... ... ..... ... . . . . ... . . . . . .. ... . . ........ ................................................................................................... .. ... .. . ....... ..... ....... ............................................ ...... . ... .

0

500 Time (ms)

1500

G

0

500 Time (ms)

1500

H ......... .. . ................................ .................... ............................ .. ............... ...... .... ..... ....... .... .......................... ........... .... ........ .... ..... .......... ................. ............... .............. .. ... ..... .......... ............... .......... ...... ............ . ................................................ . .. ..... .. ... ............. ........................................................ ..... ........ .. ....... . .. ... .. ... .............. ..... .................................................................................. ....... . . ..... . ..... ................................ ......... ... ..... ... . . . . . . . . . . . . ... . . . . . . . ..... .. .... ........ ....................... .... ...... ... ... ....... .... ..................... ................................................................ .... .... ... ..... ... . . . . .... . . .. . . . . . .. .. ......... .................................... ............................................................. ... ... .... . . .. . ... .. .. ... ................................. ................................... ... ................ ..... ......... ... .... ...... ........................................................................ ..................... . . ....... ... .... ... ... ... . ........... ........................................ ........ ............................ . .. .... . .... ....... . ........... ............................. ............................................................... .... ... . . ...... .... ...... ... ...... .............. .......... .......................... ...................... ......... ................................ .... ... ... .. ..... . ..... . . .. . ... ..... ... ...... ... ......... ............................................................ ....... ............. ............... .... .. ... ....... ...... .. ..... ....................................... .................................................... ...... . ... .

0

500 Time (ms)

1500

J

0

500 Time (ms)

1500

K .. ................................ .................... ............................ . ... ............. .... ..... .... ....... ............................................ ...................... ......... . .... ... ... .. .... .................. ..... . . ... ......... .... ........ .................. ................ ..... .............. .. . ... . . ..... .. ... ... ............... ................................. ............................................................ ..... ............ ...... ..... ...... ... ............................... ...................................... ............. ......... ......... ....... .. ..... ..... ........ . . .. . . . . . . . .. ... . . .................................................. ............................. .. . ... ... .. ... ...... ............. ......... ..... ....................................................... ............... .... .... .............. ......... . .. ..... ..... ..... ..................... ................................................................. ............... ... ... ... .. . ... ... . .. . .. ... ............................ .............. ........... .............................. ......... .... ... ........ ...... ............................................. ..................... ................ . ..... ... ........................... ......... . ... .......... . ....... ........ ....... ....... .................................................... ............................................. .... ... ....... . ..... ..... .. .......................................... ..................................... ........................................... ...... ..... .. ... . ... . ..... ........................................ ............. ...... . ...... .. ............... ...................... ......... ...... ......... . . ... ...................................................... ................... ..... . .. ............ ....................................... ......................... ... ...... . .......

0

500 Time (ms)

1500

0

500 Time (ms)

1500

C

...... . ....................................... . . ... ... .. ..... . ... .................. . ....................................................... . . . . . . . ............................................. . .............. .. ......... . . . . . 701

−300

Estimated latencies −200 100

...... ... .. . .......................... .............. .................. ..................................................... ............ .......... . ...... ... .................. ........... ................... ....... . ...... ................ ......................... ....... ..... . ........ ....................... . .... . ..... .............. .................... ......................... ....... ... ....... ......... ................................................... ... . ........ ............ ............ ......... . ..... ................... . .... ..... ........................... ......................................... ......... ............. ........ ...... ..... ..... ... . . . . . . . . . . . . ... . . . . . . . ..... . . .. . . .. . ......... ...... ......................................................... ................. ...... ... ..... .................. ..... .............. ......... .......... ...... ... ..... ........... ..... ..................................... .................... ......................................... ............. .. . . . .. . ... .. ... ............................ ........................ ..... .......................... ............. ........ ................... ...... ........... ................... ........... .. .... .. . .... ........................................................ ........... .......... ................... .............. ......... .... .. .... .. ..... ............. .............................. ............................ ....... .... ............... ... . .. ... ............... ... .. ...... .......................... ....................... ........ .... ..... ............................ ...................................... ....... .. .. ...... . ..... . ... . .... . . ... . . .... . . ....... .............................. ............. ............. ................... ...... ........... . ....... ....... . .. . . . . ... . . .. ... . ......................................... ............. ..... ............... .......................................... . ...

0 200 True latencies

F

. .. .......... ............................................ . . . .. . . . ................ .......... ..................................................... . . . . . . . . ..................................... .. . .. . . ......... . ......... . 602

−300

Estimated latencies −300 0 300

B

0 200 True latencies

I

.... .................... . . . . . . . . . . . . . . . . . ... ...... .. . . . ......................... .................................................. .. . . . .. .. .. . . . ....................... .. .............................. . 558

−300

Estimated latencies −200 400

A

Estimated latencies −150 0 150

2328

0 200 True latencies L 538 530 . 526 ....... 522 . ....................

..... ................... . . ....... ........... ................................................... ........ .. . ................... . .......................... . .

−300

0 200 True latencies

Figure 1: Iterations for latency estimation. Successive rows of panels correspond to successive iterations, until convergence. The data consist of K = 500 trials with rate shown in the right panels and latencies uniformly distributed on [400, 1000] msec. The first column shows the PSTHs of the trials shifted by the current latency estimates, and the second column the corresponding rasters; only the first 30 trials were plotted for visibility. The last column shows the centered estimated latencies versus the centered true latencies. The straight line is the first diagonal; the number in the upper left corner is the current value of V(τ ) in equation 2.2. We did not show iterations 4–6, but recorded the successive values of V(τ ) in the bottom right panel.

Testing and Estimating Latencies

2329

before, but with hkj in equations 2.3 and 2.4 replaced by N−1

l l=1 hkj , where

= and = 1 or 0 indicates the presence or absence of a spike at time tj in trial k for neuron l. Basically, all this does is combine the spikes of all N neurons for each trial into a “combined” spike train, with rate hlkj

nlkj /nlk· ,

N

(t) =

nlkj

N

λl (t − βl ),

l=1

where λl (t) is the rate of neuron l. This is illustrated in section 5.2 to adjust a CC for latency effects. 3 Procedure Performance and Limitations In this section, we investigate more fully the properties of our estimates. We assess the effects of the firing rate λ(t) and the number of trials K. We illustrate that gains in efficiency can be obtained from smoothing the spike trains and discuss briefly how to choose the estimation window [T1 , T2 ]. We also show that the latency estimation procedure can readily be applied to spike trains with simple constant excitability covariations in addition to latency effects, but illustrate how more complex excitability effects can bias the estimates. We measure the efficiency of the estimates using MSE = K

−1

K

1/2 2

[(τˆk − τˆ• ) − (τk − τ• )]

,

(3.1)

k=1

where τk and τˆk are true and estimated latencies for trial k, and τ• and τˆ• are their respective means. True and estimated latencies are recentered at zero to compare relative rather than absolute latencies. Equation 3.1 is the square root of what is known in statistics as the mean squared error (MSE); it is a combined measure of bias and variance. To assess the effect of the firing rate, we use the three rates shown in Figure 2 for a variety of baseline and response to stimulus rates; they are referred to as the step(a, b), block(a, b), and transient(a, b) rates, where a and b are the values of the baseline and of the maximum firing rates. As in Baker and Gerstein (2001), a ranges from 10 to 50 Hz in steps of 10 Hz, and b is two to five times a, in steps of one. For block and transient rates, the duration of the response is 1000 msec, after which the rate returns to baseline. Finally, the latencies are uniformly distributed on [400, 1000] msec. Although Baker and Gerstein used only the step rate, the performances of their estimators also apply to the block rate because they are based on change-point methods. Baker and Gerstein did not consider a transient rate, for which the departure from baseline is perhaps harder to detect.

2330

V. Ventura

Three rate types Transient(a,b) rate

0

500

1000 1500 Time (ms)

a

a

b

Block(a,b) rate b

Rate (Hz) a b

Step(a,b) rate

0

500

1000 1500 Time (ms)

0

500

1000 1500 Time (ms)

Estimates eﬃciencies Poisson spike trains step rate

G

J

5

10 20 30 40 50 Baseline (Hz)

2

2 5 4 10 20 30 40 50 Baseline (Hz)

3

3 4 5

3 4 5

2 3 5

2 3 5

2 3 5 10 20 30 40 50 Baseline (Hz) 3 4 5

2 3 4 5

2 3 4 5

−3

K

2 3 4 5

Excitability effects transient rate

Estimation window [T1,T2] block(30,60) rate

2 3 4 5

0

500 T1

1000

Smoothing transient(30,b) rate

L 150

I 150

2

22 2 33 3 44 4 55 5

10 20 30 40 50 Baseline (Hz)

0 100 300 500 number of trials

MSE 100

2

Bias 0

2

150 MSE 100

2 3 4

150

3 4 5

MSE 100

2 3 5 4

2 3 5

Proposed method transient rate

0

3 4 5

2

0

3 5 4

2

2 3 4 5

F

50

150

2

2

Number of trials block(10,b) rate

50

2 5

2 3 4 5

3 3 4 4 5 10 20 30 40 50 Baseline (Hz)

H

0

MSE 100

2 4

50

2 5

0

2 4

3 5 4

T2 2000 3000

3 5

MSE 150 250

4 5

50

4 5

Proposed method block rate

E

150 MSE 100

3 4 5

0

50

2

BG’s rate change method step, block rates 2

MSE 100

2

3

2 3

10 20 30 40 50 Baseline (Hz)

10 20 30 40 50 Baseline (Hz)

50

5

2

150

BG’s Bayesian method step, block rates

C

3 4

2

2 3 4 5

2 3 4 5

2

2

3 4 5

3 4 5

10 20 30 40 50 Baseline (Hz)

MSE 100

10 20 30 40 50 Baseline (Hz)

B

2

3 4 5

50

3 4 5

MSE 100

150

3 4 5

2

2

0

3 4 5

2

50

4 5

3 4 5

2

0

3

2

0

50

MSE 100

150

2 2

Bias step, block, transient rates

50

Proposed method step rate

D

0

BG’s variance method step, block rates

A

22 2 2 333 3 454 5

2 3 4

2 3 5

0 200 600 1000 Bandwidth (ms)

Figure 2: MSE (3.1) of the latency estimates. (A,B,C): Methods in Baker and Gerstein (2001). (All other panels) Proposed method. (K) MSE (3.1) as a function of T1 and T2 ; lighter shades denote lower MSE. (J) Bias for all simulations in D–F. All panels but G and H use 500 gamma(8) spike trains with rate indicated in the title. G uses 500 Poisson spike trains. On most panels, the baseline a is on the x-axis, and the maximum rate is b = k · a, where k is the plotting symbol.

Testing and Estimating Latencies

2331

We use two types of spike generating mechanisms, Poisson and gamma spike trains of order q = 8, the latter to match the simulation results in Baker and Gerstein (2001). The former generates spikes independent of the past, while the latter can be used to model spike trains where refractory period effects are present. Recall, however, that our estimation method works generally. Figure 2D through 2F show equation 3.1, based on K = 500 gamma(8) spikes with step, block, and transient rates, from which we conclude that the type of firing rate does have some impact on efficiency. In particular, our method is more effective for firing rates that return to baseline than for sustained rates. Figures 2D and 2E are directly comparable to Figures 2F, 3G, and 4G in Baker and Gerstein (2001), which we reproduced for convenience in Figures 2A through 2C. Figure 2H shows that the efficiency does not depend on the number of trials. Baker and Gerstein (2001) used the mean of the errors [τˆk − τk ] as a measure of the tendency to consistently over- or underestimate the latencies; their methods in Figures 2A through 2C produced biases as large as, in absolute value, 20, 30, and 70, respectively. The equivalent measure for our relative latencies is the mean of [(τˆk − τˆ• ) − (τk − τ• )], which is always zero. We therefore used the median of these errors as our measure of bias, which we plotted in Figure 2J. The bias is close to zero, which suggests that our estimates are randomly scattered around the true relative latencies, as could be seen in Figure 1. We discuss conditions under which estimates can be biased in section 3.1. Figure 2G suggests that latencies based on Poisson spike trains are less accurate than latencies based on gamma(8) spike trains with the same rate in Figure 2D. This is not surprising since there is more variability in Poisson than in gamma(8) spike times. We discuss further the variances of the estimates in section 4. Figure 2L shows equation 3.1 as a function of the bandwidth h in equation 2.7 used to smooth the spike trains. It is clear that some efficiency can be gained from smoothing, but that the gain is fairly insensitive to h. However, our experience suggests that convergence is most easily assessed with smaller bandwidths. In the rest of this article, we used h = 200 msec. The efficiencies in Figure 2 all used [T1, T2] = [200, 2200] msec. They can be improved by finer choices of T1 and T2 . The gray scale in Figure 2K shows equation 3.1, averaged over 1000 simulated data sets, as a function of T1 ∈ [−100, 1000] msec and T2 ∈ [1300, 3600] msec. We used gamma(8) data with block(30,60) rate; results are qualitatively similar for other spike trains. True latencies are uniform on [400, 1000] msec, so that, based on the PSTH, the response to stimulus starts around 400 msec and reaches its peak around 1000 msec, which is indicated by white vertical lines. The response begins to decline around 1400 msec and returns to baseline around 2000 msec, which is indicated by horizontal white lines. It is clear that our procedure is sensitive to T1 and T2 . A good choice for T1 is just before the

2332

V. Ventura

firing rate begins to depart from baseline (based on the PSTH). For sustained responses to stimuli (step rate, not shown), we found that the efficiency of our method is not sensitive to the particular choice of T2 , whereas for nonsustained rates (block or transient), a good choice for T2 is such that the length of [T1 , T2 ] is roughly equal to the duration of the response to the stimulus. Finally, even with optimal choices of T1 and T2 , the efficiencies of the latency estimates are comparable for block and transient rates but not as good for the step or sustained rate, which we observed already in Figures 2D through 2F. 3.1 Limitations. Our latency estimation is based on the assumption that the firing rate is identical across trials except for a time shift. In many contexts, however, the conditions of the experiment or the subject may vary across repeated trials enough to produce discernible trial-to-trial spike train variation beyond that predicted by Poisson or other point processes. We illustrate the consequences of such effects on our latency estimates. Assume that the state of the neuron varies slowly so that the firing rate of trial k is αk λ(t − τk )

(3.2)

rather than λ(t − τk ), where αk is some positive constant; that is, the firing rate is inflated or deflated by a multiplicative gain αk on each trial, as pictured in Figure 3A. Because the αk are constants, firing rates αk λ(t − τk ) and λ(t − τk ) are proportional to the same density; hence, they have the same means and thus produce the same estimates of latency. What differs is the number of spike times used for estimation, which affects the variance of the means rather than their average values (see section 4). Figure 2I shows the efficiency of the latency estimates for 500 gamma(8) spike trains with rate 3.2, αk uniformly distributed on [0.5, 1.5], τk uniformly distributed on [400, 1000], and λ(t) the transient rate. Figure 2F uses spike trains that are in every way similar but for excitability covariations. We picked the αk in equation 3.2 so that the total number of spikes across the K trials are comparable for Figures 2F and 2I. This explains why the efficiencies are also fairly comparable. In practice, excitability effects will likely be more complex than equation 3.2, and depending on the degree of deviation from it, the latency estimates will be biased. A partial solution that sacrifices some efficiency is to reduce the estimation window [T1 , T2 ] so that the excitability effects are approximately constant on that window. This is illustrated in Figure 3, based on gamma(8) spike trains with firing rate for trial k, gk (t − τk ) λ(t − τk ),

Testing and Estimating Latencies

2333

0.0

0.0

Firing rate 0.05 0.15

B

Firing rate 0.05 0.15

A

0

500

C

1000 1500 Time (ms)

2000

2500

0

500

1000 1500 Time (ms)

2000

2500

................. ... . ............. ......... . ......... .......... ........... ...................... ....... ......... ....... .. ... ... ... .. ... ............... ........... ...................................... ... . .... ........ ...... ................ ............... ......... . .. ................... .. . ..... ... .... ...... ................. ......... .. . ......... ... ...... ............ . . ... ......................... .... ............ .............. ....................... ..... ....... ..... ... ... ... ....... ............... ................. ............... ..................... .... .................... ..... ............. ... ................................ .............. .......... ...... .. ........... ....... ....... .... ..................... ....... ...... ........ .... .... .............. .................... ........ ................ ........ ... .......... ......... ..... ..... . ............ ... ........ ..... ..... ....... ............. ................ ............... ....... ...... .... ...... ................ .................................. ... ...... ..... ........ .... ................ ............... ..... ..... ...... ..... ..... ... ....... ...... ... ....... .. ... . . . . . . . . . .. ... .. ... .. ..... ....... . .. .... . . . . . . . . . . . . .. ............... ....... . . . . . . . ... . . . . . .. ................ . ... . . ......... .. . . . . .. . . .. . . . . . . . ... ....................... ....... . ....... ........... . .... .... .. ... ...... . .. . .. . ...... . ..... .. .. .... ............. . ........ ............. .. ....... .... .... ... ... ..... .......... .................... ................ ...... .......... ........ .................... .... . ..... ................. ....... ......... ........................ .. ... .... ........................ ..... ..... ............. ...... ........... ............... ........ ......... .......... ...... . ......... . .... ........ ..... .. .... .............. ....... ... .. . ..... ................. ....... . . . ... .... ... . ........ ................ . . . ... ................................... . . . ... ........... ........ . . .. ... . ........ . . ... ... ...... . . ...... . . ...... . .... . .. . .. . . . .. .. .. ........... . ........ . ...... . . ...... .. . . . . . . . .. . . . . . . . . ....... ............. . .. ................ .... .... ...... ... .... . . ............ .. ......... .............. ....................... ..... .. ................. .................................. ...... ..... ............ ........ ....... .... . .. ... .......... ................. . .............................................. .......... ... ........ .... ....... ....... .. ...... ... .... ................... ... ............. .... .... ... .... ... .. .. ..... .......... ..... ................ ............ .......... ................................... .......... ...... .. ....... .... ...... .. .... ..... . ... . ... . ...... ... . . ... .............. ...... .......... ............ ................. .......... ........... ............. ...... .............. ... .... ...... ..... ...... . . ............. . . . . . . . ...... . ....... ...... .... .. .. .. ...... ... ... ... ........ . .. ....... . . .... ... . .. . ... . . . . . . ................... . ...... . . . . . . . . . ..... ....... . . ... . ... . . . . . . . .. ... . .. . ......... . ....... .. . .. . .. . . . .. .. ... .. ................ ...... ........... .......... ........................ ................ ....... ....... .......... ............. .. ..... ..... ..... ... .............. .. . ................... ... .. .. .... ................ .. ................ ............... .... .. ......... ......... . ..... .. ....... ........... ... ............ .. ............................ ........ ............................. ......... ................ ..... ............. ...... ...... .. .... ........ ........ ........ ....... ....... ...... .. ............... ......... ... . .. . . . . ......... ......................... . . .. ........ . .. ..... .... .... . ...... . ....... . ..... . ....... . .. . ..... .. . .. . . . . . . . .. .... . ...... . . . . . . . . . . . . . . . . . . . . . . . .. ...... ................... ... . ...... ... .. . .................. . ............. . . . . .. . . . . .. . . .......... . ....... . . . . . .. . . .. . . . ..... . .. . ..... . .......... . .. .. . . . . . . . . . . . . . . . . . . ............... . . . ... ..... . ................ ..... . . .. . . . . .. . . . . . . . .. . .. .. ... . .. ... . ................... ......... ............................ ................. ................. .. . . . . ...... 0

500

1000

1500

2000

2500

Time (ms)

−100

Estimated latencies 0 100−100 0 100

D T2 = 1300 ... ..... ............................... ............................. .......... ....... . ........................ ................. .. ....... ............... ................. . ............ . .............................. . . . .............................. . . ............... .................... ....................... . ...................... .... ..... −200 0 200 True Latencies

T2 = 1500 .............................. ........................... ..... .......... .................... ............................ ................ .................... . ..... ............................ ........... . . .......... . . ........................ . .. ............ . .............................. .. . ...... . ................. .................... .. −200 0 200 True Latencies

T2 = 1750 ............................... ........................... ..................... ............ .... ................................ ................... . ......... ..........

T2 = 2000 .............. ..................... ............................ . ...... . ..... ....... . .................. ............................. ....................... . ..................

. . ... .. ....................................... . . ......................... .... ..... .. . . .. . . . ..................... ....................................... ................................... . . .. .. .................................. .

. . . .. .............................................. .. ................................................... .. .. . . . ... . ..................................... . . .................. .. .. .. .......................... .............. . . . . . . ........................................... .. . .

−200 0 200 True Latencies

−200 0 200 True Latencies

Figure 3: Excitability effects. (A) Firing rates for four trials with multiplicative excitability effects (see equation 3.2). (B) Firing rates for five trials with complex excitability effects. (C) Raster plot for 50 trials with complex excitability effects. (D) True versus estimated latencies for several values of T2 . The first row of panels is for excitability effects in A, and the bottom row for excitability effects in B.

2334

V. Ventura

with λ(t) the transient(20,120) rate and τk uniformly distributed on [400, 600] msec. We generated excitability effects using gk (t) = {a0k + a1k · h1 (t, µ1k )}{1 + ck · a2k · h2 (t, µ2k )}, where a0k , a1k and a2k are gain coefficients that are normally distributed with respective means and variances 0.8 and 0.22 , 0.4 and 0.12 , and 1 and 0.52 , and cr is a Bernoulli random variable with Pr(ck = 1) = Pr(ck = 0) = 0.5. The time component h2 (t, µ2k ) is a normal density with standard deviation 200 and mean µ2k that we take to be normally distributed with mean 1700 and variance 1502 ; h2 adds a later component to the neural response at a random time, for 50% of the trials. The other time component h1 (t, µ1k ) is a gamma density function that takes nonzero values when t > µ1k , where µ1k = 1800 − 2τk depends on the latency; it inflates and lengthens the first peak of the firing rate by random amounts. Figure 3B shows five fairly extreme firing rates from this model, all with latencies τk = 400 msec, while Figure 3C shows 50 typical spike trains generated from the model with latencies uniformly distributed on [400, 600] msec. The estimated latencies from 500 such spike trains are plotted against the true latencies in the bottom panels of Figure 3D, where it is clear that the estimates become severely biased as T2 becomes larger. For comparison, the top panels in Figure 3D are the corresponding plots from spike trains shown in Figure 3A, with multiplicative excitability effects (see equation 3.2). For T2 smaller than 1300 msec, latency estimates from trials with either type of excitability effects are comparable. If excitability effects are so extreme that the firing rate does not have a common component across trials, a change-point latency estimation method may be more efficient. 4 Inference for Latencies Typical consequences of estimating nonexistent effects on subsequent statistical procedures are loss of statistical power and efficiency. It is therefore important to test if response latencies are constant across trials. We illustrate this below for simulated data, and in section 5 for real data. We also provide standard deviations for the latencies. To carry out a statistical test, we must choose a test statistic T, determine its distribution under the null hypothesis H0 that the response latency is constant, which we refer to as the “null distribution of the test statistic,” and finally compare this distribution to the observed value tobs of the test statistic, typically via the p-value, to determine if the null hypothesis should be rejected. Choosing a test statistic is most easily done if we consider once more the analogy between firing rates and distributions. The spike times of trial k can be viewed as a sample from a distribution proportional to λ(t − τk ). Let µk denote its true mean, with estimate the sample mean tk in equation 2.6.

Testing and Estimating Latencies

2335

Then the null hypothesis H0 that all latencies are equal is equivalent to the assumption of equal means in different samples, H0 : µ1 = · · · = µk = · · · = µK , which is routinely dealt with an ANOVA F-test. The ANOVA test statistic is K nk (t¯k − t¯· )2 /(K − 1) F = K k=1 , nk ¯ 2 k=1 j=1 (tkj − tk ) /(m· − K)

(4.1)

where tkj , j = 1, . . . are the spike times of trial k, t¯k is their mean, t¯· is the sample mean of the spike times across all K trials, nk is the number of spikes in trial k, and n is the number of spikes across all K trials. Large values of F provide evidence against the null hypothesis of constant latency; thus, the p-value is p = Pr(F ≥ fobs | H0 true),

(4.2)

where fobs denotes the observed value of equation 4.1 in the sample, and “| H0 true” means that the probability is calculated with respect to the null distribution of F. It is commonly assumed that under H0 , the ANOVA Fstatistic has a Fisher F-distribution with K and m· degrees of freedoms, and it is common practice to reject H0 when equation 4.2 is smaller than 5%, or 1%. This null distribution involves assumptions that we come back to later in this section. Whatever the outcome of the global test above, it may be of interest to test the latency of a particular trial against a value t0 or to compare the latencies of any two trials. This can be dealt with in one-sample and two-sample ttests, respectively; note that a two-sample t-test is equivalent to the F-test (see equation 4.1) applied to K = 2 trials. In this article, we consider only the tests of every pair of trials, since choosing a particular point t0 seems arbitrary; moreover, it seems more relevant to test if particular trials have latencies that differ from those of the other trials, as illustrated in section 5. We report the p-values of all two sample t-tests in a K × K matrix, where K is the number of trials. An important point is that the simple excitability effects (see equation 3.2) do not affect either test, because an overall inflated or deflated rate αk λ(t−τk ) does normalize to the same distribution as the rate λ(t − τk ). What happens is that excitability becomes a sample size effect that is seamlessly taken care of by the test statistic F via the spike counts nk in equation 4.1. Of course it is unlikely that excitability effects will be exactly of the simple form of equation 3.2, but they may be sufficiently well approximated by that equation on a small testing window [T1 , T2 ]. Section 5.2 presents an application where excitability effects are present, yet our testing and estimation procedures appear to work well. Global and pairwise tests are illustrated in Figure 4A. We simulated 100 Poisson trials with block(20, 60) rate, latency 400 msec for the first 50 trials,

2336

V. Ventura

and 700 msec for the other 50 trials. This shows clearly on the raster plot. The p-value (see equation 4.2) for the global test is less than 0.00001, suggesting that latencies are variable across trials. The right-most panel shows the K×K matrix of the p-values of the pairwise tests, with white, gray, and black cells corresponding to p-values smaller than 1% (strong evidence that the latencies are different), between 1 and 5% (evidence), and above 5% (no evidence), respectively. The two mostly black areas of the matrix correspond to pairs of trials that belong to the same latency group. These areas are not perfectly black because if α denotes the significance level of the test, that is, the probability of erroneously rejecting a true null hypothesis, we should expect to reject approximately a proportion α of tests that have H0 true. We indeed verified that the proportion of false-positive results (white and gray pixels in the two mostly black areas of the matrix) is just about 5% and the proportion of white pixels about 1%. The test appears to lack power, since many pairs of latencies that are different are not discovered, as indicated by the dark pixels (P > 5%) in the mostly white areas of the matrix. However, given a fixed testing window [T1 , T2 ], the ANOVA F-test is in fact known to be the most powerful test; the power happens to be low in this example because the firing rate is low, which yields small sample sizes, and Poisson trials are quite variable. Figure 4B shows that the same test applied to gamma(8) spike trains with the same rate is more powerful (see section 4.2). To determine the optimal testing window, we conducted a simulation study similar to that in section 2. The power for the global test was estimated by the proportion of significant tests applied to 1000 simulated samples that have latency effects. We found that the power was sensitive to the testing window and that for all rate types, a good choice for T1 is just before the firing rate begins to depart from the baseline, and for T2 approximately when the response to stimulus ends, based on the PSTH of the unshifted trials. Figure 4A also shows 95% confidence intervals for τk obtained as follows. Our estimates τˆk are sample means, so the central limit theorem applies to give · τˆk ∼ N(τk , σk2 );

(4.3)

that is, τˆk is approximately normally distributed with mean the true relative latency τk , and variance σk2 . Hence, a 95% confidence interval for the true τk is +

τˆk − 2σk .

(4.4)

To estimate σk2 for each trial, we consider once more the spike times to be a random sample from a density proportional to the firing rate. Therefore,

Testing and Estimating Latencies

2337

with nk the number of spikes in [T1 , T2 ] for trial k, the estimated variance of τˆk is σˆ k2

S2 = k, nk

S2k

=

j (tkj

− t¯k )2

nk − 1

,

(4.5)

where t¯k and S2k are the usual sample mean and sample variance of the spike times tkj . The methods proposed in this section so far are fast and straightforward. However, the F-test, t-tests, and confidence intervals’ results hold only if the spike times are a random sample from λ(t−τk ) and if λ(t−τk ) is proportional to a normal distribution. The second condition is not generally crucial provided the number of spikes in each trial is not too small; this follows from the central limit theorem. The first condition is the most worrisome, since the spike times are random only for Poisson spike trains. For non-Poisson spike trains, we know of no result that specifies the null distribution of F, and thus we will have to either derive it theoretically or obtain an approximation using a bootstrap simulation. The first option is too daunting, if at all possible, so we chose the second. 4.1 Bootstrap Inference. In this section, we provide general bootstrap simulation algorithms for tests and confidence intervals, (see Davison and Hinkley, 1997, for a complete bootstrap treatment). Theoretical and bootstrap results are compared in the Poisson case, where we can get both. Let T denote a test statistic and tobs its value in the sample. For the global test, T is the ANOVA F-statistic (see equation 4.1), and for a pairwise test, T is the two-sample t-statistic, or equivalently the F-statistic (see equation 4.1) evaluated for the two trials under consideration. A general bootstrap testing algorithm follows: Bootstrap Testing 1. For r = 1 · · · R (a) Create bootstrap sample r that satisfies the null hypothesis H0 . Options are described below. (b) Calculate t∗r , the value of the test statistic T in bootstrap sample r. 2. The histogram of the R values of t∗r approximates the null distribution of T. The bootstrap p-value that approximates the exact p-value in equation 4.2 is: pboot =

1 + #{t∗r ≥ tobs } R+1

(4.6)

2338

V. Ventura

Tests and Confidence Intervals 100

pairwise p−values

0

500

1500 Time (ms)

50

..... ......... ........ .. ........... ........ . . .... ... . ..... ... ..... .... ..... ..... .............. . ... .................... . ...... ..... .... ..... ....... ....... ................ ... ............. ... .... .......................... ............ ........ ................ .. ..... ...... ..................... .............. .............. ..... ... ... .... .............. ............ .................. ............... .... ... ......... .... ....... .... ........... .. ...... .... ........ .... ........... .... . ........... ..... ..... ......... . .... ... ......... ..... ............... ......... ...... .. ............ ... .............. ...... ..... ..... ............ .................... .... ................ ....... .......... ....... ... .... ..... .. ......... ...... ... ..... ........... .................. ..... ............ ..... ........ .... ....... .. ...... ........... ........... .......... ... ........ .... ....... ............... ................... ............ ............ ...................... .. ........... .. .... ........... .. ...... ...... ....................................... ........... ...... ..... .... .... ...... .... ............. ..... ...... ......... ..... .. ............. ............ ................................ ........ ............ ... ... .......... ... ... ..... ...... ........ ..................... ........... ........ ..... ............ ..... ................... .. . ... . ..... .. . .. .... . .. . . . . .. .. . . . ... ... . . . ... . . . .. .. . ..... . . .. . .... . . .. . . .. .. . .. .. . . . .. . . . .. ... . ... .. ..... .... . .. . . . . . . . . .... . . . . .... .. . ... .......... .... ................... ............ ... ... . ...... ... .......... .................. . ............ ... ................... ......................... ..... ........... .............. ........ ..... ............ ..... ... ..... ....... ............................ .... .... ........ ... ............. ......... ... ............ .......... ..... .......... ........ ..... ... ............ ........... ....................................... ...... ...... .................. .... .............. ............ ..... ............. ............... .......... .......... ...... .... ...... .... ....................... .................. ......... ............. ......... ...... ... ........ ... ...... .... ..... ....................................... ..... .... .................. .......... ....... ............. ..... ..... ...... ...... .......... .. . . . .. .. . . . . .. . .... .. . . .... . .. .. . . ... . .. . . . . . . . ... .. . . . .. . .... .. ... . . .. . . . .. .. . .. .. . .. . . . . . .. . .. . . . . ... . . ... . . . .. . .. . ..... . . . . . . . . . . . . .... ......... .. . ..... .... ....... ....... ......... ............. ..... ........ ..... .... .......... ..... ... ......... .......... ................ ............. ... ... ... ................. .... .... ..... ....... ............. ....... ............ ....... ... ..... ........... ......... ..... ............ .. ..... ............ .. .. ...... ..... ............... ......... .............. .......... ....... .... ........ ............ ..... .. ..... .. .... .... ... ........... ........... ... ..... ......................... ....... .. ...... ............... ....... ..... ..... ....... .... ... ......... .. ................... ............ ...... .... .... .... ....... ......... ........... ..... ....... ... ..... ... ...... ........ .. .......... ...... ... ............. .... . ...... ...... ... ............ ... ........... ... ...... ........... ...... ......... .................. ....... ..... .......... .... .......... ..... ......... . ..... ......... ............. .... .. .... ... ... .................. ........ ........ ......... ..................................... ......... ..... ..... .... ........ ...... ......... ......... .. ......... ..... ........ .............. ..... .............. ... ............................ .... ............. ............ ...... .... .... .... . . . .. . .. . .. . . . .. . ... .... . ..... . .. . .. . . . ... . ... . .. ... . ..... .... ........... . . .. . . . . . .... . .. .... . .. .... . . . . . .. . . . . ... .... ... . . ... . ... .. . . . .. . .. . . . .. . . . ... . . ... ... . . .......... ..... ...... .......... ....... ............. ................................... .. ........ .................... .... .......... ... .... .......... ....... ...... ....... ....... ...... ................. ........ ................................ .... ....... ... ...... .. .... .... .... ..... ..... .................... ............. .. ....... ..... ............ ... ........... . ... .......... .... ..... ....... ..... ............... ......... ..... ............. ........... ..... ..... ........... .. ... ... ..... ......... ........... .... ........... .......... ....... ... ........... ........... .................. ..................... ....... ............ ....... ......... ... ..................... ...... ........ ........... ......... ........ .... ................. ........ .............................. ..... .......... ... ....................... ........................... ......................... ..... ...... ................ ..... ..... ..... .. ......... ........ .. ................. ....... ..... ...... ............. .......... ....... .. .... ....... ......... ....... .............. ...... . ...................... .... ........ ................... ... ......... .... ......... .................... ....... ........ . .....

0

Trial number

A – Poisson

2500

200 400 600 800 95% CI for relative latencies

0

100

pairwise p−values

500

1500 Time (ms)

50

.......... ............. ..... ................ ........ ..... ..... ....... ......... ... ... ..... ..... .... . ... ...... .......... ..... ................. .... ...... ............ ... ........ ...... ... ....................... ... ... .... ....... .. ..... ... ........... ..... .................. .. ............. ..... ..... ..... ............ ...... ........... ..... ..... ..... ........ .............. ....... .... ..... .... ...... ........ ........ .............. ..... .......... ..... ...... ........... .................. .. ..... ... .... . ....... ... ...... ......... ..... .......... .... ............... ..... ........ ........... .... ........... ............ ........ ...... ...... ......... ... ............. ............... ............ ........ ..... ................ .... ..... .... ......... .. ..... ...... ..... ...... ......... ........... ..... .... ................. .. .... ............ ............. .. ........ ............ .......... ............ ........ ....... ................ ............ ....... ....... ............ ... ... ... ... ..... .... ...................... . ..... ....... ....... ........... ............ ... ....... .................... ........ ...... .......... .... .......... ........ . ........ . .... .. .............. ...... ..... ......................... ............... ....... ....... .... ..... ............ ...... ........ .. ........ ...... ...... .......... .... ........ .. ....... ......... .. ....... .... ............... .......... ....... .......... ........ .... ...... ......... .. . ..... .... ........ ......... ...... .... ...... ... ...... ...... ........ ................... .............. ......... .... .......... ..... ........ .............. .......... .... ... .................... ................ ...................... ....... ... ...... ... ....... ....... .. ... ...... .......... .............................. .......... ....... .... ........ ...... .............. ..... .......... .. ... ... ... .. . ... . . . ..... ..... ...... ..... ..... .... ... .... .. . . . .. ... ... . ..... .. . ... . . ... . . ........ . .... . ...... .. . . .. . ... . .. .. . . . . . . ........ . . . .... .. ..... ............. ... . .... ... . . .... . .. .. . .. ... . ........ .. ....... . .. ....... .. ... .. ...... ................ .......... ... ....... .. ..... .... ....... ...... ..... ... ..... ............. .................... ........... ... .......... ...... ........ ............. ............. .......... ..................... ...... ....... ...... ...... ... ........ .... ..... ............................ ..... ............ .. .. .. .......... ..... ... ........... .............. .... ... ......... ... ........ ....... .......... .......... ........ .. ... ........ .... ... . ...... ...... ................ ........... ....... ..... ......... ...... ........... ........ ... ....... ....... .... .......... .... .......... .............. .......... .............. ............ ......... ...... ....... .... .... ....... ......... ... .... .... ............ ... ...... ..... .. ...... .... .......... ......... ...... ........ ... .... ............ ................................... ........ ........ ...... ............... ................. ...... ............ .... ... ... .. .. .................. ..... ................. ........... .......... . . ... . ........ .. ... . ...... ..... ...... ........ ... ........ . ....... . . . ... ... .. . .. .. ............ .. .. ..... . . ... ........ .... ... ..... .... ..... .... . . . . . ... .... . .... . .... .. ... . ..... .. . . . ... ... . . ... ..... . ..... . . . .. . . ...... . . ... . ... ... . ... . .... .. ......... ................... ............ ...... ........ ......... ............ ........ ..... ......... .... .......... ...... ....... ..... ........ ..... ... .............. ......... ......... ...... .... ....... ..... ....... ...... ..... ....... ....................... .... ............... .... ... ........... ..... .... ..... ...... ............. ....... ... ...... ........ ..... ..... .............. ...................... ....... .......... ... ......... ........ ................... ....... ....... ....... ....... ........ ......... ..... ............... .... .. .. ... .... .... ............. .................. .... ....... .......... .... ....... ................ ..... ..... .. ....... ......... ..... .... .......... ... ..... ..... ...... ........... ....... ....... ...... ........ ......... .................... ....... ........... .... ....... ... .... .......... .... ....... ... ..... ... ........ ......... ........... ............ ..... ......... .......... ............. ............. ..... .. ......... ... .. . ...... ....... ... .............. .... ........ ..... ........ .. ........ .... .... .... ....... .......... .. .. ....... .... ............ ........ .... ..... ........... .............. ... ... ...... ................... ........ ............ ..... ... ....... .. ..... ...... ...... .... .................. ...... ........... ....... ..... ............. ................. ....... .... ............... ...... .. ...... .......... ...... ........... ........ ................. ........... ...... ..... ... .... ...... ....................... ....... ....................... .. .... ........ ..... .. .. ............... ... ... ............... .................. ............ .......... ..... ........ ......... ...... ............ ..... ... ...... ...... ..... .. ... ........... ...... ........ ... .... ..................................... ................ .... .... .... .......... ................ ............... ............ .... ............ ...... ... .... ....... .... ........ ........ ............... ........................... ... .... ..... .... ...... .. ...... ...... .... .... ... .... ......... ............ ...... ....... .... .... ...... ............................ ........... .... ..... ... ... ........ .......... .. ...... .... ... ..... .... . ... . .. . . ..... ... ...... ... .... .... ... . . . . . .. .. . .... ... .... .... .. ....... . ... . . .. ... ....... . . . ........ .. .. . ...... ..... . .. .. .. ... . . .. . . .. . .... . . . . . . . . . . . .. .... .. ......... ........ .. . ... ... . .... .. . . . . . .. . .. . ... . . . ...... . . . .. . . . . ....... .. .... .......... ................... ....... .............. ............ ......... ........ ......... ......... ..... ........ ...... ...... .................... ............. ..... .... ............ .. .... .. ......... ....... ...... ...... ........... ..... ........ ............... ........ ..... ................ ..... .... ..... ........... ... ... ............... .......... ... ..... ...... ...... .. ... ... .... . . . . . ... . . . ........... .... . ... ........ . ... ... . ...... .... . .. . .... .. .. . ... . . ... . . . ..... .. .. ..... . . . . . .. . . . .. ... . . ... . .. ......... . .. . .... . ..... . ... . . .... ..... .. . .. . ... . . . . . .. .. . .. .. .. . ..... ... .... . .. .......... ........ .... ...... . ................. ........ ........ .... ............ ... ..... ............ ....... ........... ......... ..... .. ...... . ........... ............ ..................... .... ..... ..... ....... ... ...... ... ...................... ... ........ .. ........ .... ... .... ........ ...... ................... ... ... . ...... ............... . ............. ............ ....... ........... .......... .................................. ..... .. .... ..............

0

2500

200 400 600 800 95% CI for relative latencies

Null distribution of F C Bootstrap P = 0.922

0

D

1.0

E

1.5 T*r

2.0

0.5

Gamma bootstrap IMI bootstrap F approximation

0.2

0.4

Gamma(2)

0.0

0.5

1.0 T*r

1.0

2.0 0.0

0.5

1.0 T*r

1.5 T*r

0.8

Gamma(1)

1.5

100

2.0

P < 0.0001% P < 0.0001% P = 88.9%

0.6 T*r

F

50 Trial number

Bootstrap P = 0 F approx P = 0

F approx P = 0.93

0.5

100

0

Trial number

B – Gamma

50 Trial number

1.0

Gamma(1/2)

1.5

2.0 1.0

1.5

2.0 T*r

2.5

3.0

Testing and Estimating Latencies

2339

For Poisson spike trains, bootstrap samples in step 1a can be obtained parametrically or nonparametrically as follows (see Cowling, Hall, & Phillips, 1996, for other sampling options): Poisson Bootstrap Sample • Nonparametric. (i) Combine the spike times of all K trials. (ii) Sample m· = k nk spike times from the set of combined spike times, and form the new spike trains by allocating the first n1 spikes to trial 1, the next n2 spikes to trial 2, . . . , and the last nK spikes to trial K. • Parametric. (i) Estimate λˆ 0 (t), the firing rate of the neuron based on the PSTH of the unshifted trials. (ii) For k = 1, . . . , K, form trial k by generating nk spikes from a Poisson process with mean λˆ 0 (t). Standard methods to estimate firing rates are gaussian filtering and spline smoothing (see Ventura, Carta, Kass, Olson, & Gettner, 2001). Figures 4C and 4D show histograms of R = 1000 values of t∗r for two sample of K = 100 Poisson spike trains, where T is the F-statistic (see equation 4.1) for the global latency test. Parametric and nonparametric bootstraps produced indistinguishable results. The data in Figure 4C have H0 true (equal latencies), whereas Figure 4D used data with true latencies uniformly distributed on [400, 1000] msec. The Fisher F-distribution is overlaid; it matches the two histograms almost perfectly, as we would expect, since Figure 4: Facing page. (A) 95% confidence intervals for the latency estimates of the trials in the raster plot, and K × K matrix of p-values of all pairwise latency tests. Black: P ≥ 5%; gray: 1% < P < 5%; white: P ≤ 1%. The spike trains are Poisson with block(20,60) rate, latencies 400 msec for the first 50 trials, and 700 msec for the other 50 trials. (B) Same as A but for gamma(8) spike trains. (C, D) Asymptotic F-approximation (solid curves) and bootstrap null distributions (histograms) for the F-statistic (see equation 4.1), with observed values fobs indicated by vertical lines. The spike trains are Poisson with block(20,60) rate, with (C) no latencies, and (D) latencies uniformly distributed on [400, 1000] msec. (E) Bootstrap gamma (histogram), bootstrap IMI (bold curve), and bootstrap Poisson (dotted curve) null distributions for the test F-statistic (see equation 4.1) with observed values fobs indicated by the vertical line, based on gamma(8) spike trains with block(20,60) rate and latencies uniformly distributed on [400, 600] msec. (We use latencies in [400, 600] msec rather than in [400, 1000] msec so that fobs is within the frame of the plot.) (F) Bootstrap gamma (histograms) and bootstrap IMI (curves) null distributions of (see equation 4.1) based on gamma(2), gamma(1), that is, Poisson, and gamma(1/2) spike trains with block(20,60) rate and latencies uniformly distributed on [400, 600] msec. The dotted curve on the middle panel is the asymptotic F-approximation.

2340

V. Ventura

the spike times are random, and the number of trials is large enough that the firing rate does not have to be bell shaped. The bootstrap and F-test p-values, equations 4.6 and 4.2, are very close, and for both data sets, we take the appropriate decision: reject H0 for Figure 4D, with the conclusion that latencies vary across trials, and fail to reject H0 for Figure 4C. The dashed curves in Figure 4E are the null distribution of equation 4.1 obtained by Poisson bootstrap and F approximation, this time for gamma(8) rather than Poisson spike trains, with latencies uniformly distributed on [400, 600] msec. Once again, both distributions are practically identical, and thus they are indistinguishable in the plot, with corresponding p-values equations 4.2 and 4.6 equal to 88.9%; we fail to reject H0 , which is the wrong decision. Also plotted as a histogram is the parametric bootstrap null distribution from a gamma(8) rather than a Poisson model, with rate λˆ 0 (t) fitted to the unshifted spike trains via gaussian filtering. The resulting null distribution is approximately the “correct” one, since the correct model was used. The corresponding bootstrap p-value equation 4.6, is zero so that we now reject H0 , the correct decision. Figure 4E illustrates that the validity of a test (bootstrap or not) requires an appropriate model for the data, from which bootstrap samples can be simulated. Model selection for spike trains is discussed briefly in section 4.2. Now a few remarks about bootstrap testing are important. A bootstrap sample should be in every way similar to the observed sample—hence the need for an appropriate model. In the context of statistical testing, bootstrap samples should also conform to the hypothetical reality imposed by H0 . Here, H0 forces us to assume that the trials have a common firing rate, even if we see with the naked eye that they do not. Both our parametric and nonparametric bootstraps did satisfy H0 : the implicitly (combined spikes) and explicitly estimated common firing rate λˆ 0 (t) ignored any latency effects, since we fitted the same firing rate to all trials. If the data also contain excitability effects, the bootstrap samples must also contain that extra source of variability. (This issue is developed further in Cai, Kass, & Ventura, 2004.) The confidence intervals, equation 4.3, hold for Poisson and non-Poisson spike trains since they are based on the central limit theorem, but the estimate of σk2 proposed in equation 4.5 is valid for Poisson data only. We provide a bootstrap alternative that is valid for any spike trains: Bootstrap Standard Deviations 1. For r = 1 · · · R (a) Create bootstrap sample r. (b) For each spike train k = 1, . . . , K in bootstrap sample r, calculate the mean t¯∗kr of the spike times.

Testing and Estimating Latencies

2341

2. For k = 1, . . . , K, the bootstrap estimate of σk2 is the sample variance of the t¯∗kr , R ∗ (t¯ − t¯∗k. )2 2 t¯∗kr . , where t¯∗k. = R−1 σˆ k = r=1 kr R−1 r Note that unlike for testing, the model used to simulate bootstrap samples in step 1a should be fitted to the spike trains first shifted according to the latency estimates. Indeed, the final latency estimates are based on shifted trials. We applied the standard deviation bootstrap algorithm to Poisson and gamma(8) spike trains. In the Poisson case, we obtained bootstrap standard deviations comparable to the analytic result in equation 4.5. The bootstrap standard deviations for gamma(8) spike trains were used to calculate the confidence intervals in Figure 4B. In this section, we have shown that the bootstrap compares well to asymptotic results in the Poisson case, which gives it credibility when no such asymptotic results are available. For Poisson spike trains, it will be safe to use the theoretical results, unless the numbers of trials or spikes are very small, or the firing rate is very dissimilar in shape to a normal density; in doubt, a bootstrap test is also easily done. For non-Poisson spike trains, all we need to perform bootstrap inference is an appropriate model fitted to the observed data, from which bootstrap samples can be simulated. 4.2 Model Selection for Spike Trains. The quality of any test, bootstrap or not, depends on how well the chosen model fits the data; by quality, we refer to how well the actual significance level of the test matches the nominal one. This is illustrated below. Model selection, for spike trains or any other data, involves proposing competing models and determining which fits the data best. Reich, Victor, and Knight (1998) introduced the power ratio statistic to test whether spike trains can be completely characterized by an inhomogeneous firing rate. Brown, Barbieri, Ventura, Kass, and Frank (2002) proposed a goodness-offit test based on the time rescaling theorem; a goodness-of-fit test determines if a particular model appears to fit, without requiring an alternative model. In the case where two competing models can be fit by maximum likelihood, it is standard to use the likelihood ratio (LR) test; common tests like t-tests, and ANOVA F-tests are LR tests; Pearson chi-squared tests are asymptotic approximations to LR tests. The space of models is infinite, so an exhaustive model search is unrealistic, as is the search for the “true” model. A good start is to consider the point process models that have been found to fit some neural data well, for example, inhomogeneous renewal processes, Poisson processes with refractory periods and integrate-and-fire models. (Barbieri, Quirk, Frank, Wilson, & Brown, 2001; Johnson, 1996; Reich et al., 1998). We also like to consider the

2342

V. Ventura

inhomogeneous Markov interval (IMI) models of Kass and Ventura (2001) because they include as particular cases inhomogeneous Poisson and homogeneous renewal process models, and thus are well suited to fit data that do not deviate much from these two large classes of models. To illustrate these ideas, consider once more the gamma(8) spike trains used in Figure 4E. Likelihood ratio tests were performed to compare Poisson, gamma(8), and IMI models; the (true) gamma(8) model was favored over the others (P < 10−3 ). For the sake of illustration, we still performed a parametric bootstrap based on the IMI model, with resulting null distribution overlaid as a bold curve; it is much closer to, although not equal to, the gamma bootstrap null distribution than the bootstrap Poisson distribution was. The discrepancy between the IMI and the correct gamma(8) bootstrap null distributions happens because inhomogeneous renewal processes are not IMI models (only homogeneous renewal processes are). The test based on the IMI bootstrap is conservative, so that the actual significance level is much smaller than the nominal level. Figure 4F is the same as Figure 4E, but based on simulated gamma(q) spike trains with q = 2, 1, and 0.5, which are, respectively, less, as, and more variable than Poisson spike trains; the F approximation was also plotted (dotted curve) in the middle panel. Although the IMI bootstrap null distributions do not quite match their gamma counterparts, these plots suggest that the IMI model yields reasonable inference for data that are close to Poisson or renewal processes. 5 Applications This section illustrates our latency testing and estimation method based on two examples rather than draws conclusions about the functional characteristics of the types of neurons we used. 5.1 Correlating Neural Activity and Behavior. Individual neurons in the frontal eye field of a rhesus monkey were recorded while the animal performed a memory-guided saccade task or a delayed-response task, as described in Roesch and Olson (2003). Figure 5 shows the PSTH of 20 identical trials of a particular neuron in that experiment. The cue indicating the direction of eye movement is presented for the short period between the two vertical bars. After a waiting period, the central light is turned off at time t = 0, at which point the monkey is to execute the eye movement. For each trial, the onset of movement was recorded. We want to investigate if the latency of neural response predicts (or is correlated with) the onset of movement. Before we apply the latency test, we check if the spike trains are Poisson. The mean versus variance plot in Figure 5A suggests that the deviation from the Poisson assumption is minimal; this is confirmed by the goodness-offit test of Brown et al. (2002; not shown). The test of Cai et al. (2004) did

Testing and Estimating Latencies

2343

not detect any excitability effects. We thus used the simple latency test of section 4. The global test for latencies suggests strongly that latencies vary across trials, with a p-value smaller than 10−8 . Figure 5C shows the overlaid smoothed PSTHs of the original and shifted trials; as expected, the latter is slightly higher. The latency estimates were robust to the choice of all estimation windows [T1 , T2 ] within the guidelines of section 3. Figure 5B shows the latency estimates plotted against the time of onset of movement, with latencies and onsets, respectively, recentered to have sample mean zero. We conclude that for this particular neuron, there exists a strong correlation (0.72, with P < 0.00001) between the latency of the neural response to the stimulus and the onset of eye movement. 5.2 Adjusting the Cross-Correlogram for Latency Effects. The raw CC displays the correlation between the spike trains of two neurons at a series of time lags. It is not used directly to assess synchrony, because other correlation sources typically contribute to it. The most common such source is due to modulations in firing rates following some experimental stimulus, although it is easily accounted for by subtracting the shuffle corrector (Perkel, Gerstein, & Moore, 1967). However, the remaining features in the shuffled corrected CC merely indicate that sources of correlation other than stimulus-induced correlations exist between the two neurons. A potential source is synchrony, but other sources include variations in firing rates’ amplitudes and in latencies, which Brody (1999a, 1999b) refers to as excitability and latency effects. Baker and Gerstein (2001) provide a list of references that report such effects. Therefore, for the CC to be a useful and reliable tool of synchrony assessment, it must be adjusted for all possible sources of correlation other than synchrony. The customary way to account for latency effects is to estimate the latencies, realign the spike trains, and proceed with the shifted spike trains as one would with usual spike trains (Brody, 1999b; Baker & Gerstein, 2001). We illustrate our latency testing and estimation on two neurons recorded simultaneously in the primary visual cortex of an anesthetized macaque monkey Aronov, Reich, Mechler, & Victor, 2003, Figure 1B, units 40106.st; 67.5 degrees spatial phase; other spatial phases produced similar results). The data consist of 64 trials shown in Figure ??B. The stimulus is identical for all trials, and consists of a standing sinusoidal grating that appears at time 0 and disappears at 237 msec. The mean versus variance plots in Figure ??A lie somewhat above the diagonal, which suggests more variability than predicted under the Poisson assumption. The goodness-of-fit tests of Brown et al. (2002) in Figure ??A indeed confirm that the data deviate slightly from Poisson, since the empirical versus model quantiles curves do not lie completely within the 99% joint confidence bands. The IMI model provides a better fit, although the improvement is only marginal. The slight deviation from Poisson could be

V. Ventura

Variance

A

−2000

−1000

1000

• •

• • • •

• • • • • • • •

0.0

0.2

20 40 60

•

•

•

•

• • • • • • −40 −20

Corr = 0.72 ( p < 0.00001 )

• 0

20

• • • • • • •

• • • • •

0.4

•

• • • • • • • • • • • •

0.6

. .. ... . .... .... .................. . ... . ... .. .. .. ••..........•........ .. .... . .. • . . . . ........•.................. ........ .... .. . .. . . ............... •.....•..... ............ . .... .. . . . . . . •.... .......... ............. . . .. . . .. •.... ............. ... .. . ... ............ ... . ••.......•..... . . ... ...... ... . . . .. •............ . ...... ......... . ..•..... .......... . .. . . . . .. .. . . ...•..•......................... .. ... • . . . . ... . . . .. . . .. . ... . •..•.... ......... .. . . . . .. • ............... . . . .

•

• • • • • •

• • • • •

• • • • • • •

Mean

• •

•

•

Time (ms)

−40 −20 0

Onset

B

0

0.0 0.2 0.4 0.6 0.8 1.0

2344

40

60

80

−200

0

200

400

600

800

Time (ms)

Latency

C

Original trials

. . ...................... ...... . ... . ... .... . ..... .. .... . ........ ... .... ... .... .......... .... ... . .. .... ............. ...... . ........ ..... ................ ...... . ... . . . . . ... .. . .. . ....... ..... ... .... . . . ........ ...... ...... .... ..... ..................... ...... . . .. .. . .. . ....... .. ......... ... ..... ........... . .. . . . .. . . ...... ..... .. .. .. ...... .................... . . ... . ....... ...... .... ....... .... . . . .

Firing rate

Shifted Unshifted

−200

0

200

400

600

800

Shifted trials

−200

0

200

400

Time (ms)

600

800

.... ....... . .. ... . .. . ... .. .. ....... ................... . ..... ..... ... .... .. .......................... ......... ............... . ... ... .. .. .. . ... . . . . .. . . . . . ... .... ............. ..... . .... ......... .... ... . .. . ...... ........ .. ... ..... ... .... .......... .......... .. . .. ... .. ... . .. . ..... .. . . ... ......... ........ . . . .. .. ... .. . . . ... ... ... .............................................. . . . .. . . . ... .. −200

0

200 400 Time (ms)

600

800

Figure 5: (A) PSTH for a neuron in Roesch and Olson (2003), based on 20 identical trials. The cue indicating the direction of the eye movement is presented between the two vertical bars; the cue for the movement is at time t = 0. Mean versus variance plot of the spike trains, based on time bins ranging from 2 to 20 msec. The solid line is the first diagonal. (B) Onset of movement versus neural response latency estimates. The solid line is the first diagonal, and the dotted line the fitted linear regression of onset on latency. Raster plot of the 20 spike trains with filled circles marking the latency estimate and open circles the onset of movement. (C) Overlaid smoothed PSTHs and raster plots of the original and shifted trials.

Testing and Estimating Latencies

2345

partly due to excitability effects, which were found to be significant based on the test for excitability in Cai et al. (2004). Although the deviations from the Poisson model may be important from a functional standpoint, they are not large enough from a statistical standpoint to warrant the more complicated bootstrap procedures for non-Poisson spike trains, especially since the evidence for latency effects is overwhelming, as discussed below. Figure 6B shows the raster plots for the two neurons, from which it appears that the few first and last trials have latencies that differ from those of the other trials. This is confirmed by the global tests for latencies, with a p-value 0.045 for neuron 2, and p-values smaller than 10−8 for neuron 1 and for the two neurons combined. Lack of statistical power for neuron 2 is due to the sparseness of the spikes. The value of the test statistic for neuron 1 is tobs = 3.3, which is extreme enough as to leave no doubt about the presence of latency effects, under Poisson, gamma or IMI models. Figure 6B also shows the matrix of p-values for all pairs of trials for the two neurons combined, with white and gray pixels corresponding to p-values smaller than 1% and 5%, respectively. If all trials had equal latencies, we would expect about 5% of white and gray pixels, whereas we have about 30%. This confirms the outcome of the global tests. Additionally, there is a clear pattern in the pairwise p-values; the mostly black area in the middle indicates that the latencies of all trials are similar, except for the first few and last trials, as could in fact be seen by the naked eye on the raster plots. Next, we estimated the latencies for the two neurons separately. The presence of excitability effects is not of overwhelming concern because the estimation windows [T1 , T2 ], shown in Figure 6B, are short enough that the effects on these intervals are presumably well approximated by equation 3.2. Figure 6B shows a plot of the latency estimates of one neuron versus the other, which shows that the two sets of latencies covary, with correlation 0.69. To shift the spike trains, we used the latency estimates based on the combined spikes of the two neurons (see section 2.2) because they are less variable. Finally, Figure 6C shows CC for these data, corrected for the correlation induced by modulations of the firing rates, along with 95% confidence bands. The left-most panel is for the observed spike trains, from which one may conclude that there is synchronous activity at small lags. However, these effects disappear entirely once the CC is adjusted for correlations induced by the latencies. We also produced a CC based on the observed spike trains, after removing the few first and last trials that appear to have different latencies. This CC is qualitatively similar to the CC adjusted for latencies, although the confidence bands are wider because fewer trials are used. 6 Conclusion We have developed statistical procedures for testing and estimating latency effects in spike trains obtained from identical repeats of an experiment. The

2346

V. Ventura

A – Poisson tests

•

0.0

0.2 Mean

0.4

neuron 2

Empirical Quantiles 0.0 0.5 1.0

0.0

•

Poisson IMI

Variance 0.2 0.4

Variance 0.2 0.4

• •• •• • • • •• • • •• • •• • • • • •• • • • • • • • •• • • • •• • • •• • • • •• • • •• • • ••

0.0

Empirical Quantiles 0.0 0.5 1.0

neuron 1

0.0 0.5 1.0 Model Quantiles

•

• • •• •••• • • •• • ••• • • • • • • • •••••• •• • •• • • •• • • •• • • •••••••• • •• • • •• • • •• •• • •• • • • ••

0.0 0.1 0.2 0.3 Mean

Poisson IMI

0.0 0.5 1.0 Model Quantiles

B – Latency testing and estimation

. . ..

0

200 400 Time (ms)

. .. . . .. . .. .... . ... . . ...

600

0

200 400 Time (ms)

neuron 1 shifted

. . ..

0

200 400 Time (ms)

600

neuron 2 shifted

................. . .. . .. . . ... . .. . . . ............ . ... . .. . . .. . .......... . . ............. . . ..... ........ ...... ... .... . .. . . .. .. ... . . . . .......... . . .... .. . .. ....... . . .. . ................. .... ........ .. . .. ... .... .. .. .... ... . ......... ........ ................ ... . . .

. ..

........ . .. . . . .... ...... .. . . ... .. .... . .. ............ ...... . .. .. ..... .... .......... ... . . .... ..... ........ ..... . .................. .. .. ... ........... .... .. ....... . ....... . ... .. ..... .... . ... .... . ................ ......... .... .. . . .... ..... ... .... . . .. ... .. . ..... .. ................. ...... ... . .. . .... .... . .. .

Trial number 0 20 40 60

. ..

pairwise p−values

neuron 2

........... . .. . . ........... . .. .. . . . ........... . ... . .. . . .. . ........... . . ....... ... . . . . ............. .. .......... .. ...... . ............ . . .. . ............... .. . ............... . . .. .............. ...... ... ... .. ........... . .. .. .. ... ... ....... ................ .... .. .

. . .. ......... . . . . .. .... .. . . .. ........ ........ . .. ... ..... ... . .... ....... . .. . .. . . . . . . . . . . . . .. . .. .......... ..... ........... ... . ..... .. ..... .... .. ..... ........... . . ................. ... .. . . ... .... ..... . ..... .... . .. ..... ................ ... .... . ... ...... .... . .. ...... . .. . .. ... . .. .. . .... . ...... .......... .... .. ... .. . . .. . ... ... . ....

600

0

200 400 Time (ms)

600

0

neuron 2 latencies −10 0 10 30

neuron 1

50 Trial number

• •

• • • • • • • • •• • •••••••• • • • •• •• • • • ••• • • •• • • •

•• •

•

• •

Corr = 0.69

−5 0 5 10 15 neuron 1 latencies

C – Cross-correlograms

−10 0 10 Lag (1.3 ms)

30

−30

0.4 0.0

0.0

0.4

Middle spike trains

−10 0 10 Lag (1.3 ms)

30

−0.4

−30

Shifted spike trains

−0.4

Adjusted CC −0.4 0.0 0.4

Observed spike trains

−30

−10 0 10 Lag (1.3 ms)

30

Testing and Estimating Latencies

2347

main attractions of our methods are their simplicity. Moreover, they appear to be efficient and powerful based on a large number of simulated spike trains. We also applied our methods to two real data sets and obtained results that seem reasonable. We proposed a formal statistical procedure to test for unequal latencies across trials; by “formal,” we mean that we provide not only a diagnostic test but also a p-value. For Poisson spike trains, this test is the usual analysis of variance (ANOVA) test for the equality of several means. For non-Poisson spike trains, we still use the ANOVA F-statistic, although we obtain its null distribution via a parametric bootstrap. We use an inhomogeneous Markov interval (IMI) model to fit non-Poisson spike trains when competing parametric models do not fit the data as well. Our estimation method consists of finding the set of shifts that minimizes the spread of the resulting PSTH. We show that this minimization criterion is equivalent to finding the shifts so that the means of the shifted spike times are equal in all trials. Therefore, our estimates require calculations only of sample means so that they are simple and very fast to obtain, and they do not require any model assumptions. We applied our method successfully to Poisson and non-Poisson spike trains with various rates, including spike trains that contain simple multiplicative excitability effects. In situations where more complicated excitability effects are present, our estimates can be biased, so it may be preferable to use a change-point method as, for example, in Baker and Gerstein (2001). But our latency estimates can still be used as starting values in other latency estimation algorithms. Indeed, change-point methods are based on detecting the rate change from baseline, typically on each trial separately, so that the latency estimates thus obtained do not benefit from any information that may be contained in other trials. Our method is based on the differences in firing rate from trial to trial. Including this extra information in change-point estimation procedures is likely to improve them.

Figure 6: Facing page. (A) Mean versus variance plots and goodness-of-fit tests for Poisson and IMI models (Brown et al. 2002) for two simultaneously recorded neurons in the primary visual cortex of an anesthetized macaque monkey (Aronov et al., 2003). (B) Raster plots with estimation windows [T1 , T2 ] (top), and after the trials are shifted (bottom). Matrix of p-values for all pairwise tests (see section 4). Estimates of latencies for neuron 1 plotted against those of neuron 2, with (0,1) line (solid) and fitted linear regression (dotted line), and sample Pearson correlation. (C) Cross-correlograms adjusted for firing-rate modulation (shuffle corrected) and 95% confidence bands for the observed and the shifted spike trains, as well as for the observed trials that appear to have constant latencies. We used bins of 1.3 msec. The central bin is not plotted because a recording artifact prevents testing of synchrony at lag 0.

2348

V. Ventura

Our latency estimates are obtained thus far in a completely nonparametric way. We do not assume a particular model for the firing rate or for the spike generation mechanism; spike trains do not have to be Poisson or gamma. But the absence of a specific model does not preclude assumptions; in particular, our procedure produces meaningful estimates under the basic assumption that the firing rates for all trials are proportional to one another except for a time shift. This begs for two extensions. First, if the basic assumption is met, can we improve our procedure by explicitly making use of the common element, that is, λ, between all trials? If the basic assumption is not appropriate—for example, if more complicated excitability effects exist—can we modify the procedure to allow for this? Our answer to both questions involves a more statistical approach than we have used. A full treatment is beyond the scope of this article, and is, in fact, the topic of a future article that treats the estimation, and adjustment, of latency and excitability effects (Cai et al., 2004). Acknowledgments This work was supported by grants N01-NS-2-2346 and 1R01 MH64537-01 from the National Institutes of Health. References Aronov, D., Reich, D. S., Mechler, F., & Victor, J. D. (2003). Neural coding of spatial phase in V1 of the macaque monkey. J. Neurophysiol., 89, 3304-¬3327. Baker, S. N. & Gerstein, G. L. (2001). Determination of response latency and its application to normalization of cross-correlation measures. Neural Computation, 13, 1351–1377. Barbieri, R., Quirk, M. C., Frank, L. M., Wilson, M. A., and Brown, E. N. (2001). Construction and analysis of non-Poisson stimulus-response models of neural spike train activity. J. Neurosci. Methods, 105, 25–37. Brody, C. D. (1999a). Correlations without synchrony. Neural Computation, 11, 1537–1551. Brody, C. D. (1999b). Disambiguating different covariation. Neural Computation, 11, 1527–1535. Brown, B. N., Barbieri, R., Ventura, V., Kass, R. E., & Frank, L. M. (2002). The timerescaling theorem and its applications to neural spike train data analysis. Neural Computation, 14(2), 325–346. Cai, C., Kass, R. E., & Ventura, V. (2004). Trial to trial variability effects and its effect on time-varying dependence between two neurons. Manuscript submitted for publication. Cowling, A., Hall, P., & Phillips, M. J. (1996). Bootstrap confidence regions for the intensity of a Poisson point process. Journal of the American Statistical Association, 91, 1516–1524. Davison, A. C., & Hinkley, D. V. (1997). Bootstrap methods and their applications. Cambridge: Cambridge University Press.

Testing and Estimating Latencies

2349

Everling, S., Dorris, M. C., Klein, R. M., & Munoz, D. P. (1999). Role of primate superior colliculus in preparation and execution of anti-saccades and prosaccades. J. Neurosci. 19, 2740–2754. Everling, S., Dorris, M. C., & Munoz, D. P. (1998). Reflex suppression in the antisaccade task is dependent on prestimulus neural processes. J. Neurophysiol., 80, 1584–1589. Everling, S., & Munoz, D. P. (2000). Neuronal correlates for preparatory set associated with pro-saccades and anti-saccades in the primate frontal eye field. J. Neurophysiol., 20, 387–400. Hanes, D. P., & Schall, J. D. (1996). Neural control of voluntary movement initiation. Science, 274, 427–430. Horwitz, G. D. & Newsome, W. T. (2001). Target selection for saccadic eye movements: Prelude activity in the superior colliculus during a directiondiscrimination task. J. Neurophysiol., 86, 2543–2558. Johnson, D. (1996) Point process models of single-neuron discharges. J. Comput. Neurosci., 3, 275–299. Kass, R. E., & Ventura, V. (2001). A spike train probability model. Neural Computation, 13, 1713–1720. Kass, R. E., Ventura, V., & Cai, C. (2003). Statistical smoothing of neuronal data. NETWORK: Computation in Neural Systems (special issue on Information and Statistical Structure in Spike Trains), 14, 5–15. Perkel, D. H., Gerstein, G. L., & Moore, G. P. (1967). Neuronal spike trains and stochastic point processes. II. Simultaneous spike trains. Biophys. J., 7, 419– 440. Pham, D. T., Mocks, J., Kohler, W., & Gasser, T. (1987). Variable latencies of noisy signals: Estimation and testing in brain potential data. Biometrika, 74(3), 525– 533. Reich, D. S., Victor, J. D., & Knight, B. W. (1998). The power ratio and the interval map: Spiking models and extracellular recordings. J. Neurosci., 18(23), 10090– 10104. Roesch, M. R., & Olson, C. R. (2003). Impact of expected reward on neuronal activity in prefrontal cortex, frontal and supplementary eye fields and premotor cortex. J. Neurophysiol., 90, 1766–1789. Ventura, V., Carta, R., Kass, R. E., Olson, C. R., & Gettner, S. N. (2001). Statistical analysis of temporal evolution in single-neuron firing rates. Biostatistics., 1(3), 1–20. Woody, C. D. (1967). Characterization of an adaptive filter for the analysis of variable latency neuroelectric signals. Med. Biol. Engng., 5, 539–553. Received August 20, 2003; accepted April 8, 2004.

LETTER

Communicated by Charles J. Wilson

Two-State Membrane Potential Fluctuations Driven by Weak Pairwise Correlations Andrea Benucci [email protected]

Paul F.M.J. Verschure [email protected]

Peter Konig ¨ [email protected] ¨ ¨ Institute of Neuroinformatics University and ETH Zurich 8057 Zurich, Switzerland

Physiological experiments demonstrate the existence of weak pairwise correlations of neuronal activity in mammalian cortex (Singer, 1993). The functional implications of this correlated activity are hotly debated (Roskies et al., 1999). Nevertheless, it is generally considered a widespread feature of cortical dynamics. In recent years, another line of research has attracted great interest: the observation of a bimodal distribution of the membrane potential defining up states and down states at the single cell level (Wilson & Kawaguchi, 1996; Steriade, Contreras, & Amzica, 1994; Contreras & Steriade, 1995; Steriade, 2001). Here we use a theoretical approach to demonstrate that the latter phenomenon is a natural consequence of the former. In particular, we show that weak pairwise correlations of the inputs to a compartmental model of a layer V pyramidal cell can induce bimodality in its membrane potential. We show how this relationship can account for the observed increase of the power in the γ frequency band during up states, as well as the increase in the standard deviation and fraction of time spent in the depolarized state (Anderson, Lampl, Reichova, Carandini, & Ferster, 2000). In order to quantify the relationship between the correlation properties of a cortical network and the bistable dynamics of single neurons, we introduce a number of new indices. Subsequently, we demonstrate that a quantitative agreement with the experimental data can be achieved, introducing voltage-dependent mechanisms in our neuronal model such as Ca2+ - and Ca2+ -dependent K+ channels. In addition, we show that the up states and down states of the membrane potential are dependent on the dendritic morphology of cortical neurons. Furthermore, bringing together network and single cell dynamics under a unified view allows the direct transfer of results obtained in one context to the other and suggests a new experimental paradigm: the use of specific intracellular analysis as a powerful tool to reveal the properties of the correlation structure present in the network dynamics. c 2004 Massachusetts Institute of Technology Neural Computation 16, 2351–2378 (2004)

2352

A. Benucci, P. Verschure, and P. Konig ¨

1 Introduction In this study, we address the relationship of two experimentally observed phenomena. At the network level, correlated spiking activity between ensembles of neurons has been described in recent years. At the cellular level, the observation that the membrane potential dynamics of single neurons can show distinct up states and down states has received a lot of attention. Regarding the first phenomenon, multielectrode recordings in cat visual cortex have demonstrated that pairs of neurons sharing similar orientation tuning properties tend to have synchronized spiking activity (Singer, 1993). This finding has been confirmed in different species (Bair, 1999) and different cortical areas (Salinas & Sejnowski, 2001). The synchronization pattern is dependent on the properties of the stimulus. For example, when coherently moving gratings or randomly moving dots are used as visual stimuli, they elicit cortical activity that displays pairwise correlations of different degree (Gray, Engel, Konig, ¨ & Singer, 1990, Usrey & Reid, 1999). Moreover, crosscorrelation analysis shows that rather than having a precise synchronization, optimally driven neurons lead over suboptimally driven ones (Konig, ¨ Engel, & Singer, 1995a) (see Figure 1B). This suggests that under realistic conditions, cortical dynamics is highly structured in the temporal domain (Roskies et al., 1999). The impact of a large number of randomly timed or synchronized inputs on the subthreshold dynamics of single neurons has been studied in simulations (Salinas & Sejnowski, 2000; Bernander, Koch, & Usher, 1994; Destexhe & Par´e, 1999; Shadlen & Newsome, 1998; Softky & Koch, 1993). However, taking into account our current knowledge of the correlation structure of cortical activity, we have little insight into the cellular dynamics under realistic conditions (Singer, 1993; Lampl, Reichova, & Ferster, 1999; Douglas, Martin, & Whitteridge, 1991; Stern, Kincaid, & Wilson, 1997; Agmon-Snir & Segev, 1993; Mel, 1993; Destexhe & Par´e, 2000; Singer & Gray, 1995). Regarding the second phenomenon, a number of intracellular studies have shown that the membrane potential of neurons does not take on any value between rest and threshold with equal probability but rather that it assumes either a depolarized state, associated with spiking activity, or a resting state where the cell is silent (see Figures 2A and 2B, right panels). This behavior has been observed in different animals and brain structures (Anderson, Lampl, Reichova, Carandini, & Ferster, 2000; Stern, Jaeger, & Wilson, 1998; Steriade, Timofeev, & Grenier, 2001). This bistability of the membrane potential is referred to as up states and down states, and its biophysical properties have been characterized (Wilson & Kawaguchi, 1996; Lampl et al., 1999; Douglas et al., 1991; Stern et al., 1997; Lewis & O’Donnell, 2000; Wilson & Groves, 1981; Kasanetz, Riquelme, & Murer, 2002) (see Figures 2A and 2B, right panels). The origin of up states and down states has been related to presynaptic events (Wilson & Kawaguchi, 1996); however, the underlying mechanisms have not yet been identified.

High-Order Correlations in Neural Activity

2353

Figure 1: Model neuron and the temporal structure of its input. (A) Reconstructed layer 5 pyramidal cell (left) with a schematic of five input spike trains sharing weak pairwise correlations (right). For every pair of spike trains, there are moments in time when the probability of firing together is high and pairs of synchronized spikes occur. Due to the high level of input convergence, higherorder events emerge statistically and are shown as triplets in the example (Benucci et al., 2003). (B) Cross-correlation analysis of paired extracellular recordings from cat area 17 while using moving bars as visual stimuli (data taken from Konig ¨ et al., 1995a). The peak size is proportional to the strength of the correlation. The shift of the peak indicates a phase lag of the firing of one neuron relative to the other. In general, optimally driven cells tend to lead over suboptimally driven ones. Solid line: Gabor fit for parameters estimation. (C) Crosscorrelogram of the synthetic data. Solid line: Gabor fit of the cross-correlogram.

Here, we take a theoretical approach to examine the dynamics of the membrane potential in single neurons given a physiologically constrained representation of the temporal properties of its afferent signals. 2 Materials and Methods In the following, we describe a detailed model of a cortical neuron and a procedure to produce synthetic input spike trains with physiologically realistic first order (i.e., firing rates) and second-order statistical properties (i.e., pairwise correlations). Subsequently, our methods of analysis are introduced.

2354

A. Benucci, P. Verschure, and P. Konig ¨

High-Order Correlations in Neural Activity

2355

2.1 The Cellular Model. A morphologically detailed layer 5 pyramidal cell (Bernander, Koch, & Douglas, 1994; Bernander, Koch, & Usher, 1994; Destexhe & Par´e, 2000) is simulated using the NEURON simulation environment (Hines & Carnevale, 2001) (see Figure 1A). (The original anatomical data are kindly made available by J. C. Anderson and K. A. C. Martin.) We deliberately keep the cell as simple as possible to avoid the introduction of strong a priori hypotheses. Nevertheless, the resulting simulated neuron is of considerable complexity. The parameters for the passive properties, HH-like channels in the soma and the synapses (4000 AMPA, 500 GABAa , 500 GABAb ), are selected according to the available literature and previous computational studies (Bernander, Koch, & Douglas, 1994; Agmon-Snir & Segev, 1993; Bernander, Koch, & Usher, 1994; Mel, 1993; Destexhe & Par´e, 2000). The parameter sets used are reported in Tables 1 through 3 (see the appendix). The files used to implement voltage-dependent mechanisms, as indicated in the tables, are freely available online from the NEURON web page (http://www.neuron.yale.edu). For the cell as a whole, we tune the parameters to obtain a consistent input-output firing-rate transfer function. Figure 2: Facing page. Intracellular somatic recordings. (A) The membrane potential of the model neuron is shown during optimal stimulation (left panel). Spiking episodes (burst-like behavior) are associated with the depolarized states of the membrane potential (up states). Following spiking activity, there are periods in which the simulated cell is in a more hyperpolarized state and silent (down states). The pairwise correlation strength of the input spike trains has been set to 0.1 and the mean input firing rate to 85 Hz. The alternation between up states and down states has also been found in several experimental studies using intracellular recordings (right panel). (B) Histogram of the distribution of the membrane potential recorded under the conditions as shown in a after removal of the action potentials. The membrane potential does not take any value between the up states and down states with equal probability, but the histogram shows two peaks (left panel). Similarly, the biphasic behavior in the membrane potential has been experimentally observed (right panel). (C) If the cell in the simulation is stimulated solely by uncorrelated background activity (6 Hz), the bimodality disappears (left panel). The corresponding trace of the membrane potential is shown in the inset. Similar “shapes” in the histograms of the membrane potential under similar stimulation conditions have also been observed experimentally (right panel). Note that when not using visual stimulation, Anderson et al. (2000) found bimodal as well as nonbimodal distributions of the membrane potential. A nonbimodal example is shown here (see also section 4). (D) The duration of the up states has been increased 10 times with respect to Figure 2A due to the introduction of Ca2+ and Ca2+ -dependent K+ channels in the modeled L5 pyramidal cell. (E) Relationship between the firing rate in the up states and the mean membrane potential with Ca2+ and Ca2+ -dependent K+ currents. The input pairwise correlation strength is 0.1, and the mean input firing rate is 30 Hz.

IM Ih

IL IK IC IA

INa

Na+

Ca2+ K+

Current

Ion

Soma Dendrites Apical Soma Apical Soma Dendrites Dendrites Apical

Place

50 50 115 −77 −77 −66 −66 −77 −50

Erev (mV) 500 10 5.1 50 30 10 10 30 L IV: 2.5 L III: 7 L II: 22 L I: 32

G¯ (mS/cm2 ) 2 2 2 2 2 1 1 1 2

p

Table 1: Simulation Parameters for the Active Mechanisms.

1 1 0 0 0 1 1 0 0

q 0.05 0.05 2.12 1.5 200 2 2 2.21− 2.24 2.26

τact (ms)

−68.9

−47/ − 42 −32/ − 37 −35 −47 2.17 −24.5/ − 72.3 −22.9/ − 83.1

u1/2 (mV) (active/inactive)

−6.5

16.9/−5.9 16.2/−6.5

3/ − 3 3/ − 3 4 3

k (active/inactive)

iaspec.mod iap.mod ical.mod iapspec.mod ic.mod ianew.mod iA.mod km.mod IH.mod

File

2356 A. Benucci, P. Verschure, and P. Konig ¨

High-Order Correlations in Neural Activity

2357

Table 2: Synaptic Parameters. Type

g¯ (nS)

er (mV)

τ1 (ms)

τ2 (ms)

τpeak (ms)

GABAa GABAb AMPA NMDA

1 0.1 20 26.9

−70 −95 0 0

4.5 35 0.3 4

7 40 0.4 120

5.6 37.4 0.35 14

Table 3: Passive Properties. Erest Cm Gm Ri T

−65 mV 1 µF 0.07 mS/cm2 90 cm 36◦ C

2.2 Input Stage, First-Order Statistics. The input stage reproduces measured anatomical and physiological conditions (Douglas & Martin, 1990). Layer 5 neurons in cat’s visual cortex do not receive significant input from the lateral geniculate nucleus; the overwhelming majority of inputs are cortico-cortical afferents (Gabbott, Martin, & Whitteridge, 1987). The distributions of orientation preferences of the afferent excitatory signals to our model neuron are graded and match the experimentally observed specificity of intra-areal corticocortical connections (Kisvarday, Toth, Rausch, & Eysel, 1997; Salin & Bullier, 1995): 57% of the inputs originate from neurons with similar (±30 degrees) orientation preference, and 30% and 13% of afferents originate from neurons with a preferred orientation differing by between 30 and 60 degrees and 60 and 90 degrees, respectively, from the neuron under consideration. This connectivity automatically provides the cell with a feedforward mechanism for its orientation tuning. Whether this mechanism correctly describes the origin of orientation-selective responses in primary visual cortex is still unresolved (Sompolinsky & Shapley, 1997) and not within the scope of this study. To take into account these physiological conditions, we separate the total of 5000 afferents of the simulated neuron according to the localization of the respective synapse in layer 1 to layer 5 into five groups. This gives the flexibility to make selective changes in the correlation properties of inputs targeting different cortical layers. Each of the five groups was subdivided into 36 subpopulations of afferents with similar orientation preference, resulting in a discretization of ±5 degrees. This mimics the situation of input spike trains from cells having different tuning properties than the target neuron. The size of the subpopulations ranged linearly from 25 to 100 cells, for orthogonally tuned inputs to identically tuned inputs, respectively. The firing rates of the inputs are determined with a “typical” visual stimulus in mind (i.e., an oriented grat-

2358

A. Benucci, P. Verschure, and P. Konig ¨

ing). The maximum input firing rate of the optimally stimulated neurons is set to 85 Hz. Nonoptimally stimulated cells are assumed to have an orientation tuning of ±30 degrees half width at half height, and the firing rate is reduced accordingly. In addition, background activity is simulated with uniform, uncorrelated 6 Hz activity. Inhibitory inputs are implemented with a broader orientation tuning than excitatory inputs. 2.3 Input Stage, Second-Order Statistics. The rationale behind the synthetic spike train generation algorithm is that for each pair of spike trains, there are intervals in time when the probability of correlated spiking activity is increased. The method used closely resembles other published algorithms to produce correlated spike trains (Stroeve & Gielen, 2001; Feng & Brown, 2000), and the source code is available from the authors upon request. The statistical properties of the synthetic spike trains conform to known physiological constraints, such as average strength and precision of correlation, and for this reason, we will refer to them as physiologically realistic inputs. We do not introduce a dependence of the strength of correlation on the orientation tuning of the neurons. This conforms to the hypothesis that the synchronization patterns induced in the network reflect global stimulus properties and are not a fixed property of the network (Singer & Gray, 1995). In particular, when only a single stimulus is presented, the strength of synchronization of neuronal activity in the primary visual cortex over short distances is found not to depend on the orientation tuning of the neuron (Engel, Konig, ¨ Gray, & Singer, 1990). The details of the algorithm are as follow: within the overall time window of analysis, T (typically 10 seconds), time epochs are selected, during which the probability of spiking is increased for all 5000 cells in the population afferent to the simulated neuron. Nevertheless, neurons spike only in correspondence to a subset of these epochs (randomly chosen), and for any two given cells, there is always a nonzero overlap between such subsets. The total number of possible overlaps depends, in a combinatorial way, on the size of the subsets, which is a controlled parameter of the simulation. It is the number and amount of these overlaps that determines the pairwise correlation strength and the different degrees of temporal alignments of the spikes. The epochs are time intervals centered on specific points in time, whose duration and distribution are controlled parameters of the algorithm. We used a Poisson process to distribute these points in time, and their frequency was fixed at 75 Hz. The duration of the time epochs was 10 ms. The algorithm is very flexible; choosing a distribution other than a Poisson one (exponential or gamma of any order) would allow the creation of correlations with or without overall oscillatory activity (Gray & Viana di Prisco, 1997; Engel et al., 1990). The duration of the time epochs determines the precision of the correlations, and it can be changed to affect the overlaps between the epochs within the same spike train. The frequency of the time epochs is itself a free parameter. Spikes are assigned within the

High-Order Correlations in Neural Activity

2359

epochs according to a distribution whose skewness and integral is parametrically controlled. The skewness controls the shape of the peaks in the cross-correlograms, with the possibility of creating very “sharp” or “broad” correlation peaks, without changing the correlation strength itself. In all the simulations, we used a gaussian distribution. The integral of the distribution controls the correlation strength. Thus, the absolute fraction of spikes assigned to the epochs changed depending on the desired pairwise correlation strength. The rest of the unassigned spikes are randomly distributed according to a Poisson process with constraints related to refractoriness (2 ms in the simulation). This ensures that the coefficient of variation (CV) of each spike train ranges around one (Dean, 1981). A key point of the simulation is that by varying the frequency and the duration of the time epochs, it is possible to create pairwise correlations in the population with different degrees of high-order correlation events (see the following section for more details). The temporal dynamics of the input stage thus reproduce in a controlled fashion the correlation strengths and time lags that have been observed experimentally. In the following, the temporal precision of the correlations will be held constant, with the width of the peaks in the cross-correlograms always on the order of 10 ms. Furthermore, the time lags of the correlations are determined by a parameter that is kept fixed in all simulations (Konig, ¨ Engel, & Singer, 1995b). We vary only the correlation strength, that is, the height of the peaks in the cross-correlogram. We will refer to these dynamical features as to the correlation structure of the inputs (Konig, ¨ Engel, & Singer, 1995a; Roskies et al., 1999) (see Figures 1B and 1C). In our model, the inhibitory inputs follow the same correlation structure as the excitatory inputs. All simulations are analyzed in epochs of 10 s duration. For the quantitative evaluation of our data, we use a five-parameters Gabor fit of the cross-correlograms (Konig, ¨ 1994). We calculate the pairwise correlation strength as the area of the peak of the Gabor function divided by the area of the rectangular portion of the graph under the peak, delimited by the offset.

2.4 Higher-Order Correlations. Neuronal interactions in the cortex are measured solely in terms of correlations of the activity of pairs of neurons. Hence, the algorithm used here synthesizes spike trains with a specified pairwise correlation that satisfies those observed in the neuronal interactions in the cortex (see Figures 1B and 1C). Nevertheless, due to the high level of convergence observed in the cortex, higher-order statistical events must and do appear. This holds even for the low values of the experimentally observed pairwise correlations (Benucci, Vershure, & Konig, ¨ 2003; Destexhe & Par´e, 2000; Both´e, Spekreijse, & Roelfsema, 2000; Grun, ¨ Diesmann, & Aertsen, 2001). Accordingly, the above-described algorithm not only generates weak pairwise correlations between the input spike trains, but also creates high-order correlations in the presynaptic population of neurons. Higher-

2360

A. Benucci, P. Verschure, and P. Konig ¨

order events indicate episodes of the presynaptic dynamics characterized by spiking activity from a large fraction of afferent neurons, occurring altogether in a small time window on the order of a few milliseconds. How these nearly synchronized events naturally emerge from the pairwise correlations constraint, has been formally investigated (Benucci et al. 2003; Both´e et al., 2000). It can be intuitively understood considering that the number of pairwise coincidences rises quadratically with the number of afferents and that the number of spikes available to generate such coincidences rises only linearly with the number of afferents. It follows that spikes have to be “used” multiple times to generate coincidences; that is, higher-order correlations must appear. With an increasing number of inputs, the effect is amplified. This is important when considering realistic values for the number of afferent inputs to a given cortical cell, typically of the order of 104 . At this level of convergence, the statistical effect explained above becomes dominant, and higher-order events are a prominent feature of the dynamics (Mazurek & Shadlen, 2002). 2.5 Analysis of Intracellular Dynamics. The intracellular dynamics is evaluated using different measures in the time and frequency domain. The up states and down states of the subthreshold membrane potential are determined as in Anderson et al. (2000) using two thresholds. A sliding window of 50 ms is used to find segments in which the membrane potential is above the up threshold for more than 60% of the time. We use a similar procedure for identifying the down states (i.e., the membrane potential is below a down threshold). For the cumulative probability distribution of the membrane potential, any section where the membrane potential exceeds the up threshold is included in the analysis. To eliminate the spikes and check for the inverse relationship between the spike threshold and the slope of the rising phase of the action potential, we use the same procedure as in Azouz and Gray (2000). Furthermore, we compute a number of indices characterizing the relation of subthreshold dynamics with observable measures. These indices are defined in section 3. 2.6 Control Simulations. To verify the scope and robustness of the results found, we ran several simulations varying the values of the parameters. The rationale is that if a small change in the value of a given parameter causes a dramatic change in the results, then the mechanisms the parameter refers to should be considered critical for the emergence of the phenomenon studied. But if the changes are minor and smoothly affect the results, the associated mechanism is considered to be relevant for the modulation and quantitative features of the phenomenon, but not for the emergence of the effect per se. We changed the width of the tuning curves of the inhibitions and excitations by a factor of ± 10% without noticing significant changes in the degree of bimodality. We also checked for the robustness of the results in

High-Order Correlations in Neural Activity

2361

respect to changes in the time constants and conductance peak amplitudes of AMPA and GABA synapses. In the extreme cases of long time constants (NMDA synapses) and large (twice the mean) conductance amplitudes of AMPA synapses (see the tables for mean values), the cell was driven to an epileptic-like state. In case of GABA, the cell would be totally silenced. In between these extremes, the cell was showing smooth changes in the degree of bistable dynamics. We used exactly the same principles when performing controls with voltage-dependent mechanisms, such as calcium and sodium voltage-dependent currents, muscarinic channels, or anomalous rectifying currents (see the tables for complete listings of mechanisms used). We varied the values of the parameters around their means (as reported in the tables) and noticed only smooth modulatory effects. These controls confirm the generality of the phenomena as reported below. Finally, for controls regarding the neuronal morphology, we used two strategies: we took a different cell model, and we reduced the whole cell, or part of the dendritic tree, to single equivalent compartments while keeping the electrotonic properties unaffected (Destexhe, 2001). The second cell model (a spiny stellate neuron) was kindly made available by Y. Banitt and I. Segev. It includes active mechanisms such as Ca2+ dynamics, fast repolarizing K+ currents, Ca2+ -dependent K+ currents, spike frequency adaptation, and Na+ channels, (parameters as in Banitt & Segev, personal communication, 2003). The input spike trains are generated with the MATLAB software package. 3 Results 3.1 Emergence of Up and Down States. When we provide the model neuron with correlated inputs as explained, the membrane potential at the soma is characterized by up and down states of the dynamics (see Figures 2A and 2B, left panels). However, when the cell is stimulated with uncorrelated background activity, this bistability of the dynamics disappears (see Figure 2C, left panel). Indeed, in physiological recordings in primary visual cortex using stimulation paradigms that are not associated with correlation structures (awake cats during spontaneous activity), up and down states are not observed (Anderson et al., 2000; Steriade et al., 2001; see Figure 2C). When comparing the results of the simulation to recently published data (Azouz & Gray, 2003) the similarities are obvious. Compared to other experimental studies, however, a remarkable difference in timescales is apparent (see Figure 2A). In the purely passive model, the duration of the up states is typically around 30 ms, which is at least a factor of 10 lower than the experimental value. As discussed in Wilson and Kawaguchi (1996), the duration of the up and down states is mainly determined by the kinetic properties of Ca2+ and Ca2+ -dependent K+ currents. Including these active mechanisms in our model shows that this also holds true in the simulation (see Figure 2D). In comparison to the passive model, the duration of up states

2362

A. Benucci, P. Verschure, and P. Konig ¨

increases by a factor of 10. It reaches a duration of 300 ms, resulting in an improved match to the experimental results. 3.2 Quantification of Intracellular Dynamics. To analyze the relationship between network and single cell dynamics in a more quantitative way, we compare a number of characteristic measures of the simulated neuron to known physiological results (Anderson et al., 2000). First, as reported in the literature, we find a significant correlation between the membrane potential in the up state and the spiking frequency (see Figures 2E and 3A). In Figure 2E, the calcium dynamics exerts its modulating effect by compressing the dynamic range for the firing frequency of the up states. The range is reduced from 10 to 220 Hz as in the passive case (see Figure 3A), to 5 to 50 Hz when calcium is introduced (see Figure 2E). Second, we find an increase of the standard deviation of the membrane potential in the up-state that matches that observed in the visual cortex (see Figure 3B). This is a large (more than twofold) and highly significant effect. The significance holds also in the comparison between the variability of the membrane potential during a simulated visual stimulation Figure 3: Facing page. Characteristic features of up states and down states. (A) The firing rate in the up state (vertical axis) is shown as a function of the membrane potential in the up state (horizontal axis) for both the simulation of a passive model (left) and the experimental data (right). In both cases, the average membrane potential in the up state is correlated with the firing rate of the cell in the up state. The difference in the scale between the left and right panels depends on the mean input-firing rate chosen for this specific data set. The input correlation strength is 0.1, and the mean firing rate of the input is 85 Hz. See Figures 2D and 2E for a comparison of the active model with the experimental data. (B) The standard deviation (STD) of the membrane potential of the simulated neuron is increased in the up state as compared to the down state (left). The inset shows the subthreshold dynamics during stimulation and spontaneous background activity, respectively. The corresponding experimental data are shown to the right. In both cases, stimulation increases the variance of the subthreshold membrane potential. (C) The plot of the power spectra in the 20–50 Hz frequency band of the membrane potential for the optimal stimulation condition (left panel) shows an increase in the up state as compared to the down state. The same phenomenon is visible in the plot of the experimental data (right panel). Because no information is available on the normalization used for the power spectra in Anderson et al. (2000), we use arbitrary units, and a comparison of the absolute scale in the two panels is not possible. The vertical and horizontal dotted lines indicate the median for each corresponding axis; the light gray circle is the center of gravity of the distribution for a better comparison with the result of the simulation as shown in the left panel. The input pairwise correlation strength is 0.1 in Figures 3A and 3C and 0.2 in Figure 3B, while the mean input rate is always 85 Hz.

High-Order Correlations in Neural Activity

2363

and spontaneous activity (see the Figure 3B inset and right panel). Finally, we observe an increase of the power in the 20 to 50 Hz frequency band in the up state versus the down state (see Figure 3C). This is a smaller effect, but it is still statistically significant and is comparable to reported experimental findings (see Figure 3C, right panel). These three results match experimental findings that some consider to be some of the central characteristics of up and down state dynamics (see Figures 3A to 3C, right side). Our results show that these characteristics emerge naturally in a detailed model of a cortical neuron when exposed to realistically structured input.

2364

A. Benucci, P. Verschure, and P. Konig ¨

As a next step, we investigate measures of intracellular dynamics and relate them to the properties of the network’s activity. We define a set of indices to capture different aspects of the subthreshold activity. The first of these quantifies the strength of the correlations in the network activity and the bimodality of the membrane potential histogram. This adimensional index, S, accounts for the dependence of the bimodality on the −Vmin input correlation strength: S = Vmax |Vmax | , where Vmax and Vmin are the location of the peaks in the membrane potential histograms (see Figures 2A and 2B). Increasing the correlation strength from its typically reported value of about 0.1 as used above (Konig ¨ et al., 1995a; Salinas & Sejnowski, 2000; Kock, Bernander, & Douglas, 1995) leads to an enhancement of the bimodality of the membrane potential histogram, resulting in a sharpening of the two peaks (see Figure 4A). This index is a monotonically increasing function of the correlation strength (see Figure 4B). The next index represents the fraction of the total time spent in the up state (TUS). For each simulation, the total time the membrane potential surpassing the up threshold is divided by the total simulation time. This index is related to the integral of the peaks in the histograms, as explained in Anderson et al. (2000). The results indicate that this measure is strongly dependent on the input correlation strength. The TUS index is measured to be 7%, 13%, and 42% for the correlation strengths of 0.01, 0.1, and 0.2, respectively. Figure 4: Facing page. Sensitivity of the membrane potential to the input correlation strength. (A) Histograms of the distribution of the membrane potential for three different conditions. The input correlation strength is increased from left to right: 0.01, 0.1, and 0.2, respectively. Bimodality emerges, and the hyperpolarized peak gets further away with increasing correlation strength. (B) The “S” index (see section 2), which quantifies the increasing separation of the peaks in the membrane potential histograms, is shown for five choices of correlation strengths. Note that the data points are connected by lines simply to improve visualization. We have no a priori hypotheses about specific functional relationships. (C) Cumulative probability distribution of the time intervals during which the membrane potential dwells above the up threshold. It is shown for six trials at two levels of correlation strength each. (D) For each action potential of the simulated neuron, the maximum slope of the rising phase of the membrane potential and the threshold potential for the spike generation are shown in a scatter plot. (E) The identical measure used in an experimental study demonstrates an inverse relationship as well. (F) The slope of the corresponding linear fit (solid line in Figure 4D), provides an index that is related to the input correlation strength. (G) To perform a reverse correlation analysis for every transition from down states to up states, a time window was centered in the corresponding population (PSTH) at the input stage to identify presynaptic events associated to the transition. The plot shows the average population activity in the temporal vicinity of a down-to-up transition.

High-Order Correlations in Neural Activity

2365

2366

A. Benucci, P. Verschure, and P. Konig ¨

The third measure relates the correlation strength of the inputs and the cumulative probability (CUP) distribution of the up state (Anderson et al., 2000). The cumulative probability distributions of the time intervals during which the membrane potential is above the up threshold are computed for different values of the input correlation strengths. Though it is not independent from the TUS index, it is a refinement of the previous index in that it controls for possible artifacts due to the spike-cutting procedure (see Figure 4C). The CUP measure can easily separate the diverse input correlation strengths, that is, 0.01 pairwise correlation strength (black lines) from 0.2 pairwise correlation strength (light gray lines). Finally, the fourth index measures the slope at threshold (SAT) and thus characterizes the relationship between the input correlation structure and the dependence of the voltage threshold for the spike generation on the maximum slope of the rising phase of an action potential (see Figures 4D and 4E). Indeed, Azouz and Gray (2000) have observed an inverse relationship between the maximum slope of the rising phase of an action potential and the voltage threshold for the generation of the spike (see Figure 4E). A similar inverse relationship also appears in the simulated neuron (see Figure 4D). More important, we find that the slope of the fit quantifying the relationship between the maximum slope of the rising phase of an action potential and the voltage threshold for the generation of the spike provides a measure of the input correlation strength (see Figure 4F). This result indicates that this index tends to decrease with increasing correlation strength. These analyses in the simulation data suggest that these four indices based on the membrane potential measurable within a single neuron can be used to extract the correlation strength within the network activity from the intracellular dynamics. We focus now on the mechanisms of the observed bimodality. An important result comes from performing reverse correlation analysis of the transition from down to up states with respect to the population activity. On average, the total presynaptic activity around the transition from down to up states shows a sharp peak (see Figures 4G and 5A). This indicates that the switch of the intracellular dynamics between up and down states is induced by short, lasting, highly correlated input events. This confirms that up states are induced by presynaptic higher-order events. When the pairwise correlation strength is lowered, the amplitude of these sharp correlation peaks decreases, and the bimodality smoothly disappears. This is true for the passive L5 pyramidal cell, as well as for the model with voltagedependent mechanisms. As shown before, the quantitative features differ in the two cases (duration of up states, or mean firing rate, for example), but the bimodality itself is not affected. Moreover, there is no intrinsic bimodality in the modeled neuron; the dynamics is fully input driven. When high-order correlation events disappear following a decrease in the pairwise correlations, the bimodality is destroyed, even though the mean input firing rate to the neuron is kept constant.

High-Order Correlations in Neural Activity

2367

3.3 Simulating Network Effects. According to the above findings, up and down state fluctuations could manifest themselves in a large-scale coherent fashion if all the neurons in the population experience the same correlated dynamics (as for up and down fluctuations observed in slow-wave sleep states or in anesthetized conditions; see section 4). Instead, groups of neurons exposed to different network correlation structures would undergo different subthreshold dynamics. We tested this idea by running 11 different simulations of a neuron exposed to a corresponding number of different sets of input spike trains. While the mean firing rate and pairwise correlation strength were kept constant, the timing of the epochs used to generate the correlations was linearly changing: from identical for the first set to completely different for the last set. In other words, the timing of high-order correlation events, as highlighted by the reverse correlation analyses, was getting more and more dissimilar from set to set. For the first six sets, we observe a significant correlation between the membrane potentials of the simulated neurons (p < 0.001). When the afferent signals are only slightly overlapping, the membrane potentials are no longer significantly correlated. High values of correlation, as experimentally described, are observed only when pairs of cells share a large fraction of synaptic inputs (see Figure 5B). 3.4 Cell morphology. An important question is whether the detailed morphology of a neuron contributes to the emergence of up and down states and, if so, what the key properties involved are. We manipulate the gross morphological structure of the pyramidal cell in a series of experiments. First, we delete all morphological specificity of the pyramidal neuron by morphing the cell into a spherical, single compartment of equal surface, while keeping the firing-rate transfer function unchanged. This reduces the cell to a conductance-based integrate-and-fire (IF) model. This procedure has been done with and without Ca2+ and Ca2+ -dependent K+ currents. Second, we morph the cell into a three-compartment model (basal dendritic tree plus soma, proximal-apical dendritic tree, and distal-apical dendritic tree) that preserves dendritic voltage attenuation (Destexhe, 2001). Third, we morph only the basal part into an equivalent compartment while preserving the apical dendritic tree in detail. In all these cases, when exposed to the same inputs used in the simulation experiments described above, the up and down dynamics disappear (see Figure 5D). Interestingly, when exposing the IF model to the same input statistics and using a parameter choice for the Ca2+ currents that was eliciting long-lasting bistability in the L5 pyramidal cell, we do not find any up and down states. Note that this does not imply that it is not possible to find a choice of parameters and input statistics such that bimodality would emerge. However, when we keep the morphology of the basal dendritic tree unchanged and reduce the apical part to a single equivalent compartment, the qualitative aspects of the bimodal intracellular dynamics are preserved (see Figure 5E). Thus, we find that in the passive model, an intact basal dendritic tree is the

2368

A. Benucci, P. Verschure, and P. Konig ¨

High-Order Correlations in Neural Activity

2369

Figure 5: Facing page. (A) Somatic intracellular recordings of the membrane potential showing up-down fluctuations. The modeled cell incorporates calcium dynamics. The raster plot of the corresponding input activity is shown in the bottom panel. High-order correlation events (darker vertical stripes) of a few tens of milliseconds duration induce transitions to the up states. The distribution in time of these correlated events is according to a Poisson process. Up-states’ duration can be prolonged by increasing the frequency of high-order events, as it happens in the time window of 2 to 2.5 seconds. The same effect is obtained by increasing the calcium peak conductance and time constants (see controls in section 2), with the result of eventually merging all the up states shown in the figure into a single long-lasting up state. (B) Decorrelating the subthreshold dynamics: a template set of points in time (in the raster plot shown in panel A, such a set of points would correspond to the moments in time when the vertical stripes occur) has been used to generate 10 other different sets of points, whose numeric values differ more and more from the original one. The first set is a perfect copy of the original template; the second one has 10% difference, and so on. The last one is 100% different. This group of 11 sets is then used to generate 11 corresponding different ensembles of 5000 correlated inputs. These 11 populations have increasingly different degree of correlations between them, since they share fewer and fewer time epochs for generating pairwise correlations (see section 2). We run the simulations and recorded the voltage traces for 10 seconds of simulation time. The subthreshold dynamics of the second trace is identical to the first one (the template), while the following traces differ more and more. (C) To quantify the degree of decorrelation, we windowed each voltage trace using a 1 second time window, and for each segment, we computed the linear correlation coefficients between the reference trace and the following ones. By using the windowing procedure, we could get 10 different estimates of the correlation coefficients for every couple of voltage traces, thus deriving an estimate of the mean correlation values and standard deviations. “Seeds” in the abscissa refers to the points in time used to create the correlations (see section 2). (D) The histogram of the membrane potential for a single spherical compartment, preserving the original input-output firing-rate transfer function, shows a central predominant peak and two small satellite peaks. The more depolarized one is an effect of the spike-cutting procedure, and the more hyperpolarized one is a result of the after-spike hyperpolarization. (E) Histogram of the membrane potential with intact basal morphology and a reduced apical one, substituted by a single cylinder of equivalent surface. (F) Histogram of the membrane potential of a spiny stellate cell with active mechanisms included (see the text for details). (G) Bimodality in the histogram of the membrane potential of an IF neuron with modified EPSPs’ rising and decaying time constants, 1 ms and 12 ms, respectively.

2370

A. Benucci, P. Verschure, and P. Konig ¨

minimal condition necessary for the emergence of up and down states. The interesting observation here is that the parameter set that robustly produces bistability in an L5 pyramidal cell does not produce the experimentally observed bistability for a point neuron. As a further test of the hypothesis that the cable properties of the basal dendritic tree are essential for the generation of up and down states, we test a model of a spiny stellate cell developed by Banit and Segev (personal communication, 2003). From the point of view of the gross morphological structure, this spiny stellate cell can be considered as a pyramidal cell without an apical dendritic tree. It thus resembles the morphological characteristics of the cell used in the latter control. It is as if, instead of an equivalent cylindrical compartment, the apical part had been “cut off.” Furthermore, this model contains a number of active mechanisms (see section 2). Because Banit and Segev used this model for a different purpose, we extended it with AMPA and GABA synapses supplying correlated afferent input as described above. In this simulation, using an alternative detailed model neuron with several voltage-dependent mechanisms (see section 2), the same bimodality appears (see Figure 5F). This demonstrates that the basal dendritic tree is an important morphological compartment for the induction of up and down states. In order to investigate the role of the basal dendritic tree, we separate the effects of the interaction of many excitatory postsynaptic potentials (EPSPs) in the dendrites and the effects of electrotonic propagation on individual EPSPs by developing a neuron model, which retains some aspects of dendritic processing, but radically simplifies others. We increased the duration of the EPSPs in an IF model without any detailed morphology (i.e., a point neuron). We kept the total charge flow and the input correlation structure unchanged—a correlation strength of 0.1 and a modification of the rise and decay time constants and peak amplitude of the conductance change, g(t)— to keep the integral constant. Also in this simulation, up and down states emerge (see Figure 5G). It should be emphasized that the quantitative mismatch in the bimodality between Figure 5G and Figure 2B is not surprising since the models used are completely different: an L5 pyramidal cell with detailed morphology and a single geometrical point, IF neuron. The intuitive explanation is that in the real neuron, as well as in the detailed simulations, when a barrage of EPSPs occurs synchronously in several basal dendrites, these long, thin cables quickly become isopotential compartments, each simultaneously depolarizing the soma. Moreover, the temporal filtering properties of these cables have the net effect of prolonging the effective duration of the EPSPs. Note that this temporal broadening is effective even though the intrinsic dendritic time constant, τ , is lowered by the arrival of massive excitations, which increases the conductance (Par´e, Shink, Gandreau, Destexhe, & Lang, 1998). The overall effect is a strong, sustained current to the soma. We conclude that for the passive model, higher-order events, which naturally result from weak pairwise correlations in the network, combined

High-Order Correlations in Neural Activity

2371

with physiologically realistic electrotonic dendritic properties, explain the bimodal distribution of the membrane potential and that active conductances shape its detailed temporal properties. 4 Discussion Here we show in a detailed model of a pyramidal neuron that weak pairwise correlations in its inputs, in combination with the electrotonic properties of its basal dendritic tree, cause up and down states in its membrane potential dynamics. Furthermore, several experimental characterizations of up and down states, such as an increase in gamma power and standard deviation of the membrane potential, can be explained in terms of presynaptic correlation structures. By introducing several statistical indices, we demonstrate a way to derive the correlation strength of the inputs, and thus in the activity of the network, from the subthreshold dynamics of a single neuron. In this sense, correlated activity in the network and the bimodality of the membrane potential are different views on one and the same phenomenon. Previous explanations of up and down states have focused on a presynaptic origin and have suggested that synchronous barrages of excitations may be the major agents involved (Wilson & Kawaguchi, 1996). What kind of mechanism in the cortex could be responsible for their generation is an issue that has not been fully resolved. Here we show that no additional hypothesis other than the well-described weak pairwise correlations is required to fill this gap. However, we also show in our model that such a presynaptic source can account for the bimodality in the membrane potential of the postsynaptic neuron only when it is complemented by temporal filtering of the input spike trains by the basal dendtrites. Active intracellular mechanisms sustain the triggered transitions to up states. Importantly, the neuron exhibits bimodality in its membrane potential with, as well as without, insertion of voltage-dependent mechanisms. The key effect of K+ -dependent Ca2+ or Na+ channels is their strong modulation of the quantitative properties of the subthreshold dynamics. This includes a sixfold compression of the dynamics range of the up state’s mean firing rates. Potassium-dependent calcium and sodium channels can stretch the up-state duration from 25 to 300 ms, providing a better match with electrophysiological data. However, they are not responsible for the emergence of the phenomenon per se. The differential impact of the morphology of the basal and apical dendrites on the response properties of pyramidal cells has been pointed out in other experimental (Larkum, Zhu, & Sakmann, 1999; Berger, Larkum, & Luscher, 2001) and theoretical studies (Kording ¨ & Konig, ¨ 2001). In these studies, properties of the inputs play a key role in the interaction of apical and basal dendritic compartments. In another study, it has been shown that the dynamic regulation of the dendritic electrotonic length can give rise to highly specific interactions in a cortical network that can account for the differential processing of multiple stimuli (Verschure & Konig, ¨ 1999).

2372

A. Benucci, P. Verschure, and P. Konig ¨

4.1 The Scope of the Model. In our simulations, we do not reconstruct a complete visual system but approximate the first- and second-order statistical structure of the afferent inputs to a single neuron derived from experimental results. This approximation can be experimentally verified by intracellular recordings in area 17 of an anesthetized cat using full field gratings as visual stimuli. This setup has been used in a fair number of laboratories to demonstrate the existence of weak pairwise correlations in neuronal activity (Bair, 1999). Such synchronized activity has been observed over a spatial extent of several millimeters, roughly matching the scale of monosynaptic tangential connections in the cortex (Singer, 1993). Therefore, a pyramidal neuron in visual cortex samples mainly the activity of a region where neuronal activity is weakly correlated. Furthermore, the pyramidal cell reconstructed and simulated in this study has been recorded from and filled in primary visual cortex. Pyramidal cells are the predominant neuronal type in the whole cortex, and weak pairwise correlations have been found in many other cortical areas and species as well (Bair, 1999). Thus, our simulations apply to widely used experimental paradigms. Another important question pertaining to our study and the physiological data it relates to is whether the phenomena studied here generalize to the awake, behaving animal. The impact of the behavioral state on the cortical dynamics is not fully understood, and the bulk of physiological experiments are performed under anesthesia. Electroencephalogram data show marked differences in the neuronal dynamics between different behavioral states (Steriade et al., 2001), and we can expect that the detailed dynamics of neuronal activity are affected. Furthermore, it is known that anesthetics influence the dynamics of up and down states during spontaneous activity (Steriade, 2001). However, whether the subthreshold dynamics is influenced by anesthetics when visual stimuli are applied is yet unknown. Furthermore, in the few studies where awake cats are used (Gray & Viana di Prisco, 1997; Siegel, Kording, & Konig, ¨ 2000), correlated activity on a millisecond timescale has been observed, compatible with the results obtained with the anesthetized preparation. Interestingly, a bimodality in the subthreshold dynamics has been observed in slow-wave sleep (Steriade et al., 2001). This raises the question of how the correlation structure of neuronal activity during sleep matches that observed under anesthetics. Hence, whether the assumptions made in our study, and the data they are based on, generalize to awake or sleeping animals has to be further investigated. The few results available suggest, however, that the relationship reported here between network dynamics and up down states could be valid for different behavioral states. 4.2 Simplifications and Assumptions. The simulations presented in this study incorporate a number of assumptions and simplifications. Although within the framework of the simulation, it is possible to use statistical indices to quantify the relationship between membrane potential dy-

High-Order Correlations in Neural Activity

2373

namics and the statistical structure of the inputs, a full quantitative match between experimental and simulation results is difficult to achieve. To understand the reasons behind these differences (for standard deviation, see Figure 3B; for slope, see Figures 4D and 4E), it has to be considered that we investigated detailed compartmental models where many known channels and currents (Wilson & Kawaguchi, 1996) have been omitted. In particular, voltage-dependent currents with long time constants are known to play an important role in stabilizing and prolonging up and down states (Wilson & Kawaguchi, 1996). Once the upward or downward transition has occurred, active currents contribute to stabilizing the membrane potential in either an up state or a down state. Their duration is related not only to the input dynamics but also to the kinetic properties of such active mechanisms, which essentially implement a bistable attractor of the dynamics. Without that, the passive filtering properties of the dendrites would simply be responsible for the emergence of stretched-in-time envelopes of the coherent presynaptic events. Up and down states would be triggered but not maintained for prolonged periods. These considerations have been validated by a tenfold increase in the duration of the up states as soon as Ca2+ and K+ -dependent Ca2+ currents have been included in the model. The relative importance of voltage-dependent mechanisms and presynaptic events in inducing and maintaining a bistable neuronal dynamics seems to vary considerably depending on the animal and brain area. In view of this, the temptation to include a much larger number of active mechanisms to capture this complexity arises. However, it is quickly obvious that the data needed to specify the precise distribution and strength of each mechanism are not available. Therefore, the number of free parameters of the model, to be fitted by reaching a consistent input-output firing rate and other constraints, is dramatically increased and surpasses the number of constraints available. An interesting alternative to the above scenario can be envisioned. Instead of increasing the complexity of the modeled cell, it is possible to lower the complexity of the real cell. Essentially, an experiment can be designed in which the focus shifts from an investigation of the cellular properties of the recorded neuron as such, to using the neuron as a probe to investigate the correlation structure in the network. The main guidelines of the experiment should be to record intracellularly from a neuron in primary visual cortex while presenting different visual stimuli that will induce different correlation structures. Gratings are known to induce pairwise correlations while randomly moving dots lead to weak or no correlations (Usrey & Reid, 1999; Brecht, Goebel, Singer, & Engel, 2001; Gray et al., 1990). Alternatively, synchronous network activity could be simulated by using an intracortical electrode and applying short, lasting, depolarizing current pulses. Applying blockers of voltage-dependent channels or hyperpolarizing the neuron strongly affects the membrane conductance and simplifies the dynamics within the real neuron. The aim is to reduce as much as possible the impact

2374

A. Benucci, P. Verschure, and P. Konig ¨

of active mechanisms so that the real cell becomes simply a passive receiver of the input spike trains and their higher-order correlation statistics. Reducing the number of parameters that can vary in a real experiment to the smaller set of controlled parameters employed in the simulation study allows further validation of the conclusions derived from our model against physiological reality. This makes it possible to infer the correlation strength of the activity in the cortical network from the subthreshold dynamics as quantified by the indices described above. In this sense, the simulation study facilitates a new experimental approach, using a neuron under nonphysiological conditions as a passive probe to investigate the dynamics of the network. Appendix The tables describe parameter values used in the NEURON simulation. “Place” (in Table 1) refers to the dendritic location of the inserted mechanism: “dendrites” indicates both basal and apical dendrites, while G¯ (and g¯ for synaptic mechanisms) indicate the peak conductance values. For Ih channels, the peak conductance varies according to the dendritic location, as reported by Berger et al. (2001). The parameters m and n refer to the kinetic scheme of the Hodgkin- and Huxley-like formalism used to describe the activation and inactivation properties of voltage dependent mechanisms, according to the following equation: IA = G¯ A × mA × nA (u − EA ), p

q

where A is a generic ionic type. The parameters k and u1/2 refer to the Boltzman equation that describes the steady-state conductance behavior of the ionic mechanisms inserted, according to the equation f (u) =

1 . u −u 1 + exp 1/2k

Finally, the time course of the synaptic conductance (AMPA and GABA; see Table 2) follows an alpha function behavior: t t f (t) = exp − − exp − . τ1 τ2 NMDA channels have been used solely in control experiments (see section 2.6). Acknowledgments This work was supported the Swiss National Fund (31-61415.01), ETH Zurich ¨ and EU/BBW (IST-2000-28127/01.0208-1).

High-Order Correlations in Neural Activity

2375

References Agmon-Snir, H., & Segev, I. (1993). Signal delay and input synchronization in passive dendritic structures. J. Neurophysiol., 70, 2066–2085. Anderson, J., Lampl, I., Reichova, I., Carandini, M., & Ferster, D. (2000). Stimulus dependence of two-state fluctuations of membrane potential in cat visual cortex. Nature Neurosci., 3, 617–621. Azouz, R., & Gray, C. M. (2000). Dynamic spike threshold reveals a mechanism for synaptic coincidence detection in cortical neurons in vivo. PNAS, 97, 8110–8115. Azouz, R., & Gray, C. M. (2003). Adaptive coincidence detection and dynamic gain control in visual cortical neurons in vivo. Neuron, 37, 513–523. Bair, W. (1999). Spike timing in the mammalian visual system. Curr. Opin. Neurobiol., 9, 447–453. Benucci, A. Vershure, P. F., & Konig, ¨ P. (2003). Existence of high-order correlations in cortical activity. Phys. Rev. E., 68 (4 Pt 1), 041905. Berger, T., Larkum, M. E., & Luscher, H. R. (2001). High I(h) channel density in the distal apical dendrite of layer V pyramidal cells increases bidirectional attenuation of EPSPs. J. Neurophysiol., 85(2), 855–868. ¨ Koch, C., & Douglas, R. J. (1994). Amplification and linearization Bernander, O., of distal synaptic inputs to cortical pyramidal cells. J. Neurophysiol., 72, 2743– 2753. ¨ Koch, C., & Usher, M. (1994). The effects of synchronized inputs Bernander, O., at the single neuron level. Neural Computation, 6, 622–641. Both´e, S. M., Spekreijse, H., & Roelfsema, P. R. (2000). The effect of pair-wise and higher order correlations on the firing rate of a post-synaptic neuron. Neural Computation, 12, 153–179. Brecht, M., Goebel, R., Singer, W., & Engel, A. K. (2001). Synchronization of visual responses in the superior colliculus of awake cats. Neuroreport, 12, 43–47. Contreras, D., & Steriade, M. (1995). Cellular basis of EEG slow rhythms: A study of dynamic corticothalamic relationships. J. Neurosci., 15, 604–622. Dean, A. F. (1981). The variability of discharge of simple cells in the cat striate cortex. Exp. Brain Res., 44(4), 437–440. Destexhe, A. (2001). Simplified models of neocortical pyramidal cells preserving somatodendritic voltage attenuation. Neurocomputing, 38–40, 167–173. Destexhe, A., & Par´e, D. (1999). Impact of network activity on the integrative properties of neocortical pyramidal neurons in vivo. J. Neurophysiol., 81, 1531– 1547. Destexhe, A., & Par´e, D. (2000). A combined computational and intracellular study of correlated synaptic bombardment in neocortical pyramidal neurons in vivo. Neurocomputing, 32–33, 113–119. Douglas, R. J., & Martin, K. A. C. (1990). Neocortex. In G. M. Shepherd (Ed.), The synaptic organization of the brain (pp. 389–438). Oxford: Oxford University Press. Douglas, R. J., Martin, K. A. C., & Whitteridge, D. (1991). An intracellular analysis of the visual responses of neurons in cat visual cortex. J. Physiol. (London), 440, 659–696.

2376

A. Benucci, P. Verschure, and P. Konig ¨

Engel, A. K., Konig, ¨ P., Gray, C. M., & Singer, W. (1990). Stimulus-dependent neuronal oscillations in cat visual cortex: Inter-columnar interaction as determined by cross-correlation analysis. Eur. J. Neurosci., 2, 588–606. Feng, J., & Brown, D. (2000). Impact of correlated inputs on the output of the integrate-and-fire model. Neural Comput., 12, 671–692. Gabbott, P. L., Martin, K. A. C., & Whitteridge, D. (1987). Connections between pyramidal neurons in layer 5 of cat visual cortex (area 17). J. Comp. Neurol., 259, 364–381. Gray, C. M., Engel, A. K., Konig, ¨ P., & Singer, W. (1990). Stimulus-dependent neuronal oscillations in cat visual cortex: Receptive field properties and feature dependence. Eur. J. Neurosci., 2, 607–619. Gray, C. M., & Viana di Prisco, G. (1997). Stimulus-dependent neuronal oscillations and local synchronization in striate cortex of the alert cat. J. Neurosci., 17, 3239–3253. Grun, ¨ S., Diesmann, M., & Aersten, A. (2001). Unitary events in multiple singleneuron spiking activity: I: Detection and significance. Neural Computation, 14, 43–80. Hines, M. L., & Carnevale, N. T. (2001). NEURON: A tool for neuroscientists. Neuroscientist., 7, 123–135. Kasanetz, F., Riquelme, L. A., & Murer, M. G. (2002). Disruption of the two-state membrane potential of striatal neurones during cortical desynchronization in anesthetized rats. J. Physiol., 543, 577–589. Kisvarday, Z. F., Toth, E., Rausch, M., & Eysel, U. T. (1997). Orientation-specific relationship between populations of excitatory and inhibitory lateral connections in the visual cortex of the cat. Cerebral Cortex, 7, 605–618. ¨ & Douglas, R. J. (1995). Do neurons have a voltage or Kock, C., Bernander, O, a current threshold for action potential initiation. J. Comput. Neurosci., 2, 63– 82. Konig, ¨ P. (1994). A method for the quantification of synchrony and oscillatory properties of neuronal activity. J. Neurosci. Meth., 54, 31–37. Konig, ¨ P., Engel, A. K., & Singer, W. (1995a). How precise is neuronal synchronization? Neural Computation, 7, 469–485. Konig, ¨ P., Engel, A. K., & Singer, W. (1995b). Relation between oscillatory activity and long-range synchronization in cat visual cortex. PNAS, 92, 290–294. Kording, ¨ K. P., & Konig, ¨ P. (2001). Supervised and unsupervised learning with two sites of synaptic integration. J. Comput. Neurosci., 11, 207–215. Lampl, I., Reichova, I., & Ferster, D. (1999). Synchronous membrane potential fluctuations in neurons of the cat visual cortex. Neuron, 22, 361–374. Larkum, M. E., Zhu, J. J., & Sakmann, B. (1999). A new cellular mechanism for coupling inputs arriving at different cortical layers. Nature, 398, 338–341. Lewis, B. L., & O’Donnell, P. (2000). Ventral tegmental area afferents to the prefrontal cortex maintain membrane potential “up” states in pyramidal neurons via D1 dopamine receptors. Cerebral Cortex, 10, 1168–1175. Mazurek, M. E., & Shadlen, M. N. (2002). Limits to the temporal fidelity of cortical spike rate signal. Nature Neurosci., 5, 463–471. Mel, B. W. (1993). Synaptic integration in an excitable dendritic tree. J. Neurophysiol., 70, 1086–1101.

High-Order Correlations in Neural Activity

2377

Par´e, D., Shink, E., Gaudreau, H., Destexhe, A., & Lang, E. J. (1998). Impact of spontaneous synaptic activity on the resting properties of cat neocortical pyramidal neurons in vivo. J. Neurophysiol., 79, 1450–1460. Roskies, A., et al. (1999). Reviews on the binding problem. Neuron, 24(1), 7– 110. Salin, P. A., & Bullier, J. (1995). Corticocortical connections in the visual system: Structure and function. Physiological Rev., 75, 107–154. Salinas, E., & Sejnowski, T. J. (2000). Impact of correlated synaptic input on output firing rate and variability in simple neuronal models. J. Neurosci., 20, 6193–6209. Salinas, E., & Sejnowski, T. J. (2001). Correlated neuronal activity and the flow of neural information. Nat. Rev. Neurosci., 2, 539–550. Shadlen, M. N., & Newsome, W. T. (1998). The variable discharge of cortical neurons: Implications for connectivity, computation and information coding. J. Neurosci., 18, 3870–3896. Siegel, M., Kording, K. P., & Konig, ¨ P. (2000). Integrating top-down and bottomup sensory processing by somato-dendritic interactions. J. Neurosci., 8, 161– 173. Singer, W. (1993). Synchronization of cortical activity and its putative role in information processing and learning. Annu. Rev. Physiol., 55, 349–374. Singer, W., & Gray, C. M. (1995). Visual feature integration and the temporal correlation hypothesis. Annu. Rev. Neurosci., 18, 555–586. Softky, W. R., & Koch, C. (1993). The highly irregular firing of cortical cells is inconsistent with temporal integration of random EPSPs. J. Neurosci., 13, 334–350. Sompolinsky, H., & Shapley, R. (1997). New perspectives on the mechanisms for orientation selectivity. Current Opinion in Neurobiol., 7, 514–522. Steriade, M. (2001). Impact of network activities on neuronal properties in corticothalamic systems. J. Neurophysiol., 86, 1–39. Steriade, M., Contreras, D., & Amzica, F. (1994). Synchronized sleep oscillations and their paroxysmal developments. Trends Neurosci., 17, 199–208. Steriade, M., Timofeev, I., & Grenier, F. (2001). Natural waking and sleep states: A view from inside neocortical neurons. J. Neurophysiol., 85, 1968–1985. Stern, E. A., Jaeger, D., & Wilson, C. J. (1998). Membrane potential synchrony of simultaneously recorded striatal spiny neurons in vivo. Nature, 394, 475– 478. Stern, E. A., Kincaid, A. E., & Wilson, C. J. (1997). Spontaneous subthreshold membrane potential fluctuations and action potential variability of rat corticostriatal and striatal neurons in vivo. J. Neurophysiol., 77, 1697–1715. Stroeve, S., & Gielen, S. (2001). Correlation between uncoupled conductancebased integrate-and-fire neurons due to common and synchronous presynaptic firing. Neural Comput., 13, 2005–2029. Usrey, W. M., & Reid, R. C. (1999). Synchronous activity in the visual system. Annu. Rev. Physiol., 61, 435–456. Verschure, P.F., & Knig, P. (1999). On the role of biophysical properties of cortical neurons in binding and segmentation of visual scenes. Neural Comput., 11, 1113–1138.

2378

A. Benucci, P. Verschure, and P. Konig ¨

Wilson, C. J., & Groves, P. M. (1981). Spontaneous firing patterns of identified spiny neurons in the rat neostriatum. Brain Res., 220, 67–80. Wilson, C. J., & Kawaguchi, Y. (1996). The origins of two-state spontaneous membrane potential fluctuations of neostriatal spiny neurons. J. Neurosci., 16, 2397–2410. Received September 4, 2003; accepted April 7, 2004.

LETTER

Communicated by Yair Weiss

On the Uniqueness of Loopy Belief Propagation Fixed Points Tom Heskes [email protected] SNN, University of Nijmegen, 6525 EZ, Nijmegen, The Netherlands

We derive sufficient conditions for the uniqueness of loopy belief propagation fixed points. These conditions depend on both the structure of the graph and the strength of the potentials and naturally extend those for convexity of the Bethe free energy. We compare them with (a strengthened version of) conditions derived elsewhere for pairwise potentials. We discuss possible implications for convergent algorithms, as well as for other approximate free energies. 1 Introduction Loopy belief propagation is Pearl’s belief propagation (Pearl, 1988) applied to networks containing cycles. It can be used to compute approximate marginals in Bayesian networks and Markov random fields. Whereas belief propagation is exact only in special cases, for example, for tree-structured (singly connected) networks with just gaussian or just discrete nodes, loopy belief propagation empirically often leads to good performance (Murphy, Weiss, & Jordan, 1999; McEliece, MacKay, & Cheng, 1998). That is, the approximate marginals computed with loopy belief propagation are in many cases close to the exact marginals. In gaussian graphical models, the means are guaranteed to coincide with the exact means (Weiss & Freeman, 2001). The notion that fixed points of loopy belief propagation correspond to extrema of the so-called Bethe free energy (Yedidia, Freeman, & Weiss, 2001) is an important step in the theoretical understanding of this success and paved the road for interesting generalizations. However, when applied to graphs with cycles, loopy belief propagation does not always converge. So-called double-loop algorithms have been proposed that do guarantee convergence (Yuille, 2002; Teh & Welling, 2002; Heskes, Albers, & Kappen, 2003), but are an order of magnitude slower than standard loopy belief propagation. It is generally believed that there is a close connection between (non)convergence of loopy belief propagation and (non)uniqueness of loopy belief propagation fixed points. More specifically, the working hypothesis is that uniqueness of a loopy belief propagation fixed point guarantees convergence of loopy belief propagation to this fixed point. The goal of this study, then, is to derive sufficient c 2004 Massachusetts Institute of Technology Neural Computation 16, 2379–2413 (2004)

2380

T. Heskes

conditions for uniqueness. Such conditions are not only relevant from a theoretical point of view, but can also be used to derive faster algorithms and suggest different free energies, as will be discussed in section 9. 2 Outline Before getting into the mathematical details, we first sketch the line of reasoning that will be followed in this article. It is inspired by the connection between fixed points of loopy belief propagation and extrema of the Bethe free energy, by studying the Bethe free energy we can learn about properties of loopy belief propagation. The Bethe free energy is an approximation to the exact variational GibbsHelmholtz free energy. Both are concepts from (statistical) physics. Abstracting from the physical interpretation, the Gibbs-Helmholtz free energy is “just” a functional with a unique minimum, the argument of which corresponds to the exact probability distribution. However, the Gibbs-Helmholtz free energy is as intractable as the exact probability distribution. The idea is then to approximate the Gibbs-Helmholtz free energy in the hope that the minimum of such a tractable approximate free energy relates to the minimum of the exact free energy. Examples of such approximations are the mean-field free energy, the Bethe free energy, and the Kikuchi free energy. The connections between the Gibbs-Helmholtz free energy, Bethe free energy, and loopy belief propagation are reviewed in section 3. The Bethe free energy is a function of so-called pseudomarginals or beliefs. For the minimum of the Bethe free energy to make sense, these pseudomarginals have to be properly normalized as well as consistent. Our starting point, the upper-left corner in Figure 1, is a constrained minimization problem. In general, it is in fact a nonconvex constrained minimization problem since the Bethe free energy is a nonconvex function of the pseudomarginals (the constraints are linear in these pseudomarginals). However, using the constraints on the pseudomarginals, it may be possible to rewrite the Bethe free energy in a form that is convex in the pseudomarginals. When this is possible, we call the Bethe free energy “convex over the set of constraints” (Pakzad & Anantharam, 2002). Now, if the Bethe free energy is convex over the set of constraints, we have, in combination with the linearity of the constraints, a convex constrained minimization problem. Convex constrained minimization problems have a unique solution (see, e.g., (Luenberger, 1984), which explains link d in Figure 1. Sufficient conditions for convexity over the set of constraints, link b in Figure 1, can be found in Pakzad and Anantharam (2002) and Heskes et al. (2003). They are (re)derived and discussed in section 4. These conditions depend on only the structure of the graph, not on the (strength of the) potentials that make up the probability distribution defined over this graph. A corollary of these conditions, derived in section 4.3, is that the Bethe free energy for a graph with a single loop is “just” convex over the set of

Uniqueness of Loopy Belief Propagation Fixed Points

2381

Figure 1: Layout of correspondences and implications. See the text for details.

constraints: with two or more connected loops, the conditions fail (see also McEliece & Yildirim, 2003). Milder conditions for uniqueness, which do depend on the strength of the interactions, follow from the track on the right-hand side of Figure 1. First, we note that nonconvex constrained minimization of the Bethe free energy is equivalent to an unconstrained nonconvex/concave minimax problem (Heskes, 2002), link a in Figure 1. Convergent double-loop algorithms like CCCP (Yuille, 2002) and faster variants thereof (Heskes et al., 2003) in fact solve such a minimax problem: the concave problem in the maximizing parameters (basically Lagrange multipliers) is solved by a message-passing algorithm very similar to standard loopy belief propagation in the inner loop, where the outer loop changes the minimizing parameters (a remaining set of pseudomarginals) in the proper downward direction. The transformation

2382

T. Heskes

from nonconvex constrained minimization problem to an unconstrained nonconvex/concave minimax problem is, in a particular setting relevant to this article, repeated in section 5.1. Rather than requiring the Bethe free energy to be convex (over the set of constraints), we then in sections 6 and 8 work toward conditions under which this minimax problem is convex/concave. These indeed depend on the strength of the potentials, defined in section 7. These conditions can be considered the main result of this article. Link c follows from the observation, in section 5.2, that the minimax problem corresponding to a Bethe free energy that is convex over the set of constraints has to be convex or concave. As indicated by link e, convex/concave minimax problems have a unique solution. This then also implies that the Bethe free energy has a unique extremum satisfying the constraints, which, since the Bethe free energy is bounded from below (see section 5.3), has to be a minimum: link f. The concluding statement by link g in the lower-right corner is, to the best of our knowledge, no more than a conjecture. We discuss it in more detail in section 9. 3 The Bethe Free Energy and Loopy Belief Propagation 3.1 The Gibbs-Helmholtz Free Energy. The exact probability distribution in Bayesian networks and Markov random fields can be written in the factorized form Pexact (X) =

1 α (Xα ). Z α

(3.1)

Here α is a potential, some function of the potential subset Xα , and Z is an unknown normalization constant. Potential subsets typically overlap, and they span the whole domain X. The convention that we adhere to in this article is that there are no potential subsets Xα and Xα such that Xα is fully subsumed by Xα . The standard choice of a potential in a Bayesian network is a child with all its parents. We further restrict ourselves to probabilistic models defined on discrete random variables, each of which runs over a finite number of states. The potentials are positive and finite. The typical goal in Bayesian networks and Markov random fields is to compute the partition function Z or marginals, for example, Pexact (Xα ) = Pexact (X). X\α

One way to do this is with the junction tree algorithm (Lauitzen & Spiegelhalter, 1988). However, the junction tree algorithm scales exponentially with the size of the largest clique and may become intractable for complex models. The alternative is then to resort to approximate methods, which can be

Uniqueness of Loopy Belief Propagation Fixed Points

2383

roughly divided into two categories: sampling approaches and deterministic approximations. Most deterministic approximations derive from the so-called GibbsHelmholtz free energy, F(P) = − P(Xα )ψα (Xα ) + P(X) log P(X), α

Xα

X

with shorthand ψ ≡ log . Minimizing this variational free energy over the set P of all properly normalized probability distributions, we get back the exact probability distribution, equation 3.1, as the argument at the minimum and minus the log of the partition function as the value at the minimum: Pexact = argmin F(P) and − log Z = min F(P). P∈P

P∈P

Since the Gibbs-Helmholtz free energy is convex in P, the equality constraint (proper normalization) is linear, and the inequality constraints (nonnegativity) are convex, this minimum is unique. By itself, we have not gained anything: the entropy may still be intractable to compute. 3.2 The Bethe Free Energy. The Bethe free energy is an approximation of the exact Gibbs-Helmholtz free energy. In particular, we approximate the entropy through X

P(X) log P(X) ≈

α

−

P(Xα ) log P(Xα )

Xα

(nβ − 1) P(xβ ) log P(xβ ), β

xβ

with xβ a (super)node and nβ = α⊃β 1: the number of potentials that contains node xβ . The second term follows from a discounting argument: without it, we would overcount the entropy contributions on the overlap between the potential subsets. The (super)nodes xβ are themselves subsets of the potential subsets, that is, xβ ∩ Xα = ∅ or xβ ∩ Xα = xβ ∀α,β , and partition the domain X, xβ ∩ xβ = ∅ ∀β,β and

xβ = X.

β

Typically the xβ are taken to be single nodes, and in the following we will refer to them as such. For clarity of notation, we will indicate these nodes by β and xβ in lowercase, to contrast them with the potentials α and potential subsets Xα in uppercase.

2384

T. Heskes

Note that the Bethe free energy depends on only the marginals P(Xα ) and P(xβ ). We replace minimization of the exact Gibbs-Helmholtz free energy over probability distributions by minimization of the Bethe free energy, F(Qα , Qβ ) = − Qα (Xα )ψα (Xα ) + Qα (Xα ) log Qα (Xα ) α

α

Xα

Xα

(nβ − 1) Qβ (xβ ) log Qβ (xβ ), − β

(3.2)

xβ

over sets of “pseudomarginals”1 or beliefs {Qα , Qβ }. For this to make sense, these pseudomarginals have to be properly normalized as well as consistent, that is,2 Qα (Xα ) = 1 and Qα (xβ ) = Qα (Xα ) = Qβ (xβ ). (3.3) Xα

Xα\β

Let Q denote all subsets of consistent and properly normalized pseudomarginals. Then our goal is to solve min

{Qα ,Qβ }∈Q

F(Qα , Qβ ).

The hope is that the pseudomarginals at this minimum are accurate approximations to the exact marginals Pexact (Xα ) and Pexact (xβ ). 3.3 Link with Loopy Belief Propagation. For completeness and later reference, we describe the link between the Bethe free energy and loopy belief propagation, as originally reported on by Yedidia et al. (2001). It starts with the Lagrangian L(Qα , Qβ , λαβ , λα , λβ ) = F(Qα , Qβ ) + λαβ (xβ ) Qβ (xβ ) − Qα (xβ ) β α⊃β xβ

+

α

+

β

λα 1 − λβ 1 −

Xα

Qα (Xα ) Qβ (xβ ) .

(3.4)

β

1 Terminology from Wainwright, Jaakkola, and Willsky (2002), used to indicate that there need not be a joint distribution that would yield such marginals. 2 Strictly speaking we also have to take inequality constraints into account, namely, those of the form Qα (Xα ) ≥ 0. However, with the potentials being positive and finite, the logarithmic terms in the free energy make sure that we never really have to worry about those; they never become “active.” For convenience, we will not consider them any further.

Uniqueness of Loopy Belief Propagation Fixed Points

2385

At an extremum of the Bethe free energy satisfying the constraints, all derivatives of L are zero: the ones with respect to the Lagrange multipliers λ give back the constraints; the ones with respect to the pseudomarginals Q give an extremum of the Bethe free energy. Setting the derivatives with respect to Qα and Qβ to zero, we can solve for Qα and Qβ in terms of the Lagrange multipliers: Q∗α (Xα )

= α (Xα ) exp λα − 1 +

λαβ (xβ )

β⊂α

Q∗β (xβ )

1 = exp λαβ (xβ ) . 1 − λβ + nβ − 1 α⊃β

In terms of the“message” µβ→α (xβ ) ≡ exp[λαβ (xβ )] from node β to potential α, the pseudomarginal Q∗α (Xα ) reads Q∗α (Xα ) ∝ α (Xα )

µβ→α (xβ ),

(3.5)

β⊂α

where proper normalization yields the Lagrange multiplier λα . With definition µα→β (xβ ) ≡

Q∗β (xβ ) µβ→α (xβ )

,

(3.6)

the fixed-point equation for Q∗β (xβ ) can, after some manipulation, be written in the form Q∗β (xβ ) ∝

µα→β (xβ ),

(3.7)

α⊃β

where again the Lagrange multiplier λβ follows from normalization. Finally, the constraint Q∗α (xβ ) = Q∗β (xβ ) in combination with equation 3.6 suggests the update µα→β (xβ ) =

Q∗α (xβ ) . µβ→α (xβ )

(3.8)

Equations 3.5 through 3.8 constitute the belief propagation equations. They can be summarized as follows. A pseudomarginal is the potential (just 1 for the nodes in the convention where no potentials are assigned to nodes) times its incoming messages; the outgoing message is the pseudomarginal divided by the incoming message. The scheduling of the messages is somewhat arbitrary. Loopy belief propagation can be “damped” by taking smaller

2386

T. Heskes

steps. This damping is usually done in terms of the Lagrange multipliers, that is, in the log domain of the messages: log µnew α→β (xβ ) = log µα→β (xβ )

+ [{log Q∗α (xβ ) − log µβ→α (xβ )} − log µα→β (xβ )]. (3.9)

Summarizing, loopy belief propagation is equivalent to fixed-point iteration, where the fixed points are the zero derivatives of the Lagrangian. 4 Convexity of the Bethe Free Energy 4.1 Rewriting the Bethe Free Energy. Minimization of the Bethe free energy, equation 3.2, under the constraints of equation 3.3 is equivalent to solving a minimax problem on the Lagrangian, equation 3.4, namely, min max L(Qα , Qβ , λαβ , λα , λβ ).

Qα ,Qβ λαβ ,λα ,λβ

The ordering of the min and max operations is important here: to enforce the constraints, we first have to take the maximum. The min and max operations can be interchanged if we have a convex constrained minimization problem (Luenberger, 1984). That is, the function to be minimized must be convex in its parameters, the equality constraints have to be linear, and the inequality constraints convex. In our case, the equality constraints are indeed linear, and the inequality constraints enforcing nonnegativity of the pseudomarginals indeed are convex. However, the Bethe free energy, equation 3.2, is clearly nonconvex in its parameters {Qα , Qβ }. This is what makes it a difficult optimization problem. Luckily the description in equation 3.2 is not unique: any other form that can be constructed by substituting the constraints of equation 3.3 is equally valid. Following Pakzad and Anantharam (2002), we call the Bethe free energy “convex over the set of constraints” if, by making use of the constraints of equation 3.3, we can rewrite it in a form that is convex in {Qα , Qβ }. 4.2 Conditions for Convexity. The problem is with the term Qβ (xβ ) log Qβ (xβ ), Sβ (Qβ ) ≡ − xβ

which is concave in Qβ . Using the constraint Qβ (xβ ) = Qα (xβ ), we can turn it into a functional that is convex in Qα and Qβ separately, but not necessarily jointly. That is, with the substitution Qβ (xβ ) = Qα (xβ ) for any α ⊃ β, the entropy, and thus the Bethe free energy, is convex in Qα and in Qβ , but not necessarily in {Qα , Qβ }. However, if we add to Sβ (Qβ ) a convex entropy contribution, −Sα (Qα ) ≡ Qα (Xα ) log Qα (Xα ), Xα

Uniqueness of Loopy Belief Propagation Fixed Points

2387

the combination of −Sα and Sβ is convex in {Qα , Qβ }, as the following lemma, needed in the proof of theorem 1 below, shows. Lemma 1. αβ (Qα , Qβ ) ≡

Qα (Xα ) log Qα (Xα ) −

Qα (xβ ) log Qβ (xβ )

xβ

Xα

is convex in {Qα , Qβ }. Proof.

The matrix with second derivatives of αβ has the components

H(Xα , Xα ) ≡

∂ 2 αβ 1 = δX ,X ∂Qα (Xα )∂Qα (Xα ) Qα (Xα ) α α

H(Xα , xβ ) ≡

∂ 2 αβ 1 =− δx ,x ∂Qα (Xα )∂Qβ (xβ ) Qβ (xβ ) β β

H(xβ , xβ ) ≡

∂ 2 αβ Qα (xβ ) =− 2 δx ,x , ∂Qβ (xβ )∂Qβ (xβ ) Qβ (xβ ) β β

where we note that Xα and xβ should be interpreted as indices. Convexity requires that for any “vector” (Rα (Xα ) Rβ (xβ )), H(Xα , Xα ) H(Xα , xβ ) Rα (Xα ) 0 ≤ (Rα (Xα ) Rβ (xβ )) Rβ (xβ ) H(xβ , Xα ) H(xβ , xβ ) R2 (Xα ) Rα (Xα )Rβ (xβ ) Qα (xβ )R2β (xβ ) α −2 + Qα (Xα ) Qβ (xβ ) Q2β (xβ ) xβ Xα Xα

Rβ (xβ ) 2 Rα (Xα ) = Qα (Xα ) . − Qα (Xα ) Qβ (xβ ) X =

α

The idea is that the Bethe free energy is convex over the set of constraints if we have sufficient convex resources Qα log Qα to compensate for the concave −Qβ log Qβ terms. This can be formalized in the following theorem. Theorem 1. The Bethe free energy is convex over the set of consistency constraints if there exists an allocation matrix Aαβ between potentials α and nodes β satisfying 1. 2.

Aαβ ≥ 0 ∀α,β⊂α Aαβ ≤ 1 ∀α

(positivity) (sufficient amount of resources)

β⊂α

3.

α⊃β

Aαβ ≥ nβ − 1 ∀β

(sufficient compensation).

(4.1)

2388

T. Heskes

Proof. First, we note that we do not have to worry about the energy terms that are linear in Qα . In other words, to prove the theorem, we can restrict ourselves to proving that minus the entropy, −S(Q) = −

α

Sα (Qα ) −

(nβ − 1)Sβ (Qβ )

β

is convex over the set of consistency constraints. The resulting operation is now a matter of resource allocation. For each concave contribution (nβ − 1)Sβ , we have to find convex contributions −Sα to compensate for it. Let Aαβ denote the “amount of resources” that we take from potential subset α to compensate for node β. Now, in shorthand notation and with a little bit of rewriting, −S(Q) = −

α

=−

β

=−

1−

−

1−

β

Aαβ +

β⊂α

Aαβ Sα

β⊂α

Aαβ +

α⊃β

α

−

β⊂α

β

α

−

Sα − (nβ − 1)Sβ

Aαβ − (nβ − 1) Sβ

α⊃β

Aαβ Sα −

α β⊂α

Aαβ [Sα − Sβ ]

Aαβ − (nβ − 1) Sβ .

α⊃β

Convexity of the first term is guaranteed if 1 − β Aαβ ≥ 0 (condition 2), of the second term if Aαβ ≥ 0 (condition 1 and lemma 1), and of the third term if α Aαβ − (nβ − 1) ≥ 0 (condition 3). This theorem is a special case of the one in Heskes et al. (2003) for the more general Kikuchi free energy. Either one of the inequality signs in condition 2 and 3 of equation 4.1 can be replaced by an equality sign without any consequences. 4.3 Some Implications Corollary 1. The Bethe free energy for singly connected graphs is convex over the set of constraints.

Uniqueness of Loopy Belief Propagation Fixed Points

2389

Proof. The proof is by construction. Choose one of the leaf nodes as the root β ∗ and define Aαβ = 1

iff β ⊂ α and β closer to the root β ∗ than any other β ⊂ α

Aαβ = 0

for all other β .

Obviously, this choice of A satisfies conditions 1 and 2 of equation 4.1. Arguing the other way around, for each β = β ∗ , there is just a single potential α ⊃ β that is closer to the root β ∗ than β itself (see the illustration in Figure 2) and thus there are precisely nβ − 1 contributions Aαβ = 1. The root itself gets nβ ∗ contributions Aαβ ∗ = 1, which is even better. Hence, condition 3 is also satisfied: α⊃β

Aαβ = nβ − 1 ∀β=β ∗ and

α⊃β ∗

Aαβ ∗ = nβ ∗ > nβ ∗ − 1.

With the above construction of A, we are in a sense “eating up resources toward the root.” At the root, we have one piece of resources left, which suggests that we can still enlarge the set of graphs for which convexity can be shown using theorem 1. This leads to the next corollary. Corollary 2. The Bethe free energy for graphs with a single loop is convex over the set of constraints. Proof. Again the proof is by construction. Break the loop at one particular place, that is, remove one node β ∗ from a potential α ∗ such that a singly connected structure is left. Construct a matrix A as in the proof of corollary 1, taking the node β ∗ as the root. The matrix A constructed in this way also just works for the graph with the closed loop since still, α⊃β

Aαβ = nβ − 1 ∀β=β ∗ and now

α⊃β ∗

Aαβ ∗ = nβ ∗ − 1.

It can be seen that this construction starts to fail as soon as we have two loops that are connected: with two connected loops, we have insufficient positive resources to compensate for the negative entropy contributions. 4.4 Connection with Other Work. The same corollaries can be found in Pakzad and Anantharam (2002) and McEliece and Yildirim (2003). Furthermore, the conditions in theorem 1 are very similar to the ones stated in Pakzad and Anantharam (2002), which for the Bethe free energy boil down to the following.

2390

T. Heskes

Figure 2: The construction of an allocation matrix satisfying all convexity constraints for singly connected (a) and single-loop structures (b). Neglecting the arrows and dashes, each graph corresponds to a factor graph (Kscischang, Frey, & Loeliger, 2001), where dark boxes refer to potentials and circles to nodes. The numbers within the circles give the corresponding “overcounting numbers,” for the Bethe free energy 1 − nβ with nβ the number of neighboring potentials. The arrows, pointing from potentials α to nodes β, visualize the allocation matrix A with Aαβ = 1 if there is an arrow and Aαβ = 0 otherwise. As can be seen, for each potential there is precisely one outgoing arrow, pointing at the node closest to the root, chosen to be the node in the upper right corner of the graph. In the singly-connected structure (a), all nonroot nodes have precisely nβ − 1 incoming arrows, just sufficient to compensate the overcounting number 1 − nβ . The root node itself has one incoming arrow, which it does not really need. In the structure with the single loop (b), we open the loop by breaking the dashed link and construct the allocation matrix for the corresponding singly connected structure. This allocation matrix works for the single-loop structure as well, because now the incoming arrow at the “root” is just sufficient to compensate for the negative overcounting number resulting from the extra link closing the loop.

Uniqueness of Loopy Belief Propagation Fixed Points

2391

Theorem 2. (Adapted from theorem 1 in Pakzad & Anantharam, 2002). The Bethe free energy is convex for the set of constraints if for any set of nodes B we have (1 − nβ ) + 1≥0, (4.2) β∈B

α∈π(B)

where π(B) ≡ {α : ∃β ∈ B; β ⊂ α} denotes the “parent” set of B, that is, those potential subsets that include at least one node in B. Proposition 1.

The conditions in theorems 1 and 2 are equivalent.

Proof. Let us first suppose that there does exist an allocation matrix Aαβ satisfying the conditions of equation 4.1. Then for any set B, (nβ − 1) ≤ Aαβ ≤ Aαβ ≤ 1, β∈B

β∈B α⊃β

α∈π(B) β⊂α

α∈π(B)

where the inequalities follow from conditions 3, 1, and 2 in equation 4.1, respectively. In other words, validity of the conditions in theorem 1 implies the validity of those in theorem 2. Next let us suppose that the conditions in Theorem 1 fail. Above, we have seen that this can happen if and only if the graph contains at least one connected component with two connected loops. But then condition 4.2 is violated as well when we take for B the set of all nodes within such a component. Since validity implies validity and violation implies violation, the conditions must be equivalent. Graphical models with a single loop have been studied in detail in Weiss (2000), yielding important theoretical results (e.g., correctness of maximum a posteriori assignments). These results are derived by “unwrapping” the single loop into an infinite tree. This argument also breaks down as soon as there is more than a single loop. It might be interesting to find out whether there is a deeper connection between this unwrapping argument and the convexity of the Bethe free energy. In summary, we have given conditions for the Bethe free energy to have a unique extremum satisfying the constraints. From the connection between the extrema of the Bethe free energy and fixed points of loopy belief propagation, it then follows that loopy belief propagation has a unique fixed point when these conditions are satisfied. These conditions fail as soon as the structure of the graph contains two connected loops. The conditions for convexity of the Bethe free energy depend on the structure of the graph; the potentials α (Xα ) do not play any role. These potentials appear only in the energy term that is linear in the pseudomarginals

2392

T. Heskes

and thus does not affect the convexity argument. Consequently, adding a “fake link” with potential α (Xα ) = 1 can change the validity of the conditions, whereas it has no effect on the loopy belief propagation updates. Even if we managed to find more interesting (i.e., milder) conditions for convexity over the set of constraints,3 the impact of fake links would never disappear. In the following, we will therefore dig a little deeper to arrive at milder conditions that do take into account (the strength of) the potentials. 5 The Dual Formulation 5.1 From Lagrangian to Dual. As we have seen, fixed points of loopy belief propagation are in a one-to-one correspondence with zero derivatives of the Lagrangian. If we manage to find conditions under which these zero derivatives have a unique solution, then for the same conditions, loopy belief propagation has a unique fixed point. In the following, we will work with a Lagrangian slightly different from equation 3.4. First, we substitute the constraint Qα (xβ ) = Qβ (xβ ) to write the Bethe free energy in the “more convex” form, F(Qα , Qβ ) = −

−

α

Qα (Xα )ψα (Xα ) +

Xα

Aαβ

β α⊃β

α

Qα (Xα ) log Qα (Xα )

Xα

Qα (xβ ) log Qβ (xβ ),

(5.1)

xβ

where the allocation matrix Aαβ can be any matrix that satisfies

Aαβ = nβ − 1.

(5.2)

α⊃β

And second, we express the consistency constraints from equation 3.3 in terms of the potential pseudomarginals Qα alone. This then yields L(Qα , Qβ , λαβ , λα ) = −

+ −

α

Qα (Xα )ψα (Xα )

Xα

α

Qα (Xα ) log Qα (Xα )

Xα

α β⊂α

Aαβ

Qα (xβ ) log Qβ (xβ )

xβ

3 We would like to conjecture that this is not possible—that the conditions in theorem 1 are not only sufficient but also necessary to prove convexity of the Bethe free energy over the set of consistency constraints. Note that this would not imply that we need these conditions to guarantee the uniqueness of fixed points, since for that convexity by itself is sufficient, not necessary.

Uniqueness of Loopy Belief Propagation Fixed Points

+

2393

λαβ (xβ )

β α⊃β xβ

1 × Aα β Qα (xβ ) − Qα (xβ ) nβ − 1 α ⊃β λα 1 − Qα (Xα ) + α

+

Xα

(nβ − 1)

β

Qβ (xβ ) − 1 .

(5.3)

xβ

Note that the constraint Qβ (xβ ) = Qα (xβ ) as well as its normalization is no longer incorporated with Lagrange multipliers, but follows when we take the minimum with respect to Qβ . It is easy to check that the fixed-point equations of loopy belief propagation still follow by setting the derivatives of the Lagrangian, equation 5.3 to zero. Although the Bethe free energy and thus the Lagrangian, equation 5.3, may not be convex in {Qα , Qβ }, they are convex in Qα and Qβ separately. Therefore, we can interchange the minimum over the pseudomarginals Qα and the maximum over the Lagrange multipliers, as long as we leave the minimum over Qβ as the final operation:4 min max L(Qα , Qβ , λαβ , λα ) = min max min L(Qα , Qβ , λαβ , λα ).

Qα ,Qβ λαβ ,λα

Qβ λαβ ,λα Qα

Rewriting

1 λαβ (xβ ) Aα β Qα (xβ ) − Qα (xβ ) nβ − 1 α ⊃β α⊃β xβ λ¯ αβ (xβ )Qα (xβ ), =−

β

α β⊂α xβ

with λ¯ αβ (xβ ) ≡ λαβ (xβ ) −

1 Aα β λα β (xβ ), nβ − 1 α ⊃β

we can easily solve for the minimum with respect to Qα : Q∗α (Xα ) = α (Xα ) exp λα − 1 +

Aαβ log Qβ (xβ ) + λ¯ αβ (xβ ) . (5.4)

β⊂α

4 In principle, we could also first take the minimum over Q and leave the minimum β over Qα , but this does not seem to lead to any useful results.

2394

T. Heskes

Plugging this into the Lagrangian, we obtain the “dual,” G(Qβ , λαβ , λα ) ≡ L(Q∗α , Qβ , λαβ , λα ) =− α (Xα ) α

Xα

exp × λα − 1 +

β⊂α

+

α

λα +

(nβ − 1)

β

Aαβ log Qβ (xβ ) + λ¯ αβ (xβ )

Qβ (xβ ) − 1 .

(5.5)

xβ

Next, we find for the maximum with respect to λα , ∗ exp 1 − λα = Aαβ log Qβ (xβ ) + λ¯ αβ (xβ ) α (Xα ) exp β⊂α

Xα

≡ Z∗α ,

(5.6)

where we have to keep in mind that Z∗α by itself, like Q∗α , is a function of the remaining pseudomarginals Qβ and Lagrange multipliers λαβ . Substituting this solution into the dual, we arrive at G(Qβ , λαβ ) ≡ G(Qβ , λαβ , λ∗α ) =−

α

log Z∗α

+ (nβ − 1) Qβ (xβ ) − 1 . β

(5.7)

xβ

Let us pause here for a moment and reflect on what we have done so far. The Lagrangian, equation 5.3, being convex in Qα , has a unique minimum in Qα (given all other parameters fixed), which is also the only extremum. It happens to be relatively straightforward to express the value at this minimum in terms of the remaining parameters and then also to find the optimal (maximal) λ∗α . Plugging these values into the Lagrangian equation 5.3, we have not lost anything. That is, zero derivatives of the Lagrangian are still in one-to-one correspondence with zero derivatives of the dual, equation 5.7, and thus with fixed points of loopy belief propagation. 5.2 Recovering the Convexity Conditions (1). To find a minimum of the Bethe free energy satisfying the constraints in equation 3.3, we first have to take the maximum of the dual, equation 5.7, over the remaining Lagrange multipliers λαβ and then the minimum over the remaining pseudomarginals Qβ . The duality theorem, a standard result from constrained optimization (see, Luenberger, 1984) tells us that the dual G is concave in the Lagrange multipliers. The remaining question is then whether the dual is convex in

Uniqueness of Loopy Belief Propagation Fixed Points

2395

Qβ . If it is, we have a convex-concave minimax problem, which is guaranteed to have a unique solution. Link c in Figure 1 follows from the following proposition. Proposition 2. Convexity of the Bethe free energy, equation 5.1, in {Qα , Qβ } implies convexity of the dual, equation 5.7, in Qβ . Proof. First, we note that the minimum of a convex function over some of its parameters is convex in its remaining parameters. In obvious onedimensional notation, with y∗ (x) ≡ argmin f (x, y), y

f (x + δ, y∗ (x + δ)) + f (x − δ, y∗ (x − δ)) ≥ 2 f (x, (y∗ (x + δ) + y∗ (x − δ))/2) ≥ 2 f (x, y∗ (x)), where the first inequality follows from the convexity of f in {x, y} and the second inequality from y∗ (x) being the unique minimum of f (x, y). Therefore, the dual, equation 5.5, is convex in Qβ when the Lagrangian, equation 5.3, and thus the Bethe free energy, equation 5.1, is convex in {Qα , Qβ }. Furthermore, from the duality theorem, the dual, equation 5.5, is concave in the Lagrange multipliers {λαβ , λα }. Next, we note that the maximum of a convex or concave function over its maximizing parameters is again convex: with y∗ (x) ≡ argmax f (x, y), y

f (x + δ, y∗ (x + δ)) + f (x − δ, y∗ (x − δ)) ≥ f (x + δ, y∗ (x)) + f (x − δ, y∗ (x)) ≥ 2 f (x, y∗ (x)), where the first inequality follows from y∗ (x ± δ) being the unique maximum of f (x ± δ, y) and the second inequality from the convexity of f (x, y) in x. Hence, the dual, equation 5.7, must still be convex in Qβ . For now, we did not gain or lose anything in comparison with the conditions for theorem 1. However, the inequalities in the above proof suggest a little space that will lead to milder conditions for the uniqueness of fixed points. 5.3 Boundedness of the Bethe Free Energy. For completeness and to support link f in Figure 1, we will here prove that the Bethe free energy is bounded from below. The following theorem can be considered a special case of the one stated in Minka (2001) on the Bethe free energy for expectation propagation, a generalization of (loopy) belief propagation. Theorem 3. If all potentials are bounded from above, that is, α (Xα ) ≤ max for all α and Xα , the Bethe free energy is bounded from below on the set of constraints.

2396

T. Heskes

Proof. It is sufficient to prove that the function G(Qβ ) ≡ maxλαβ G(Qβ , λαβ ) is bounded from below for a particular choice of Aαβ satisfying equation 5.2. n −1 Considering Aαβ = βnβ , we then have

nβ − 1 log α (Xα ) exp log Qβ (xβ ) G(Qβ ) ≥ − nβ α β⊂α Xα (nβ − 1) Qβ (xβ ) − 1 + β

xβ

β

xβ

nβ − 1 ≥− log α (Xα )Qβ (xβ ) nβ α β⊂α Xα (nβ − 1) Qβ (xβ ) − 1 +   nβ − 1 log  max  ≥− nβ α β⊂α Xα\β (nβ − 1) − log Qβ (xβ ) + Qβ (xβ ) − 1 + β

≥−



xβ



xβ

nβ − 1 log  max  , nβ α β⊂α X α\β

where the first inequality follows by substituting the choice λαβ (xβ ) = 0 for all α, β, and xβ in G(Qβ , λαβ ), the second from the concavity of the function y

nβ −1 nβ

, and the third from the upper bound on the potentials.

6 Toward Better Conditions 6.1 The Hessian. The next step is to compute the Hessian—the second derivative of the dual with respect to the pseudomarginals Qβ . The first derivative yields Q∗ (xβ ) ∂G Aαβ α =− + (nβ − 1), ∂Qβ (xβ ) Qβ (xβ ) α⊃β which is immediate from the Lagrangian, equation 5.3. To compute the matrix of second derivatives Hββ (xβ , xβ ) ≡

∂ 2G , ∂Qβ (xβ )∂Qβ (xβ )

Uniqueness of Loopy Belief Propagation Fixed Points

2397

we make use of Q∗α (xβ , xβ ) − Q∗α (xβ )Q∗α (xβ ) ∂Q∗α (xβ ) = A , αβ ∂Qβ (xβ ) Qβ (xβ ) where both β and β should be a subset of α and with convention Q∗α (xβ , xβ ) = Q∗α (xβ ) and Q∗α (xβ , xβ ) = 0 if xβ = xβ . Here, the first term follows from the differentation of equation 5.4 and the second term from the normalization as in equation 5.6. Distinguishing between β = β and β = β , we then have Hββ (xβ , xβ ) =

Aαβ (1 − Aαβ )

α⊃β

+

α⊃β

Hββ (xβ , xβ ) = −

A2αβ

Q∗α (xβ ) δx ,x Q2β (xβ ) β β

Q∗α (xβ )Q∗α (xβ )

Qβ (xβ )Qβ (xβ )

Aαβ Aαβ

α⊃{β,β }

Q∗α (xβ , xβ ) − Q∗α (xβ )Q∗α (xβ ) Qβ (xβ )Qβ (xβ )

for β = β, where δxβ ,xβ = 1 if and only if xβ = xβ . Here, it should be noted that both β and xβ play the role of indices, that is, xβ should not be mistaken for a variable or parameter. The parameters are still the (tables with) Lagrange multipliers λαβ and pseudomarginals Qβ . The goal is now to find conditions under which this Hessian is positive (semi) definite for any setting of the parameters {Qβ , λαβ }, that is, conditions that guarantee K≡ Sβ (xβ )Hββ (xβ , xβ )Sβ (xβ ) ≥ 0, β,β xβ ,xβ

for any choice of the “vector” S with elements Sβ (xβ ). Straightforward manipulations yield Sβ (xβ )Hββ (xβ , xβ )Sβ (xβ ) (K) β,β xβ ,xβ

=

α β⊂α xβ

+

Aαβ (1 − Aαβ )Q∗α (xβ )R2β (xβ )

α {β,β }⊂α xβ ,x

−

α

{β,β }⊂α β =β

xβ ,xβ

Aαβ Aαβ Q∗α (xβ )Q∗α (xβ )Rβ (xβ )Rβ (xβ )

(K1 ) (K2 )

β

Aαβ Aαβ Q∗α (xβ , xβ )Rβ (xβ )Rβ (xβ ),

where Rβ (xβ ) ≡ Sβ (xβ )/Qβ (xβ ).

(K3 )

2398

T. Heskes

6.2 Recovering the Convexity Conditions (2). Let us first see how we get back the conditions for convexity of the Bethe free energy, equation 5.1. Since K2 =

α

β⊂α xβ

2 Aαβ Q∗α (xβ )Rβ (xβ )

≥0

and5 K3 =

α

{β,β }⊂α β =β

xβ ,xβ

Aαβ Aαβ Q∗α (xβ , xβ )

2 1 2 1 2 1 Rβ (xβ ) − Rβ (xβ ) − Rβ (xβ ) − Rβ (xβ ) × 2 2 2 ≥ Aαβ Aαβ − Aαβ Q∗α (xβ )R2β (xβ ),

α β⊂α xβ

(6.1)

β ⊂α

we have K = K1 + K2 + K3 ≥

α β⊂α xβ

Aαβ 1 −

β ⊂α

Aαβ Q∗α (xβ )R2β (xβ ).

That is, sufficient conditions for K to be nonnegative are Aαβ ≥ 0 ∀α,β⊂α and

Aαβ ≤ 1 ∀α ,

β⊂α

precisely the conditions for theorem 1. 6.3 Fake Interactions. While discussing the conditions for convexity of the Bethe free energy, we noticed that adding a “fake interaction,” such as a constant potential, can change the validity of the conditions. We will see that here this is not the case and these fake interactions drop out as we would expect them to. Suppose that we have a fake interaction α (Xα ) = 1. From the solution, equation 5.4, it follows that the pseudomarginal Q∗α (Xα ) factorizes:6 Q∗α (xβ , xβ ) = Q∗α (xβ )Q∗α (xβ ) ∀{β,β }⊂α . 5

This step is in fact equivalent to the Gerschgorin theorem for bounding the eigenvalues of a matrix. 6 The exact marginal P exact (Xα ) need not factorize. This is really a consequence of the locality assumptions behind loopy belief propagation and the Bethe free energy.

Uniqueness of Loopy Belief Propagation Fixed Points

2399

Consequently, the terms involving α in K3 cancel with those in K2 , which is most easily seen when we combine K2 and K3 in a different way: K2 + K3 = A2αβ Q∗α (xβ )Q∗α (xβ )Rβ (xβ )Rβ (xβ ) (K˜ 2 ) α β⊂α xβ ,x β

−

α

{β,β }⊂α β =β

Aαβ Aαβ

xβ ,xβ

× [Q∗α (xβ , xβ ) − Q∗α (xβ )Q∗α (xβ )]Rβ (xβ )Rβ (xβ ).

(K˜ 3 )

This leaves us with the weaker requirement (from K1 ) Aαβ (1 − Aαβ ) ≥ 0 for all β ⊂ α. The best choice is then to take Aαβ = 1, which turns condition 3 of equation 4.1 into Aα β + 1 ≥ nβ − 1. α ⊃β α =α

The net effect is equivalent to ignoring the interaction, reducing the number of neighboring potentials nβ by 1 for all β that are part of the fake interaction α. We have seen how we get milder and thus better conditions when there is effectively no interaction. Motivated by this “success,” we will work toward conditions that take into account the strength of the interactions. Our starting point will be the above decomposition in K˜ 2 and K˜ 3 where, since K˜ 2 ≥ 0, we will concentrate on K˜ 3 . 7 The Strength of a Potential 7.1 Bounding the Correlations. The crucial observation, which will allow us to obtain milder and thus better conditions for the uniqueness of a fixed point, is the following lemma. It bounds the term between brackets in K˜ 3 such that we can again combine this bound with the (positive) term K1 . However, before we get to that, we take some time to introduce and derive properties of the “strength” of a potential. Lemma 2.

Two-node correlations of loopy belief marginals obey the bound

Q∗α (xβ , xβ )

− Q∗α (xβ )Q∗α (xβ ) ≤ σα Q∗α (xβ , xβ ) ∀ {β,β }⊂α ∀xβ ,x , β =β

β

(7.1)

with the “strength” σα a function of the potential ψα (Xα ) ≡ log α (Xα ) only: σα = 1 − exp(−ωα ) with ωα ≡ max ψα (Xα ) + (nα − 1)ψα (Xˆ α ) − Xα ,Xˆ α

where nα ≡

β⊂α

β⊂α

1.

ψα (Xˆ α\β , xβ ) ,

(7.2)

2400

T. Heskes

Proof. For convenience and without loss of generality, we omit α from our notation and renumber the nodes that are contained in α from 1 to n. We consider the quotient between the loopy belief on the potential subset divided by the product of its single-node marginals: n−1 (X) µβ (xβ ) (X ) µβ (xβ ) Q∗ (X) β β X   = n ∗ Q (xβ )  (X\β , xβ ) µβ (xβ )µβ (xβ ) β=1

X\β

β

(X) =

β =β

(X )

β

X

n−1 µβ (xβ )

,  (X\β , xβ ) µβ (xβ ) 

β

X\β

(7.3)

β =β

where we substituted the properly normalized version of equation 3.5: a loopy belief pseudomarginal is proportional to the potential times incoming messages. The goal is now to find the maximum of the above expression over all possible messages and all values of X. Especially the maximum over messages µ seems to be difficult to compute, but the following intermediate lemma helps us out. The maximum of the function n V(µ) = (n − 1) log (X) µβ (xβ )

Lemma 3.

X

β=1

  n log  (X\β , x∗β ) µβ (xβ ) , − β=1

X\β

β =β

with respect to the messages µ under constraints xβ µβ (xβ ) = 1 for all β and µβ (xβ ) ≥ 0 for all β and xβ , occurs at an extreme point µβ (xβ ) = δxβ ,xˆ β for some xˆ β to be found. Proof. Let us consider optimizing the message µ1 (x1 ) with fixed messages µβ (xβ ) for β > 1. The first and second derivatives are easily found to obey ∂V Q(x1 |x∗β ) = (n − 1)Q(x1 ) − ∂µ1 (x1 ) β=1 ∂ 2V Q(x1 |x∗β )Q(x1 |x∗β ), = (n − 1)Q(x1 )Q(x1 ) − ∂µ1 (x1 )∂µ1 (x1 ) β=1

Uniqueness of Loopy Belief Propagation Fixed Points

2401

where (X)

β µβ (xβ ) . X (X ) β µβ (xβ )

Q(X) ≡

Now suppose that V has a regular extremum (maximum or minimum) not at an extreme point, that is, µ1 (x1 ) > 0 for two or more values of x1 . At such an extremum, the first derivative should obey (n − 1)Q(x1 ) −

β=1

Q(x1 |x∗β ) = λ,

with λ a Lagrange multiplier implementing the constraint x1 µ1 (x1 ) = 1. Summing over x1 , we obtain λ = 0 (in fact, V is indifferent to any multiplicative scaling of µ). For the matrix with second derivatives at such an extremum, we then have ∂ 2V Q(x1 |x∗β )Q(x1 |x∗β ), = ∂µ1 (x1 )∂µ1 (x1 ) β=1 β =1 β =β

which is positive semidefinite: the extremum cannot be a maximum. Consequently, any maximum must be at the boundary of the domain. Since this holds for any choice of µβ (xβ ), β > 1, it follows by induction that the maximum with respect to all µβ (xβ ) must be at an extreme point as well. The function V(µ) is, up to a term independent of µ, the logarithm of equation 7.3. So the intermediate lemma 3 tells us that we can replace the ˆ maximization over messages µ by maximization over values X: n−1 ˆ (X) (X) Q∗ (X) max ∗ . = max µ Q (xβ ) (Xˆ \β , xβ ) Xˆ β

β

Next, we take the maximum over X as well and define the “strength” σ to be used in equation 7.1 through n−1 ˆ (X) (X) Q∗ (X) 1 . ≡ max ∗ = max X,µ 1−σ Q (xβ ) (Xˆ \β , xβ ) X,Xˆ β

β

(7.4)

2402

T. Heskes

The inequality 7.1 then follows by summing out X\{β,β } in Q∗ (X) −

Q∗ (xβ ) ≤ σ Q∗ (X).

β

The form of equation 7.2 then follows by rewriting equation 7.4 as ˆ with ω ≡ − log(1 − σ ) = max W(X; X) X,Xˆ

ˆ = ψ(X) + (n − 1)ψ(X) ˆ − W(X; X)

ˆ ψ(X\β , xβ ) ,

β

where we recall that ψ(X) ≡ log (X). 7.2 Some Properties. In the following we will refer to both ω and σ as the strength of the potential. There are several properties worth noting: • The strength of a potential is indifferent to multiplication with any term that factorizes over the nodes, that is, ˜ ˜ = ω() for any choice of µ. if (X) = (X) µβ (xβ ) then ω() β

This property relates to the arbitrariness in the definition of equation 3.1: if two potentials overlap, then multiplying one potential with a term that only depends on the overlap and dividing the other by the same term does not change the distribution. Luckily, it also does not change the strength of those potentials. • To compute the strength, we can enumerate all possible combinations. However, we can neglect all combinations X and Xˆ that differ in fewer than two nodes. To see this, consider W(x1 , x2 , x\1\2 ; xˆ 1 , xˆ 2 , x\1\2 ) = ψ(x1 , x2 , x\1\2 ) + ψ(xˆ 1 , xˆ 2 , x\1\2 ) − ψ(xˆ 1 , x2 , x\1\2 ) − ψ(x1 , xˆ 2 , x\1\2 ) = −W(x1 , xˆ 2 , x\1\2 ; xˆ 1 , x2 , x\1\2 ). If now also xˆ 2 = x2 , we get W(x1 , x\1 ; xˆ 1 , x\1 ) = −W(x1 , x\1 ; xˆ 1 , x\1 ) = 0. Furthermore, if W(x1 , x2 , x\1\2 ; xˆ 1 , xˆ 2 , x\1\2 ) ≤ 0, then it must be that W(x1 , xˆ 2 , x\1\2 ; xˆ 1 , x2 , x\1\2 ) ≥ 0 and vice versa. So ω, the maximum over all combinations, must be nonnegative, and we can indeed neglect all combinations that by definition yield zero. • Thus, for finite potentials, 0 ≤ ω < ∞ and 0 ≤ σ < 1.

Uniqueness of Loopy Belief Propagation Fixed Points

2403

• With pairwise potentials, the above symmetries can be used to reduce the number of evaluations to |x1 ||x2 |(|x1 | − 1)(|x2 | − 1)/4 combinations. And indeed, for binary nodes x1,2 ∈ {0, 1}, we immediately obtain ω = |ψ(0, 0) + ψ(1, 1) − ψ(0, 1) − ψ(1, 0)|.

(7.5)

Any pairwise binary potential can be written as a Boltzmann factor: (x1 , x2 ) ∝ exp[wx1 x2 + θ1 x1 + θ2 x2 ]. In this notation, we find the simple and intuitive expression ω = |w|: the strength is the absolute value of the “weight.” It is indeed independent of (the size of) the thresholds. In the case of {−1, 1}, coding the relationship is ω = 4|w|. • In some models, there is the notion of a “temperature” T, that is, (X) ∝ ˜ ˜ exp[ψ(X)/T] where ψ(X) is considered constant. In obvious notation, we then have ω(T) = ω(1)/T and thus σ (T) = 1 − exp[−ω(1)/T] = 1 − [1/(1 − σ (1))]1/T . • Loopy belief revision (max-product) can be interpreted as a zerotemperature limit of loopy belief propagation (sum product). More specifically, we get the belief revision updates if we imagine running loopy belief propagation on potentials that are scaled with temperature T and then take the limit T to zero. Consequently, when analyzing conditions for uniqueness of loopy belief revision fixed points, we can take σ (0) = 0 if σ (1) = 0 (fake interaction), yet σ (0) = 1 whenever σ (1) > 0. 8 Conditions for Uniqueness 8.1 Main Result. Theorem 4. Loopy belief propagation has a unique fixed point if there exists an allocation matrix Aαβ between potentials α and nodes β with properties 1.

Aαβ ≥ 0 ∀α,β⊂α

2.

(1 − σα ) max Aαβ + σα

3.

β⊂α

Aαβ ≥ nβ − 1 ∀β

β⊂α

(positivity) Aαβ ≤ 1 ∀α

(sufficient amount of resources) (8.1) (sufficient compensation)

α⊃β

with the strength σα a function of the potential α (Xα ) as defined in equation 7.2. Proof. For completeness, we first summarize our line of reasoning. Fixed points of loopy belief propagation are in one-to-one correspondence with

2404

T. Heskes

extrema of the dual, equation 5.5. This dual has a unique extremum if it is convex/concave. Concavity is guaranteed, so we focus on conditions for convexity, that is, for positive (semi)definiteness of the corresponding Hessian. This then boils down to conditions that ensure K = K1 + K˜ 2 + K˜ 3 ≥ 0 for any choice of Rβ (xβ ). Substituting the bound, equation 7.1, into the term K˜ 3 , we obtain K˜ 3 ≥ − ≥−

α

α

{β,β }⊂α β =β

σα

xβ ,xβ

β⊂α xβ

Aαβ Aαβ σα Q∗α (xβ , xβ )Rβ (xβ )Rβ (xβ ) Aαβ

β ⊂α β =β

Aαβ Q∗α (xβ )R2β (xβ ),

where in the last step, we applied the same trick as in equation 6.1. Since K˜ 2 ≥ 0 and combining K1 and (the above lower bound on) K˜ 3 , we get K = K1 + K˜ 2 + K˜ 3 ≥ Aαβ 1 − Aαβ − σα Aαβ Q∗α (xβ )R2β (xβ ). α β⊂α xβ

β =β

This implies (1 − σα )Aαβ + σα

Aαβ ≤ 1

∀α,β⊂α ,

β ⊂α

which, in combination with Aαβ ≥ 0 and σα ≤ 1, yields condition 2 in equation 8.1. The equality constraint, equation 5.2, that we started with can be relaxed to the inequality condition 3 without any consequences. We get back the stricter conditions of theorem 1 if σα = 1 for all potentials α. Furthermore, “fake interactions” play no role: with σα = 0, condition 2 becomes maxβ⊂α Aαβ ≤ 1, suggesting the choice Aαβ = 1 for all β ⊂ α, which then effectively reduces the number of neighboring potentials nβ in condition 3. 8.2 Comparison with Other Work. To the best of our knowledge, the only conditions for uniqueness of loopy belief propagation fixed points that depend on more than just the structure of the graph are those in Tatikonda and Jordan (2002) for pairwise potentials. The analysis in Tatikonda and Jordan is based on the concept of the computation tree, which represents an unwrapping of the original graph with respect to the loopy belief propagation algorithm. The same concept is used in Weiss (2000) to show that belief revision yields the correct maximum a posteriori assignments in graphs

Uniqueness of Loopy Belief Propagation Fixed Points

2405

with a single loop and Weiss and Freeman (2001) to prove that loopy belief propagation in gaussian graphical models yields exact means. Although the current theorems based on the concept of computation trees are derived for pairwise potentials, it should be possible to extend them to more general factor graphs. The setup in Tatikonda and Jordan (2002) is slightly different; it is based on the factorization 1 ˆ α (Xα ) ˆ β (xβ ), Pexact (X) = Z α β to be compared with our equation 3.1, where there are no self-potentials β (xβ ). With this in mind, the statement is then as follows. Theorem 5. (adapted from Tatikonda & Jordan, 2002, in particular proposition 5.3). Loopy belief propagation on pairwise potentials has a unique fixed point if max ψˆ α (Xα ) − min ψˆ α (Xα ) < 2 ∀β . (8.2) α⊃β

Xα

Xα

To make the connection between theorem 5 and theorem 4, we will first strengthen the former and then weaken the latter. We will focus on the case of binary pairwise potentials. Since the definition of self-potentials is arbitrary and the condition 8.2 is valid for any choice, we can easily improve the condition by optimizing this choice. This then leads to the following corollary. Corollary 3. This corollary concerns an improvement of theorem 5 for pairwise binary potentials. Loopy belief propagation on pairwise binary potentials has a unique fixed point if ωα < 4 ∀β , (8.3) α⊃β

with ωα defined in equation 7.2. Proof. The condition 8.2 applies to any arbitrary definition of self-potenˆ β (xβ ). In fact, it is valid for any choice tials ψˆ α (Xα ) = ψα (Xα ) +

φαβ (xβ ),

β⊂α

where ψα (Xα ) is any choice of potential subsets that fits in our framework of no self-potentials (as argued above, there is some arbitrariness here as

2406

T. Heskes

well). We can then optimize this choice to obtain milder, and thus better, conditions. Omitting α and renumbering the nodes from 1 to 2, we have

ˆ ˆ min max ψ(x1 , x2 ) − min ψ(x1 , x2 ) x1 ,x2 φ1 ,φ2 x1 ,x2 = min max [ψ(x1 , x2 ) + φ1 (x1 ) + φ2 (x2 )] φ1 ,φ2

x1 ,x2

− min [ψ(x1 , x2 ) + φ1 (x1 ) + φ2 (x2 )] . x1 ,x2

In the case of binary nodes (two-by-two matrices ψ(x1 , x2 )), it is easy to check that the optimal φ1 and φ2 that yield the smallest gap are such that ψ(x1 , x2 ) + φ1 (x1 ) + φ2 (x2 ) = ψ(xˆ 1 , xˆ 2 ) + φ1 (xˆ 1 ) + φ2 (xˆ 2 ) ≥ ψ(x1 , xˆ 2 ) + φ1 (x1 ) + φ2 (xˆ 2 ) = ψ(xˆ 1 , x2 ) + φ1 (xˆ 1 ) + φ2 (x2 ), (8.4) for some x1 , x2 , xˆ 1 , and xˆ 2 with x1 = xˆ 1 and x2 = xˆ 2 . Solving for φ1 and φ2 , we find 1 ψ(xˆ 1 , x2 ) − ψ(x1 , xˆ 2 ) + ψ(xˆ 1 , xˆ 2 ) − ψ(x1 , x2 ) 2 1 φ2 (x2 ) − φ2 (xˆ 2 ) = ψ(x1 , xˆ 2 ) − ψ(xˆ 1 , xˆ 2 ) + ψ(xˆ 1 , xˆ 2 ) − ψ(x1 , x2 ) . 2 φ1 (x1 ) − φ1 (xˆ 1 ) =

Substitution back into equation 8.4 yields ψ(x1 , x2 ) + φ1 (x1 ) + φ2 (x2 ) − ψ(x1 , xˆ 2 ) − φ1 (x1 ) − φ2 (xˆ 2 ) 1 = ψ(x1 , x2 ) + ψ(xˆ 1 , xˆ 2 ) − ψ(xˆ 1 , x2 ) − ψ(x1 , xˆ 2 ) , 2 which has to be nonnegative. Of all four possible combinations, two of them are valid and yield the same positive gap, and the other two are invalid since they yield the same negative gap. Enumerating these combinations, we find

ˆ ˆ min max ψ(x1 , x2 ) − min ψ(x1 , x2 ) φ1 ,φ2

x1 ,x2

x1 ,x2

1 ω = |ψ(0, 0) + ψ(1, 1) − ψ(0, 1) − ψ(1, 0)| = , 2 2 from equation 7.5. Substitution into the condition 8.2 then yields equation 8.3. Next we derive the following weaker corollary of theorem 4:

Uniqueness of Loopy Belief Propagation Fixed Points

2407

Corollary 4. This is a weaker version of theorem 4 for pairwise potentials. Loopy belief propagation on pairwise potentials has a unique fixed point if ωα ≤ 1 ∀β , (8.5) α⊃β

with ωα defined in equation 7.2. Proof. Consider the allocation matrix with components Aαβ = 1 − σα for all β ⊂ α. With this choice, conditions 1 and 2 of equation 8.1 are fulfilled, since (condition 1) σα ≤ 1 and (condition 2) (1 − σα )(1 − σα ) + 2σα (1 − σα ) = 1 − 2σα2 ≤ 1. Substitution into condition 3 yields (1 − σα ) ≥ 1 − 1 and thus σα ≤ 1. α⊃β

α⊃β

(8.6)

α⊃β

Since ωα = − log(1 − σα ) ≥ σα , condition 8.5 is weaker than condition 8.6. Summarizing, the conditions in Tatikonda and Jordan (2002) are, for binary pairwise potentials and when strengthened as above, at most a constant (factor 4) less strict and thus better than the ones derived here. The latter are better when the structure is (close to) a tree. The best set of conditions follows by taking the union of both. Note further that the conditions derived in Tatikonda and Jordan (2002) are, unlike theorem 4, specific to pairwise potentials. 8.3 Illustration. For illustration we consider a 3 × 3 Ising grid with toroidal boundary conditions as in Figure 3a and uniform ferromagnetic potentials proportional to α 1−α . 1−α α The trivial solution, which is the only minimum of the Bethe free energy for small α, is the one with all pseudomarginals equal to (0.5, 0.5). With simple algebra, for example, following the line of reasoning that leads to the belief optimization algorithm in Welling and Teh (2003), it can be shown that this trivial solution becomes unstable at the critical αcritical = 2/3 ≈ 0.67. For α > 2/3, we find two minima: one with “spins up” and the other one with “spins down.” In this symmetric problem, the strength of each potential is given by

α 1−α 2 ω = 2 log . and thus σ = 1 − 1−α α

2408

T. Heskes

Figure 3: Three Ising grids in factor-graph notation: circles denote nodes, boxes interactions. (a) Toroidal boundary conditions. All elements of the allocation matrix equal to 3/4 (not shown). (b) Aperiodic boundary conditions and (c) two loops left. The elements of the allocation matrix along the edges follow directly from optimizing condition 3 in theorem 4 and symmetry considerations. With B = 2 − 2A in b and C = 1 − A in c, the optimal settings for the single remaining variable A then boil down to 3/4 and 1 − 1/8, respectively. See the text for further explanation.

Uniqueness of Loopy Belief Propagation Fixed Points

2409

The minimal (uniform) compensation in condition 3 of theorem 4 amounts to A = 3/4 for all combinations of potentials and nodes. Substitution into condition 2 then yields σ ≤

1 1 ≈ 0.55. and thus α ≤ √ 3 1 + 2/3

The critical value that follows from corollary 3 is in this case slightly better: 1 ≈ 0.62. 1 + e−1/2 Next we consider the same grid with aperiodic boundary conditions as in Figure 3b. Numerically, we find a critical αcritical ≈ 0.79. The value that follows from corollary 3 is dominated by the center node and hence stays the same: a unique loopy belief propagation fixed point for α < 0.62. Theorem 4 can be exploited to shift resources a little. In principle, we can solve the nonlinear programming problem, but for this small problem, it can still be done by hand with the following argumentation. Minimal compensation according to condition 3 in theorem 4 combined with symmetry considerations yields the allocation matrix elements along on the edges in Figure 3b. It is then easy to check that there are only two different appearances of condition 2: 3 1 (2 − 2A)σ + ≤ 1 and σ + A ≤ 1. 4 2 The optimal choice for A is the one in which both conditions turn out to be identical. In this way, we obtain A = 3/4, yielding, ω < 1 and thus α ≤

σ ≤

1 1 ≈ 0.58, and thus α ≤ √ 2 1 + 1/2

still slightly worse than the condition from corollary 3. An example in which the condition obtained with theorem 4 is better than the one from corollary 3 is given in Figure 3c. Straightforward analysis √ following the same recipe as for Figure 3b yields A = 1 − 1/8 with 1 1 ≈ 0.65, and thus α ≤ σ ≤ √ 2 1 + 1 − 1/2 better than the α < 0.62 from corollary 3 and to be compared with the critical αcritical ≈ 0.88. 9 Discussion In this article, we derived sufficient conditions for loopy belief propagation to have just a single fixed point. These conditions remain much too strong to be anywhere near the necessary conditions and in that sense should be seen as no more than a first step. These conditions have the following positive features:

2410

T. Heskes

• Generalize the conditions for convexity of the Bethe free energy. • Incorporate the (local) strength of potentials. • Scale naturally as a function of the “temperature.” • Are invariant to arbitrary definitions of potentials and self-interactions. Although the analysis that led to these conditions may seem quite involved, it basically consists of a relatively straightforward combination of two observations. The first observation is that we can exploit the arbitrariness in the definition of the Bethe free energy when we incorporate the constraints. This forms the basis of the resource allocation argument. And the second observation concerns the bound on the correlation of a loopy belief propagation marginal that leads to the introduction of the strength of a potential. Besides its theoretical usefulness, there are more practical uses. First, algorithms for guaranteed convergence explicitly minimize the Bethe free energy. They can be considered “bound optimization algorithms,” similar to expectation maximization and iterative proportional fitting: in the inner loop, they minimize a bound on the Bethe free energy, which is then updated in the outer loop. In practice, it appears that the tighter the bound, the faster the convergence (see, e.g., Heskes et al., 2003). Instead of a bound that is convex (Yuille, 2002) or convex over the set of constraints (Teh & Welling, 2002; Heskes et al., 2003), we might relax the convexity condition and choose a tighter bound that still has a unique minimum, thereby speeding up the convergence. Second, in Wainwright et al. (2002) a convexified Bethe free energy is proposed. The arguments for this class of free energies are twofold: they yield a bound on the partition function (instead of just an approximation, as the standard Bethe free energy) and have a unique minimum. Focusing on the second argument, the conditions in this article can be used to construct Bethe free energies that may not be convex (over the set of constraints), but do have a unique minimum and, being closer to the standard Bethe free energy, may yield better approximations. We can think of the following opportunities to make the sufficient conditions derived here stricter and thus closer to necessary conditions: • The conditions guarantee convexity of the dual G(Qβ , λαβ ) with respect to Qβ . But in fact we need only G(Qβ ) ≡ maxλαβ G(Qβ , λαβ ) to be convex, which is a weaker requirement. The Hessian of G(Qβ ), however, appears to be more difficult to compute and to analyze in general, but may lead to stronger results in specific cases (e.g., only pairwise interactions or substituting a particular choice of Aαβ ). • It may be possible to strengthen the bound, equation 7.1, on loopy belief correlations, especially for interactions that involve more than two nodes. An important question is how the uniqueness of loopy belief propagation fixed points relates to the convergence of loopy belief propagation.

Uniqueness of Loopy Belief Propagation Fixed Points

2411

Intuitively, one might expect that if loopy belief propagation has a unique fixed point, it will also converge to it. This also seems to be the argumentation in Tatikonda and Jordan (2002). However, to the best of our knowledge, there is no proof of such correspondence. Furthermore, the following set of simulations does seem to suggest otherwise. We consider a Boltzmann machine with four binary nodes, weights 

0 1  w = ω −1 −1

1 0 1 −1

−1 1 0 −1

 −1 −1 , −1 0

zero thresholds, and potentials ij (xi , xj ) = exp[wij /4] if xi = xj and ij (xi , xj ) = exp[−wij /4] if xi = xj . Running loopy belief propagation, possibly damped as in equation 3.9, we observe “convergent” and “nonconvergent” behavior. For relatively small weights, loopy belief propagation converges to the trivial fixed point with Pi (xi ) = 0.5 for all nodes i and xi = {0, 1}, as in the lower left inset in Figure 4. For relatively large weights, it ends up in a limit cycle, as shown in the upper right inset. The weight strength that forms the transition between this “convergent” and “nonconvergent” behavior strongly depends on the step size.7 This by itself makes it hard to defend a one-to-one correspondence between convergence of loopy belief propagation (apparently depending on step size) and uniqueness of fixed points (obviously independent of step size). For weights larger than roughly 5.8, loopy belief propagation failed to converge to the trivial fixed point even for very small step sizes. However, running a convergent double-loop algorithm from many different initial conditions and many weight strengths considerably larger than 5.8, we always ended up in the trivial fixed point and never in another one. We found similar behavior for a three-node Boltzmann machine (same weight matrix as above, except for the fourth node) for very large weights: loopy belief propagation ends up in a limit cycle, whereas a convergent double-loop algorithm converges to the trivial fixed point, which here, by corollary 2, is guaranteed to be unique. In future work we hope to elaborate on these issues.

7 Note that the conditions for guaranteed uniqueness imply ω = 4/3 for corollary 3 and ω = log(2) ≈ 0.69 for theorem 4, both far below the weight strengths where “nonconvergent” behavior sets in.

2412

T. Heskes 6 1

weight strength

5.5

0 0

100

5

4.5

4

0.505

0.495 0

3.5 0

200

0.2

0.4

0.6

0.8

1

step size Figure 4: The transition between “convergent” and “nonconvergent” behavior as a function of the step size used for damping loopy belief propagation and the weight strength. Simulations on a four-node Boltzmann machine. The insets show the marginal P1 (x1 = 1) as a function of the number of loopy belief iterations for step size 0.2 and strength 4 (lower left) and step size 0.6 and strength 6 (upper right). See the text for further detail.

Acknowledgments This work has been supported in part by the Dutch Technology Foundation STW. I thank the anonymous reviewers for their constructive comments and Joris Mooij for computing the critical αcritical ’s in section 8.3. References Heskes, T. (2002). Stable fixed points of loopy belief propagation are minima of the Bethe free energy. In S. Becker, S. Thrun, & K. Obermayer (Eds.), Advances in neural information processing systems, 15 (pp. 359–366). Cambridge, MA: MIT Press. Heskes, T., Albers, K., & Kappen, B. (2003). Approximate inference and constrained optimization. In Uncertainty in artificial intelligence: Proceedings of the Nineteenth Conference (UAI-2003) (pp. 313–320). San Francisco: Morgan Kaufmann. Kschischang, F., Frey, B., & Loeliger, H. (2001). Factor graphs and the sumproduct algorithm. IEEE Transactions on Information Theory, 47(2), 498–519.

Uniqueness of Loopy Belief Propagation Fixed Points

2413

Lauritzen, S., & Spiegelhalter, D. (1988). Local computations with probabilities on graphical structures and their application to expert systems. Journal of the Royal Statistics Society B, 50, 157–224. Luenberger, D. (1984). Linear and nonlinear programming. Reading, MA: AddisonWesley. McEliece, R., MacKay, D., & Cheng, J. (1998). Turbo decoding as an instance of Pearl’s “belief propagation” algorithm. IEEE Journal on Selected Areas in Communication, 16(2), 140–152. McEliece, R., & Yildirim, M. (2003). Belief propagation on partially ordered sets. In D. Gilliam & J. Rosenthal (Eds.), Mathematical systems theory in biology, communications, computation, and finance (pp. 275–300). New York: Springer. Minka, T. (2001). Expectation propagation for approximate Bayesian inference. In J. Breese & D. Koller (Eds.), Uncertainty in artificial intelligence: Proceedings of the Seventeenth Conference (UAI-2001) (pp. 362–369). San Francisco: Morgan Kaufmann. Murphy, K., Weiss, Y. & Jordan, M. (1999). Loopy belief propagation for approximate inference: An empirical study. In K. Laskey & H. Prade (Eds.), Proceedings of the Fifteenth Conference on Uncertainty in Artificial Intelligence (pp. 467–475). San Francisco: Morgan Kaufmann. Pakzad, P., & Anantharam, V. (2002). Belief propagation and statistical physics. In 2002 Conference on Information Sciences and Systems. Princeton, NJ: Princeton University. Pearl, J. (1988). Probabilistic reasoning in intelligent systems: Networks of plausible inference. San Francisco: Morgan Kaufmann. Tatikonda, S., & Jordan, M. (2002). Loopy belief propagation and Gibbs measures. In A. Darwiche & N. Friedman (Eds.), Uncertainty in artificial intelligence: Proceedings of the Eighteenth Conference (UAI-2002) (pp. 493–500). San Francisco: Morgan Kaufmann. Teh, Y., & Welling, M. (2002). The unified propagation and scaling algorithm. In T. Dietterich, S. Becker, & Z. Ghahramani (Eds.), Advances in neural information processing systems, 14 (pp. 953–960). Cambridge, MA: MIT Press. Wainwright, M., Jaakkola, T., & Willsky, A. (2002). A new class of upper bounds on the log partition function. In A. Darwiche & N. Friedman (Eds.), Uncertainty in artificial intelligence: Proceedings of the Eighteenth Conference (UAI-2002) (pp. 536–543). San Francisco: Morgan Kaufmann. Weiss, Y. (2000). Correctness of local probability propagation in graphical models with loops. Neural Computation, 12(1), 1–41. Weiss, Y., & Freeman, W. (2001). Correctness of belief propagation in graphical models with arbitrary topology. Neural Computation, 13(10), 2173–2200. Welling, M., & Teh, Y. (2003), Approximate inference in Boltzmann machines. Artificial Intelligence, 143(1), 19–50. Yedidia, J., Freeman, W., & Weiss, Y. (2001). Generalized belief propagation. In T. Leen, T. Dietterich & V. Tresp (Eds.), Advances in neural information processing systems, 13 (pp. 689–695). Cambridge, MA: MIT Press. Yuille, A. (2002). CCCP algorithms to minimize the Bethe and Kikuchi free energies: Convergent alternatives to belief propagation. Neural Computation, 14, 1691–1722. Received December 2, 2003; accepted April 29, 2004.

LETTER

Communicated by Steven Nowlan

Neural Network Uncertainty Assessment Using Bayesian Statistics: A Remote Sensing Application F. Aires [email protected] Department of Applied Physics and Applied Mathematics, Columbia University, NASA Goddard Institute for Space Studies, New York, NY 10025, U.S.A., and CNRS/IPSL/ ´ Laboratoire de M´et´eorologie Dynamique, Ecole Polytechnique, 91128 Palaiseau Cedex, France

C. Prigent [email protected] CNRS, LERMA, Observatoire de Paris, Paris 75014, France

W.B. Rossow [email protected] NASA Goddard Institute for Space Studies, New York, NY 10025, U.S.A.

Neural network (NN) techniques have proved successful for many regression problems, in particular for remote sensing; however, uncertainty estimates are rarely provided. In this article, a Bayesian technique to evaluate uncertainties of the NN parameters (i.e., synaptic weights) is first presented. In contrast to more traditional approaches based on point estimation of the NN weights, we assess uncertainties on such estimates to monitor the robustness of the NN model. These theoretical developments are illustrated by applying them to the problem of retrieving surface skin temperature, microwave surface emissivities, and integrated water vapor content from a combined analysis of satellite microwave and infrared observations over land. The weight uncertainty estimates are then used to compute analytically the uncertainties in the network outputs (i.e., error bars and correlation structure of these errors). Such quantities are very important for evaluating any application of an NN model. The uncertainties on the NN Jacobians are then considered in the third part of this article. Used for regression fitting, NN models can be used effectively to represent highly nonlinear, multivariate functions. In this situation, most emphasis is put on estimating the output errors, but almost no attention has been given to errors associated with the internal structure of the regression model. The complex structure of dependency inside the NN is the essence of the model, and assessing its quality, coherency, and physical character makes all the difference between a blackbox model c 2004 Massachusetts Institute of Technology Neural Computation 16, 2415–2458 (2004)

2416

F. Aires, C. Prigent, and W. Rossow

with small output errors and a reliable, robust, and physically coherent model. Such dependency structures are described to the first order by the NN Jacobians: they indicate the sensitivity of one output with respect to the inputs of the model for given input data. We use a Monte Carlo integration procedure to estimate the robustness of the NN Jacobians. A regularization strategy based on principal component analysis is proposed to suppress the multicollinearities in order to make these Jacobians robust and physically meaningful. 1 Introduction Neural network (NN) techniques have proved very successful in developing computationally efficient algorithms for geophysical applications. We are interested, in this study, in the application of the NN retrieval methods for satellite remote sensing (Aires, Rossow, Scott, & Ch´edin, 2002a): the NN is used as a nonlinear multivariate regression to represent the inverse radiative transfer function in the atmosphere. This is an application of the inverse theory: remote sensing requires the estimation of geophysical variables from indirect measurements by applying the inverse radiative transfer function to radiative measurements. NN are well adapted to solve nonlinear problems and are especially designed to capitalize more completely on the inherent statistical relationships among the input and output variables. A rigorous statistical approach requires not only a minimization of output errors, but also an uncertainty estimate of the model parameters (Saltelli, Chan, & Scott, 2000). The reliability of the inverse model is as important as its answer, but until now, probably because of the lack of adequate tools, the uncertainty of an NN statistical model has rarely been quantified. Our work is based on the developments of Le Cun, Denker, and Sola (1990) and MacKay (1992). These studies introduced error bar estimates for neural networks using a Bayesian approach, but these tools were developed and tested in simple cases for a unique network output. In this article, we use a slightly different approach than the more traditional full Bayesian method, where scalar hyperparameters are estimated using the so-called evidence approach. A multiple output method is used in order to develop uncertainty tools for real-world applications. Our Bayesian methodology provides first uncertainty estimates for the parameters of the neural network (i.e., the network weights). A similar approach is, but in a simpler presentation (a monovariate case), used, for example, in Bishop (1996), Neal (1996), and Nabney (2002). The robustness of the NN parameters is assessed using the Hessian matrix (second derivative) of the log likelihood with respect to the NN weights. Uncertainty estimates for the parameters of the neural network can then be used for the determination of a variety of other probabilistic quantities related to the overall stochastic character of the NN model. Such possible applications can use theoretical derivations when they are available. In this

Neural Network Uncertainty Assessment

2417

article, one such analytical application provides uncertainty estimates of the network output (error bars plus their correlation structure). Reliability of the NN predictions is very important for any application. Confidence intervals (CI) have been developed for classical linear regression theory with wellestablished results (e.g., Koroliouk, Portenko, Skorokhod, & Tourbine, 1983). For nonlinear models, such results are more recent (Bates & Watts, 1988), and in NN they are rarely available. Generally, only the root mean square (RMS) of the generalization error is provided, but this single quantity is not situation dependent. Other approaches use bootstrap techniques to estimate such CI, but they are limited by the large number of computations that such techniques require. Recently, Rivals and Personnaz (2000, 2003) introduced a new method for estimating CI by using a linear Taylor expansion of the NN outputs (which makes traditional estimation of CI for nonlinear models a tractable problem). In this article, we separate the errors that are due to the NN weight uncertainty and the errors from all remaining sources. Such additional sources of uncertainty can be, for example, noise in the inputs of the NN (Wright, Ramage, Cornford, & Nabney, 2000). We will comment on an approach to analyze in even more detail the various contributions to output errors. These errors are described in terms of covariance matrices that can be interpreted using eigenvectors called error patterns (Rodgers, 1990). When a theoretical derivation is too complex to be obtained, another possible application of weight uncertainty is the empirical estimation of probabilistic quantities. Modern Bayesian statistics are used here together with Monte Carlo (MC) simulations (Gelman, Carlin, Stern, & Rubin, 1995) to estimate uncertainties on the NN Jacobian. These Jacobians, or sensitivities, of a NN model are defined as the partial first derivatives of the model outputs with respect to its inputs. These quantities are very useful. However, the NN model is trained to obtain good fit statistics for its outputs, but most of the time, no constraint is applied to structure the internal regularities of the model. Statistical inference is an ill-posed inverse problem (Tarantola, 1987; Vapnik, 1997; Aires, Schmitt, Scott, & Ch´edin, 1999): many solutions can be found for the NN parameters (i.e., the synaptic weights) for similar output statistics. One of the reasons for this nonunique solution comes from the fact that multicollinearities can exist among the variables. Such correlations on input or output variables are a major problem even for linear regressions: the parameters of the regression are very unstable and can vary drastically from one experiment to another. The Jacobians are the equivalent of the linear regression parameters, so similar behavior is expected: when multicollinearities are present, the Jacobians will probably be highly variable and unreliable, even if the output statistics of the NN are very good. The aim of this article is to investigate this problem, analyze it, and suggest a solution. Many regularization techniques exist to reduce the number of degrees of freedom in the model for the multicollinearity problem or for any other ill-posed problem. For example, one approach is to reduce the number of

2418

F. Aires, C. Prigent, and W. Rossow

inputs to the NN (Rivals & Personnaz, 2003); this is a model selection tool. However, the introduction of redundant information in the input of the NN can be useful for reducing the observational noise (e.g., Aires et al., 2002a; Aires, Rossow, Scott, & Ch´edin, 2002b) as long as the NN learning is regularized in some way. Furthermore, the input variables used in this work are highly correlated (among brightness temperatures, among first guesses, or between observations and first guesses), so it would be difficult to extract few of the original variables and avoid the multicollinearities by input selection. We propose to solve this nonrobustness by using a principal component analysis (PCA) regression approach. Our technological developments are illustrated by application to an NN inversion algorithm for remote sensing over land. Such NN methods have already been used to retrieve columnar water vapor, liquid water, or wind speed over ocean using special sensor microwave/imager observations (Stogryn, Butler, & Bartolac, 1994; Krasnopolsky, Breaker, & Gemmill, 1995; Krasnopolsky, Gemmill, & Breaker, 2000). Our algorithm includes for the first time the use of a first guess to retrieve the surface skin temperature Ts, the integrated water vapor content WV, the cloud liquid water path LWP, and the microwave land surface emissivities Em between 19 and 85 GHz from SSM/I and infrared observations. Neural network techniques have proved very successful in developing computationally efficient algorithms for remote sensing (e.g., Aires et al., 2002b), but uncertainty estimates on the retrievals have been a limiting factor for the use of such methods. Our technical developments on this remote sensing application provide a new framework for the characterization and the analysis of various sources of neural network errors. Estimation of the Jacobian uncertainties is then used as a diagnostic tool to identify nonrobust regressions, resulting from unstable learning processes. The Bayesian approach for the estimation of NN weight uncertainty is presented in section 2. The NN technique is described, the theoretical formulation of a posteriori distributions for the NN weights is developed, and a remote sensing application is presented as an example to illustrate some first results of the application of our weight uncertainty analysis. The theoretical computation of the predictive distribution of network outputs is developed in section 3. Theoretical developments are used to characterize the NN output uncertainty sources. Section 4 presents the uncertainty estimate of NN Jacobians. The PCA of the NN input and output data is described together with its regularization properties. Network Jacobians are presented with their corresponding uncertainties. Conclusions and perspectives are discussed in section 5. 2 Network Weights Uncertainty 2.1 The Quality Criterion. In this study, we use a classical multilayer perceptron (MLP) trained by backpropagation algorithm (Rumelhart, Hin-

Neural Network Uncertainty Assessment

2419

ton, & Williams, 1986). For the definition of the quality criterion to maximize, we present a general matrix formulation of the problem and link our derivation to the “classical” literature on Bayesian error estimation often introduced with a scalar formulation (MacKay, 1992; Bishop, 1996). The first and main term in the quality criterion used to train a neural network is the “data” term, expressed using the difference between the target data and the NN estimates as measured by a particular distance. Many distance measures can be used, but it is often supposed that the differences follow a gaussian probability distribution function (PDF), which means that the right distance is the Mahalanobis distance (Crone & Crosby, 1995). The ideal covariance matrix for the gaussian PDF, denoted Cin = Ain −1 , describes what we call the intrinsic noise (or natural variability) of the physical variables y to retrieve. Note that Cin takes into account only the intrinsic variability and not the error associated with the retrieval scheme itself: this makes this measure coherent physically. The information encoded in Cin is difficult to obtain a priori; we will see how to estimate this quantity, but we suppose here that it is known. The data quality term becomes ED (w) =

N 1 T [εy (n) ] · Ain · εy (n) , 2 n=1

(2.1)

where εy (n) = (t(n) − y (n) ) is the output error and the index (n) indicates the sample number in database B . This criterion leads to a weighted least squares when the matrix Cin is just diagonal. When no a priori information is available, Cin = I , and the criterion becomes the classical least squares. In order to regularize the learning process, a regularization term is sometimes added to the data term in the quality criterion. The weight decay (Hertz, Krogh, & Palmer, 1991) is probably the most common regularization technique for NN: Er (w) =

1 T w · Ar · w , 2

(2.2)

where Cr = Ar −1 is the covariance matrix of the gaussian a priori distribution for the network weights. The overall quality criterion that is minimized during the learning stage is the sum of the data and the regularization terms, E(w) = ED (w) + Er (w).

(2.3)

The two matrices Ain and Ar are called hyperparameters. They are generally simplified in the classical literature by using two scalars instead, respectively, β and α, so that the general quality criterion becomes E(w) = βED + αEr , where ED and Er are simplified quadratic forms. In this formulation, β represents the inverse of the observation noise variance for all

2420

F. Aires, C. Prigent, and W. Rossow

outputs and α is a weight for the regularization term linked to the a priori general variance of the weights. This is obviously poorer and less general than our matrix formulation in equation 2.3, but the hyperparameters Ain and Ar are difficult to guess a priori. 2.2 Intrinsic Uncertainty of Targets. The conditional probability P(t|x, w) represents the variability of target t for input x and network weights w, due to a variety of sources like the errors in the model linking x to t in B or the observational noise on x. This variability includes all sources of uncertainty except those from the NN regression model, represented by uncertainties on the network weights w, that are fixed in the conditional probability. If the neural network gw fits the data well (after the learning stage), the intrinsic variability is evaluated by comparing the target values, t, matched with each input x in the data set B to the NN outputs y . Generally, this distribution can be approximated locally to first order by a gaussian distribution with zero mean and a covariance matrix Cin = Ain −1 : P(t|x, w) =

1 − 1 εy T · Ain · εy , e 2 Z

(2.4)

where Z is a normalization factor. The likelihood of the parameters w, given the inverse model structure g of the trained NN gw , is expressed by evaluating this probability over the database B that includes D = {t(n) ; n = 1, . . . , N}, the set of output samples, N T εy (n) · Ain · εy (n) − 12 1 (n) (n) n=1 P(D|x, w) = P(t |x , w) = N e , Z n=1 N

(2.5)

that we simplify by P(D|x, w) =

1 −ED e , ZN

(2.6)

using the defintion of ED in equation 2.1. The smaller ED is, the likelier the output data sample D is (i.e., the closer all y are to target t). The conditioning of the previous probabilities as in equation 2.5 is dependent on the input x, but since the distribution of x is not of interest here, this variable will be omitted in the following notation for simplicity. 2.3 Theoretical Derivation of Weight PDF. In classical regression techniques, a point estimate of the parameters w is searched for (i.e., only one estimate of the weight vector w is evaluated). In the Bayesian context, an uncertainty of w described by a PDF P(w) can also be characterized. This

Neural Network Uncertainty Assessment

2421

distribution of the weights conditional on a database is given by the Bayes theorem: P(w|D) =

P(D|w)P(w) . P(D)

(2.7)

P(D) does not depend on the weights, and the prior P(w) is a uniform distribution in this application (since there is no prior information on w), meaning that no regularization term Er (w) is used in equation 2.3. So we can use for P(w|D) the expression for P(D|x, w) from equation 2.6, the other terms in equation 2.7 being considered as constant normalization factors. Laplace’s method is now used: it consists in using a local quadratic approximation of the log-posterior distribution. A second-order Taylor expansion of ED (w ) is performed, where w is the set of the final optimized network weights (parameters of the neural network regression) found at the end of the learning process: 1 ED (w) = ED (w ) + bT · w + wT · H · w, 2

(2.8)

where w = w − w , b is the Jacobian vector given by b = |w (ED (w)), and H is the Hessian matrix given by H = |w ( |w (ED (w))). The linear term bT · w disappears because we are at the optimum w , which means that the gradient b is zero. For the local quadratic approximation to be valid, w must be a real optimum (at least locally in the weight space), otherwise, the gradient b cannot be neglected anymore and the matrix H might not be positive definite, which will make its use difficult for subsequent uncertainty estimates. The second-order approximation leads to P(w|D) =

1 T 1 −ED (w ) − 1 wT · H · w 2 e ∝ e− 2 w · H · w . ZN

(2.9)

This means that the a posteriori PDF of the neural network weights follows a gaussian distribution with mean w and covariance matrix H −1 . This probability represents a plausibility (in the Bayesian sense) for the weight w, not the probability of obtaining the weight w when using the learning algorithm. If a regularization term, such as the one described in equation 2.2, is used, then this probability becomes: 1 T P(w|D) ∝ e− 2 w · (H + Ar ) · w .

(2.10)

These two terms are used to weight the contribution to the variability of the weights due to the network model and the variability of the weights due to

2422

F. Aires, C. Prigent, and W. Rossow

the gaussian distribution of the a priori information on the weights. What is interesting about this formula is that to obtain the covariance matrix on the weights, we invert H + Ar instead of H only, which is more robust (see section 2.6) since Ar is the inverse of a positive definite matrix. 2.4 Hessian Matrix for a One-Hidden-Layer Network. The Hessian, H , of the previously defined log likelihood is a matrix of dimension W × W (W is the dimension of w) whose components are defined by ∂ 2 E(w) Hij (x) = , ∂wi ∂wj x

(2.11)

where wi and wj are two weights from the set w. There are many ways of estimating the Hessian matrix; some are generic methods, and some are specific to the MLP. For example, one generic approximation method uses finite differences, but in our case, it is possible to retrieve a mathematical expression for the Hessian based on the NN model. This theoretical Hessian is less demanding computationally—its scaling is O(W 2 ) (where W is the number of weights in the neural network)—than the previous approximation by finite differences, which scales like O(W 3 ). 2.5 A Remote Sensing Example. An NN inversion scheme has been developed to retrieve surface temperature (Ts), water vapor column amount (WV), and microwave surface emissivities at each frequency/polarization (Em ), over snow- and ice-free land from a combined analysis of satellite microwave (SSM/I) and infrared International Satellite Cloud Climatology Project (ISCCP) data (Aires, Prigent, Rossow, & Rothstein, 2001; Prigent, Aires, & Rossow, 2003). This study aims, in part, to provide uncertainty estimates for these retrievals. To avoid nonuniqueness and instability in an inverse problem, it is essential to use all a priori information available. The chosen solution is then constrained so that it is physically more consistent (Rodgers, 1976). We introduce a priori first-guess information into the input of an NN model, so the neural transfer function becomes

y = gw (y b , x◦ ),

(2.12)

where y is the retrieval (i.e., retrieved physical parameters), gw is the NN with parameters w, y b is the first guess for the retrieval of physical parameters, and x the observations. In this approach, the first guess is considered to be an independent estimate of the state obtained from sources other than the indirect measurements (here, the satellite observations). These are sometimes called virtual measurements (Rodgers, 1990). The extensive learning database used in this study, together with the characteristics of the a priori first-guess information and related background

Neural Network Uncertainty Assessment

2423

errors, are presented in Aires et al. (2001). Over the 9,830,211 samples for clear, snow- and ice-free measurements from a whole year of data, we have used only N =20,000 samples, chosen randomly, to construct the learning database B . The learning algorithm and the network architecture are able to infer the inverse radiative transfer equation with these N samples. The conjugate gradient optimization algorithm used to train the NN is fast and efficient: the learning errors decrease extremely fast and then stabilize after a few thousand iterations, each iteration involving the whole learning database B . This learning stage determines the optimal weights w . Once trained, the neural network gw represents statistically the inverse of the radiative transfer equation. The NN model is then valid for all observations (i.e., global inversion), where iterative methods, such as variational assimilation, have to compute an estimator for each observation (i.e., local inversion). Table 1 gives the RMS scores for the first guesses and the retrievals. For each output, the retrieval is a considerable improvement compared to the first guess. 2.6 Neural Network Hessian Regularization. The Hessian H is computed using the data set B . A few comments about the inversion of H are required. This matrix can be very large when the NN considered is big (W, the size of H , and the number of parameters in the NN can reach a few thousand). This means that the inversion can be sensitive to numerical problems. As a consequence, the estimation of H needs to be done with enough samples from B ; otherwise, the subspace spanned by the samples describing H might be too small or the eigenvalues of H too close to zero or even negative, making the inversion numerically impossible. We noted in section 2.3 that the gradient b in equation 2.8 is supposed to be zero; otherwise, the local quadratic approximation is not good enough, implying that the Hessian matrix H is not positive definite. As a consequence, it is very important that the learning of the NN converges close enough toward the optimal solution w . Monitoring the convergence al-

Table 1: First Guess and Retrieval RMS Errors.

Ts (K) WV (Kg.m−2 ) Em 19V Em 19H Em 22V Em 37V Em 37H Em 85V Em 85H

First Guess

Retrieval

3.52411 7.99485 0.01504 0.01837 0.01659 0.01425 0.01802 0.01764 0.02137

1.46250 3.83521 0.00494 0.00495 0.00562 0.00497 0.00501 0.00682 0.00820

2424

F. Aires, C. Prigent, and W. Rossow

gorithm could be enhanced by checking in parallel the positive definite character of the corresponding Hessian of the network. Even when enough samples from B are used to estimate H and when the learning convergence is reached, numerical problems can still exist. This situation can be related to an inconsistency between the complexity of the NN versus the complexity of the desired function to be estimated: too many degrees of freedom in the NN can produce an ill-conditioned Hessian matrix H . A possible solution often used in this context is to introduce a diagonal regularization matrix: H is replaced by H + λI , where λ is a small scalar and I is the identity matrix. The regularization factor λ is chosen to be small enough not to change the structure in H but big enough to allow the inversion: a compromise must be found. To determine the factor λ representing the right trade-off, we use four regularization criteria together with a discrepancy measure between the nonregularized H and the regularized matrix H + λI . The regularization criteria are the condition number with respect to inversion (the lower the better); the P-number, which is a positive integer if the matrix is not positive definite and zero otherwise (the lower the better); and the number of negative diagonal terms in the matrix (the lower the better). For the discrepancy measure between H and H + λI , we use the RMS differences between the square roots of the positive diagonal elements of the matrices (the lower the better). This quantity measures the differences that the regularization has introduced in the standard deviations of the two covariance matrices. In Figure 1, the variations of these four quantities for an increasing λ, from 0 to 50 are shown. A good compromise is found to be λ = 12.0: the regularization criteria are satisfactory (positive definite matrix, all diagonal terms positive, minimum condition number), and the discrepancy measure is still small. 2.7 PDF of Network Weights. To complete the analysis of the uncertainties due to the inversion algorithm, the posterior distribution of the network weights needs to be determined. As previously stated, this PDF represents a plausibility of weights w, not a probability of finding the particular weights. We saw in section 2.3 that this distribution follows a gaussian PDF with mean w and covariance matrix H −1 . In Figure 2, the optimum weights w are shown together with ± two standard deviations. As previously noted, weights between the hidden and the output layers are more variable than weights between the input and the hidden layers. This is due to the fact that the first processing stage of the NN, at the hidden layer level, is a high-level processing that includes the nonlinearity of the network. The second processing stage of the NN, from the hidden layer to the output layer, is just a linear postprocessing of the hidden layer.

A

800

400

200

0

10

8

40

50

0.18 0.16 0.14 0.12 0.1

C

x 10

3

20 30 LAMBDA

40

50

40

50

D

15

2

1

0

10

20

P. NUMBER

COND. NUMBER

20 30 LAMBDA

B

0.2

600

4

2425

RMS DIFFERENCE

NEGATIVE DIAGONAL TERMS

Neural Network Uncertainty Assessment

10

5

10

20 30 LAMBDA

40

50

0

10

20 30 LAMBDA

Figure 1: Quality criteria for variable λ. (A) The number of negative diagonal terms in the matrix. (B) RMS differences between the square root of the positive diagonal elements of the matrices. (C) The condition number with respect to inversion. (D) The P-number, which is a positive integer if the matrix is not positive definite and zero otherwise. See the text.

It is possible to know much more than just the output estimates of an NN. From the distribution of weights, samples {wr ; r = 1, . . . , R} of NN weights can be chosen. Each of the R samples wr represents a particular NN. Together, they represent the uncertainty on the NN weights. These samples can be used later to integrate under the PDF of weights in a Monte Carlo approach. For neural networks, the number of parameters (i.e., size of w) is big, so it is preferable to use an advanced sampling technique. Even if these samples are included within the large variability of the two standard deviations envelope, correlation constraints avoid random oscillations from noise by imposing some structure on them. The weights have considerable latitude to change, but their correlations constrain them to follow a strong dependency structure. This is why different weight configurations can result in the same outputs. Most important for network processing is the structure of these correlations. For example, if the difference of two inputs is a good predictor, as long as two weights linked to the two inputs perform the

2426

F. Aires, C. Prigent, and W. Rossow 3

A MEAN MEAN−2STD MEAN+2STD

WEIGHT VALUE

2 1 0 −1 −2 −3

3

10

20

30

40 50 60 WEIGHT NUMBER

70

80

100

B MEAN MEAN−2STD MEAN+2STD

2 WEIGHT VALUE

90

1 0 −1 −2 −3

100

200

300

400 500 WEIGHT NUMBER

600

700

800

Figure 2: Mean network weights w ± 2 standard deviation: (A) The first 100 NN weights corresponding to input/hidden layer connections, and (B) all 821 NN weights with weight 510 to 819 for hidden/output layer connections.

difference, the absolute value of the weights is not essential. Another source of uncertainty for the weights is the fact that some permutations of neurons have no impact on the network output. For example, if two neurons in the hidden layer of the network are permuted, the network answer would not change. The sigmoid function used in the network is saturated when the neuron activity entering is too low or too high. This means that a change of a weight going to this neuron would have a negligible consequence. These are just a few reasons that explain why the network weights can vary and still provide a good general fitting model. Variability of the network weights is considered a natural variability, inherent to the neural technique. Furthermore, what is important for the NN user is not the variability of the weights but the uncertainty that this variability produces in the network outputs or in even more complex quantities such as the network Jacobians.

Neural Network Uncertainty Assessment

2427

3 Uncertainty in Network Outputs 3.1 Theoretical Derivation of the Network Output Error PDF. The distribution of uncertainties of the NN output, y , is given by P(y |x, D) = P(y |x, w) · P(w|D)dw, (3.1) where D is the set of outputs y in a data set B = {(x(n) , t(n) ) ; n = 1, . . . , N} of N matched input-output couples. Using equations 3.4 and 3.12, we find that this probability is equal to: 1 1 T T 1 e− 2 (t − gw (x)) · Ain · (t − gw (x)) · e− 2 w · H · w dw, (3.2) Z where Ain is the inverse of Cin , the covariance matrix of the “intrinsic noise” of physical variables y , and H is the Hessian matrix of the quality criterion used by the learning process. Note that all the terms not dependent on w have been put together in the normalization factor Z. A first-order expansion of the NN function gw about the optimum weight w is now used: gw (x) = gw (x) + GT · w,

(3.3)

where

G = |{w=w } (gw )

(3.4)

is a W × M matrix. Introducing equation 3.4 into 3.2, and using εy = (y − gw (x)), we obtain T 1 T ·A ·ε T − ε y y in 2 P(t|x, D) ∝ e e−εy · Ain · (G w) T 1 T e− 2 w · (G · Ain · G + H ) · w dw T 1 1 T T ∝ e− 2 εy · Ain · εy eh · w − 2 w · O · w dw,

(3.5) (3.6)

where h = [−εy T · Ain · GT ]T and O = G · Ain · GT + H . The integral term in equation 3.6 can be simplified by (2π)

dimW 2

T 1 1 |O |− 2 e 2 h · O · h .

(3.7)

We can rewrite equation 3.6 using this simplification to obtain 1 T P(t|x, D) ∝ e− 2 εy · Ain · εy T T 1 T e 2 εy · Ain · G (G · Ain · G + H )

−1

· G · Ain · εy

(3.8)

T T −1 1 T ∝ e− 2 εy · [Ain − Ain · G (G · Ain · G +H ) G · Ain ] · εy . (3.9)

2428

F. Aires, C. Prigent, and W. Rossow

This means that the distribution of t follows a gaussian distribution with mean gw (x) and covariance matrix: −1

−1

C0 = [Ain − Ain · GT (G · Ain · GT + H ) G · Ain ] .

(3.10)

This covariance matrix can be simplified by multiplying the numerator and denominator by

G · (I + H −1 · G · Ain · GT ) · G to obtain

C0 = Cin + GT · H −1 · G.

(3.11)

We see that the uncertainty in the network outputs is due to the intrinsic noise of the target data embodied in Cin and the uncertainty described by the posterior distribution of the weight vector w embodied in GT · H −1 · G. This relation describes the fact that the uncertainties are approximately related to the inverse data density. As expected, uncertainties are larger in the less dense data space, where the learning algorithm gets less information. 3.2 Sources of Uncertainty. In equation 3.11, we have separated the sources of error in two terms: the intrinsic noise with covariance matrix Cin and the neural inversion term with covariance matrix GT H −1 G. Our neural inversion term refers to the errors due only to the uncertainty in the inverse model parameters, and all the remaining outside sources of errors are grouped in Cin . The inversion uncertainty can itself be decomposed into three sources, corresponding to the three main components of an NN model: 1. The imperfections of the learning data set B , which include simulation errors when B is simulated by a model, collocation and instrument errors when B is a collection of coincident inputs and outputs, nullspace errors, and others. This is probably the most important source of uncertainty due to the inversion technique. 2. Limitations of the network architecture because the model might not be optimum, with too few degrees of freedom or a structure that is not optimal. This is usually a lower-level source of uncertainty because the network can partly compensate for these deficiencies. 3. A nonoptimum learning algorithm because as good as the optimization technique is, it is impossible in practice to be sure that the global minimum w has been found instead of a local one. We think that this source of uncertainty is limited. Matrix Cin includes all other sources of errors. Our approach allows for the estimation of the global Cin , but if some individual terms are known,

Neural Network Uncertainty Assessment

2429

it is possible to subtract them from Cin . For example, if the instrument noise is known, it is possible to measure the impact of this noise on the NN outputs. The individual terms can then be subtracted from the global Cin . For simplification and because we do not use such a priori information, we adopt the hypothesis that Cin is constant for each situation; only the inversion term is situation dependent. But any a priori information about any nonconstant term in Cin could be used in this very flexible approach. Note that the specification of the sources of uncertainty by the approach of Rodgers (1990) uses mainly the concept of Jacobians of either the direct or the inverse model in order to linearize the impact of each error source. Linearity and gaussian variables are easily manageable analytically, the algebra being essentially based on the covariance matrices—for example: • CM = Dx · E · Dx T , the covariance of the errors due to instrument ∂g noise, where Dx = ∂xw is the contribution function and E = η T · η is the covariance matrix of instrument noise η ; or • or F = Ab · Cb · Ab T , the covariance of the forward model errors, where Cb is the covariance matrix errors of the forward model parameter, b, and Ab is the sensitivity matrix of observations b with respect to b (Rodgers, 1990). Some bridges can be built to link our error analysis and the approach used in variational assimilation by Rodgers (1990). In section 4, such Jacobians are analytically derived in the neural network framework. This makes feasible the use of Rodgers’ estimates. The difference would be that our linearization uses Jacobians that are situation dependent; this means that the estimation of the error sources would be nonlinear in nature. This will be the subject of another study. In Wright et al. (2000), noise in NN inputs is considered an additional source of uncertainty. An approach for the empirical characterization of the various sources of uncertainties is to use simulations. For example, for the instrument noise-related uncertainty, it is easy to introduce a sample of noise into the network inputs and analyze the consequent error distribution of the outputs. The advantage of such simulation approach is that it is very flexible and allows for the manipulation of nongaussian distributions. This will be the subject of another study. 3.3 Distribution of Network Outputs. After the learning stage, we estimate C0 , the covariance matrix of network errors εy = (t − gw (x)), over the database B . equation 3.11 shows that this covariance adds the errors due to neural network uncertainties and all other sources of uncertainty. Table 2 gives the numerical values of C0 for the particular example from Prigent et al. (2003). The right/top triangle is for the correlation, and the left/bottom triangle is for the covariance. The diagonal values give the variance of errors of quantity. The correlation part indicates clearly that some errors are highly

2.138910 −1.392113 −0.006294 −0.005261 −0.006274 −0.006121 −0.005290 −0.004895 −0.003906

Em 19V

−0.87 0.16 0.000024 0.000019 0.000024 0.000021 0.000018 0.000020 0.000017

WV

−0.24 14.708836 0.003179 −0.001143 0.003140 0.001049 −0.002954 −0.004945 −0.011933

−0.72 −0.06 0.77 0.000024 0.000020 0.000018 0.000020 0.000020 0.000022

Em 19H −0.76 0.14 0.88 0.72 0.000031 0.000023 0.000020 0.000027 0.000024

Em 22V −0.84 0.05 0.89 0.73 0.84 0.000024 0.000020 0.000023 0.000020

Em 37V −0.72 −0.15 0.74 0.81 0.71 0.81 0.000025 0.000022 0.000027

Em 37H −0.49 −0.18 0.60 0.60 0.71 0.70 0.65 0.000046 0.000044

Em 85V

−0.32 −0.37 0.42 0.56 0.54 0.50 0.67 0.79 0.000067

Em 85H

Notes: The right/top triangle is for correlation, and the left/bottom triangle is for covariance; the diagonal gives the variance. Correlations with absolute value higher than 0.3 are in bold.

Ts WV Em 19V Em 19H Em 22V Em 37V Em 37H Em 85V Em 85H

Ts

Table 2: Covariance Matrix C0 of Network Output Error Estimated over the Database B.

2430 F. Aires, C. Prigent, and W. Rossow

Neural Network Uncertainty Assessment

2431

correlated. This is why it would be a mistake to monitor only the error bars, even if they are easier to understand. The correlations of errors exhibit the expected physical behavior. Errors in Ts are negatively correlated with the other errors, with large values of correlation with the vertical polarization emissivities, for the channels that are much less sensitive to the water vapor (Em 19V and Em 37V). The vertical polarization emissivities are larger than for the horizontal polarizations and are often close to one, with the consequence that the radiative transfer equation in channels that are much less sensitive to the water vapor (the 19 and 37 GHz channels) is quasi-linear in Ts and in Em V. In contrast, errors in water vapor are weakly correlated with the other errors: the largest correlation is with the emissivity at 85 GHz in the horizontal polarization. The 85 GHz channel is the most sensitive to water vapor, and, since the emissivity for the horizontal polarization is lower than for the vertical, the horizontal polarization channel is more sensitive to water vapor. Correlations between the water vapor and the emissivities errors are positive or negative, depending on the respective contribution of the emitted and reflected energy at the surface (which is related not only to the surface emissivity but also to the atmospheric contribution at each frequency). Correlations between emissivity errors are always of the same signs and are high for the same polarizations, decreasing when the difference in frequency increases. The correlations involved in the PDF of the errors described by the covariance matrix C0 make it necessary to understand the uncertainty in a multidimensional space. This is more challenging than just determining the individual error bars, but it is also much more informative: the diagonal elements of the covariance matrix provide the variance for each output error, but the off-diagonal terms show the level of dependence among these output errors. To statistically analyze the covariance matrix C0 , this matrix can be decomposed into its orthogonal eigenvectors (not shown). These base functions constitute a set of error patterns (Rodgers, 1990). 3.4 Covariance of Output Errors Due to the Neural Inversion. The matrix H −1 is the covariance of the PDF of network weights. The use of the gradient G transforms this matrix into GT H −1 G, the covariance error of the NN outputs associated with the uncertainty of weights. Note that multiplication by G partially regularizes H −1 , so that for this particular purpose of the estimation of the output errors, H does not need to be regularized. Table 3 represents this covariance matrix GT H −1 G averaged over the whole learning database B . Even if some of the bottom-left values representing the covariance matrix are close to zero (this is an artifact since the variability ranges of the variables are quite different from each other), structure is still present in this matrix, as is shown in the correlation part (top right). The error correlation matrix GT H −1 G, related to the NN inversion method, has relatively small magnitudes with a maximum of 0.55. However,

0.493615 −0.106484 −0.000325 −0.000255 −0.000268 −0.000330 −0.000270 −0.000231 −0.000128

Em 19V −0.28 0.10 0.000002 0.000001 0.000001 0.000001 0.000001 0.000000 0.000000

WV

−0.14 1.063071 0.000167 −0.000060 0.000152 0.000033 −0.000183 −0.000282 −0.000681

−0.14 −0.02 0.33 0.000006 0.000001 0.000000 0.000001 0.000000 0.000000

Em 19H −0.25 0.09 0.55 0.26 0.000002 0.000001 0.000000 0.000000 0.000000

Em 22V −0.32 0.02 0.55 0.22 0.50 0.000002 0.000001 0.000000 0.000000

Em 37V −0.16 −0.07 0.28 0.29 0.26 0.34 0.000005 0.000000 0.000001

Em 37H

−0.19 −0.15 0.27 0.10 0.28 0.38 0.16 0.000002 0.000001

Em 85V

−0.06 −0.25 0.08 0.13 0.12 0.14 0.26 0.43 0.000006

Em 85H

Notes: The right/top triangle is for correlation, and the left/bottom triangle is for covariance; the diagonal gives the variance. Correlations with absolute value higher than 0.3 are in bold.

Ts WV Em 19V Em 19H Em 22V Em 37V Em 37H Em 85V Em 85H

Ts

Table 3: Covariance matrix GT H−1 G of Error Due to Network Uncertainty, Averaged over the Database B.

2432 F. Aires, C. Prigent, and W. Rossow

Neural Network Uncertainty Assessment

2433

it has a structure similar to the global correlation matrix, with the same signs of correlation and similar relative values between the variables. 3.5 Covariance of the Intrinsic Noise of Target Values. To estimate Cin , we use equation 3.11,

Cin = C0 B − GT H −1 GB ,

(3.12)

where the two right-hand terms are the covariance matrix of the total output errors averaged over B (see section 3.3) and the covariance matrix of the output errors due to the network inversion scheme averaged over B (see section 3.4). Table 4 gives the numerical values of the matrix Cin : The right/top triangle is for the correlation and the left/bottom triangle is for the covariance. Intrinsic error correlations can be very large (up to 0.99). The structure of Cin is also very similar to the structure of the global error correlation matrix, the only noticeable difference being the larger correlation values. An important remark is that the covariance error is dominated, in this application, by the intrinsic errors. This behavior is totally dependent on the particular application that is treated. 3.6 Network Outputs Error Estimate. Once Cin is available, we can estimate a C0 (x) that is dependent on the observations x, the term GT H −1 G varying with input x. It should be noted that the use of the regularization for matrix H has virtually no consequences for the results obtained for the error bars in the following. Using no regularization for the Hessian matrix is possible since H is multiplied by the gradients in GT H −1 G. This is an additional argument that the regularization helps the matrix inversion without damaging the information in the Hessian. C0 (x) is estimated for each of the 1,239,187 samples for clear-sky pixels in July 1992. Figure 3 presents the monthly mean standard deviations (square root of the diagonal terms in C0 (x)) for four outputs: the surface skin temperature Ts, the columnar integrated water vapor WV, and the microwave emissivities at 19 GHz for vertical and horizontal polarizations. The errors exhibit the expected geographical patterns. Large errors on Ts are concentrated in regions where the emissivities are lower or highly variable: inundated areas and deserts. In inundated areas, for instance (around the rivers like the Amazon or the Mississippi), or in coastal regions, the contribution from the surface is weaker, and sensitivity to Ts is lower because the emissivities are lower. In sandy regions through desert areas, due to higher transmission in the very dry sandy medium, microwave radiation does not come from the very first millimeters of the surface, but from deeper below the surface—the lower the frequency, the deeper (Prigent & Rossow, 1999). As a consequence, the microwave radiation is not directly related to the skin surface temperature (see Prigent & Rossow, 1999, for a detailed explanation) and Ts cannot be retrieved with the same accuracy.

1.645294 −1.285629 −0.005968 −0.005006 −0.006005 −0.005790 −0.005019 −0.004663 −0.003777

Em 19V −0.99 0.17 0.000021 0.000017 0.000023 0.000020 0.000017 0.000019 0.000016

WV

−0.27 13.645765 0.003011 −0.001083 0.002988 0.001015 −0.002770 −0.004662 −0.011251

−0.92 −0.06 0.89 0.000017 0.000019 0.000017 0.000018 0.000019 0.000021

Em 19H −0.86 0.14 0.91 0.83 0.000029 0.000022 0.000019 0.000026 0.000024

Em 22V −0.95 0.05 0.92 0.86 0.87 0.000022 0.000019 0.000022 0.000019

Em 37V −0.88 −0.16 0.83 0.98 0.80 0.90 0.000019 0.000021 0.000026

Em 37H

−0.55 −0.19 0.63 0.71 0.75 0.72 0.74 0.000043 0.000042

Em 85V

-0.37 −0.39 0.46 0.66 0.58 0.54 0.76 0.82 0.000060

Em 85H

Notes: The right/top triangle is for correlation, and the left/bottom triangle is for covariance; the diagonal gives the variance. Correlations with absolute value higher than 0.3 are in bold.

Ts WV Em 19V Em 19H Em 22V Em 37V Em 37H Em 85V Em 85H

Ts

Table 4: Covariance Matrix Cin of Intrinsic Noise Errors, Estimated over the Database B.

2434 F. Aires, C. Prigent, and W. Rossow

Neural Network Uncertainty Assessment

2435

Figure 3: Standard deviation of error maps for (A) surface skin temperature Ts, (B) columnar integrated water vapor WV, (C) microwave emissivity at 19 GHz vertical polarization, and (D) microwave emissivity at 19 GHz horizontal polarization.

2436

F. Aires, C. Prigent, and W. Rossow

The same arguments hold for the errors in emissivity. All the parameters being tightly related for a given pixel, the water vapor errors are also rather large in inundated regions and in sandy areas. 3.7 Outlier Detection. What is the behavior of the neural retrieval when the situation is particularly difficult, as when the first-guess is far from the actual solution? In principle, the nonlinearity of the NN allows it to have different weights on the observations and first-guess information, depending on the situation. For example, if the first guesses are better in tropical cases than in polar cases, the NN will have inferred this behavior during the learning stage and then will give less emphasis to the first guess when a polar situation is to be inverted. This assumes once again that the training data set is correctly sampled. To understand the behavior of the uncertainty estimates better, a good strategy is to introduce artificial errors for each source of information and to analyze the resulting impact on the network outputs. In Figure 4, the retrieval STD error change index is presented to show the effect of perturbating the mean inputs or the mean FGs by an artificial error. The impact of these artificial errors is measured in terms of the percentage of the regular STD retrieval error. For example, an impact index of 120% means that the regular STD retrieval error estimate increases by 20% when the input is perturbed. The impact indices can be compared for each of the nine network outputs. These results are obtained by averaging over the 20,000 samples in B . Figure 4A presents the error impacts when all 17 network inputs are changed by a factor ranging from −5% to +5%. Obviously, this will correspond to incoherent situations since the complex nonlinear relationships between vertical and horizontal brightness temperatures and first guesses will not be respected. As expected, the error increases monotically with the absolute value of the perturbation. However, the impact is not uniform among the output variables. For WV, which is retrieved with a rather low accuracy, changes in the inputs do not have a large influence. The impact on the emissivities is larger for horizontal polarizations than for vertical: horizontal polarization emissivities are much more variable than the vertical ones, and as a consequence, emissivities for vertical polarization have rather similar values in outputs whatever the situation and do not depend that much on the inputs. It can also be noted that positive perturbations have a slightly stronger impact than negative ones. This is to be related to the distribution of the variables in the training database. For the emissivities, for instance, the distribution has a steep cut-off for unit emissivity, above which the emissivities are not physical. On the contrary, a large range of emissivities exists in the training data base at lower values (see Figure 3 in Aires et al., 2001). As a consequence, decreasing the emissivity first guess will still be physically realistic, whereas increasing it will not be. Figure 4B is the same except that the changes are made only for the firstguess inputs. We note a similar behavior (nonuniform impact among output

Neural Network Uncertainty Assessment

2437 (B) − MEAN FG CHANGE

150 140 130 120 110

−5% −4% −3% −2% −1% +1% +2% +3% +4% +5%

STD ERROR CHANGE INDEX

STD ERROR CHANGE INDEX

(A) − MEAN INP. CHANGE 160

100 Ts WV 19V 19H 22V 37V 37H 85V 85H NEURAL NETWORK OUTPUT

160 150 140 130 120 110

100 Ts WV 19V 19H 22V 37V 37H 85V 85H NEURAL NETWORK OUTPUT

115

110

TS −5% WV −5% E19V −5% E19H −5% E22V −5% E85V −5% E85H −5%

105

100 Ts WV 19V 19H 22V 37V 37H 85V 85H NEURAL NETWORK OUTPUT

(D) − IND. FG POS. CHANGE STD ERROR CHANGE INDEX

STD ERROR CHANGE INDEX

(C) − IND. FG NEG. CHANGE 120

−5% −4% −3% −2% −1% +1% +2% +3% +4% +5%

120

115

110

TS +5% WV +5% E19V +5% E19H +5% E22V +5% E85V +5% E85H +5%

105

100 Ts WV 19V 19H 22V 37V 37H 85V 85H NEURAL NETWORK OUTPUT

Figure 4: Estimated STD error change index for an artificial pertubation: (A) of the mean input, (B) of the mean first-guess input, (C) of individual first-guess negative changes, and (D) of individual first-guess positive changes. See the detailed explanation in the text. Statistics are performed over 20,000 samples from B.

variables and with larger impact for positive perturbations), but we observe also that errors are larger than when all the inputs are perturbed in Figure 4A. This suggests that the error estimate is able to detect inconsistencies between observations and first-guess inputs. In Figures 4C and 4D, the first-guess input variables are perturbed individually with, respectively, negative and positive amplitude of 5%. For negative perturbations, the biggest impact is produced by the Ts first-guess perturbation: it is noticeable that the Ts error impact is similar for the retrieval of Em 19H and for its own retrieval. For other variables, the impacts have lower levels, with almost no impact from the WV first guess. The WV first guess is associated with a large error (40%), and as a consequence the NN gives little importance to this first guess. For positive individual perturbations in Figure 4D, the results are similar to the negative errors. The magnitude of the positive changes as compared to the negative ones is related again to the distribution of the variables in the training data set (see Figure 3 in Aires et al., 2001): If the distribution is not symmetrical around

2438

F. Aires, C. Prigent, and W. Rossow B TBH −5% TBH +5%

200 180 160 140 120

STD ERROR CHANGE INDEX

STD ERROR CHANGE INDEX

A 220

Ts WV 19V 19H 22V 37V 37H 85V 85H NEURAL NETWORK OUTPUT

220

180 160 140 120 Ts WV 19V 19H 22V 37V 37H 85V 85H NEURAL NETWORK OUTPUT

D EMH −5% EMH +5%

130 120 110 100 Ts WV 19V 19H 22V 37V 37H 85V 85H NEURAL NETWORK OUTPUT

STD ERROR CHANGE INDEX

STD ERROR CHANGE INDEX

C 150 140

TBV −5% TBV +5%

200

150 140

EMV −5% EMV +5%

130 120 110 100 Ts WV 19V 19H 22V 37V 37H 85V 85H NEURAL NETWORK OUTPUT

Figure 5: Estimated STD error change index for an artificial pertubation: (A) of horizontal polarization brightness temperatures, (B) of vertical polarization brightness temperatures, (C) of horizontal polarization first-guess emissivities, and (D) of vertical polarization first-guess emissivities. Statistics are performed over 20,000 samples from B.

a mode value, depending on the shape of the distribution, increasing or decreasing the value can be more or less realistic. In Figure 5, “incoherencies” have been introduced between the vertical and horizontal polarizations in the brightness temperatures (TB) observations and in the first-guess emissivities, Em s, by increasing or decreasing one, keeping the other polarization constant. In Figure 5A, we increased and decreased artificially by 5% the horizontal TB, and in Figure 5B, the same has been done for vertical polarizations. Figures 5C and 5D are similar for first-guess emissivities instead of TB. Several comments can be made. First, the impact is larger for observations than for first-guess errors, which suggests that observations are more important for the retrieval, the first guess being used mostly as an additional constraint. Second, these polarization inconsistencies have a bigger impact than changes of the means in Figure 4. For example, the NN might emphasize the difference of polarization for the retrieval, and then these inconsistencies would have a very strong impact. This shows that the NN, using complex nonlinear multi-

Neural Network Uncertainty Assessment

2439

variate relationships, is sensitive to inconsistencies among the inputs. It is encouraging to see that our error estimates are able to detect such situations. Finally, the relative impact of the positive and negative changes can be explained again by the distribution of the variables in the learning database. For the emissivities, whatever the polarization and the frequency, the histograms are not symmetrical, having a broad tail toward lower values and an abrupt end for the higher values. As a consequence, when artificially increasing the emissivities, unrealistic values are attained, which is not the case when decreasing the emissivities. (See Aires et al., 2001, for a complete description of the distributions of the learning database and the histograms of the inputs.) The results shown in Figures 4 and 5 are consistent with a coherent physical behavior, confirming that the new tools developed in this study and its companion articles can be used to diagnose difficult retrieval situations such as might be caused by bad first guesses, inconsistent measurements, situations not included in the training data set, or uncertainties of the NN on the possible retrievals. Our a posteriori probability distributions for the NN retrieval define confidence intervals on the retrieved quantities that allow the detection of such situations. It could be argued that a limitation of our retrieval uncertainty estimates comes from the fact that our technique is based on statistics over a data set B . This could mean that the error estimate is valid only when we are inside the variability spanned by B . On the contrary, it has been shown that the local quadratic approximation approach increases error estimates in sparsely sampled data space domains (see, e.g., MacKay, 1992). 4 Network Jacobian Uncertainties The a posteriori distribution of weights is useful to estimate the uncertainties of network outputs (see section 3). We will now show that these distributions can also be used for the estimation of complex probabilistic quantities via Monte Carlo simulations. As an example of such an approach, we use it to estimate the uncertainties of the NN Jacobians. 4.1 Definition of Neural Network Sensitivities. The NN technique not only provides a statistical model relating the input and output quantities, it also enables an analytical and fast calculation of the neural Jacobians (the derivative of the analytical expression of the NN model), also called the neural sensitivities or adjoint model (Aires et al., 1999). For example, the neural Jacobians for the two-layered MLP (a MLP network with one hidden layer) are ∂yk dσ = wjk · ∂xi da j∈S 1

i∈S0

wij xi · wij .

(4.1)

2440

F. Aires, C. Prigent, and W. Rossow

For a more complex MLP network with more hidden layers, a backpropagation algorithm exists that computes efficiently the neural Jacobians (Bishop, 1996). Since the NN is nonlinear, these Jacobians depend on the situation defined by the particular input, x. The neural Jacobian concept is a very powerful tool since it allows for a statistical estimation of the multivariate and nonlinear sensitivities connecting the input and output variables in the model under study, which is a useful data analysis tool (Aires & Rossow, 2003). The Jacobian matrix with terms given by equation 4.1 describes the global sensitivities for each retrieved parameter: they indicate the relative contribution of each input in the retrieval for a given output parameter. The Jacobian is situation dependent, which means that depending on the situation x, the NN uses the available information in different ways. 4.2 Sampling Strategy for Network Weights. To go beyond the point estimation approach where a learning algorithm is used to estimate only the optimal set of weights, the distribution of weights w uncertainty must be investigated. This distribution of weights can be used to estimate complex probabilitistic quantities like the confidence intervals of stochastic variables, the distribution of the outputs, and other probabilities of quantities dependent on the output of the network. All these potential applications require the integration under the PDF of weights. Fortunately, the a posteriori distribution of weights is gaussian (see section 2). This means that the normalization term Z1N in equation 2.9 is easily obtained (this is a main difficulty when integrating a PDF). The integration and the manipulation of a gaussian PDF is particularly easy compared to other distributions. However, when faced with the estimation of complex quantities, the analytical solution of such integrations can still be difficult to obtain. The estimation of the network Jacobians PDF is such a situation. This is why simulation strategies have to be used. Simulations first sample the PDF of weights with {wr ; r = 1, . . . , R} and then use this sample to approximate the integration under the whole weight PDF. Using only w , the MAP parameters, to estimate some other dependent quantities (such as NN Jacobians) directly may not be optimal, even if we are not interested in uncertainty estimates. In fact, most of the mass of the distribution (i.e., the location of the domain where the probability is higher), in a high-dimension space, can be far from the most probable state (i.e., the MAP state). The high dimension makes the mass of the PDF more on the periphery of the density domain and less at its center. Nonlinearities can also distort the distribution of the estimated quantity. This is why it is good to use R samples of the weights {wr ; r = 1, . . . , R} to estimate the density of the quantity of interest. Concerning the network Jacobians, the MAP network Jacobian is given by using the most probable network weights w . The mean Jacobian is not sufficient for a real sensitivity analysis; a measure of the uncertainty in this

Neural Network Uncertainty Assessment

2441

estimate is required as well. In fact, the NN is designed to reproduce the right outputs, but without any a priori information, the internal regularities of the network have no constraint. As a consequence, the internal regularities, such as the NN Jacobians, are expected to have a large variability. This variability needs to be monitored. To estimate the uncertainties of the Jacobians, we use R =1000 samples from the weights PDF described in section 2. Using an adequate sampling algorithm is a key issue here. To sample this gaussian distribution in very high-dimension space (about 800 network weights), the metropolis algorithm is used. This method is also suitable for nongaussian PDFs. For each weight sample wr , we estimate the mean Jacobian over the entire data set B . This means that we have at our disposal a sample of R =1000 mean Jacobians. They are then averaged, and a PDF for each individual term in the Jacobian matrix is obtained. 4.3 Multicollinearity Problem. Table 5 gives the mean neural Jacobian values for the variables xk and yi for the neural network, as defined in equation 4.1. These values indicate the relative contribution of each input in the retrieval of a given output. The numbers correspond to global mean over B values, which may mask rather different behaviors in various regions of the input space. The standard deviations of the uncertainty PDF are also indicated. The variability of the Jacobians is large: uncertainty of the neural sensitivities can be up to several times the mean value. For most cases, the Jacobian value is not in the confidence interval, which means that the actual value is not significant. In linear regression, obtaining nonsignificant parameters is often the signal that multicollinearities are a problem for the regression. The distribution of the Jacobians shows that most of them are not statistically significant. The reason for such uncertainty can be the pollution of the learning process by multicollinearities in the data (inputs and outputs), which introduce compensation phenomena. For example, if two inputs are correlated and they are used by the statistical regression to predict an output component, then the learning has some indeterminacy: it can give more or less emphasis to the first of the inputs as long as it compensates this underor overallocation by, respectively, an over- or underallocation in the second, correlated input variable. This means that the two corresponding sensitivities will be highly variable from one learning to another one. The output prediction would be just as good for both cases, but the internal structure of the model would be different. Since it is these internal structures (i.e., Jacobians) that are of interest here, this problem needs to be resolved. To see if the multicollinearities and consequent compensation phenomena are at the origin of the sensitivity uncertainties, the correlation between sensitivities is measured. If some of these sensitivities are correlated or anticorrelated, it means that from one learning cycle to another, the sensitivities will always be related following the compensation principle. The correlation

−0.15±0.11 0.33±0.07 0.07±0.06 −0.03±0.10 0.04±0.05 0.02±0.05 −0.07±0.07 −0.05±0.06 −0.18±0.09 0.13±0.09

0.18±0.08 −0.04±0.05 −0.07±0.04 −0.11±0.08 −0.06±0.04 −0.07±0.04 −0.08±0.06 −0.04±0.04 −0.03±0.07 −0.03±0.06

Ts WV Em 19V Em 19H Em 22V Em 37V Em 37H Em 85V Em 85H Tlay

−0.27±0.08 0.03±0.06 0.12±0.05 0.22±0.09 0.11±0.04 0.11±0.04 0.14±0.06 0.07±0.04 0.12±0.09 −0.04±0.08

−0.16±0.24 0.17±0.21 0.19±0.18 −0.28±0.17 −0.54±0.17 0.05±0.15

0.91±0.18

Em 19V

−0.14±0.09 0.04±0.08 0.11±0.08 −0.04±0.15 0.04±0.04 0.07±0.05 0.09±0.07 0.07±0.05 −0.04±0.11 0.01±0.09

−0.20±0.23 1.26±0.40 −0.14±0.25 −0.25±0.21 −0.00±0.22 −0.14±0.19 −0.16±0.17

Em 19H

−0.26±0.10 −0.01±0.09 0.09±0.07 0.29±0.14 0.15±0.04 0.13±0.05 0.15±0.07 0.11±0.05 0.17±0.10 −0.07±0.09

−0.46±0.36 0.59±0.28 0.25±0.21 −0.13±0.21 −0.61±0.23 0.09±0.18

0.57±0.22

Em 22V

−0.31±0.08 0.03±0.06 0.15±0.05 0.19±0.09 0.13±0.04 0.16±0.04 0.18±0.06 0.10±0.05 0.12±0.08 −0.06±0.08

0.02±0.19 −0.54±0.23 −0.15±0.21 1.12±0.19 0.18±0.19 −0.29±0.18 −0.12±0.16

Em 37V

−0.12±0.07 −0.04±0.05 0.06±0.04 0.09±0.07 0.05±0.03 0.07±0.03 0.11±0.05 0.04±0.04 0.05±0.06 −0.03±0.06

−0.29±0.15 0.03±0.18 −0.09±0.17 0.05±0.15 0.84±0.15 −0.33±0.14 0.02±0.13

Em 37H

−0.26±0.09 −0.06±0.07 0.16±0.05 0.16±0.10 0.13±0.04 0.15±0.04 0.20±0.06 0.20±0.05 0.21±0.08 −0.13±0.08

−0.17±0.21 −0.30±0.23 −0.77±0.21 0.63±0.20 −0.20±0.20 1.06±0.18 −0.14±0.16

Em 85V

−0.07±0.09 −0.15±0.08 0.03±0.05 0.18±0.10 0.06±0.04 0.07±0.04 0.15±0.06 0.10±0.05 0.21±0.07 −0.04±0.07

−0.12±0.19 −0.43±0.27 −0.26±0.22 0.01±0.19 0.61±0.21 −0.15±0.20 0.45±0.17

Em 85H

Notes: Columns are network outputs, y, and rows are network inputs, x. Sensitivities with absolute value higher than 0.3 are in bold, and positive 5% significance test are indicated by an asterisk. The first part of this table is for SSM/I observations; the second part corresponds to first guesses.

0.04±0.23 0.42±0.27 −0.79±0.27 −0.16±0.21 −0.67±0.23 −0.05±0.20 1.60±0.18

0.26±0.19 0.08±0.19 0.11±0.19 0.20±0.18 0.15±0.18 0.24±0.16 −0.13±0.15

WV

TB19V TB19H TB22V TB37V TB37H TB85V TB85H

Ts

∂y Table 5: Global Mean Nonregularized Neural Sensitivities ∂ x .

2442 F. Aires, C. Prigent, and W. Rossow

Neural Network Uncertainty Assessment

2443

of a set of sensitivities is shown in Table 6; some of the correlations are significant. For example, as expected, the correlation between the sensitivities of Ts to TB19V and TB22V is larger in absolute value than Ts with higherfrequency TB. The negative sign of this correlation is explained by the fact that TB19V, being highly correlated with TB22V, a large sensitivity of Ts to TB19V will be compensated for in the NN by a low sensitivity to TB22V, leading to a negative correlation. The absolute value of the correlations is not extremely high (about 0.3 or 0.4), but when added, all these correlations define a quite complex and strong dependency structure among the sensitivities. This is a sign that multicollinearities and subsequent compensations are acting in the network model. To avoid such multicollinearity problems, the network learning needs to be regularized by using some physical a priori information to better constrain the learning, in particular in terms of dependency structure among the variables, or by employing some statistical a priori information that will help reduce the number of degrees of freedom in the learning process in a physically meaningful way. In the following sections, we investigate the latter regularization strategy by using PCA. 4.4 Principal Component Analysis of Inputs and Outputs. Let C x be the K × K covariance matrix of inputs to a neural network and C y be the M × M covariance matrix of the outputs. We use the eigendecomposition of these two matrices to obtain F x and F y the K × K and M × M matrices whose columns are the corresponding eigenvectors. Instead of the full matrices, we can use the truncated K × K matrix F x and the M × M matrix F y (K < K and M < M), to use only the lower-order components (Aires et al., 2002a). Inputs x and outputs y are projected using

x = F x · S 1x −1 · (x − m1x ) y = F y · S 1y −1 · (y − m1y ),

(4.2) (4.3)

where S 1x and S 1y are the diagonal matrices with diagonal terms equal to the standard deviation of, respectively, inputs and outputs, and the vectors m1x and m1y are the input and output means. The vectors x and y are a compression of the real data, but the inverse transformations of equations 4.2 and 4.3 go back from the compression to the full representation with, of course, some compression errors. PCA is optimum in the leastsquares sense: the square errors between data and its PCA representation are minimized. Using a reduced PCA representation allows us to reduce the dimension of the data, but a compromise needs to be found between a good compression level (a smaller number of PCA components used) and a small compression error (a larger number of PCA components used). The more PCA components that are used for compression, the lower the compression error is. Another advantage of the PCA representation is to suppress part of

1.00 −0.19 −0.15 −0.05 0.13 −0.08 −0.01 0.14 −0.00 −0.01

... 1.00 −0.18 −0.44 −0.00 −0.04 0.18 −0.17 0.14 0.18

∂Ts ∂TB19V

... ... 1.00 −0.16 −0.59 0.12 −0.05 0.25 −0.44 −0.41

∂Ts ∂TB19H

... ... ... 1.00 −0.01 −0.18 −0.08 −0.16 0.09 0.17

∂Ts ∂TB22V

... ... ... ... 1.00 −0.03 −0.38 −0.06 0.03 0.06

∂Ts ∂TB37H

Note: Correlations with absolute value higher than 0.3 are in bold.

∂Ts ∂Ts ∂Ts ∂TB19V ∂Ts ∂TB19H ∂Ts ∂TB22V ∂Ts ∂TB37H ∂Ts ∂TB85V ∂Ts ∂TB85H ∂Ts ∂Em 19V ∂Ts ∂Em 19H ∂Ts ∂Em 85H

∂Ts ∂Ts

... ... ... ... ... 1.00 −0.45 0.15 −0.03 0.03

∂Ts ∂TB85V

... ... ... ... ... ... 1.00 0.04 0.01 −0.12

∂Ts ∂TB85H

Table 6: Correlation Matrix for a Sample of Neural Network Sensitivities.

... ... ... ... ... ... ... 1.00 −0.41 −0.31

∂Ts ∂Em 19V

... ... ... ... ... ... ... ... 1.00 0.26

∂Ts ∂Em 19H

... ... ... ... ... ... ... ... ... 1.00

∂Ts ∂Em 85H

2444 F. Aires, C. Prigent, and W. Rossow

Neural Network Uncertainty Assessment

2445

the noise during the compression process, when the lower-order principal components of a PCA decomposition describe the real variability of the observations or the signal and the remaining principal components describe higher-frequency variabilities. The higher orders are more likely to be related to the gaussian noise of the instrument or to very minor variability. We will consider in the following that the higher-order components describe noise (instrumental plus unimportant information) and use the reduced instead of the full PCA representation. We will not comment on compression or denoising considerations in this study (see Aires et al., 2002a). Figure 6 describes the cumulated percentage of explained variance by a cumulated number of PCA components for the input and output data. The first PCA components for the inputs and the outputs of the neural network are represented in Figures 7A and 7B, respectively. The physical consistency of the PCA has been checked (not shown) by projecting the samples of the database B onto the map of the first two principal components that represent most of the variability. Clusters of points are related to surface characteristics. Since surface types are known to represent a large part of the variability, the fact that the PCA is able to coherently separate them demonstrates the physical significance of the PCA representation. This is particularly important because the PCA will be used, in the following, to regularize the NN

100 17 INPUTS 9 OUTPUTS

PERCENTAGE OF VARIANCE EXPLAINED

95

90

85

80

75

70

65

60

2

4

6 8 10 12 CUMULATED NUMBER OF EIGEN−VALUES

14

16

Figure 6: Percentage of variance explained by the first PCA components of inputs (solid) and outputs (dotted) of the neural network.

2446

F. Aires, C. Prigent, and W. Rossow

A

0.5

PRINCIPAL COMPONENT

COMPONENT 1 COMPONENT 2 COMPONENT 3 COMPONENT 4

0

−0.5

−1

1

1

2

3

4

5

6

7 8 9 10 11 NEURAL NETWORK INPUT

12

13

14

15

16

17

B COMPONENT 1 COMPONENT 2 COMPONENT 3 COMPONENT 4

0.8

0.6

PRINCIPAL COMPONENT

0.4

0.2

0

−0.2

−0.4

−0.6

−0.8

−1

Ts

WV

Em19V

Em19H Em22V Em37V NEURAL NETWORK OUTPUT

Em37H

Em85V

Em85H

Figure 7: First PCA components of inputs (A) and outputs (B) of the neural network.

Neural Network Uncertainty Assessment

2447

learning. The patterns that are found by the PCA will distribute the contribution of each input and each output for a given sensitivity. It is essential that these patterns have a physical meaning. 4.5 PCA Regression Approach. The fact that the dimension of the inputs is reduced decreases the number of parameters in the regression model (i.e., weights in the neural network) and consequently decreases the number of degrees of freedom in the model, which is good for any statistical techniques. The variance in determining the actual values of the neural weights is reduced. The training of the NN is simpler because the inputs are decorrelated. Correlated inputs in a regression are called multicollinearities, and they are well known to cause problems for the model fit (Gelman et al., 1995). Suppressing these multicollinearities makes the minimization of the quality criterion more efficient: it is easier to minimize, with less probability of becoming trapped in a local minimum. Therefore, it has the general effect of suppressing uncertainty in the determination of the parameters of the NN model. (For a detailed description of PCA-based regression, see Jolliffe, 2002.) How many PCA components should the regression use? From section 4.4, it is preferable to use the optimal compromise between the best compression fit and denoising in terms of global statistics. This statement is related to the PCA representation, not taking into account how the NN uses these components. No theoretical results exist to define the optimal number of PCA components to be used in a regression; it depends entirely on the problem to be solved. Various tests can be performed. Experience with the NN technique shows that if the problem is well regularized, once sufficient information is provided as input, adding more PCA components to the inputs does not have a large impact on the retrieved results; the processing just requires more computations because of the increased data dimension. Therefore, we recommend being conservative and taking more PCA components than the denoising optimum would indicate in order to keep all possibly useful information. In terms of output retrieval quality, the number of PCA components used in the input of the NN needs to be reduced for reasons other than just denoising or compression. In fact, during the learning stage, the NN is able to relate each output to the inputs that help predict it, disregarding the inputs that vary randomly. In some cases, the dimension of the network input is so big (few thousands) (Aires et al., 2002a) that a compression is necessary. In our case, K = 17 is easily manageable, so all the input variables could be used. For our study here, K = 12 is chosen to reduce the number of degrees of freedom in the network architecture. This number of input PCA components is large enough for the retrieval, representing 99.46% of the total variance (see Table 7). No additional information would be gained from adding the higher-order PCA input components.

2448

F. Aires, C. Prigent, and W. Rossow

∂ y Table 7: Global Mean Neural Sensitivities ∂ x of Raw Network Output and Input.

NN Outputs NN Inputs Compo 1 Compo 2 Compo 3 Compo 4 Compo 5 Compo 6 Compo 7 Compo 8 Compo 9 Compo 10 Compo 11 Compo 12

Compo 1

Compo 2

Compo 3

Compo 4

Compo 5

−0.81±0.01 −0.21±0.01 0.51±0.01 0.17±0.01 −0.06±0.01 −0.01±0.01 −0.01±0.01 −0.03±0.01 0.01±0.01 −0.01±0.01 −0.14±0.01 0.10±0.01

−0.25±0.01 −0.69±0.01 −0.65±0.01 −0.46±0.01 0.04±0.01 0.02±0.01 0.02±0.01 −0.10±0.01 0.02±0.01 0.03±0.01 0.22±0.01 −0.18±0.01

0.53±0.01 −0.62±0.01 0.41±0.01 0.05±0.01 −0.00±0.01 0.01±0.01 −0.01±0.01 0.01±0.01 −0.01±0.01 0.00±0.01 −0.01±0.01 0.10±0.01

−0.05±0.01 0.16±0.01 0.47±0.01 −0.44±0.01 −0.02±0.01 0.02±0.01 −0.02±0.01 −0.04±0.01 −0.00±0.01 0.02±0.01 0.05±0.01 0.08±0.01

−0.03±0.01 0.10±0.02 0.02±0.01 −0.07±0.01 0.77±0.04 0.07±0.01 0.10±0.01 −0.05±0.01 −0.25±0.01 0.12±0.01 0.25±0.01 −0.14±0.01

Notes: Columns are network outputs, y , and rows are network inputs, x . Sensitivities with absolute value higher than 0.3 are in bold.

The number of PCA components used in the NN output is related to the retrieval error magnitude for a nonregularized NN. If the compression error is minimal compared to the retrieval error of the nonregularized network, then M , the number of output components used, is satisfactory. It would be useless to try to retrieve something that is noise in essence. Furthermore, it could lead to numerical problems and interfere with the retrieval of the other, more important, output components. In this application, M = 5 has been chosen, representing 99.93% of the total variability of the outputs. The outputs of the network, that is, the PCA components, are not homogeneous; they have different dynamic ranges. The importance of each of the components in the output of the NN is not equal. The first PCA component represents 52.68% of the total variance of the data, where the fifth component represents only 0.43%. Giving the same weight to each of these components during the learning process would be misleading. To resolve this, we give a different weight to each of the network outputs in the “data” part, ED , of the quality criterion used for the network learning (see section 2). For an output component, this weight is equal to the standard deviation of the component. This is equivalent to using equation 2.1, where Ain is the diagonal matrix with diagonal terms equal to the standard deviation of the PCA components. Off-diagonal terms are zero since, by definition, no correlation exists between the components in εy = (t(n) − gw (x(n) )) (i.e., the output error, target, or desired output minus the network output).

Neural Network Uncertainty Assessment

2449

4.6 Retrieval Results of PCA Regression. The mean RMS retrieval error for the new NN with PCA representation of its inputs and outputs is slightly higher than for the original nonregularized NN. For example, the surface skin temperature RMS error is 1.53 instead of 1.46 in the nonregularized NN. This is expected because we know that reducing variance (overfitting) by regularization increases the bias (RMS error). This is know as the bias/variance dilemma (Geman, Bienenstock, & Doursat, 1992). This dilemma describes the compromise that must be found between a good fitting on the learning database B and a robust model with physical meaning. The differences of RMS errors are, in this case, negligible. In order to estimate the NN weight uncertainties, we use the approach described in section 2: the Hessian matrix H must first be computed and then regularized in order to obtain the covariance matrix of the weights PDF. This regularization of the Hessian matrix is done to make it positive definite, which is not the same goal as the regularization of the NN behavior by the PCA representation. These two regularization steps should not be confused. Figure 8 presents the corresponding standard deviation for the NN weights with various regularization parameters λ around the optimal value,

0.25 600 660 670 680 1000

STD ON WEIGHT VALUE

0.2

0.15

0.1

0.05

0

10

20

30

40 50 60 WEIGHT NUMBER

70

80

90

100

Figure 8: Standard deviation of NN weights with increased regularization parameter λ: λ = 600 (dotted), λ = 660 (dot-dash), λ = 670 (dashed), λ = 680 (solid), and λ = 1000 (asterisk).

2450

F. Aires, C. Prigent, and W. Rossow

λ = 660, which is determined as described in section 2.6 using various quality criteria. It is interesting that the ill-conditioning of the Hessian matrix shows large sensitivity to some particular network weights. For λ too small, the standard deviation is very chaotic and nonmonotonic, with some values going from extreme large values to even negative ones. Increasing λ makes the standard deviation of the particular weights converging to a more acceptable, positive value and coherent with the other standard deviations. At the same time, increasing λ uniformly decreases the standard deviation in all the network weights. The balance between a λ large enough to regularize H but without changing the standard deviation of well-behaved weights must be found. This is probably the most important issue for the uncertainty estimates described in this study. Another approach to obtain a well-conditioned Hessian would be to constrain the Hessian matrix H to stay definite positive during the learning stage. 4.7 PCA-Regularized Jacobians. Before they are introduced as inputs and outputs of the neural network, the reduced-PCA representations, x and y , need to be centered and normalized. This is a requirement for the neural network method to work efficiently. The new inputs and outputs of the neural network are given by:

x = S 2x −1 · (x − m2x ) y = S 2y −1 · (y − m2y ),

(4.4) (4.5)

where the S 2x and S 2y are the diagonal matrices of the standard deviations of, respectively, x and y (defined in equations 4.2 and 4.3) and vectors m2x and m2y are the respective means. ∂y The NN formulation allows derivation of the network Jacobian ∂ x for the normalized quantities of equations 4.4 and 4.5. To obtain the Jacobian in physical units, one should use equations 4.2 through 4.5 to find

∂y ∂y T = S 1y · F y · S 2y · · S 2x −1 · F x · S 1x −1 . ∂x ∂ x

(4.6)

Equation 4.6 gives the neural Jacobian for the physical variables x and y . To enable comparison of the sensitivities between variables with different variation characteristics, the terms S 1y and S 1x −1 can be suppressed in this expression so that, for each input and output variable, a normalization by its standard deviation is used. The resulting nonlinear Jacobians indicate the relative contribution of each input in the retrieval to a given output variable. 4.8 Uncertainty of Regularized NN Jacobians. In Table 7, the PCA∂ y regularized NN is used to estimate the mean Jacobian matrix ∂ x of raw

Neural Network Uncertainty Assessment

2451

network outputs and inputs, together with the corresponding standard deviations. The standard deviations are much more satisfactory in this case: some high sensitivities are present, but they are all significant to the 5% confidence interval. The structure of this sensitivity matrix is interesting and illustrates the way the NN connects inputs and outputs together. For example, the first output component is related to the first input component (0.81 sensitivity value) but also the third input component (0.51). This shows that the PCA components are not the same in output and in input, so that the NN needs to nonlinearly transform the input component to retrieve the output ones. With increasing output component number, the input component number used increases too. But higher-order input components (more than five) have limited impact. Even if the mean sensitivity is low, it does not mean that the input component has no impact on the retrieval for some situations. The nonlinearity of the NN allows it to have a situation dependency of the sensitivities so that a particular input component can be valuable for some particular situations. ∂y

Using equation 4.6, we obtain the corresponding Jacobian matrix ∂ x for the physical variables instead of the PCA components, but normalized to be able to compare individual sensitivities (see Table 8). The uncertainty of the sensitivities is now very low, and most of the mean sensitivities are significant to the 5% level. This demonstrates that the PCA regularization has solved, at least partially, the problem of Jacobian uncertainty by suppressing the multicollinearities in the statistical regression. Interferences among variables are suppressed, and the standard deviations calculated for each neural sensitivity are very small, as compared to the values previously estimated without regularization (see Table 5). In addition, the sensitivities make more sense physically, as expected. The retrieved Ts is very sensitive to the brightness temperatures at vertical polarizations for the lower frequencies (see the numbers in bold in the corresponding column). The emissivities being close to one for the vertical polarization (and higher than for the horizontal polarization), Ts is almost proportional to TB in window channels (i.e., those that are not affected by water vapor). Sensitivity to the Ts first guess is also rather high but associated with a higher standard deviation. Sensitivities to the first-guess emissivities are weak, regardless of frequency and polarization. WV information clearly comes from the 85 GHz horizontal polarization channel. It is worth emphasizing that the sensitivity of WV to TB85H is almost twice as large as to the WV first guess, meaning that real pertinent information is extracted from this channel. Sensitivity of the retrieved emissivities to the inputs strongly depends on the polarization, the vertical polarization emissivities being more directly related to Ts and TBV given their higher values generally close to one. Emissivities in vertical polarization are essentially sensitive to Ts and to the TBV, whereas the emissivities in the horizontal polarization are dominated by the emissivity first guess. The sensitivity ma-

0.20±0.02 −0.06±0.01 −0.09±0.01 −0.07±0.02 −0.07±0.01 −0.08±0.01 −0.08±0.02 −0.05±0.01 −0.05±0.02 −0.03±0.02

Ts WV Em 19V Em 19H Em 22V Em 37V Em 37H Em 85V Em 85H Tlay

Em 19V

0.06±0.00 0.00±0.00 0.05±0.00 0.04±0.00 −0.02±0.00 −0.02±0.00 −0.06±0.00

−0.04±0.00 0.01±0.00 0.03±0.00 0.03±0.00 0.02±0.00 0.02±0.00 0.02±0.00 0.01±0.00 0.01±0.00 −0.00±0.00

WV

−0.52±0.02 0.14±0.01 −0.34±0.01 −0.27±0.01 0.28±0.01 0.39±0.01 0.80±0.02

−0.18±0.02 0.42±0.01 0.09±0.01 −0.08±0.02 0.05±0.01 0.02±0.01 −0.13±0.02 −0.06±0.01 −0.16±0.02 0.11±0.02

−0.01±0.00 0.00±0.00 0.01±0.00 0.05±0.00 0.01±0.00 0.00±0.00 0.03±0.00 −0.00±0.00 0.01±0.00 −0.01±0.00

−0.00±0.00 0.03±0.00 −0.01±0.00 −0.01±0.00 0.02±0.00 −0.01±0.00 0.01±0.00

Em 19H

−0.05±0.00 0.01±0.00 0.03±0.00 0.02±0.00 0.02±0.00 0.02±0.00 0.02±0.00 0.01±0.00 0.01±0.00 −0.00±0.00

0.06±0.00 −0.00±0.00 0.04±0.00 0.04±0.00 −0.01±0.00 −0.02±0.00 −0.05±0.00

Em 22V

−0.05±0.00 −0.00±0.00 0.02±0.00 0.00±0.00 0.02±0.00 0.03±0.00 0.02±0.00 0.03±0.00 0.02±0.00 −0.01±0.00

0.04±0.00 −0.01±0.00 0.03±0.00 0.03±0.00 −0.01±0.00 −0.00±0.00 −0.03±0.00

Em 37V

−0.01±0.00 −0.01±0.00 0.01±0.00 0.04±0.00 0.01±0.00 0.01±0.00 0.03±0.00 0.00±0.00 0.02±0.00 −0.01±0.00

−0.01±0.00 0.03±0.00 −0.01±0.00 −0.01±0.00 0.02±0.00 −0.01±0.00 0.01±0.00

Em 37H

−0.04±0.00 −0.02±0.00 0.01±0.00 −0.04±0.00 0.02±0.00 0.03±0.00 0.00±0.00 0.05±0.00 0.04±0.00 −0.01±0.00

−0.02±0.00 −0.01±0.00 −0.01±0.00 0.00±0.00 0.01±0.00 0.04±0.00 0.04±0.00

Em 85V

−0.01±0.00 −0.01±0.00 0.01±0.00 0.01±0.00 0.01±0.00 0.01±0.00 0.02±0.00 0.02±0.00 0.03±0.00 −0.01±0.00

−0.03±0.00 0.02±0.00 −0.02±0.00 −0.02±0.00 0.02±0.00 0.01±0.00 0.04±0.00

Em 85H

Notes: Columns are network outputs, y, and rows are network inputs, x. Sensitivities with absolute value higher than 0.3 are in bold. The first part of this table is for SSM/I observations; the second part corresponds to first guesses.

0.23±0.02 0.06±0.01 0.21±0.01 0.21±0.01 0.06±0.01 0.12±0.01 0.01±0.02

TB19V TB19H TB22V TB37V TB37H TB85V TB85H

Ts

∂y Table 8: Global Mean Regularized Neural Sensitivities ∂ x .

2452 F. Aires, C. Prigent, and W. Rossow

NORMALIZED SENSITIVITY ESTIMATE

Neural Network Uncertainty Assessment

0.6

2453

A

0.4

0.2

0

−0.2

6

2

4

6

8

10 12 SIMULATION NUMBER

14

16

20

B TS/TS TS/TB37V TS/TB85V TS/EM37V TS/EM85V

5 4 PDF

18

3 2 1 0

−0.4

−0.2

0 0.2 0.4 NORMALIZED SENSITIVITY ESTIMATE

0.6

0.8

Figure 9: (A) Twenty samples of five neural network sensitivities ( ∂Ts , ∂Ts , ∂Ts ∂TB37V ∂Ts ∂Ts ∂Ts , , and ∂Em 85V ). (B) Histogram of the same network sensitivities. ∂TB85V ∂Em 37V

trix clearly illustrates how the NN extracts the information from the inputs to derive the outputs. In Figure 9 (resp. Figure 10), 20 samples of five NN sensitivities are represented. These samples are found using the Monte Carlo sampling strategy described in section 2.7. Associated with these samples are represented the histogram of the same NN sensitivities. As can be seen in these two figures, the uncertainty on the NN sensitivities has been largely reduced with the regularized NN (see Figure 10) compared to the nonregularized NN (see Figure 9). Experiments (not shown) establish that such PCA-regularized NNs have robust Jacobians even when the NN architecture is changed—for example, with a different number of neurons in hidden layer. This shows how robust and reliable the new NN Jacobians and the NN model have become with the help of the PCA representation regularization. 5 Conclusion and Perspectives This study provides insight into how the NN model works and how the NN outputs are estimated. These developments draw the NN technique closer to

NORMALIZED SENSITIVITY ESTIMATE

2454

F. Aires, C. Prigent, and W. Rossow

0.6

A

0.4

0.2

0

−0.2

60

2

4

6

8

10 12 SIMULATION NUMBER

14

16

20

B dTS/dTS dTS/dTB37V dTS/dTB85V dTS/dEM37V dTS/dEM85V

50 40 PDF

18

30 20 10 0

−0.4

−0.2

0 0.2 0.4 NORMALIZED SENSITIVITY ESTIMATE

0.6

0.8

Figure 10: (A) Twenty samples of five regularized neural network sensitivities ( ∂Ts , ∂Ts , ∂Ts , ∂Ts , and ∂E∂Ts ) and (B) Histogram of the same network ∂Ts ∂TB37V ∂TB85V ∂Em 37V m 85V sensitivities.

better-understood classical methods, in particular linear regressions. With these older techniques, estimation of uncertainties of the statistical fit parameters is standard and is completely mandatory before the use of the regression model. Having at our disposal similar statistical tools for the NN establishes it on a stronger theoretical and practical basis so that NNs can be a natural alternative to traditional regression methods, with its particular advantage of nonlinearity. The tools are very generic and can be used for different linear or nonlinear regression models. A fully multivariate formulation is introduced. Its generality will allow future developments (like the iterative reestimation strategy or the fully Bayesian estimation of the hyperparameters). The uncertainty of the NN weights can be large, but as we saw, the complex structure of correlation constrains this variability so that the NN outputs are a good statistical fit to the desired function. In the Bayesian approach, the prediction (estimation of the NN output) does not use just a specific estimation of the weights w but rather integrates the outputs over the distribution of weights P(w), the “plausibility” distribution. This approach is different in the sense that the prediction is given in terms of the PDF instead of the mode value. This article describes a technique to estimate

Neural Network Uncertainty Assessment

2455

the uncertainties of NN retrievals and provides a rigorous description of the sources of uncertainty. The second application of the weight PDF is the estimation of NN Jacobians uncertainty. In this article, we show how to estimate the Jacobians of a nonlinear regression model, in particular for a NN model. New tools are provided to check how robust and stable these Jacobians are by estimating their uncertainty PDF by Monte Carlo simulations, providing the identification of situations where regularization needs to be used. As is often the case, regularization is a fundamental step of NN learning, especially for inverse problems (Badeva & Morosov, 1991; Tikhonov & Arsenin, 1977). We propose a regularization method based on the PCA regression (using a PCA representation of input and output data for the NN) to suppress the problem of multicollinearities in data at the origin of the NN Jacobian variability. Our approach is able to make the learning process more stable and the Jacobians more reliable, and it can be more easily interpreted physically. All these tools are very general and can be used for other nonlinear models of statistical inference. Together with the introduction of first-guess information first described in Aires et al. (2001), error specification makes the neural network approach even closer to more traditional inversion techniques like variational assimilation (Ide, Courtier, Ghil, & Lorenc, 1997) and iterative methods in general. Furthermore, quantities obtained from NN retrievals can now be combined with a forecast model in a variational assimilation scheme since the error covariances matrices can be estimated. These covariance matrices are not constant; they are situation dependent. This makes the scheme even better since it is now possible to assimilate only inversions of good quality (low-uncertainty estimates). Bad situations can be discarded from the assimilation or, even better, can be used as an “extreme” detection scheme that would, for example, signal the need for an increased number of simulations in an ensemble forecast. All these new developments establish the NN technique as a serious candidate for remote sensing in operational schemes, compared to the more classical approaches (Twomey, 1977). Our method provides a framework for the characterization, the analysis, and the interpretation of the various sources of uncertainty in any NN-based retrieval scheme. This makes possible improvements in the inversion schemes. Any fault that can be detected can be corrected: lack of data in the observation domains, errors of the model in some specific situations, or detection of extreme events. This should benefit a large community of NN users in meteorology or climatology. Many new algorithmic developments can be pursued, and we provided a few ideas. For example, the network output uncertainties can easily be used for novelty detection (i.e., data that have not been used to train the network) or fault detection (i.e., data that are corrupted by errors, like instrumentrelated problems). Our determination of error characteristics can also be used with adaptive learning algorithms (i.e., learning when a small addi-

2456

F. Aires, C. Prigent, and W. Rossow

tional data set is provided after the main learning of the network has been done). The NN Jacobians can be used to express the various sources of uncertainty with even more detail, using Rodgers’s approach (Rodgers, 1990). A source of uncertainty can be the presence of noise in NN inputs (Wright et al., 2000). Another technical development would be the optimization of the hyperparameters using an iterative reestimation strategy or evidence measure in a Bayesian framework (Neal, 1996; Nabney, 2002). The Jacobians of a nonlinear model, such as the NN, are a very powerful concept. Many applications of the NN Jacobians can be derived from this study. The Hessian matrix can be used for many purposes (Bishop, 1996): (1) in several second-order optimization algorithms, (2) in adaptative learning algorithms (i.e., learning when a small additional data set is provided after the main learning of the network is done), (3) for identifying parameters (i.e., weights) not significant in the model as indicated by small diagonal terms Hii , which is used by regularization processes like the “weight pruning” algorithm; (4) for automatic relevance determination, which is able to select the most informative network inputs and eliminate the negligible ones; and (5) to give a posteriori distributions of the neural weights as we do here. We also saw that the regularization of the Hessian matrix is essential if one wants to use it. A regularization solution is given in this article, but for some purposes, a few other techniques can also be used. In terms of the NN model, it allows us to obtain a robust model that will generalize well and suffer less from overfitting deficiencies. Jacobians can also be used to analyze how the NN model links the inputs and the outputs, in a nonlinear way. This capacity of measuring the internal structure of the NN is especially important when the NN is used as an analysis tool as proposed by Aires and Rossow (2003), where the NN Jacobians are estimated in order to analyze feedback processes in a dynamical. The reliable estimation of physical Jacobians through the NN model is an ideal candidate for the study of climate feedback in both numerical models and observation data sets. The new ideas and techniques presented in this article will directly benefit such studies. Acknowledgments We thank Ian T. Nabney for providing the Netlab toolbox from which some of the routines have been used in this work. F.A. thanks Andrew Gelman for very interesting discussion about modern Bayesian statistics. This work was partly supported by NASA Radiation Sciences and Hydrology Programs. References Aires, F., Prigent, C., Rossow, W. B., & Rothstein, M. (2001). A new neural network approach including first guess for retrieval of atmospheric water vapor,

Neural Network Uncertainty Assessment

2457

cloud liquid water path, surface temperature and emissivities over land from satellite microwave observations. J. Geophys. Res., 106(D14), 14, 887–14,907. Aires, F., & Rossow, W. B. (2003). Inferring instantaneous, multivariate and nonlinear sensitivities for the analysis of feedback processes in a dynamical system: The Lorenz model case study. Q. J. Roy. Met. Soc., 129, 239–275. Aires, F., Rossow, W. B., Scott, N. A., & Ch´edin, A. (2002a). Remote sensing from the infrared atmospheric sounding interferometer instrument. 1: Compression, de-noising, and first guess retrieval algorithms. J. Geophys. Res., 107, no. D22, 4619. Aires, F., Rossow, W. B., Scott, N. A., & Ch´edin, A. (2002b). Remote sensing from the IASI instrument. 2: Simultaneous retrieval of temperature, water vapor and ozone atmospheric profiles. J. Geophys. Res., 107, no. D22, ACH-7. Aires, F., Schmitt, M., Scott, N. A., & Ch´edin, A. (1999). The weight smoothing regularisation for resolving the input contribution’s errors in functional interpolations. IEEE Trans. Neural Networks, 10, 1502–1510. Badeva, V., & Morosov, V. (1991). Probl`emes incorrectement pos´es. Paris: Masson. Bates, D. M., & Watts, D. G. (1988). Nonlinear regression analysis and its applications. New York: Wiley. Bishop, C. (1996). Neural networks for pattern recognition. Oxford: Clarendon Press. Crone, L., & Crosby, D. (1995). Statistical applications of a metric on subspaces to satelite meteorology. Technometrics, 37, 324-328. Gelman, A. B., Carlin, J. S., Stern, H. S., & Rubin, D. B. (1995). Bayesian data analysis. London: Chapman & Hall. Geman, S., Bienenstock, E., & Doursat, R. (1992). Neural networks and the biasvariance dilema. Neural Computation, 1(4), 1–58. Hertz, J., Krogh, A., & Palmer, R. C. (1991). Introduction to the theory of neural computation. Cambridge, MA: Perseus Books. Ide, K., Courtier, P., Ghil, M., & Lorenc, A. C. (1997). Unified notation for data assimilation: Operational, sequential and variational. J. Meteorol. Soc. J., 75, 181–189. Jolliffe, I. T. (2002). Principal component analysis (2nd ed.). New York: Springer. Koroliouk, V., Portenko, N., Skorokhod, A., & Tourbine, A. (1983). Aidem´emoire de th´eorie des probabilit´es et de statistique math´ematique. Moscow: Edition Mir. Krasnopolsky, V. M., Breaker, L. C., & Gemmill, G. H. (1995). A neural network as a nonlinear transfer function model for retrieving surface wind speeds from the special sensor microwave imager. J. Geophys. Res., 100, 11,033–11,045. Krasnopolsky, V. M., Gemmill, G. H., & Breaker, L. C. (2000). A neural network multiparameter algorithm for SSM/I ocean retrievals: Comparison validations. Remote Sens. Environ., 73, 133–142. Le Cun, Y., Denker, J. Y., & Solla, S. A. (1990). Optimal brain damage. In D. S. Touretzky (Ed.), Advances in neural information processing systems, 2 (pp. 598–605). San Mateo, CA: Morgan Kaufmann. MacKay, D. J. C. (1992). A practical Bayesian framework for back-propagation networks. Neural Computation, 4(3), 448–472. Nabney, I. T. (2002). Netlab: Algorithms for pattern recognition. New York: SpringerVerlag.

2458

F. Aires, C. Prigent, and W. Rossow

Neal, R. M. (1996). Bayesian learning for neural networks. New York: SpringerVerlag. Prigent, C., Aires, F., & Rossow, W. B. (2003). Land surface skin temperatures from a combined analysis of microwave and infrared satellite observations for an all-weather evaluation of the differences between air and skin temperatures. J. Geophysical Research, 108(D10), 4310. Prigent, C., & Rossow, W. B. (1999). Retrieval of surface and atmospheric parameters over land from SSM/I: Potential and limitations. Q.J.R. Met. Soc., 125, 2379–2400. Rivals, I., & Personnaz, L. (2000). Construction of confidence intervals for neural networks based on least squares estimation. Neural Network, 13, 463–484. Rivals, I., & Personnaz, L. (2003). MLPs (mono-layer polynomials and multilayer perceptrons) for nonlinear modeling. J. Machine Learning Reseach, 3, 1383–1398. Rodgers, C. D. (1976). Retrieval of atmospheric temperature and composition from remote measurements of thermal radiation. Rev. Geophys., 14, 609–624. Rodgers, C. D. (1990). Characterization and error analysis of profiles retrieved from remote sounding measurements. J. Geophys. Res., 95, 5587–5595. Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1986). Learning internal representations by error propagation. In D. E. Rumelhart & J. L. McClelland & the PDP Research Group (Eds.), Parallel Distributed Processing: Explorations in the Microstructure of Cognition, Vol. 1, Foundations (pp. 318–362). Cambridge, MA: MIT Press. Saltelli, A., Chan, K., & Scott, E. M. (2000). Sensitivity analysis. New York: Wiley. Stogryn, A. P., Butler, C. T., & Bartolac, T. J. (1994). Ocean surface wind retrievals from special sensor microwave imager data with neural networks. J. Geophys. Res., 99, 981–984. Tarantola, A. (1987). Inverse problem theory: Models for data fitting and model parameter estimation. Amsterdam: Elsevier. Tikhonov, A., & Arsenin, V. (1977). Solutions of ill-posed problems. Washington, DC: V. H. Vinsten. Twomey, S. (1977). Introduction to the mathematics of inversion in remote sensing and indirect measurements. Amsterdam: Elsevier. Vapnik, V. (1997). The nature of statistical learning theory. New York: SpringerVerlag. Wright, W. A., Ramage, G., Cornford, D., & Nabney, I. T. (2000). Neural network modelling with input uncertainty: Theory and application. Journal of VLSI Signal Processing, 26, 169–188. Received December 1, 2003; accepted April 29, 2004.

LETTER

Communicated by Erkki Oja

Principal Components Analysis Competitive Learning Ezequiel Lopez-Rubio ´ [email protected]

Juan Miguel Ortiz-de-Lazcano-Lobato, [email protected]

Jos´e Munoz-P´ ˜ erez [email protected]

Jos´e Antonio Gomez-Ruiz ´ [email protected] Department of Computer Science and Artificial Intelligence, University of M´alaga, Campus de Teatinos, s/n. 29071 M´alaga, Spain

We present a new neural model that extends the classical competitive learning by performing a principal components analysis (PCA) at each neuron. This model represents an improvement with respect to known local PCA methods, because it is not needed to present the entire data set to the network on each computing step. This allows a fast execution while retaining the dimensionality-reduction properties of the PCA. Furthermore, every neuron is able to modify its behavior to adapt to the local dimensionality of the input distribution. Hence, our model has a dimensionality estimation capability. The experimental results we present show the dimensionality-reduction capabilities of the model with multisensor images.

1 Introduction Two of the best-known techniques for multidimensional data analysis are vector quantization (VQ) and principal components analysis (PCA). Our aim here is to combine them to obtain a new neural model. Next, we make a short introduction to both of the methods. Vector quantization is a method for approximating a multidimensional signal space with a finite number of representative vectors (code vectors), which form a code book. The aim of competitive neural networks is to cluster or categorize the input data, and they can be used for data coding and compression through vector quantization. This is achieved by adapting the weight vector of each neuron so that it approximates the mean or centroid of a data cluster. On each computing step, only the winning neuron is updated, that is, the neuron with the weight vector that is closest to the current input vector. c 2004 Massachusetts Institute of Technology Neural Computation 16, 2459–2481 (2004)

2460

E. Lopez-Rubio ´ et al.

Competitive learning is an appropriate algorithm for VQ of unlabeled data. Ahalt, Krishnamurthy, Chen, and Melton (1990) discussed the application of competitive learning neural networks to VQ and developed a new training algorithm for designing VQ code books, which yields nearoptimal results and can be used to develop adaptive vector quantizers. Yair, Zeger, and Gersho (1992) have proposed a deterministic VQ design algorithm, called the soft competition scheme, which updates all the code vectors simultaneously with a step size that is proportional to its probability of winning. Pal, Bezdek, and Tsao (1993) proposed a generalization of learning vector quantization for clustering, which avoids the necessity of defining an update neighborhood scheme and the final centroids do not seem sensitive to initialization. Xu, Krzyzak, and Oja (1993) developed a new algorithm, called rival penalized competitive learning, which for each input not only the winner unit is modified to adapt itself to the input, but also its rival delearns with a smaller learning rate. Ueda and Nakano (1994) presented a new competitive learning algorithm with a selection mechanism based on the equidistortion principle for designing optimal vector quantizers. The selection mechanism enables the system to escape from local minima. Uchiyama and Arbib (1994) showed the relationship between clustering and vector quantization and presented a competitive learning algorithm that generates units where the density of input vectors is high and showed its efficiency as a tool for clustering color space in color image segmentation based on the least-sum-of-squares criterion. However, data analysis with competitive networks has severe limitations. This is because the model provides information only about the mean of each cluster. The PCA methods overcome this problem by obtaining the principal directions of the data, that is, the maximum variance directions (Kendall, 1975; Jolliffe, 1986). Hence, if we have a D-dimensional input space, the PCA computes the K principal directions, where K < D. This allows important dimensionality reductions by selecting K D. In fact, it has been proved that PCA is an optimal linear technique for dimensionality reduction in the mean sense (Jolliffe, 1986). The original method, sometimes called Karhunen-Lo`eve (KL) transform or global PCA, considers the entire input distribution as a whole. A number of local PCA methods have been proposed to partition the distribution into meaningful clusters (Kambhatla & Leen, 1997; Tipping & Bishop, 1999; Roweis & Ghahramani, 1999). These methods have been widely used to compress multispectral and multilayer images (Sahgri, Tescher, & Reagan, 1995; Tretter & Bouman, 1995). Nevertheless, most of these methods process the entire input data set as a whole, so they are not suited for online learning. Furthermore, they do not address the problem of selecting the correct number of basis vectors K. Our goal here is to propose solutions for these issues while retaining the adaptation capabilities of the above-mentioned proposals.

Principal Components Analysis Competitive Learning

2461

Finally, the adaptive subspace self-organizing map (ASSOM) model is a neural network approach to PCA (Kohonen, 1995, 1996). It is an evolution of the self-organizing feature map (SOFM) model (Kohonen, 1990). The ASSOM model uses a vector basis on each neuron. Every vector basis is trained in order to extract the most significant features of a part of the input distribution. This letter is organized as follows. Section 2 presents our new neural model. In section 3, we prove some important properties of the model. We briefly discuss the advantages of our model in section 4. Sections 5 and 6 are devoted to setting out our experimental results and conclusions, respectively. 2 The PCACL Model 2.1 Neuron Weights Updating. The PCA method uses the covariance matrix to analyze the input distribution. The covariance matrix of an input vector x is defined as R = E[(x − E[x])(x − E[x])T ],

(2.1)

where E[·] is the mathematical expectation operator, and we suppose that all the components of x are real random variables. If we have M input samples, x1 , . . . , xM , we can make the following approximation: R ≈ RI =

M 1 (xi − E[x])(xi − E[x])T . M − 1 i=1

(2.2)

Note that this is the best approximation that we can obtain with this information (it is an unbiased estimator with minimum variance). Now, if we obtain N new input samples, xM+1 , . . . , xM+N , we may write: A=

1 M+N (xi − E[x])(xi − E[x])T N i=M+1

(2.3)

R ≈ RII =

1 ((M − 1)RI + NA). M+N−1

(2.4)

Both RI and RII are approximations of R, but RII is more accurate because it takes into account the N + M input samples. Equation 2.4 is a method to accumulate the new information (the last N input samples) to the old information (the first M input samples). Now we need to approximate the expectation of the input vector E[x]. We may use a similar approach: E[x] ≈ eI =

M 1 xi , M i=1

eII =

1 M+N xi N i=M+1

(2.5)

2462

E. Lopez-Rubio ´ et al.

E[x] ≈ e

III

1 = M+N

M

xi +

i=1

M+N

xi

i=M+1

=

1 (MeI + NeII ). (2.6) M+N

Analogously, eI , eII , and eIII are approximations of E[x], but eIII is more accurate because it takes into account all the input samples. Note that if we suppose N M, we have eI ≈ eIII . We use eI to approximate E[x] in equation 2.2, and eIII to approximate E[x] in equation 2.4. Note that we do not use eII in equation 2.4 because it is a poor approximation of E[x] while eIII is better, and both of them can be calculated only when the last N input samples are known. Every processing unit of our model stores approximations to the matrix R and the vector E[x]. They will use the above equations to update these approximations. So in the time instant t, the unit j stores the matrix Rj (t) and the vector ej (t). Let the learning rate for the covariance matrix ηR and the learning rate for the mean ηe be ηR =

N , N+M−1

ηe =

N . N+M

(2.7)

Then we may rewrite equations 2.6 and 2.4 as eIII = (1 − ηe )eI + ηe eII

(2.8)

RII = (1 − ηR )RI + ηR A.

(2.9)

Note that these expressions are analogous to the weight update equations of the competitive learning and the self-organizing feature map (SOFM). Equations 2.7 correspond to a hyperbolic decay of the learning rates, but a linear decay may be used as an alternative. 2.2 Competition Among Neurons. When an input sample is presented to a self-organizing map, a competition is held among the neurons. We note the orthogonal projection of a vector x on an orthonormal vector basis B = {bh | h = 1, . . . , K} as xˆ = Orth(x, B). The vector x can be decomposed in two vectors: the orthogonal projection and the projection error, that is, x = xˆ + x˜ . Every unit (say, j) of our network has an associated vector basis B j (t) at every time instant t. It is formed by the Kj (t) eigenvectors corresponding to the Kj (t) largest eigenvalues of Rj (t). This is rooted in the PCA. Note that B j (t) must be orthonormalized in order for the system to operate correctly. The difference among input vectors xi (t), i = 1, . . . , N and estimated means are projected onto the vector bases of all the neurons. The neuron c that has the minimum sum of projection errors is the winner: zji (t) = Orth(xi (t) − ej (t), B j (t))

(2.10)

Principal Components Analysis Competitive Learning

c = arg min j

N

2463

i

x (t) − ej (t) −

zji (t) 2

.

(2.11)

i=1

So we are looking for the neuron that best represents the inputs xi (t). A neuron j represents the ith sample by the following approximation: xi (t) ≈ ej (t) + Orth(xi (t) − ej (t), B j (t)).

(2.12)

Approximation 2.12 shows that our model tries to express the input vectors as the sum of a mean vector and a term that depends on the covariance matrix. 2.3 Vector Bases Size Updating. Our model considers a variable number of basis vectors Kj (t), which is computed independently for each neuron j. This number reflects the intrinsic dimensionality of the data in the receptive field of neuron j (i.e., the set of inputs for which the neuron j is the winner). There are several algorithms to approximate the intrinsic dimensionality of the data (Fukunaga & Olsen, 1971; Verveer & Duin, 1995). Most p of them are based in the analysis of the eigenvalues λj (t) of the covariance matrix Rj (t), p = 1, . . . , D. We propose two methods: 2.3.1 Explained Variance Method. Each neuron of the PCACL represents a cluster of data, where ej and Rj are estimations of the mean and the covariance matrix of the cluster, respectively. If the cluster of data were represented only by its sample mean, that is, K = 0, then the mean squared error corresponding to this representation would be  D MSE0 = E[ ej − x 2 ] = E  (ejp − xp )2  . 

(2.13)

p=1

Let bi be the ith principal direction of neuron j, with bi = 1 (the index j is dropped for simplicity). If we use a vector base BZ = {b1 , . . . , bZ }, where Z is the number of basis vectors, the mean squared error associated with this representation is given by MSEZ = E[ x − ej − Orth(x − ej , BZ ) 2 ].

(2.14)

Note that if Z = 0, we get the maximum possible mean squared error. Our goal here is to select a number of basis vectors K that ensures that at least a fraction α of the maximum mean squared error is removed, Kj = min{Z ∈ {0, 1, . . . , D} | MSE0 − MSEZ ≥ αMSE0 },

(2.15)

2464

E. Lopez-Rubio ´ et al.

where α ∈ [0, 1]. Next, we show how this can be achieved. First, we rewrite equation 2.14 in terms of projections on the principal directions:  2  D Z MSEZ = E  (bTi (x − ej ))bi − (bTi (x − ej ))bi  i=1 i=1  2  D (bTi (x − ej ))bi  . = E 

(2.16)

i=Z+1

Since the principal directions are orthonormal, this means that MSEZ = E

D

(bTi (x

2

− ej ))

.

(2.17)

i=Z+1

From equation 2.17 and the definition of variance, MSEZ =

D

var(bTi (x − ej )).

(2.18)

i=Z+1

On the other hand, we also know from PCA that the variance of the projection of the input minus its mean onto the ith principal direction bi equals the pth largest eigenvalue of R: ∀p ∈ {1, 2, . . . , D},

p

var(bTi (x − ej )) = λj .

(2.19)

Substitution of equation 2.19 into equation 2.18 yields MSEZ =

D

p

λj .

(2.20)

p=Z+1

If we remember that ejp = E[xp ], equation 2.13 can be rewritten in terms of variances: MSE0 =

D

var(xp ).

(2.21)

p=1

From PCA we know that the sum of variances equals the trace of R: MSE0 = trace(Rj ) =

D p=1

p

λj .

(2.22)

Principal Components Analysis Competitive Learning

2465

Finally, we substitute equations 2.20 and 2.22 into 2.15:  

Kj = min Z ∈ {0, 1, . . . , D} | 

D

p

λj −

p=1

D p=Z+1

 

p

λj ≥ α trace(Rj ) . (2.23) 

This can be simplified as follows:  

Kj = min Z ∈ {0, 1, . . . , D} | 

D

 

p

λj ≥ α trace(Rj ) .  p=1

(2.24)

Note that there is no need to compute all the eigenvalues of Rj , but only the Kj largest ones. If we add the the time instant t, we get the equation to be used in practice:  

Kj (t) = min Z ∈ {0, 1, . . . , D} | 

D

p

 

λj (t) ≥ α trace(Rj (t)) .  p=1

(2.25)

p

The quotient between λj (t) and the trace of Rj (t) is the amount of variance explained by the pth principal direction of Rj (t). So we can see the parameter α as the amount of variance that we want the neurons to explain. Hence, we select Kj (t) so that the amount of variance explained by the directions associated with the Kj (t) largest eigenvalues is at least α. 2.3.2 Laplace-Minka Method. This method has been proposed by Minka (2001) to choose the number of principal compontents to be retained when PCA is used to analyze an input distribution. It interprets the PCA as a density estimation, and a Bayesian model is selected to estimate the intrinsic dimensionality of the data. This procedure is based on Laplace’s method to approximate integrals. This allows us to obtain the desired estimation with a small computational effort. The probability density of the input data {xi , i = 1, . . . , M} corresponding to a neuron j, given a intrinsic dimensionality r, is computed as follows:  p({xi } | r) ≈ p(U) 

r

−M/2 q λj 

vˆ −M(D−r)/2 (2π )(m+r)/2

q=1

× AZ −1/2 M−r/2 , p

(2.26)

where λj are the eigenvalues of the covariance matrix Rj sorted in decreasing order, U is an orthogonal matrix obtained from a decomposition of the matrix of the linear transformation defined by PCA, vˆ is the estimated noise

2466

E. Lopez-Rubio ´ et al.

variance, and AZ is a Hessian matrix associated with a parameterization of U: p(U) = 2−r

r

((D − q + 1)/2)π −(D−q+1)/2

(2.27)

q=1

vˆ =

AZ =

D 1 q λ D − r q=r+1 j D r

q q ((λˆ js )−1 − (λˆ j )−1 )(λj − λjs )M

(2.28)

(2.29)

q=1 s=q+1

q λˆ j

=

q

λj vˆ

if 1 ≤ q ≤ r . if r < q ≤ D

(2.30)

The assorted eigenvalues in equation 2.30 are sorted in decreasing order. The estimated intrinsic dimension for neuron j is that with the largest density (Bayesian model selection), Kj = arg

max {p({xi } | r)}.

r=1,...,D−1

(2.31)

The method is explained in more detail by Minka (2001). It has the advantage that there are no parameters to adjust, unlike the α parameter of the explained variance method. 2.4 Network Initialization. In order to complete the description of the learning process, the initialization of the system must be given, that is, the choice of the matrix Rj (0) and the vector ej (0) for every unit j. In most neural networks, the initial weights are randomly chosen around a “zero” or “neutral” situation. Here we follow this approach. The components of the vector ej (0) can be built by using small random values, either negative or positive. The building of the matrix Rj (0) is more complex. First, it must be a symmetric nonnegative matrix in order to be the covariance matrix of a realvalued vector. Furthermore, it should be similar to a “neutral” covariance matrix. The null matrix cannot be used as a covariance matrix, so we choose the unit matrix as a reference. Note that the eigenvectors of this matrix form the canonical vector basis, and all the eigenvalues are equal to the unit. This vector basis is unbiased in the sense that it spans all the space uniformly. The equality of all eigenvalues means that all the eigenvectors have the same importance in terms of PCA. Hence, we select Rj (0) by adding small, positive random values to the elements of the main diagonal and below the main diagonal of the unit matrix. The other elements are fixed by the symmetry condition.

Principal Components Analysis Competitive Learning

2467

2.5 Summary. The algorithm that summarizes this model can be stated as follows: 1. For every neuron j, obtain the initial covariance matrix Rj (0) by generating a random symmetric nonnegative matrix, with the elements of the main diagonal near to one and all the others near to zero. 2. For every neuron j, build the vector ej (0) by using small, random values, either negative or positive, for its components. 3. At time instant t, select the input vectors xi (t), i = 1, . . . , N, with N ≥ 1, from the input distribution. Compute the winning neuron c according to equation 2.11. 4. Update the winning neuron c, and leave the rest of the neurons unchanged by using the following equations:

1 ec (t + 1) = ec (t) + ηe (t) N A(t + 1) =

N xi − ec (t)

(2.32)

i=1

N 1 (xi − ec (t + 1))(xi − ec (t + 1))T N i=1

Rc (t + 1) = Rc (t) + ηR (t)[A(t + 1) − Rc (t)]

(2.33) (2.34)

ej (t + 1) = ej (t) ∀j = c

(2.35)

Rj (t + 1) = Rj (t) ∀j = c.

(2.36)

The number of basis vectors of the winning neuron Kc must also be updated by either the explained variance method, equation 2.25, or the Laplace-Minka method, equation 2.31. 5. If convergence has been reached, stop the simulation. Otherwise, go to step 3. As the model we have presented uses both PCA techniques and a competitive learning procedure, we call it principal components analysis competitive learning (PCACL). 3 Properties Here we prove three properties of the PCACL model. First, we prove that the maximum number of basis vectors per base can be controlled by adjusting the parameter α when the explained variance method is used. If the value of α is low, the number of basis vectors will be small, and vice versa. Next, we show that the PCACL model with the explained variance method is a formal extension of the classical competitive learning. And finally, we prove that there exists a strictly positive upper bound for the learning rate for the covariance matrix ηR so that the number of basis vectors Kj does not change

2468

E. Lopez-Rubio ´ et al.

by more than 1 in each time step, when the explained variance approach is used. This means that the amount of variation in Kj can be controlled with precision by adjusting the learning rate ηR . Proposition 1. Let z ∈ {0, 1, . . . , D}. If we use the explained variance method with α ∈ [0, z/D], then for every neuron j, it holds that 0 ≤ Kj (t) ≤ z. Proof. As the eigenvalues are sorted in decreasing order, it holds that p

q

∀p ∈ {1, . . . , z}∀q ∈ {z + 1, . . . , D}, λj (t) ≥ λj (t) ⇒ (D − z)

z p=1

Next we add z ⇒D

z

z

p p=1 λj (t)

p

λj (t) ≥ z

p=1

⇒D

z

D

p

λj (t) ≥ z

q

λj (t).

(3.1)

q=z+1

to both sides:

D

q

λj (t)

q=1 p

λj (t) ≥

p=1

D z q λ (t). D q=1 j

(3.2)

By the hypotheses, we have that α ∈ [0, z/D], so ⇒D

z

p

λj (t) ≥ α

p=1

D

q

λj (t).

(3.3)

q=1

Then by equations 2.22, 2.25, and 3.3, we have: 0 ≤ Kj (t) ≤ z.

(3.4)

Corollary 1. If we take α = 0, the PCACL model with the explained variance method reduces to classical competitive learning (CL). Proof. We use z = 0 in the previous proposition and get α = 0 ⇒ Kj (t) = 0 for every neuron j. Then the covariance matrices are not used in the competitive phase, so it reduces to that of CL. The estimated mean vectors ej (t) play the role of the weight vectors of CL, and the learning rate for the mean ηe corresponds to the learning rate of CL. Theorem 1. For each time instant t and each neuron j, there is a constant ξ such that if the learning rate for the covariance matrix ηR (t) satisfies ηR (t) <

λKj (t)−1 + ξ |(Kj (t) − 2)κ(X) Aj (t) − Rj (t) − α trace(Aj (t) − Rj (t))|

, (3.5)

Principal Components Analysis Competitive Learning

2469

where λi is the ith largest eigenvalue of Rj (t) and κ(X) is the condition number of an eigenvector matrix X of Rj (t), then it holds that Kj (t + 1) ≥ Kj (t) − 1,

(3.6)

that is, the number of basis vectors of neuron j does not decrease by more than 1. Proof. If j is not the winning neuron, Kj (t + 1) = Kj (t) and the theorem holds. Hence, we must focus in the case where j is the winning neuron. In this case, we have Rj (t + 1) = Rj (t) + ηR (t)(Aj (t) − Rj (t)).

(3.7)

Here, ηR (t)(Aj (t) − Rj (t)) is a perturbation to matrix Rj (t), and Rj (t + 1) is the perturbed matrix. Let λi be the ith largest eigenvalue of the original matrix Rj (t) and λi be the ith largest eigenvalue of the perturbed matrix Rj (t). By the Bauer-Fike theorem (Bauer & Fike, 1960) we know that |λi − λi | ≤ κ(X) ηR (t)(Aj (t) − Rj (t)) = κ(X)ηR (t) Aj (t) − Rj (t) , (3.8) where κ(X) = X X−1 is the condition number of an eigenvector matrix X of Rj (t). From the explained variance method, the condition Kj (t + 1) ≥ Kj (t) − 1 is ensured if Kj (t)−2

λi < α trace(Rj (t + 1)).

(3.9)

i=1

By application of equation 3.8, condition 3.9 will be fullfilled if Kj (t)−2

(Kj (t) − 2)κ(X)ηR (t) Aj (t) − Rj (t) +

λi < α trace(Rj (t + 1)). (3.10)

i=1

Next we apply equations 3.7 to 3.10, and we obtain Kj (t)−2

(Kj (t) − 2)κ(X)ηR (t) Aj (t) − Rj (t) +

λi

i=1

< α trace(Rj (t)) + αηR (t)trace(Aj (t) − Rj (t)).

(3.11)

From the explained variance method, we know that Kj (t)−1

α trace(Rj (t)) = ξ +

i=1

λi ,

0 < ξ ≤ λKj (t) .

(3.12)

2470

E. Lopez-Rubio ´ et al.

We use equation 3.12 to further rewrite 3.11: Kj (t)−2

(Kj (t) − 2)κ(X)ηR (t) Aj (t) − Rj (t) +

λi

i=1 Kj (t)−1

<ξ+

λi αηR (t)trace(Aj (t) − Rj (t)).

(3.13)

i=1

Next we simplify and rearrange equation 3.13 to obtain ηR (t)((Kj (t) − 2)κ(X) Aj (t) − Rj (t) − α trace(Aj (t) − Rj (t))) < ξ + λKj (t)−1 .

(3.14)

Finally, we express equation 3.14 as an upper bound for ηR (t): ηR (t) <

λKj (t)−1 + ξ |(Kj (t) − 2)κ(X) Aj (t) − Rj (t) − α trace(Aj (t) − Rj (t))|

. (3.15)

Theorem 2. For each time instant t and each neuron j, there is a constant ψ such that if the learning rate for the covariance matrix ηR (t) satisfies ηR (t) ≤

λKj (t)+1 + ψ |(Kj (t) + 2)κ(X) Aj (t) − Rj (t) − α trace(Aj (t) − Rj (t))|

, (3.16)

where λi is the ith largest eigenvalue of Rj (t) and κ(X) is the condition number an eigenvector matrix X of Rj (t), then it holds that Kj (t + 1) ≤ Kj (t) + 1,

(3.17)

that is, the number of basis vectors of neuron j does not increase by more than 1. Proof. If j is not the winning neuron, Kj (t + 1) = Kj (t) and the theorem holds. Hence, we must focus in the case where j is the winning neuron. In this case, we have Rj (t + 1) = Rj (t) + ηR (t)(Aj (t) − Rj (t)).

(3.18)

Here ηR (t)(Aj (t) − Rj (t)) is a perturbation to matrix Rj (t), and Rj (t + 1) is the perturbed matrix. Let λi be the ith largest eigenvalue of the original matrix Rj (t) and λi be the ith largest eigenvalue of the perturbed matrix Rj (t). By the Bauer-Fike theorem (Bauer & Fike, 1960) we know that |λi − λi | ≤ κ(X) ηR (t)(Aj (t) − Rj (t)) = κ(X)ηR (t) Aj (t) − Rj (t) ,

(3.19)

Principal Components Analysis Competitive Learning

2471

where κ(X) = X X−1 is the condition number of an eigenvector matrix X of Rj (t). From the explained variance method, the condition Kj (t + 1) ≤ Kj (t) + 1 is ensured if Kj (t)+1

λi ≥ α trace(Rj (t + 1)).

(3.20)

i=1

By application of equation 3.19, condition 3.20 will be fullfilled if Kj (t)+1

λi ≥ α trace(Rj (t + 1))

i=1

+ (Kj (t) + 1)κ(X)ηR (t) Aj (t) − Rj (t) .

(3.21)

Next, we apply equation 3.18 to 3.21, and we obtain Kj (t)+1

λi ≥ (Kj (t) + 1)κ(X)ηR (t) Aj (t) − Rj (t)

i=1

+ α trace(Rj (t)) + αηR (t)trace(Aj (t) − Rj (t)).

(3.22)

From the explained variance method, we know that Kj (t)

α trace(Rj (t)) = −ψ +

λi ,

0 ≤ ψ < λKj (t) .

(3.23)

i=1

We use equation 3.23 to further rewrite 3.22: Kj (t)+1

λi ≥ (Kj (t) + 1)κ(X)ηR (t) Aj (t) − Rj (t) − ψ

i=1 Kj (t)

+

λi + αηR (t)trace(Aj (t) − Rj (t)).

(3.24)

i=1

Next, we simplify and rearrange equation 3.24 to obtain ηR (t)((Kj (t) + 1)κ(X) Aj (t) − Rj (t) + α trace(Aj (t) − Rj (t))).

(3.25)

Finally, we express equation 3.25 as an upper bound for ηR (t): ηR (t) ≤

λKj (t)+1 + ψ |(Kj (t) + 1)κ(X) Aj (t) − Rj (t) + α trace(Aj (t) − Rj (t))|

. (3.26)

2472

E. Lopez-Rubio ´ et al.

Corollary 2. For each time instant t and each neuron j, there is a strictly positive upper bound for ηR (t) so that |Kj (t + 1) − Kj (t)| ≤ 1. Proof. The desired upper bound is the minimum of the upper bounds presented in the previous two theorems. Since both upper bounds are strictly positive, the obtained upper bound is also strictly positive. 4 Discussion The PCACL algorithm is quite different from ASSOM and known local PCA methods. The PCACL model has the following advantages: • It is solidly rooted on statistics (KL transform, PCA), because the model equations have been derived directly from statistical considerations. • It is a generalization of the competitive learning, as proven in the previous section. • It allows any number of inputs to be presented to the network at every computation step. This means that there is no need to present the entire data set to the network, as in local PCA methods like the proposed by Kambhatla and Leen (1997). Other models such as the ASSOM network need to group the input samples into homogeneous episodes, which prevents its use when no grouping information is available. In front of these, PCACL is able to learn from one input sample per step while it supports batch mode as an optional feature. • The neurons are able to adapt to the local dimensionality of the data. Neither the ASSOM nor many local PCA methods have this capability (Kambhatla & Leen, 1997; Kohonen, 1995, 1996). These methods rely on the user’s experience and knowledge to select the appropriate number of basis vectors, which is the same for all neurons. This leads to severe problems if the local dimensionality of the data varies significantly from one region of the input space to another. Hence, we may expect that the PCACL has a better dimensionality reduction than those methods because of its flexibility. • It does not assume that the input follows any probability distribution. This is not the case with probabilistic PCA approaches, where typically a mixture of multivariate gaussian distributions is assumed a priori (Tipping & Bishop, 1999; Roweis & Ghahramani, 1999). Hence, our approach may be particularly useful if the input distribution is unknown or is suspected to have nongaussian clusters. 5 Experimental Results 5.1 Batch Learning Versus Online Learning. This experiment shows the differences between online (N = 1) and batch learning (N > 1). Other

Principal Components Analysis Competitive Learning

2473

neural models like the ASSOM are based on the assumption that the input samples are grouped into meaningful episodes prior to presentation to the network. Our aim here is to show that the PCACL overcomes this limitation by allowing only one input sample per computation step, while batch mode can be used when grouping information is available. In order to be able to plot the results, which can be seen in Figure 1, we have used D = 2. We have used a mixture of five gaussians as the input distribution. In all cases, we have obtained 500 samples from every gaussian distribution, that is, every cluster has 500 samples. These clusters have been selected so that the orientation of some of them is the same. Furthermore, some of the clusters are very close. In fact, there are some cases of nonseparable clusters. We have used elongated clusters so that the neurons could learn their directions. There are clusters with different elongations in the same distribution. The clusters have been marked with gray ellipses. Note that the used gaussian distributions do not end sharply, unlike the ellipses plotted. For each neuron, the estimated mean has been marked with an open circle, and the first principal direction has been plotted as a straight line passing through the estimated mean. In all the experiments, we have used PCACL networks with five neurons. We have considered a linear decay of the learning rates, with initial values

Figure 1: Results with the three learning strategies: no grouping (left), meaningful groups (right), and meaningless groups (lower).

2474

E. Lopez-Rubio ´ et al.

ηR (0) = ηe (0) = 1 and final values ηR (t f inal ) = ηe (t f inal ) = 0. The explained variance method has been selected, always with α = 0.5. We have tested three approaches for the learning mode: 1. No input sample grouping (N = 1). We ran 20,000 iterations in this case. 2. Grouped samples (N = 10) where the samples of each group are drawn from the same gaussian. This means that the groups are meaningful. We have run 5000 iterations (each with 10 samples) in this approach. 3. Grouped samples (N = 10) where the samples of each group are drawn from the overall distribution. This means that the groups are meaningless (samples from different clusters are present in the same group). We ran 5000 iterations (each with 10 samples) in this approach. We can see on Figure 1 that for the first two approaches, every neuron finds a cluster, with its estimated mean in the centroid of the cluster, and the principal component direction along the orientation of the cluster. This means that these approaches have been successful in adapting to the structure of the input distribution. On the other hand, the meaningless groups approach shows no adaptation to the structure of the input, and the neurons are randomly placed, with their estimated means around the global mean. This is because a group is not associated with a single cluster, so the winning neuron is unable to adapt to a specific feature of the input distribution. We conclude that no attempt to group the input data should be made unless a reliable criterion to perform this grouping is known in advance. Since this is not the case in most applications, we have chosen not to group the samples in the experiments that follow. 5.2 Dimensionality Reduction Experiments. We have designed a set of experiments to test the dimensionality reduction performance of the PCACL model. We have selected the local PCA approach of Kambhatla and Leen (1997) and the probabilistic PCA method of Tipping and Bishop (1999) to make a comparison with our method. These two PCA networks use a fixed number of basis vectors K for all neurons. Both are aimed to reduce the projection error, as PCACL model. Hence, we use the mean normalized squared projection error (MNSPE) to measure the performance of the three systems: MNSPE =

N 1

˜xi 2 . N i=1 xi 2

(5.1)

In order to reduce the dimensionality of the input distribution, the projection error minimization must be achieved with the minimum possible number of basis vectors, so we are interested in the relation between these two values.

Principal Components Analysis Competitive Learning

2475

Our experiments have used a freely accessible data set from the NASA Earth Observatory (2002). This data set is made up of images taken every month by satellites, which represent the values of several atmospheric parameters over the surface of the planet. We selected six months (January 1988–June 1988) and 14 parameters for each month, so we have 14 images per month, each with 360 × 180 pixels. These 14 images are combined to form a single multisensor image, where every pixel is a vector with 14 components. The values of the components are real numbers in the interval [0, 1]. Each of the vectors is an input sample. Every experiment starts by presenting the input samples of a month to a network. When the training is finished, the projection error is computed for all the input samples of the month. Then these projection errors are used to compute the mean error of the month. Finally, the mean errors from the six months are averaged to yield the final mean error. The parameters of the local PCA network were selected as follows. We ran 20 iterations of the method, that is, we presented the data of the considered month 20 times to the network. Because the algorithm needs the number of basis vectors K to be specified, we carried out experiments with K = 1, 2, 3, 4, 6, and 8. All of these values of K were used with 2, 4, 8, and 16 neurons. The same setup was used with the probabilistic PCA network, but with 10 iterations, because further computations gave no better results. For the PCACL network with the explained variance approach, we used different values of the parameter α: 0.5, 0.6, 0.7, 0.8, and 0.9. All of these values of α were tested with the same numbers of neurons as local PCA. We used a linear decay of the learning rates, with initial values ηR (0) = ηe (0) = 1 and final values ηR (t f inal ) = ηe (t f inal ) = 0. The input samples were presented to the network one at a time (i.e., N = 1). In order to train the network adequately, every input sample was presented two times. The same parameter selections were used with the Laplace-Minka method, except for the absence of the parameter α. The plots of the error MNSPE versus the number of basis vectors K is shown in Figures 2 and 3. The number of basis vectors K was averaged over all the neurons (this is the reason fractional values appear). Note that for the PCACL network (explained variance approach), the results corresponding to lower α appear to the left, because a smaller amount of variance can be explained with fewer basis vectors. We can see that the PCACL network with the explained variance approach has a better performance than local PCA and probabilistic PCA in all the tests, except with two neurons, where PCACL with explained variance and local PCA algorithms have equal performance, and probabilistic PCA is better than both of them. The PCACL network with the Laplace-Minka method is the best in all cases. This method always selects the simplest model (K = 1), since the image data have low variability. Note that nearer to the coordinate origin is better. The required CPU time is shown in Tables 1, 2, and 3. It can be seen that both PCACL methods are faster than the local PCA approach, and the

2476

E. Lopez-Rubio ´ et al.

0163(

1XPEHURIEDVLVYHFWRUV.

0163(

1XPEHURIEDVLVYHFWRUV.

Figure 2: MNSPE with two neurons (upper) and four neurons (lower). Local PCA = triangles. PCACL (Minka) = square. PCACL (explained variance) = circles. PPCA = diamonds. Table 1: Average CPU Time Used to Train the PCACL Network, in Seconds. Number of Neurons

α = 0.5

α = 0.6

α = 0.7

α = 0.8

α = 0.9

Minka

2 4 8 16

1329 1663 2934 3498

1494 1055 3405 3490

1637 1718 2154 3146

1598 1678 2122 3123

1856 2346 2341 3393

1096 1315 1762 2668

Table 2: Average CPU Time Used to Train the Local PCA Network, in Seconds. Number of Neurons

K=1

K=2

K=3

K=4

K=6

K=8

2 4 8 16

6070 7520 10,347 18,343

7762 8358 12,304 22,056

8822 8695 14,686 26,096

7928 10,172 16,336 33,357

9063 11,625 19,058 35,891

9325 13,166 21,875 41,261

0163(

Principal Components Analysis Competitive Learning

2477

0163(

1XPEHURIEDVLVYHFWRUV.

1XPEHURIEDVLVYHFWRUV.

Figure 3: MNSPE with 8 neurons (upper) and 16 neurons (lower). Local PCA = triangles. PCACL (Minka) = square. PCACL (explained variance) = circles. PPCA = diamonds. Table 3: Average CPU Time Used to Train the PPCA Network, in Seconds. Number of Neurons

K=1

K=2

K=3

K=4

K=6

K=8

2 4 8 16

967 1844 3458 8270

1007 1928 3636 8394

1051 2090 3852 9084

1049 2106 3940 9142

1074 2088 3993 9420

1092 2119 4079 9122

probabilistic PCA is the fastest (at the expense of very poor projection error results when the number of neurons increases, as we have seen). 5.3 Intrinsic Dimensionality Estimation Experiments. We designed a second set of experiments to show the capability of the PCACL to estimate the intrinsic dimensionality of the data. In this case, there is no possible comparison with the Kambathla and Leen and Tipping and Bishop approaches because these approaches do not have this capability. The selected database for our experiments has been the A data set of the Santa Fe time series competition (Hubner, ¨ Weiss, Abraham, & Tang, 1994).

1XPEHURIEDVLVYHFWRUV.

2478

E. Lopez-Rubio ´ et al. DOSKD

DOSKD DOSKD

DOSKD DOSKD

0LQND

,QSXWVSDFHGLPHQVLRQ'

Figure 4: Intrinsic dimensionality estimation results with PCACL (16 neurons).

This is a real data time series generated by a chaotic system implemented by lasers. It has been used as a benchmark for intrinsic dimensionality estimators (Camastra & Vinciarelli, 2002). Given the time series s(t), t = 1, . . . , T, the input vectors for the PCACL network were built as follows: xt = (s(t), s(t + 1), . . . , s(t + D − 1)),

t = 1, . . . , T − D + 1.

(5.2)

If D is adequately large, it is possible to measure the dimensionality of this vector set to infer the dimension of the attractor of the system that generated the time series. This technique is known as the method of delays (Ott, 1993; Kaplan & Glass, 1995). Since the fractal dimension of this attractor is approximately 2.06, this means that the intrinsic dimensionality of the vector set must be near 2. We formed different input vector sets by choosing D = 5, 10, 15, 20, 25, and 30. For each of these data sets, we trained PCACL networks with the explained variance method (α = 0.5, 0.6, 0.7, 0.8, and 0.9) and the LaplaceMinka method. The rest of the parameters were the same as in the dimensionality-reduction experiment. So we used a linear decay of the learning rates, with initial values ηR (0) = ηe (0) = 1 and final values ηR (t f inal ) = ηe (t f inal ) = 0. The input samples were presented to the network one at a time (i.e., N = 1), and every input sample was presented two times. Figures 4 to 6 show the results of these experiments. We can see that the explained variance method detects more dimensionality (i.e., K is larger) as the parameter α grows. This is because more basis vectors are needed to explain more variance of the input data. The detected dimensionality K also grows with the input dimension D, because the data with more dimensions are more difficult to represent, if a fixed amount of variance is to be explained. Nevertheless, the obtained intrinsic dimensionality K is always near the true value (K = 2). Note that K is smaller when more neurons are used, because each neuron has to represent fewer input vectors, and this

1XPEHURIEDVLVYHFWRUV.

Principal Components Analysis Competitive Learning

2479

DOSKD

DOSKD DOSKD

DOSKD DOSKD

0LQND

,QSXWVSDFHGLPHQVLRQ'

1XPEHURIEDVLVYHFWRUV.

Figure 5: Intrinsic dimensionality estimation results with PCACL (32 neurons). DOSKD

DOSKD DOSKD

DOSKD DOSKD

0LQND

,QSXWVSDFHGLPHQVLRQ'

Figure 6: Intrinsic dimensionality estimation results with PCACL (64 neurons).

means lower variance. If the fractal dimension were to be precisely estimated, a bootstrapping technique could be used (Camastra & Vinciarelli, 2002). This method would learn the relation between the number of basis vectors and the fractal dimension of the input distribution. With respect to the Laplace-Minka method, we can see that its automatic choice of the best model for K allows obtaining better results as D grows, as suggested by the method of delays. The best results are obtained with more neurons because the intrinsic dimensionality is a local property, and the subset of the input space represented by each neuron is smaller as the number of neurons grows. 6 Conclusion We have presented a new neural network that combines competitive learning and PCA. It represents an improvement over known local PCA methods,

2480

E. Lopez-Rubio ´ et al.

because it does not require presenting the entire data set to the network at a time. This allows fast execution while retaining the dimensionalityreduction properties of the PCA. Furthermore, every neuron is able to modify its behavior to adapt to the local dimensionality of the input distribution. Hence, our model has a dimensionality estimation capability and greater plasticity. This feature can be controlled by a parameter that specifies the desired amount of explained variance. Alternatively, an automatic dimensionality-estimation method can be used. The experimental results presented show the dimensionality-reduction performance of the model with multisensor images, its advantages with respect to well-known local and probabilistic PCA networks, and its intrinsic dimensionality-estimation capability. References Ahalt, S. C., Krishnamurthy, A. K., Chen, P., & Melton, D. E. (1990). Competitive learning algorithms for vector quantization. Neural Networks, 3, 277–290. Bauer, F., & Fike, C. (1960). Norms and exclusion theorems. Numer. Math., 2, 137–141. Camastra, F., & Vinciarelli, A. (2002). Estimating the intrinsic dimension of data with a fractal-based method. IEEE Trans. Pattern Analysis and Machine Intelligence, 24(10), 1404–1407. Fukunaga, K., & Olsen, D. R. (1971). An algorithm for finding intrinsic dimensionality of data. IEEE Trans. Computers, 20(2), 176–183. Hubner, ¨ U., Weiss, C. O., Abraham, N. B., & Tang, D. (1994). Lorenz-like chaos in NH3-FIR lasers. In A. Weigend & N. A. Gershenfeld (Eds.), Time series prediction: Forecasting the future and understanding the past (pp. 73–104). Reading, MA: Addison-Wesley. Jolliffe, I. T. (1986). Principal component analysis. Berlin: Springer-Verlag. Kambhatla, N., & Leen, T. K. (1997). Dimension reduction by local principal component analysis. Neural Computation, 9(7), 1493–1516. Kaplan, D., & Glass, L. (1995). Understanding nonlinear dynamics. New York: Springer-Verlag. Kendall, M. (1975). Multivariate analysis. London: Charles Griffin & Co. Kohonen, T. (1990). The self-organizing map. Proc. IEEE, 78, 1464–1480. Kohonen, T. (1995). The adaptive-subspace SOM (ASSOM) and its use for the implementation of invariant feature detection. In F. Fogelman-Souli´e & P. Galniari (Eds.), Proc. ICANN’95, InternationalConference on Artificial Neural Networks, 1 (pp. 3–10). Paris, France: EC2 et Cie. Kohonen, T. (1996). Emergence of invariant-feature detectors in the adaptivesubspace SOM, Biological Cybernetics, 75, 281–291. Minka, T. P. (2001). Automatic choice of dimensionality for PCA. In T. K. Leen, T. G. Dietterich, & V. Tresp (Eds.), Advances in neural information processing systems, 13 (pp. 598–604). NASA Earth Observatory. (2002). Data and Images. Available online at: http:// earthobservatory.nasa.gov/Observatory/datasets.html.

Principal Components Analysis Competitive Learning

2481

Ott, E. (1993). Chaos in dynamical systems. Cambridge: Cambridge University Press. Pal, N. R., Bezdek, J. C., & Tsao, E. C. (1993). Generalized clustering networks and Kohonen’s self-organizing scheme. IEEE Trans. Neural Networks, 4(4), 549–557. Roweis, S., & Ghahramani, Z. (1999). A unifying review of linear gaussian models. Neural Computation, 11, 305–345. Saghri, J. A., Tescher, A. G., & Reagan, J. T. (1995, January). Practical transform coding of multispectral imaginery. IEEE Signal Processing Magazine, 32–43. Tipping, M. E., & Bishop, C. M. (1999). Mixtures of probabilistic principal components analyzers. Neural Computation, 11, 443–482. Tretter, D., & Bouman, C. A. (1995). Optimal transforms for multispectral and multilayer image coding. IEEE Trans. Image Processing, 4(3), 296–308. Uchiyama, T., & Arbib, M. A. (1994). Color image segmentation using competitive learning. IEEE Trans. Pattern Analysis and Machine Intelligence, 16(2), 1197–1206. Ueda, N., & Nakano, R. (1994). A new competitive learning approach based on an equidistortion principle for designing optimal vector quantizers. Neural Networks, 7(8), 1211–1227. Verveer, P. J., & Duin, R. P. W. (1995). An evaluation of intrinsic dimensionality estimators. IEEE Trans. Pattern Analysis and Machine Intelligence, 17(1), 81–86. Xu, L., Krzyzak, A., & Oja, E. (1993). Rival penalized competitive learning for clustering analysis, RBF net, and curve detection. IEEE Trans. Neural Networks, 4(4), 636–649. Yair, E. K., Zeger, K., & Gersho, A. (1992). Competitive learning and soft competition for vector quantizer design. IEEE Trans. Signal Processing, 40(2), 294–308. Received August 8, 2003; accepted April 5, 2004.

LETTER

Communicated by Joachim Buhmann

How Many Clusters? An Information-Theoretic Perspective Susanne Still [email protected]

William Bialek [email protected] Department of Physics and Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, NJ 08544, U.S.A.

Clustering provides a common means of identifying structure in complex data, and there is renewed interest in clustering as a tool for the analysis of large data sets in many fields. A natural question is how many clusters are appropriate for the description of a given system. Traditional approaches to this problem are based on either a framework in which clusters of a particular shape are assumed as a model of the system or on a two-step procedure in which a clustering criterion determines the optimal assignments for a given number of clusters and a separate criterion measures the goodness of the classification to determine the number of clusters. In a statistical mechanics approach, clustering can be seen as a trade-off between energy- and entropy-like terms, with lower temperature driving the proliferation of clusters to provide a more detailed description of the data. For finite data sets, we expect that there is a limit to the meaningful structure that can be resolved and therefore a minimum temperature beyond which we will capture sampling noise. This suggests that correcting the clustering criterion for the bias that arises due to sampling errors will allow us to find a clustering solution at a temperature that is optimal in the sense that we capture maximal meaningful structure—without having to define an external criterion for the goodness or stability of the clustering. We show that in a general information-theoretic framework, the finite size of a data set determines an optimal temperature, and we introduce a method for finding the maximal number of clusters that can be resolved from the data in the hard clustering limit. 1 Introduction Much of our intuition about the world around us involves the idea of clustering: many different acoustic waveforms correspond to the same syllable, many different images correspond to the same object, and so on. It is plausible that a mathematically precise notion of clustering in the space of sensory data may approximate the problems solved by our brains. Clustering methods also are used in many different scientific domains as a practical tool c 2004 Massachusetts Institute of Technology Neural Computation 16, 2483–2506 (2004)

2484

S. Still and W. Bialek

to evaluate structure in complex data. Interest in clustering has increased recently because of new areas of application, such as data mining, image and speech processing, and bioinformatics. In particular, many groups have used clustering methods to analyze the results of genome-wide expression experiments, hoping to discover genes with related functions as members of the same cluster (see, e.g., Eisen, Spellman, Brown, & Botstein, 1998). A central issue in these and other applications of clustering is how many clusters provide an appropriate description of the data. The estimation of the true number of classes has been recognized as “one of the most difficult problems in cluster analysis” by Bock (1996), who gives a review of some methods that address the issue. The goal of clustering is to group data in a meaningful way. This is achieved by optimization of a so-called clustering criterion (an objective function), and a large variety of intuitively reasonable criteria have been used in the literature (a summary is given in Gordon, 1999). Clustering methods include agglomerative clustering procedures such as described by Ward (1963) and iterative reallocation methods, such as the commonly used K-means algorithm (Lloyd, 1957; MacQueen, 1967), which reduces the sum of squares criterion, the average of the within-cluster squared distances. More recently, algorithms with physically inspired criteria were introduced (Blatt, Wiseman, & Domany 1996; Horn & Gottlieb, 2002). All of these clustering methods have in common that the number of clusters has to be found by another criterion. Often a two-step procedure is performed: the optimal partition is found for a given data set, according to the defined objective function, and then a separate criterion is applied to test the robustness of the results against noise due to finite sample size. Such procedures include the definition of an intuitively reasonable criterion for the goodness of the classification, as in Tibshirani, Walther, and Hastie (2001) or performing cross-validation (Stone, 1974) and related methods in order to estimate the prediction error and to find the number of clusters that minimizes this error (e.g., Smyth, 2000). Roth, Lange, Braun, and Buhmann (2004) quantify the goodness of the clustering with a resampling approach. It would be attractive if these two steps could be combined in a single principle. In a sense, this is achieved in the probabilistic mixture model approach, but at the cost of assuming that the data can be described by a mixture of Nc multivariate distributions with some parameters that determine their shape. Now the problem of finding the number of clusters is a statistical model selection problem. There is a trade-off between the complexity of the model and goodness of fit. One approach to model selection is to compute the total probability that models with Nc clusters can give rise to the data, and then one finds that phase-space factors associated with the integration over model parameters serve to discriminate against more complex models (Balasubramanian, 1997). This Bayesian approach has been used to determine the number of clusters (Fraley & Raftery, 2002).

How Many Clusters? An Information-Theoretic Perspective

2485

From an information-theoretic point of view, clustering is most fundamentally a strategy for lossy data compression: the data are partitioned into groups such that the data could be described in the most efficient way (in terms of bit cost) by appointing a representative to each group. The clustering of the data is achieved by compressing the original data, throwing away information that is not relevant to the analysis. While most classical approaches in statistics give an explicit definition of a similarity measure, in rate distortion theory, we arrive at a notion of similarity through a fidelity criterion implemented by a distortion function (Shannon, 1948). The choice of the distortion function provides an implicit distinction between relevant and irrelevant information in the raw data. The notion of relevance was made explicit by Tishby, Pereira, and Bialek (1999), who defined relevant information as the information that the data provide about an auxiliary variable and performed lossy compression, keeping as much relevant information as possible. This formulation, termed the information bottleneck method (IB), is attractive because the objective function follows only from information-theoretic principles. In particular, this formulation does not require an explicit definition of a measure for similarity or distortion. The trade-off between the complexity of the model, on one hand, and the amount of relevant information it captures, on the other hand, is regulated by a trade-off parameter. For a given problem, the complete range of this trade-off is meaningful, and the structure of the trade-off characterizes the “clusterability” of the data. However, for a finite data set, there should be a maximal value for this trade-off, after which we start to “overfit,” and this issue has not yet been addressed in the context of the IB. In related work, Buhmann and Held (2000) derived for a particular class of histogram clustering models a lower bound on the annealing temperature from a bound on the probability of a large deviation between the error made on the training data and the expected error. In this article, we follow the intuition that if a model—which, in this context, is a (probabilistic) partition of the data set—captures information (or structure) in the data, then we should be able to quantify this structure in a way that corrects automatically for the overfitting of finite data sets. Attempts to capture only this “corrected information” will, by definition, not be sensitive to noise. Put another way, if we would separate at the outset real structure from spurious coincidences due to undersampling, then we could fit only the real structure. In the context of information estimation from finite samples, there is a significant literature on the problem, and we argue here that a finite sample correction to information estimates is (in some limits) sufficient to achieve the “one-step” compression and clustering in the sense described above, leading us naturally to a principled method of finding the best clustering that is consistent with a finite data set. We should point out that in general, we are not looking for the “true” number of clusters, but rather for the maximum number of clusters that can be resolved from a finite data set. This number equals the true number only

2486

S. Still and W. Bialek

if there exists a true number of classes and if the data set is large enough to allow us to resolve them. 2 Rate Distortion Theory and the Information Bottleneck Method If data x ∈ X are chosen from a probability distribution P(x), then a complete description of a single data point requires an average code length equal to the entropy of the distribution, S(x) = − x P(x) log2 [P(x)] bits. On the other hand, if we assign points to clusters c ∈ {1, 2, . . . , Nc }, then we need at most log2 (Nc ) bits. For Nc |X|, we have log2 (Nc ) S(x), and our intuition is that many problems will allow substantial compression at little cost if we assign each x to a cluster c and approximate x by a representative xc . 2.1 Rate Distortion Theory. The cost of approximating the signal x by xc as the expected value of some distortion function, d(x, xc ) (Shannon, 1948). This distortion measure can, but need not be, a metric. Lossy compression is achieved by assigning the data to clusters such that the mutual information I(c; x) =

xc

P(c|x)P(x) log2

P(c|x) P(c)

(2.1)

is minimized. The minimization is constrained by fixing the expected distortion,

d(x, xc ) =

P(c|x)P(x)d(x, xc ).

(2.2)

xc

This leads to the variational problem: min [ d(x, xc ) + TI(c; x)].

P(c|x)

(2.3)

The (formal) solution is a Boltzmann distribution,1 P(c) 1 P(c|x) = exp − d(x, xc ) , Z(x; T) T

(2.4)

with the distortion playing the role of energy, and the normalization, Z(x, T) =

c

1 P(c) exp − d(x, xc ) , T

(2.5)

1 T = T/ ln(2), because the information is measured in bits in equation 2.1. P(c) is calculated as P(c) = P(c|x)P(x). x

How Many Clusters? An Information-Theoretic Perspective

2487

playing the role of a partition function (Rose, Gurewitz, & Fox, 1990). The representatives, xc , often simply called cluster centers, are determined by the condition that all of the “forces” within each cluster balance for a test point located at the cluster center,2

P(x|c)

x

∂ d(x, xc ) = 0. ∂xc

(2.6)

Recall that if the distortion measure is the squared distance, d(x, xc ) = (x − xc )2 , then equation 2.6 becomes xc = x xP(x|c). The cluster center is in fact the center of mass of the points assigned to the cluster. The Lagrange parameter T regulates the trade-off between the detail we keep and the bit cost we are willing to pay. In analogy with statistical mechanics, T often is referred to as the temperature (Rose et al., 1990). T measures the softness of the cluster membership. The deterministic limit (T → 0) is the limit of hard clustering solutions. As we lower T, there are phase transitions among solutions with different numbers of distinct clusters, and if we follow these transitions, we can trace out a curve of d versus I(c; x), both evaluated at the minimum. This is the rate distortion curve and is analogous to plotting energy versus (negative) entropy, with temperature varying parametrically along the curve. Crucially, there is no optimal temperature that provides the unique best clustering, and thus there is no optimal number of clusters: more clusters always provide a more detailed description of the original data and hence allow us to achieve smaller average values of the distortion while the cost of the encoding increases. 2.2 Information Bottleneck Method. The distortion function implicitly selects the features that are relevant for the compression. However, for many problems, we know explicitly what it is that we want to keep information about while compressing the data, but one cannot always construct the distortion function that selects for these relevant features. In the information bottleneck method (Tishby, Pereira, & Bialek, 1999), the relevant information in the data is defined as information about another variable, v ∈ V. Both x and v are random variables, and we assume that we know the distribution of co-occurrences, P(x, v). We wish to compress x into clusters c, such that the relevant information (the information about v) is maximally preserved. This leads directly to the optimization problem, max[I(c; v) − TI(c; x)]. P(c|x)

(2.7)

2 This condition is not independent of the original variational problem. Optimizing the objective function with respect to xc , we find ∂x∂ c [ d(x, xc )+TI(c; x)] = 0 ⇔ ∂x∂ c d(x, xc ) =

p(c)

x

P(x|c) ∂x∂ c d(x, xc ) = 0; ∀c ⇒ equation 2.6.

2488

S. Still and W. Bialek

One obtains a solution similar to equation 2.4, P(c) 1 P(c|x) = exp − DKL [P(v|x)P(v|c)] , Z(x, T) T

(2.8)

in which the Kullback–Leibler divergence, DKL [P(v|x)P(v|c)] =

v

P(v|x) P(v|x) ln , P(v|c)

(2.9)

emerges in the place of the distortion function, providing a notion of similarity between the distributions P(v|x) and P(v|c), where P(v|c) is given by (Tishby et al., 1999) P(v|c) =

1 P(v|x)P(c|x)P(x). P(c) x

(2.10)

When we plot I(c; v) as a function of I(c; x), both evaluated at the optimum, we obtain a curve similar to the rate distortion curve, the slope of which is given by the trade-off between compression and preservation of relevant information: δI(c; v) = T. δI(c; x)

(2.11)

3 Finite Sample Effects The formulation above assumes that we know the probability distribution underlying the data, but in practice, we have access to only a finite number of samples, and so there are errors in the estimation of the distribution. These random errors produce a systematic error in the computation of the cost function. The idea here is to compute the error perturbatively and subtract it from the objective function. Optimization with respect to the assignment rule is now by definition insensitive to noise, and we should (for the IB) find a value for the trade-off parameter, T∗ , at which the relevant information is kept maximally. The compression problem expressed in equation 2.7 gives us the right answer if we evaluate the functional 2.7 at the true distribution P(x, v). But in practice, we do not know P(x, v); instead, we have to use an estimate ˆ v) based on a finite data set. We use perturbation theory to compute the P(x, systematic error in the cost function that results from the uncertainty in the estimate. We first consider the case that P(x) is known, and we have to estimate only the distribution P(v|x). This is the case in many practical clustering problems, where x is merely an index to the identity of samples, and hence

How Many Clusters? An Information-Theoretic Perspective

2489

P(x) is constant, and the real challenge is to estimate P(v|x). In section 5, we discuss the error that comes from uncertainty in P(x) and also what happens when we apply this approach to rate distortion theory. Viewed as a functional of P(c|x), I(c; x) can have errors arising only from uncertainty in estimating P(x). Therefore, if P(x) is known, then there is no bias in I(c; x). We assume for simplicity that v is discrete. Let N be the total number of observations of x and v. For a given x, the (average) number of ˆ observations of v is then NP(x). We assume that the estimate P(v|x) converges to the true distribution in the limit of large data set size N → ∞. However, for finite N, the estimated distribution will differ from the true distribution, and there is a regime in which N is large enough such that we can approximate (compare Treves & Panzeri, 1995), ˆ P(v|x) = P(v|x) + δP(v|x),

(3.1)

where we assume that δP(v|x) is some small perturbation and its average over all possible realizations of the data is zero:

δP(v|x) = 0.

(3.2)

Taylor expansion of Iemp (c; v) := I(c; v)|P(v|x) around P(v|x) leads to a sysˆ tematic error I(c; v), = I(c; v)|P(v|x) + I(c; v), I(c; v)|P(v|x)=P(v|x)+δP(v|x) ˆ

(3.3)

where the error is calculated as an average over realizations of the data ∞ n 1 δ n I(c; v) (k) n I(c; v) = ... δP(v|x ) , (3.4) (k) n! v x(1) k=1 δP(v|x ) k=1 n=1 x(n) with

n (k) (−1)n (n − 2)! nk=1 P(c, x(k) ) δ n I(c; v) k=1 P(x ) n = − (3.5) (k) ln 2 (P(c, v))n−1 (P(v))n−1 k=1 δP(v|x ) c

is given by I(c; v) =

∞ (−1)n 1 ln 2 n=2 n(n − 1) v n (

( x δP(v|x)P(x))n x δP(v|x)P(c, x)) × − . (3.6) (P(c, v))n−1 (P(v))n−1 c

Note that the terms with n = 1 vanish because of equation 3.2 and that the second term in the sum is constant with respect to P(c|x).

2490

S. Still and W. Bialek

Our idea is to subtract this error from the objective function, equation 2.7, and recompute the distribution that maximizes the corrected objective function,

emp emp max I (c; v) − TI (c; x) − I(c; v) + µ(x) P(c|x) . (3.7) P(c|x)

c

The last constraint ensures normalization, and the optimal assignment rule P(c|x) is now given by

∞ P(c) 1 P(c|x) = (−1)n exp − DKL [P(v|x)P(v|c)] − Z(x, T) T v n=2

(δP(v|c))n δP(v|x)(δP(v|c))n−1 × P(v|x) − , (3.8) n(P(v|c))n (n − 1)(P(v|c))n−1 which has to be solved self-consistently together with equation 2.10 and δP(v|c) := δP(v|x)P(x|c). (3.9) x

The error I(c; v) is calculated in equation 3.6 as an asymptotic expansion, and we are assuming that N is large enough to ensure that δP(v|x) is small ∀(x, v). Let us thus concentrate on the term of leading order in δP(v|x), which is given by (disregarding the term which does not depend on P(c|x))3 [P(c|x)]2 P(x|v) 1 x (I(c; v))(2) , (3.10) 2 ln(2)N vc x P(c|x)P(x|v) where we have made use of equation 2.10 and the approximation (for counting statistics)

δP(v|x)δP(v|x ) δxx

P(v|x) . NP(x)

(3.11)

3 We arrive at equation 3.10 by calculating the first leading-order term (n = 2) in the sum of equation 3.6:

(I(c; v))

(2)

∼ =

1 2 ln(2)

x

x

P(c, x)P(c, x ) δP(v|x)δP(v|x ) P(c, v)

vc

,

making use of approximation 3.11 and summing over x , which leads to (2)

(I(c; v))

1 2 ln(2)N vc

x

[P(c|x)]2 P(v|x)P(x) P(c|v)P(v)

,

and then substituting P(v|x)P(x)/P(v) = P(x|v) and P(c|v) = equation 2.10).

x

P(c|x)P(x|v) (compare

How Many Clusters? An Information-Theoretic Perspective

2491

Can we say something about the shape of the resulting “corrected” optimal information curve by analyzing the leading-order error term, equation 3.10? This term is bounded from above by the value it assumes in the deterministic limit (T → 0), in which assignments P(c|x) are either 1 or 0 and thus [P(c|x)]2 = P(c|x),4 (I(c; v))(2) T→0 =

1 Kv Nc . 2 ln(2) N

(3.12)

ˆ Note Kv is the number of bins we have used to obtain our estimate P(v|x). that if we had adopted a continuous, rather than a discrete, treatment, then the volume of the (finite) V-space would arise instead of Kv .5 If one does not assume that P(x) is known, and instead calculates the bias by Taylor expansion of Iemp (c; v) around P(x, v) (see section 5), then the upper bound to the leading order term (see equation 5.7) is given by Nc (Kv − 1)/(2N ln(2)). This is the Nc -dependent part of the leading correction to the bias as derived in Treves and Panzeri (1995, equation 2.11, term C1 ). Similar to what these authors found when they computed higher-order corrections, we also found in numerical experiments that the leading term is a surprisingly good estimate of the total bias, and we therefore feel confident to approximate the error by equation 3.10, although we cannot guarantee convergence of the series in equation 3.6.6 4

=

2 Substitution of [P(c|x)]

1 2 ln(2)N

vc

x x

5

P(c|x)P(x|v) P(c|x)P(x|v)

=

1 2 ln(2)N

P(c|x) into equation 3.10 gives (I(c; v))(2)

vc

1=

=

1 K N. 2 ln(2)N v c

For choosing the number of bins, Kv , as a function of the data set size N, we refer to the large body of literature on this problem, for example, Hall and Hannan (1988). 6 For counting statistics (binomial distribution), we have

(δP(v|x))n =

1

N

(P(v|x))n n−1

×

k=0

{l1 ···lk }

where l =

n

(−1)(n−k)

Nk n! k!(n − k)! (P(v|x))k

1 k!N! (1 − p)(N−l) (N − l)! lq ! k

q=1

k

lq p q!

,

l , the lq are positive integers, and the sum {l ···l } runs over all partitions q=1 q 1 k k values of l1 ,. . . ,lk such that ql = k. There is a growing number of q=1 q

of k, that is, contributions to the sum at each order n, some of which can be larger than the smallest terms of the expression at order n − 1. If there are enough measurements such that NP(x) is large, the binomial distribution approaches a normal distribution with (δP(v|x))2n = 1 2 n 2n−1 = 0 (n = 1, 2, . . .). Substituting (2n − 1)!! Nn P(x) n (P(v|x) − P(v|x) ) , and (δP(v|x))

this into equation 3.6, and considering only terms with x(1) = x(2) = · · · = x(n) , we get 1 ln 2

P(c, v) xvc

to converge.

∞

(2k−1)!! k=1 2k(2k−1)

1 P(x) (P(c|x))2 2 N P(v) (P(c|v))2 (P(v|x) − P(v|x) )

k

, which is not guaranteed

2492

S. Still and W. Bialek

The lower bound of the leading-order error, equation 3.10, is given by7 1 I(c;x) 1 , 2 2 ln(2) N

(3.13)

and hence the “corrected information curve,” which we define as Icorr (c; v) := Iemp (c; v) − I(c; v),

(3.14)

is (to leading order) bounded from above by corr IUB (c; v) = Iemp (c; v) −

1 I(c;x) 1 . 2 2 ln(2) N

(3.15)

The slope of this upper bound is T − 2I(c;x) /2N (using equation 2.11), and there is a maximum at ∗ TUB =

1 I(c;x) . 2 2N

(3.16)

If the hard clustering solution assigns equal numbers of data to each cluster, then the upper bound on the error, equation 3.12, can be rewritten as 1 Kv I(c;x) 2 . 2 ln(2) N

(3.17)

and, therefore the lower bound on the information curve, corr (c; v) = Iemp (c; v) − ILB

1 Kv I(c;x) , 2 2 ln(2) N

(3.18)

has a maximum at ∗ = TLB

Kv I(c;x) 2 . 2N

(3.19)

Since both upper and lower bound coincide at the end point of the curve, where T → 0 (see Figure 1), the actual corrected information curve must

7

xc

Proof: P(x, c)2

[P(c|x)]2 P(x|v) P(c|v) xvc P(c|x) log2 [ P(c) ] I(c;x)

≥2

.

=

xc

P(x, c) P(c|x) P(c)

P(v|x) v P(v|c)

>

xc

P(x, c) P(c|x) P(c)

=

How Many Clusters? An Information-Theoretic Perspective

I

emp

(c;v)

xx

2493

I corr UB (c;v)

I corr LB (c;v)

I(c,x) Figure 1: Sketch of the lower and upper bound on the corrected information curve, which both have a maximum under some conditions (see equations 3.16 and 3.19), indicated by x, compared to the empirical information curve, which is monotonically increasing.

have a maximum at T∗ =

γ I(c;x) , 2 2N

(3.20)

where 1 < γ < Kv . In general, for deterministic assignments, the information we gain by adding another cluster saturates for large Nc , and it is reasonable to assume that this information grows sublinearly in the number of clusters. That means that the lower bound on Icorr (c; v) has a maximum (or at least a plateau). This ensures us that Icorr (c; v) must have a maximum (or plateau) and, hence, that an optimal temperature exists. In the context of the IB, asking for the number of clusters that are consistent with the uncertainty in our estimation of P(v|x) makes sense only for deterministic assignments. From the above discussion, we know the leading-order error term in the deterministic limit, and we define the “corrected relevant information” in the limit T → 0 as8 emp

corr IT→0 (c; v) = IT→0 (c; v) −

Kv Nc , 2 ln(2)N

(3.21)

emp

where IT→0 (c; v) is calculated by fixing the number of clusters and cooling emp the temperature to obtain a hard clustering solution. While IT→0 (c; v) incorr creases monotonically with Nc , we expect IT→0 (c; v) to have a maximum (or 8

This quantity is not strictly an information anymore—thus, the quotation marks.

2494

S. Still and W. Bialek

at least a plateau) at Nc∗ , as we have argued above. Nc∗ is then the optimal number of clusters in the sense that using more clusters, we would not capture more meaningful structure (or, in other words, would overfit the data), and although in principle we could always use fewer clusters, this comes at the cost of keeping less relevant information I(c; v). 4 Numerical Results 4.1 Simple Synthetic Test Data. We test our method for finding Nc∗ on data that we understand well and where we know what the answer should be. We thus created synthetic data drawn from normal distributions with zero mean and five different variances (for Figures 2 and 3).9 We emphasize that we chose an example with gaussian distributions not because any of our analysis makes use of gaussian assumptions, but rather because in the gaussian case, we have a clear intuition about the similarity of different distributions and hence about the difficulty of the clustering task. This will become important later, when we make the discrimination task harder (see ˆ Figure 6). We use Kv = 100 bins to estimate P(v|x). In Figures 2 and 3, we emp corr compare how IT→0 (c; v) and IT→0 (c; v) behave as a function of the number of clusters. The number of observations of v, given x, is Nv = N/Nx , and the average number of observations per bin is given by Nv /Kv . Figure 3 corr shows the average IT→0 (c; v), computed from 31 different realizations of 10 the data. All of the 31 individual curves have a maximum at the true number of clusters, Nc∗ = 5, for Nv /Kv ≥ 2. They are offset with respect to each other, which is the source of the error bars. When we have too few data (Nv /Kv = 1), then we can resolve only four clusters (65% of the individual curves peak at Nc∗ = 4, the others at Nc∗ = 5). As Nv /Kv becomes very large, emp corr IT→0 (c; v) approaches IT→0 (c; v), as expected. The curves in Figures 2 and 3 differ in the average number of examples per bin, Nv /Kv . The classification problem becomes harder as we see fewer data. However, it also becomes harder when the true distributions are closer. To separate the two effects, we create synthetic data drawn from gaussian distributions with unit variance and NA different, equidistant means α, which are dα apart.11 NA is the true number of clusters. This problem becomes intrinsically harder as dα becomes smaller. Examples are shown in Figures 4 and 5. The problem becomes easier as we are allowed to look at more data, which corresponds to an increase in Nv /Kv . We are interested

9 P(x) = 1/N and P(v|x) = N (0, α(x)) where α(x) ∈ A, and |A| = 5, with P(α) = 1/5; x and Nx = 50. Nx is the number of “objects” we are clustering. 10 Each time we compute Iemp (c; v), we start at 100 different, randomly chosen initial T→0 conditions to increase the probability of finding the global maximum of the objective functional. 11 P(v|x) = N (α(x), 1), α(x) ∈ A, N := |A|. A

How Many Clusters? An Information-Theoretic Perspective

2495

0.3

emp

I T→ 0 ( c ; v )

0.28 0.26 0.24 0.22 0.2 0.18 0.16 2

3

4

5

6

7

8

9 10

Nc Figure 2: Result of clustering synthetic data with P(v|x) = N (0, α(x)); five possible values for α. Displayed is the relevant information kept in the compression, emp computed from the empirical distribution, IT→0 (c; v), which increases monotonically as a function of the number of clusters. Each curve is computed as the mean of 31 different curves, obtained by virtue of creating different realizations of the data. The error bars are ± 1 standard deviation. Nv /Kv equals 1 (diamonds), 2 (squares), 3 (triangles), 5 (stars), and 50 (X’s). Nx = 50 and Kv = 100 for all curves. The line indicates the value of the information I(x; v), estimated from 106 data points.

in the regime in the space spanned by Nv /Kv and dα in which our method retrieves the correct number of clusters. In Figure 6, points mark those values of dα and Nv /Kv (evaluated on the shown grid) at which we find the true number of clusters. The different shapes of the points summarize results for 2, 5, and 10 clusters. A missing point on the grid indicates a value of dα and Nv /Kv at which we did not find the correct number of clusters. All of these missing points lie in a regime characterized by a strong overlap of the true distributions combined with scarce data. In that regime, our method always tells us that we can resolve fewer clusters than the true number of clusters. For small sample sizes, the correct number of clusters is resolved only if the clusters are well separated, but as we accumulate more data, we can recover the correct number of classes for more and more overlapping clusters. To illustrate the performance of

S. Still and W. Bialek

corr

I T→ 0 ( c ; v )

2496

0.23 0.22 0.21 0.2 0.19 0.18 0.17 0.16 0.15 0.14 0.13 2

3

4

5

6

7

8

9

10

Nc Figure 3: Result of clustering synthetic data with P(v|x) = N (0, α(x)); five possible values for α. Displayed is the “corrected relevant information” in the hard corr clustering limit, IT→0 (c; v) (see equation 3.21), as a function of the number of clusters. Each curve is computed as the mean of 31 different curves, obtained by virtue of creating different realizations of the data. For Nv /Kv ≥ 2, all individual 31 curves peak at Nc∗ = 5, but are offset with respect to each other. The error bars are ± 1 standard deviation. Nv /Kv equals 1 (diamonds), 2 (squares), 3 (triangles), 5 (stars), and 50 (X’s). For Nv /Kv = 1, 20 of the 31 curves peak at Nc∗ = 4, the other 11 at Nc∗ = 5. The line indicates the value of the information I(x; v), estimated from 106 data points.

the method, we show in Figure 5 the distribution P(x, v) in which α(x) has five different values that occur with equal probability, P(α(x)) = 1/5 and differ by dα = 0.2. For this separation, our method still retrieves five as the optimal number of clusters when we have at least Nv = 2000.12 Our method detects when only one cluster is present, a case in which many methods fail (Gordon, 1999). We verified this for data drawn from one gaussian distribution and for data drawn from the uniform distribution.

12

We used Kv = 100 bins to estimate P(v|x). P(x) = 1/Nx with Nx = 20.

How Many Clusters? An Information-Theoretic Perspective

2497

Figure 4: A trivial example of those data sets on which we found the correct number of clusters (results are summarized in Figure 6). Here, P(v|x) = N (α(x), 1) with five different values for α, spaced dα = 2 apart. Kv = 100, Nx = 20, Nv /Kv = 20.

4.2 Synthetic Test Data That Explicitly Violate Mixture Model Assumptions. We consider data drawn from a radial normal distribution, according to P(r) = N (1, 0.2), with x = rcos(φ), v = rsin(φ), and P(φ) = 1/2π , as shown in Figure 7. The empirical information curves (see Figure 8) and corrected information curves (see Figure 9) are computed as the mean of seven different realizations of the data for different sample sizes.13 The corrected curves peak at Nc∗ , which is shown as a function of N in Figure 10. For fewer than a few thousand samples, the optimal number of clusters goes roughly as Nc∗ ∝ N2/3 , but there is a saturation around Nc∗ ≈ 25. This number corresponds to half of the number of x-bins (and therefore half of the number of “objects” we are trying to cluster), which makes sense given the symmetry of the problem. emp

Each time we compute IT→0 (c; v), we start at 20 different, randomly chosen initial conditions to increase the probability of finding the global maximum of the objective functional. Increasing the number of initial conditions would decrease the error bars at the cost of computational time. 13

2498

S. Still and W. Bialek

Figure 5: One of the difficult examples of those data sets on which we found the correct number of clusters (results are summarized in Figure 6). Here, P(v|x) = N (α(x), 1) with five different values for α, spaced dα = 0.2 apart. Kv = 100, Nx = 20, Nv /Kv = 20.

5 Uncertainty in P(x) In the most general case, x can be a continuous variable drawn from an unknown distribution P(x). We then have to estimate the full distribution P(x, v), and if we want to follow the same treatment as above, we have to assume that our estimate approximates the true distribution ˆ x) = P(v, x) + δP(v, x), P(v,

(5.1)

where δP(v, x) is some small perturbation and its average over all possible realizations of the data is zero:

δP(v, x) = 0.

(5.2)

How Many Clusters? An Information-Theoretic Perspective

2499

2

dα

1 0.5 0.2 95%

0.1 2

1

5

20

10

Nv / K v Figure 6: Result of finding the correct number of clusters with our method for a synthetic data set of size N = Nx Nv , (Nx = 20) with P(v|x) = N (α(x), 1) and with either 2, 5, or 10 possible values for α, spaced dα apart. We indicate values of dα and the “resolution” Nv /Kv (Kv = 100) at which the correct number of clusters is found: for 2, 5, and 10 clusters (squares); for only 2 and 5 clusters (stars); for only 2 clusters (circles). The classification error (on the training data) is 0 for all points except for the one that is labeled with 95% correct.

Now, this estimate induces an error not only in Iemp (c; v) but also in Iemp (c; x). Taylor expansion of these two terms gives

∞ (−1)n 1 1 1 I(c; v) = − ln(2) vc n=2 n(n − 1) (P(v, c))n−1 (P(c))n−1 n P(c|x)δP(x, v) − (P(v)) × x ∞ (−1)n 1 1 (P(v)) = ln(2) v n=2 n(n − 1) (P(v))n−1

∞ (−1)n 1 1 ln(2) vc n=2 n(n − 1) (P(c))n−1 n . × P(c|x)δP(x, v)

(5.3)

n δP(x, v)

(5.4)

x

I(c; x) = −

x

(5.5)

2500

S. Still and W. Bialek

Figure 7: Twenty thousand data points drawn from a radial distribution, according to P(r) = N (1, 0.2), with x = rcos(φ), v = rsin(φ), and P(φ) = 1/2π. Displayed is the estimated probability distribution (normalized histogram with 50 bins along each axis).

This results in a correction to the objective function (Fcorrected = Femp − F), given by: F =

∞ (−1)n 1 1 1 + T − 1 ln(2) vc n=2 n(n − 1) (P(c))n−1 (P(v|c))n−1 n − (P(v)), P(c|x)δP(x, v) ×

(5.6)

x

where (P(v)) is constant in P(c|x) and therefore not important. At critical temperature T = 1, the error due to uncertainty in P(x) made in calculating Iemp (c; v) cancels that made in computing Iemp (c; x). For small T, the largest contribution to the error is given by the first term in the sum of equation 5.6, since 1/(P(v|c))n ≥ 1, ∀{n, v, c}. Therefore, the procedure that we have suggested for finding the optimal number of clusters in the deterministic limit (T → 0) remains unchanged, even if P(x) is unknown. To illustrate this

How Many Clusters? An Information-Theoretic Perspective

2501

emp

I T→ 0 ( c ; v )

1.8 1.6 1.4 1.2 1 0.8 0.6 0.4 10

20

30

40

Nc emp

Figure 8: IT→0 (c; v) as a function of the number of clusters, averaged over seven different realizations of the data. Error bars are ± 1 standard deviation. The information I(x; v), calculated from 100,000 data points, equals 0.58 bits (line). Data set size N equals 100 (diamonds), 300 (squares), 1000 (triangles), 3000 (stars), and 100,000 (crosses).

point, let us consider, as before, the leading-order term of the error (using the approximation in equation 3.11), 1 1 1 (P(c|x))2 P(x, v). (5.7) +T−1 (F)(2) 2N ln(2) cv p(c) p(v|c) x In the T → 0 limit, this term becomes Nc (Kv − 1)/2N ln(2), and we find emp

corr (c; v) = IT→0 (c; v) − IT→0

Kv − 1 Nc , 2 ln(2)N

(5.8)

which is insignificantly different from equation 3.21 in the regime Kv 1. Only for very large temperatures T 1 (i.e., at the onset of the annealing process) could the error that results from uncertainty in P(x) make a significant difference. The corrected objective function is now given by Fcorr = Iemp (c; v) − TIemp (c; x) − F + µ(x) P(c|x), (5.9) c

2502

S. Still and W. Bialek

0.6

corr

I T→ 0 ( c ; v )

0.5 0.4 0.3 0.2 0.1 0 -0.1 10

20

30

40

Nc corr Figure 9: IT→0 (c; v) as a function of the number of clusters, averaged over seven different realizations of the data. Error bars are ± 1 standard deviation. The information I(x; v) calculated from 100,000 data points equals 0.58 bits (line). Data set size N equals: 100 (diamonds), 300 (squares), 1000 (triangles), 3000 (stars), and 100,000 (crosses).

and the optimal assignment rule is given by P(c|x) =

P(c) 1 exp − DKL [P(v|x)P(v|c)] Z(x, T) T

∞ 1 P(x) P(v|x) − (−1)n + T − 1 P(x) v n=2 n(P(c))n (P(v|c))n

1 1 × (δP(v, c))n − + T − 1 (n − 1)(P(c))n−1 (P(v|c))n−1 n−1 , (5.10) × δP(x, v)(δP(v, c))

which has to be solved self-consistently together with equation 2.10 and δP(v, c) :=

x

δP(v, x)P(c|x).

(5.11)

How Many Clusters? An Information-Theoretic Perspective

2503

100

N c* 10

1 102

103

N

104

105

Figure 10: Optimal number of clusters, Nc∗ , as found by the suggested method, as a function of the data set size N. The middle curve (crosses) represents the average over seven different realizations of the data, points on the upper/lower curve are maximum/minimum values, respectively. Line at 25.

5.1 Rate Distortion Theory. Let us assume that we estimate the distriˆ bution P(x) by P(x) = P(x) + δP(x), with δP(x) = 0, as before. While there is no systematic error in the computation of d, this uncertainty in P(x) does produce a systematic underestimation of the information cost I(c; x): ∞ 1 (−1)n ( x P(c|x)δP(x))n I(c; x) = − . (5.12) ln(2) n=2 n(n − 1) c (P(c))n−1 When we correct the cost functional for this error (with λ = 1/T), Fcorr := I(c; x) + λ d(x, xc ) − I(c; x) + µ(x) P(c|x),

(5.13)

c

we obtain for the optimal assignment rule (with λ = λ ln(2)),

∞

( x P(c|x)δP(x))n P(c) (−1)n exp −λ d(x, xc )+ P(c|x) = Z(x, λ) n(P(c))n c n=2

1 δP(x)( x P(c|x)δP(x))n−1 . (5.14) − P(x) (n − 1)(P(c))n−1

2504

S. Still and W. Bialek

Let us consider the leading-order term of the error made in calculating the information cost, (I(c; x))

(2)

1 x (P(c|x)δP(x))2 =− . 2 ln(2) c P(c)

(5.15)

For counting statistics, we can approximate, as before, (I(c; x))(2) −

1 2 ln(2)N c

x (P(c|x))

P(c)

2 P(x)

.

(5.16)

The information cost is therefore underestimated by at least 2I(c;x) /(2 ln(2)N) bits.14 The corrected rate distortion curve with Icorr (c; x) := I(c; x) − I(c; x)

(5.17)

is then bounded from below by corr (c; x) = I(c; x) + ILB

1 2I(c;x) , 2 ln(2)N

(5.18)

and this bound has a rescaled slope given by

1 I(c;x) λ˜ = λ 1 − , 2 2N

(5.19)

but no extremum. Since there is no optimal trade-off, it is not possible to use the same arguments as we have used before to determine an optimal number of clusters in the hard clustering limit. To do this, we would have to carry the results obtained from the treatment of the finite sample size effects in the IB over to rate distortion theory, using insights we have gained in Still, Bialek, and Bottou (2004) about how to use the IB for data that are given with some measure of distance (or distortion). 6 Summary Clustering, as a form of lossy data compression, is a trade-off between the quality and complexity of representations. In general, a data set (or clustering problem) is characterized by the whole structure of this trade-off—the rate distortion curve or the information curve in the IB method—which quantifies our intuition that some data are more clusterable than others. In 14

Using

xc

P(x, c) P(c|x) P(c) =

P(c|x)

xc

P(x, c)2

log2 [ P(c) ]

≥ 2I(c;x) .

How Many Clusters? An Information-Theoretic Perspective

2505

this sense, there is never a single “best” clustering of the data, just a family of solutions evolving as a function of temperature. As we solve the clustering problem at lower temperatures, we find solutions that reveal more and more detailed structure and hence have more distinct clusters. If we have only finite data sets, however, we expect that there is an end to the meaningful structure that can be resolved—at some point, separating clusters into smaller groups just corresponds to fitting the sampling noise. The traditional approach to this issue is to solve the clustering problem in full and then to test for significance or validity of the results by some additional statistical criteria. What we have presented in this work is, we believe, a new approach. Because clustering is formulated as an optimization problem, we can try to take account of the sampling errors and biases directly in the objective functional. In particular, for the IB method, all terms in the objective functional are mutual information, and there is a large literature on the systematic biases in information estimation. There is a perturbative regime in which these biases have a universal form and can be corrected. Applying these corrections, we find that at fixed sample size, the trade-off between complexity and quality really does have an end point beyond which lowering the temperature or increasing the number of clusters does not resolve more relevant information. We have seen numerically that in model problems, this strategy is sufficient to set the maximum number of resolvable clusters at the correct value. Acknowledgments We thank N. Tishby for helpful comments on an earlier draft. S. S. thanks M. Berciu and L. Bottou for helpful discussions and acknowledges support from the German Research Foundation (DFG), grant Sti197. References Balasubramanian, V. (1997). Statistical inference, Occam’s razor, and statistical mechanics on the space of probability distributions. Neural Computation, 9, 349–368. Blatt, M., Wiseman S., & Domany, E. (1996). Superparamagnetic clustering of data. Phys. Rev. Lett., 76, 3251–3254. Bock, H.-H. (1996). Probability models and hypotheses testing in partitioning cluster analysis. In P. Arabie, L. J. Hubert, & G. De Soete (Eds.), Clustering and classification (pp. 378–453). Singapore: World Scientific. Buhmann, J. M., & Held, M. (2000). Model selection in clustering by uniform convergence bounds. In S. A. Solla, T. K. Leen, & K.-R. Muller ¨ (Eds.), Advances in neural information processing systems, 12. Cambridge, MA: MIT Press. Eisen, M., Spellman, P. T., Brown P. O., & Botstein, D. (1998). Cluster analysis and display of genome-wide expression patterns. Proc. Nat. Acad. Sci. (USA), 95, 14863.

2506

S. Still and W. Bialek

Fraley, C., & Raftery, A. (2002). Model-based clustering, discriminant analysis, and density estimation. J. Am. Stat. Assoc., 97, 611. Gordon, A. D. (1999). Classification. London: Chapman and Hall/CRC Press. Hall, P., & Hannan, E. J. (1988). On stochastic complexity and nonparametric density estimation. Biometrika, 75, 705–714. Horn D., & Gottlieb, A. (2002). Algorithm for data clustering in pattern recognition problems based on quantum mechanics. Phys. Rev. Lett., 88, 018702. Lloyd, S. (1957). Least squares quantization in PCM (Tech. Rep.). Murray Hill, NJ: Bell Laboratories. MacQueen, J. (1967). Some methods for classification and analysis of multivariate observations. In L. M. L. Cam & J. Neyman (Eds.), Proc. 5th Berkeley Symp. Math. Statistics and Probability (Vol. 1, pp. 281–297). Berkeley: University of California Press. Rose, K., Gurewitz, E., & Fox, G. C. (1990). Statistical mechanics and phase transitions in clustering. Phys. Rev. Lett., 65, 945. Roth, V., Lange, T., Braun M. L., & Buhmann, J. M. (2004). Stability-based validation of clustering solutions. Neural Computation, 16, 1299–1323. Shannon, C. E. (1948). A mathematical theory of communication. Bell System Tech. J., 27, 379–423, 623–656. Smyth, P. (2000). Model selection for probabilistic clustering using crossvalidated likelihood. Statistics and Computing, 10, 63. Still, S., Bialek W., & Bottou, L. (2004). Geometric clustering using the information bottleneck method. In S. Thrun, L. Saul, & B. Scholkopf ¨ (Eds.), Advances in neural information processing systems, 16. Cambridge, MA: MIT Press. Stone, M. (1974). Cross-validatory choice and assessment of statistical predictions. J. R. Stat. Soc., 36, 111. Tibshirani, R., Walther, G., & Hastie, T. (2001). Estimating the number of clusters in a dataset via the Gap statistic. J. R. Stat. Soc. B, 63, 411. Tishby, N., Pereira, F., & Bialek, W. (1999). The information bottleneck method. In B. Hajek & R. S. Sreenivas (Eds.), Proc. 37th Annual Allerton Conf. Urbana: University of Illinois. Treves, A., & Panzeri, S. (1995). The upward bias in measures of information derived from limited data samples. Neural Computation, 7, 399. Ward, J. H. (1963). Hierarchical groupings to optimize an objective function. J. Am. Stat. Assoc., 58, 236. Received June 17, 2003; accepted May 28, 2004.

LETTER

Communicated by Idan Segev

A Learning Rule for Local Synaptic Interactions Between Excitation and Shunting Inhibition Chun-Hui Mo [email protected]

Ming Gu [email protected]

Christof Koch [email protected] Computation and Neural Systems, California Institute of Technology, Pasadena, CA 91125, U.S.A.

The basic requirement for direction selectivity is a nonlinear interaction between two different inputs in space-time. In some models, the interaction is hypothesized to occur between excitation and inhibition of the shunting type in the neuron’s dendritic tree. How can the required spatial specificity be acquired in an unsupervised manner? We here propose an activity-based, local learning model that can account for direction selectivity in visual cortex based on such a local veto operation and that depends on synaptically induced changes in intracellular calcium concentration. Our biophysical simulations suggest that a model cell with our learning algorithm can develop direction selectivity organically after unsupervised training. The learning rule is also applicable to a neuron with multiple-direction-selective subunits and to a pair of cells with oppositedirection selectivities and is stable under different starting conditions, delays, and velocities. 1 Introduction The ability to distinguish the direction of movement is important. Cats reared in a stroboscopically illuminated environment developed normal orientation-selective neurons in cortex, but direction-selective neurons were virtually abolished. This effect remained after long periods of normal visual exposure (Cynader & Chernenko, 1976; Humphrey & Saul, 1998; Saul & Feidler, 2002). Therefore, direction selectivity is likely to require spatiotemporally structured synaptic input during early developmental stages. These neurons are thus excellent targets for the study of activity-dependent synaptic weight changes. Hebb (1949) proposed his famous learning rule based on the correlation between pre- and postsynaptic activities. A computational study showed that a Hebbian learning rule performs poorly in direction-selective synapse c 2004 Massachusetts Institute of Technology Neural Computation 16, 2507–2532 (2004)

2508

C.-H. Mo, M. Gu, and C. Koch

placement (Feidler, Saul, Murthy, & Humphrey, 1997). This is not surprising since the basic requirement for direction selectivity is a nonlinear interaction between two different inputs in space-time. In a natural environment, as many stimuli are expected to move in the preferred as in the null direction. While there exists a correlation between pre- and postsynaptic firing in the preferred direction, this is not the case for motion in the opposite, null, direction. There are modifications of the original Hebbian learning rule that could account for the development of direction selectivity, in particular, postsynaptic “gating” rules that link synaptic weight changes to postsynaptic activities (Feidler, et al., 1997; Blais, Cooper, & Shouval, 2000). Spike-time-dependent plasticity (STDP), a temporally asymmetric Hebbian learning rule (Montague & Sejnowski, 1994; Montague, Dayan, Person, & Sejnowski, 1995; Markram, Lubke, Frotscher, & Sakmann, 1997), in which the change in synaptic weight depends on the relative timing of the presynaptic input and the back-propagating postsynaptic spike, might be crucial here. If the postsynaptic spike arrives a few milliseconds before the presynaptic input, the synaptic weight is weakened; if it arrives a few milliseconds later, the weight is increased. A network of neocortical neurons implementing STDP developed direction selectivity after training (Rao & Sejnowski, 2000, 2001; Buchs & Senn, 2002; see also Song, Miller, & Abbott, 2002). A great deal is known about direction-selective cortical cells. Asymmetrical delayed inhibition is likely to be one of the mechanisms that underlie direction selectivity (Koch & Poggio, 1985; Livingstone, 1998). A bar moving in the preferred direction reaches the excitatory input before the inhibitory one, which acts only after an additional delay (see Figure 1A). The excitatory input reaches the soma and causes the cell to spike because of the temporal

Figure 1: Facing page. A direction selective (DS) mechanism and basic kinds of information available to an excitatory synapse for its learning strategy. (A) A DS neuron in V1 receives two LGN inputs—one excitatory and one delayed inhibitory (via a cortical interneuron). Solid curves show excitatory synaptic conductance changes during the preferred and null direction motion stimuli. Dashed curves show inhibitory synaptic conductance changes plotted in negative. The difference in temporal alignment of these two inputs during motion stimuli forms the basis of the cell’s direction selectivity. (B) Connection diagram for two DS cells with (bottom) and without (top) DS subunits on their dendrites. Both receive inputs from the same LGN cell array. A bar needs to move across the middle line between LGN input cell 2 and 3’s receptive fields to elicit directional response in the top cell. The DS cell at the bottom has superior position-invariant direction selectivity. (C) Three types of information available to an excitatory synapse on a remote dendrite of a DS cell to determine whether its own activity is contributing to the direction selectivity of the host cell. (D) The model neuron and relative excitatory and inhibitory synapses placement. Left: multiple DS subunits. Right: single unit model.

Local, Synaptic Learning Rule

2509

A

B DS

LGN V1

DS

Synaptic Inputs

Excitation Delayed Inhibition 0 1 2 3 4 5 No Subunit Structure Preferred

DS

Null

0

100 200 Time [ms]

0

1 2 3 4 DS Subunits

5

C

DS

Excitation

Inhibition

Spike Action

Close

Yes Increase No Decrease

1 2

Open

Yes Decrease No No Change

3 4

Close

Yes No

Open

Yes No

Open

Close

D Excitatory Synpase Inhibitory Synpase

2510

C.-H. Mo, M. Gu, and C. Koch

offset between the two inputs. In the opposite, null direction, the excitatory input is “vetoed” by the inhibition if the bar’s speed is approximately matched to the delay. The wiring requirement for such a scheme is simple: excitation in visual space can reside on either side of the inhibitory zone but not on both sides, in which case the model receives symmetric input in space-time and is thus not direction selective. Large conductance changes that reverse around the cell’s resting potential have been observed in V1 during visual stimuli (Anderson, Carandini, & Ferster, 2000; Borg-Graham, Monier, & Fr´egnac, 1998). Given the special property of shunting inhibition, it is interesting to investigate its possible role in synaptic learning. Both retinal and cortical direction-selective cells have subunit structures within their receptive fields (Barlow & Levick, 1965; Emerson, Citron, Vaughan, & Klein, 1987; Livingstone, Pack, & Born, 2001). A comparison of two fictitious cortical DS cells with and without subunit structures is shown in Figure 1B. Both cells receive inputs from the same group of lateral geniculate (LGN) cells, but the one with subunit structures utilizes its resources more efficiently and has superior position-invariant direction tuning. Synaptic logic models involving complex, branch-specific synaptic placements have been proposed (Poggio & Torre, 1978; Poggio, 1982; Koch & Poggio, 1987). Recent work by Mel and colleagues suggests that the dendritic tree of cortical pyramidal neurons may function as a two-layer neural network (Poirazi, Brannon, & Mel, 2003). Standard Hebbian learning rules, describing changes in the overall connection strength between pre- and postsynaptic neurons, cannot account for dendritic learning (Mel, 2002), for they do not distinguish among postsynaptic connection locations. We demonstrate, using an idealized compartmental simulation, how a learning rule for local synaptic interactions between excitation and shunting inhibition can, in principle, account for direction selectivity and subunit learning at the dendritic level of a single neuron. 2 Methods All compartmental simulations are carried out using the program NEURON (Hines & Carnevale, 1997). The idealized cell morphology of a directionselective neuron includes eight dendrites (width 0.5 µm, length 100 µm) that are directly connected to the soma (width 16 µm, length 16 µm). Each dendrite is unbranched and has 20 compartments, for a total of 161 compartments (the soma is modeled as a single compartment). The dendrites are passive except for an N-type calcium conductance, while the cell body contains sodium and potassium conductances that give rise to fast Hodgkin-Huxley– like action potentials without spike adaptation. The biophysical parameters are: Ra = 250 •cm, Cm = 0.5 µF/cm2 , Eleak = −60 mV, Rm = 10k •cm2 , gNa = 0.030 S/cm2 , gK = 0.028 S/cm2 , ENMDA = 0 mV, gNMDA = 0 − 2 nS, τNMDAon = 0.1 ms, τNMDAof f = 80 ms, EAMPA = 0 mV, gAMPA = 0 − 2 nS, τAMPAon = 0.1 ms, τAMPAof f = 2 ms, EGABA = −60 mV, gGABA = 5.0 nS,

Local, Synaptic Learning Rule

2511

τGABAon = 1 ms, and τGABAof f = 80 ms. Synaptic input is modeled using the point process in NEURON (adopted from Archie & Mel, 2000). The N-type voltage-gated calcium conductance is taken from Benison, Keizer, Chalupa, and Robinson (2001) and mapped to dendrites with a density of 1 mS/cm2 . We assume that excitatory synapses are located to dendritic spines and that the rapid rise in intracellular calcium concentration at the postsynaptic site inside the spine following synaptic activation has two additive sources (see Figure 2A): the calcium current through the NMDA synapse, ICa NMDA , and the calcium current, ICaN , located in the dendrite and a certain amount of which diffuses up into the spine. ICa NMDA is calculated as one-third of the total current through the NMDA synapse with a reversal potential ECa = 130 mV. Of the ICaN entering the dendritic compartment associated with the spine, 5% is assumed to contribute instantaneously to the calcium concentration in the spine head compartment (Koch, 1999). These scaling factors are chosen to bring the amounts of calcium at the spine entering through NMDA channels and through N-type voltage-gated calcium channels within the same order of magnitude. The final concentration of free, intracellular calcium at the spine, [Ca2+ ], is given by a simple decay equation, d[Ca2+ ] = ICa dt

NMDA

+ ICaN −

[Ca2+ ] , τCa2+

with τCa2+ = 15 ms. We model all internal calcium buffers and calcium pumps using a single decay constant. The stimulus is a one-dimensional bar moving at 10 degrees per second across the receptive fields of six LGN cells (see Figure 1B; for details, see Mo & Koch, 2003). There are eight excitatory and four inhibitory, deterministic synapses in the model (each corresponding to a cluster of probabilistic synapses; see Figure 1D). Each geniculate input is assumed to directly excite its appropriate dendrite at a single synaptic cluster and, via a local interneuron, to inhibit a dendrite. This is modeled by a single inhibitory synapse that is delayed by 10 ms with respect to excitation. Excitatory synapses are mapped to the dendritic compartment 60 µm away from the soma; inhibitory synapses are located 50 µm away from the cell body. In the main model, inhibition is assumed to be of the shunting (or silent) type, with a reversal potential EGABA = −60 mV (the cell’s resting potential is −60 mV). The temporal dynamics of the excitatory (NMDA) and inhibitory (GABA) synaptically induced conductance changes are as described by Destexhe, Mainen, and Sejnowski (1994). The change in the weight of the excitatory synapse is a function of the intracellular calcium concentration in the spine head compartment following synaptic activation: Weight Change = a ∗ sqrt(exp(c ∗ (−[Ca2+ ] − d) − exp(c ∗ (−[Ca2+ ] − d)))) + b, with a = −3.3, b = 1, and c = 13. d is a constant that varies continuously between −0.10 and −0.22 as gNMDA varies from 0 to 2 nS. The learning curve is chosen to give a negative output at a medium calcium concentration and a positive one at high calcium con-

2512

C.-H. Mo, M. Gu, and C. Koch

A

Spine Ca2+ AMPA Ca2+ Ca2+ NMDA Ca2+

Dendrite

Spine Ca2+ AMPA Ca2+

Ca2+

Ca2+

N-Type Calcium Channel

Ca2+ Shunting Inhibition

Ca2+

Ca2+ NMDA

Ca2+ Ca2+ N-Type Calcium Channel Ca2+ Shunting Inhibition

Back Propagating Spikes

B Scenario 1

Scenario 2 Total [Ca2+] N-type Channel NMDA

[Ca2+]

0.5

0

Scenario 4

Scenario 3

[Ca2+]

0.5

0 0

10 20 Time [ms]

C

30 0

10 20 Time [ms]

30

D

Weight Change

1

0 1 ns 0 ns 2 ns

-1 0

0. 5

[Ca2+]

1 0

0. 5 [Ca2+]

1

Local, Synaptic Learning Rule

2513

centration (see Figure 2C). The parameters a and b are chosen to restrict the function’s output to between −1 and 1. The parameter c is a scaling factor that determines the width of the curve, and parameter d is a sliding threshold that linearly shifts the curve according to different values of gNMDA (see Figure 2D). gAMPA is always set to equal gNMDA . Other types of equations that decrease synaptic weight at medium calcium concentrations and increase it at high calcium concentrations can also be used to model the learning curve (personal communication with Ming Tao, 2003). 3 Results We consider the local information available to an excitatory geniculate synapse on a dendrite of a cortical neuron to correctly judge whether its activity contributed to or degraded the direction selectivity (DS) of the host cell, assuming the inhibitory input has already been connected and fixed. There are three major pieces of information accessible to local mechanisms: the state of the excitatory input, the state of the inhibitory input, and whether the host cell generated a somatic action potential within a small time window and this spike propagated back into the dendrite to the postsynaptic site of excitation. Assuming binary states (e.g., excitatory input is either on or off), this gives rise to eight possible scenarios (for instance, both excitation and inhibition are active and the host cell spikes). We assume the excitatory synapse can be modified only when it is active; this reduces the combinations to four scenarios (see Figure 1C). In scenario 1, there is no inhibition, and the cell spikes after the excitatory synapse opens. The assumption is that the excitatory synapse directly contributes to the cell’s direction Figure 2: Facing page. Calcium dynamics at a spine and the BCM learning curve based on local calcium concentration change. (A) A schematic drawing of calcium entrance points at spines and the local dendritic branch. Calcium can enter spines directly through NMDA channels or indirectly through N-type voltagegated calcium channels activated by backpropagating somatic action potentials. Shunting inhibition is located between the spine and soma; it can block backpropagating spikes and clamp the membrane voltage to reduce calcium entry. (B) Local calcium concentration changes (solid curves) following synaptic activation for four different learning scenarios. Dashed (resp. dotted) curves show the calcium concentration changes due to calcium entering through N-type voltagegated calcium channels (resp. local NMDA channels). Peak calcium exposures within 30 ms are marked. (C) A BCM type learning curve (gNMDA = 1 nS). Peak calcium concentration reached in each of the four scenarios described. The corresponding synaptic weight changes are marked on the curve. Calcium concentrations are in an arbitrary unit. The synaptic weight change is relative to the maximum allowed learning step size in each trial. (D) A linear sliding threshold is chosen for the learning curve. The three learning curves shown are calculated at gNMDA = 0 nS (dashed), gNMDA = 1 nS (solid), gNMDA = 2 nS (dotted).

2514

C.-H. Mo, M. Gu, and C. Koch

selectivity, and thus its connection strength should increase. In scenario 2, there is no inhibition and no spike when the excitatory synapse opens. The fact that the DS cell is not spiking suggests that the stimulus moved in the null direction, yet the excitatory input is not gated by the inhibition and counteracts the cell’s direction selectivity. Its connection strength should decrease. In scenario 3, both excitation and inhibition are active, yet the cell still fires an action potential. The assumption is that this scenario corresponds to null direction of motion and that the excitatory synapse landed on a spot with incorrect matching of inhibition. Its connection strength should therefore decrease. In scenario 4, the excitation is successfully blocked by inhibition, suggesting a null direction movement, in which case the blocking by inhibition is legitimate. On the other hand, this could also correspond to a preferred direction movement. The cell generates an action potential in the presence of inhibition, yet this spike, propagating back from the soma into the dendrite, is blocked by inhibition from reaching the site of excitation. Given this ambiguity, the best possible action is to do nothing and keep the excitatory weight constant. An excitatory synapse can adjust its weight, via the local calcium concentration change, provided it can distinguish these four scenarios. This biophysical variable then determines whether the amplitude of the excitatory input is increased, remains the same, or is decreased. 3.1 Calcium Dynamics at the Spine and the Local Learning Rule. Figure 2A illustrates a dendritic branch with two spines, each with an independent excitatory input. Inhibition is mapped to a dendritic compartment between excitation and the soma, fulfilling the “on-the-path” requirement (Koch, Poggio, & Torre, 1982). We mapped the two excitatory synapses onto one electrically equivalent, dendritic compartment. While calcium can enter the spine from the dendrite, there is likely to be a severe calcium concentration gradient from the spine to the dendrite (given the large volume difference between the spine head and the dendrite and calcium pumps along the thin neck of the spine). Therefore, two nearby spines may be chemically independent, although they are electrically equivalent (Zador, Koch, & Brown, 1990). The excitatory and inhibitory inputs and the backpropagating spike from the soma affect the intracellular calcium concentration at the spine in their own way. Excitation directly correlates with the calcium current entering through NMDA channels, which is spine and synapse specific. Its time course is mainly determined by the conductance change of the local NMDA synapse. Given the NMDA synapse’s reversal potential (set to zero), however, the synaptic input current by itself cannot elevate the membrane potential high enough to cause significant activation of the N-type voltage-gated calcium channel, which are mainly activated when there is a backpropagating spike signaling a global activation state of the DS cell. On-the-path shunting inhibition affects the time course of both calcium currents. It clamps the

Local, Synaptic Learning Rule

2515

membrane voltage to the resting potential when activated, thereby reducing the amount of calcium current entering through NMDA channels. Furthermore, it blocks backpropagating spikes, which significantly reduces calcium entrance into the spine through N-type voltage-gated calcium channels. The clamping and blocking effects are branch specific due to the local action of shunting inhibition. We implemented the above calcium scheme and computed the resultant changes in free, intracellular calcium concentration at the spine for the four scenarios considered earlier (see Figure 2B). In scenario 1, there is no local inhibition, and the cell spikes after the excitatory synapse opens. Calcium enters through both channel populations. Changes of [Ca2+ ] are high. In scenario 2, there is neither local inhibition nor a backpropagating spike immediately after the excitatory synapse opens. Calcium mainly enters through NMDA channels. Changes of [Ca2+ ] fall into an intermediate range. In scenario 3, the excitatory input is blocked by the inhibition, but the cell spikes. The amount of calcium entering through NMDA channels is reduced by inhibition. Meanwhile, residual calcium enters through Ntype voltage-gated calcium channels due to the backpropagating action potential immediately before the synapse opens, adding to the total calcium concentration. Changes of [Ca2+ ] fall, again, into an intermediate range. In scenario 4, excitation is blocked by inhibition in the absence of any action potential. Only a limited amount of calcium enters through the NMDA synapse due to the clamping effect of shunting inhibition. Changes of [Ca2+ ] are low. As we mentioned earlier, the proper action for scenario 1 is to increase the weight of the excitatory synapse. Changes of [Ca2+ ] in this case are high. The proper action for scenarios 2 and 3 is to decrease the synaptic weight. Changes of [Ca2+ ] in these cases are medium. The proper action for scenario 4 is to keep the synaptic weight unchanged. Changes of [Ca2+ ] are low. In order to link the synaptic weight change with maximum calcium exposure, the amplitude of each excitatory synapse from the geniculate input to the target cell is changed in accordance with the maximum calcium concentration change at the synapse within 30 ms of synaptic activation. Such a change is computed using the BCM rule (see Figure 2C), a wellknown learning rule for cortical plasticity, proposed by Bienenstock, Cooper, and Munro (1982). As training progresses and the synaptic weight increases, the threshold for learning needs to increase to prevent runaway excitation and to stabilize the synapse (Bienenstock et al., 1982; Abbott & Nelson, 2000). We use a linear sliding threshold to shift the learning curve without changing its shape, as shown in Figure 2D. We link the sliding threshold directly to the excitatory synaptic connection strength (see section 2 for details): the larger the connection strength (i.e., its postsynaptic conductance), the larger the increase in intracellular calcium concentration necessary to trigger further increase in synaptic weight.

2516

C.-H. Mo, M. Gu, and C. Koch

3.2 Direction-Selective Single Unit Learning. We first tested our learning rule in a model cell without subunits (see Figure 3A). The model initially receives balanced excitatory inputs from both left and right LGN neurons. Two excitatory inputs are mapped onto the same compartment of the dendrite. The initial connection strength is 1 nS each. Delayed inhibition is fixed at 5 nS and is mapped to a compartment between excitation and the soma. During each trial, a bright bar randomly moves to either the left or to the right, and the maximum change of [Ca2+ ] at each excitatory synapse within 30 ms following activation is recorded. After each trial, the synaptic weight change is calculated based on the learning curve shown in Figure 2D. A synapse can be rewarded only if the host cell spikes and the spike successfully invades the dendrite. Our learning model cannot become direction selective if none of the excitatory inputs is strong enough to generate a spike. Therefore, we need to impose a “competition” rule to control the model’s excitability. We do this by holding the total excitatory connection strength over each local dendrite constant during simulations. This rule is of the “subtraction” type (Abbott & Nelson, 2000); that is, after each trial, half of the value of the synaptic weight above or below the total connection strength is subtracted from or added to both excitatory synapses. If the two excitatory synaptic weights are ge1 and ge2 , then the rule specifies ge1 new = ge1 old − (ge1 old + ge2 old − 2 nS)/2. This competition rule prevents both inputs from slipping to zero. We assume there is no initial bias, and the model cell initially receives balanced input from the left and the right. The outcome is dependent on the training sequence. The weight changes for both synapses during a simulation run that lasted 300 trials are shown in Figure 3B. Before training, the model cell responds equally for motion in either direction (see Figure 3D). During the initial training period, if a rightward-moving bar is present, the left excitatory synapse opens first and causes the host cell to spike (scenario 1 for the left synapse). Then the inhibitory synapse opens and blocks the excitatory input from the right excitatory synapse (scenario 3 for this synapse). After the trial, the left connection is strengthened and the right one weakened. Similarly, if a leftward-moving bar is shown to the model, the right connection is strengthened and the left weakened after the trial. If the training regime consisted of alternating left and right stimuli, we would expect the synaptic strengths of both sides to oscillate within a range close to the learning step size but never converge. However, a random training sequence contains consecutive left or right trials and thus causes the oscillation to be larger than one learning step. The longer the training sequence of the same direction, the bigger the expected oscillation amplitude is. Once the oscillation reaches a large enough value such that the connection strength of one excitatory input, say, the right input, drops below a value sufficient to elicit a somatic spike, the oscillation stops. Now, during its preferred direction motion (a bar moving from the right to left), its weight is decreased according to scenario 2 instead of being increased according to scenario 1. The right

Local, Synaptic Learning Rule

2517

Figure 3: Model cell with a single subunit implementing these rules acquires DS following unsupervised training. (A) The model cell initially receives balanced inputs from the left and right and is not direction selective. After training with random bar movements, the input connection from one side is strengthened, while the input connection from the other side is weakened and the model cell becomes direction selective. (B) Synaptic weight changes for motion for the left (solid curve) or right (dashed curve) input during one simulation run. Learning step size is 0.01 nS. (C) A different simulation run. After learning, the cell responds only to leftward motion. (D) The cell’s response to a bright bar moving at 10 degrees per second across its receptive field before and after training.

2518

C.-H. Mo, M. Gu, and C. Koch

input thus enters a downward spiral and gradually decreases its weight to zero, while its counterpart, the left input, gradually increases its weight to the maximum allowed value. In the simulation shown in Figure 3B, this transition occurs around 80 trials. After training, only the excitatory input to the left side of inhibition remains, and the cell fires in a direction-selective manner (see Figure 3D). Figure 3C shows another simulation run during which the right excitatory input cell won the competition and the preferred direction of motion was reversed. We ran 100 simulations with a 0.032 nS learning step size; during all of this, the model cell converged to a DS cell within 200 trials (in 52 out of these 100 simulations, the preferred direction was rightward). An index of direction selectivity (DI) is computed as (preferred direction response – null direction response) / (preferred direction response + null direction response). DI values close to zero indicate a lack of direction selectivity, while the maximal extent of selectivity yields DI = 1. In all above cases, the model cell reached DI = 1 after training. 3.3 Learning Multiple, Direction-Selective, Subunits. How can our learning rule ensure that direction selectivity in different dendritic subunits of the host neuron is the same? To answer this question, we tested our DS learning rule in a model cell with four direction-selective subunits on four of its eight dendrites (see Figure 4A). Each of the middle four LGN cells (1–4) provides delayed inhibitory input to one dendrite (1–4 from the left) of the model cell. LGN cells 0–3 each provides a left excitatory input to dendrites 1–4 respectively. We refer to this group as the left input connection group. LGN cells 2–5 each provides a right excitatory input to dendrites 1–4, respectively (right input connection group). We refer to the connections within a group as “friends” and the connections between groups as “competitors.” The learning goal is to have all members within one group outcompete their competitors after training. If the four subunits were completely independent and all received the exact same sequence of visual stimuli, we would expect them to converge to the same direction selectivity. Unfortunately, neither of these two conditions is true. Although at each trial the same moving bar is presented to each subunit, the exact timing of the bar reaching the receptive field of each geniculate cell is different. Both NMDA and GABA synapses have long offramps. Their late currents can cause differences in the status of subunits. The shunting inhibition mostly affects local connections, but it also has a more global effect. Because the same learning curve is used for all synapses, these differences can cause different branches to learn to respond to opposite directions of motion. The model cell may thus not become direction selective. The problem can be solved if there are internal links between group members and competition between the groups. The links between group members indeed exist in our model through somatic spikes. For example, the LGN cell 1’s input connection to dendrite 2 and the LGN cell 2’s input

Local, Synaptic Learning Rule

2519

A DS

Training

0

1

2

3

4

0

5

Non-Direction Selective

B

Dendrite 1

1

2

3

4

5

Direction Selective

Dendrite 2

(ns) g NMDA

1.5 1 0.5 0 Dendrite 4

Dendrite 4

(ns) g NMDA

1.5 1 0.5 0 0

100 200 Trial No.

300 0

100 200 Trial No.

300

Figure 4: The model with four DS subunits after unsupervised training with a balanced initial setting. (A) All four dendrites initially received balanced excitatory inputs. After training with random bar movements, the input connection from one side is strengthened while the input connection from the other side is weakened. Due to coupling among dendrites and the local nature of shunting inhibition, all subunits are selective for the same direction of motion. (B) Synaptic weight changes for the left (solid curves) and right excitatory inputs (dashed curves) at each dendrite during one simulation run. Learning step size is 0.003 nS. After training, the cell responds only to rightward motion.

2520

C.-H. Mo, M. Gu, and C. Koch

connection to dendrite 3 belong to the same left input group. In a rightward movement trial, a late spike caused by LGN cell 1’s input will also be counted as a spike caused by LGN cell 2 given the overlap of their input time courses. In case LGN cell 2’s connection strength drops below the transition threshold, this would “rescue” it from scenario 2 to scenario 1. The same spike, however, will not help LGN cell 2’s connection to dendrite 1, which belongs to the right input group. The inhibitory input to dendrite 1 always opens before LGN cell 2 on dendrite 1 during a rightward movement trial and thus blocks any backpropagating spike from reaching LGN cell 2’s connection point. So this “link forward” effect benefits only group friends, not competitors. A similar “link backward” effect also exists. An early spike caused by LGN cell 2 will also be registered as LGN cell 1’s own spike if it happens within 30 ms of LGN cell 1’s firing. The “rescue” effort occurs only during the preferred direction movement; there are no linkages among group members during the null direction movement. If a group member is stuck far away from the divergent point, long, consecutive same-direction trials are required to increase its connection strength above the spiking threshold. To speed up convergence, we imposed an additional “majority” rule. We scale the learning step size of each trial with the total number of action potentials generated at the soma during that trial. In such a setting, a group member is increased more in its preferred direction and decreased less in its null direction, once its group responds with more spikes than the other group. This creates direct competition between groups and thus facilitates convergence. We also experimented with a modified version of this majority rule that is more plausible on biophysical grounds: here, the scaling is applied only if the synaptic weight increases. The decrease of weight does not depend on the number of spikes triggered. Both rules yielded the same results. Initially, the model cell received balanced inputs at each of its dendrites (see Figure 4B) and is not direction selective. After 300 training trials, the entire left input group wins over the right input group, and the cell develops four DS subunits sensitive to rightward motion. Each subunit reached its divergent point at different trials and went through different weight change trajectories. Note that initially, LGN cell 2 provided excitatory input to both dendrites 1 and 3. After training, only the connection to dendrite 3 remains. Therefore, the learning process is indeed branch specific. We carried out 100 simulations with a 0.032 nS learning step size, which is increased linearly with the number of action potentials generated according to our majority rule. In all cases, the model cell achieved uniform DS subunit structures within 200 trials with DI = 1. The model converged to a right-directionselective unit during 47 simulation runs and a left-direction-selective unit during the remaining runs. In the above simulations, all dendrites receive balanced input from each side. We further tested our learning model in a “random start” configuration. The total input connection strength to a dendrite is still fixed, but the rel-

Local, Synaptic Learning Rule

2521

ative contribution from the left input cell and the right input cell is randomly assigned (see Figure 5A). After 200 training trials, the right input group won at dendrite 1, and the left input group was leading at dendrites 3 and 4. Because of the majority rule and the link-forward and link-backward effects, the left input group finally increased its “friends”’ connection strength at dendrite 1 and 2 and destabilized its “competitors.” After 1000 training trials, the cell developed four rightward-motion-selective subunits. To test the stability of our learning model in the random start condition, we ran 10 simulations, each with four different learning step sizes: 0.1 nS, 0.032 nS, 0.01 nS, and 0.003 nS. Training periods are 100 trials, 500 trials, 1000 trials, and 2000 trials, respectively. DI converged to 1 in all conditions. Another possible scenario during development is that initially all the excitatory geniculate inputs to the V1 cell are very weak—insufficient to generate a spike. As the input connections are strengthened, the cell starts to spike, and competition among input synapses begins. We tested our learning model in such a developmental configuration. Initially, all connection weights are zero. They are then gradually increased because of the “competition” rule. The total excitatory connection strength to a dendrite is low, so at each trial, both inputs are increased by the maximum allowed learning step size (see Figure 5B). During the first 100 trials, there are no spikes, and all the input connections increase at each trial. Once the input connection strength reaches about 0.3 nS, the model cell starts to spike, and the connection strength of different input groups starts to diverge. After 600 training trials, the model cell develops four rightward-motion-selective subunits. To test the stability of our learning model in the development condition, we carried out 10 simulations each at four different learning step sizes. The model cell converged to DI = 1 under all conditions. 3.4 Differential Excitation-Inhibition Learning. Up to now, the inhibitory connection strength was kept fixed throughout the simulations. Here we investigate what happens when we relax this constraint and let inhibition converge to its optimal connection strength. Given the fact that all inhibitory synaptic connection strengths are the same on all dendrites, we do not need a local, dendrite-specific, learning rule. We do not need a direction-selective one either, since the excitatory learning rule accomplished this purpose. We used a simple inhibition learning rule that links the amplitude of inhibition to the model cell’s average response, as described by Soto-Trevino, Thoroughman, Marder, and Abbott (2001). We set a target spike number of 3 spikes per trial and keep a 30-trial spiking history. At each trial, the inhibitory synaptic weight change is linearly related to the difference between the target value and the average spike number: ginhibition new = ginhibition old + (average − target) ∗ 0.1 nS/spike. We cap the inhibitory synaptic weight at 20 nS. This learning rule is not a direction-selective one: a DS cell that spikes six times in its preferred direction and zero in its null direction or a non-DS cell that spikes three times in both directions both fulfill the re-

2522

C.-H. Mo, M. Gu, and C. Koch

A Dendrite 1

Step Size [ns] Dendrite 2

Dendrite 3

Dendrite 4

0.5

g

(ns) NMDA

1

0

0.5

g

(ns) NMDA

1

0 0

B

500 1000 0 Trial No.

500 1000 Trial No.

Dendrite 1

Dendrite 2

Dendrite 3

Dendrite 4

0.5

g

(ns) NMDA

1

0

0.5

g

(ns) NMDA

1

0 0

500

Trial No.

0

500

Trial No.

Figure 5: The model with four DS subunits after unsupervised training under the “random start” setting and “development” setting. (A) All four dendrites initially received random excitatory inputs. The total excitatory connection strength at each branch was set to 1.2 nS, and the learning step size was 0.003 nS. After training, the cell responded only to rightward motion. (B) The excitatory input connection strengths to the model were initially zero. Synaptic weight changes for the left and right excitatory inputs. Learning step size was 0.003 nS. After training, the cell responded only to rightward motion.

Local, Synaptic Learning Rule

2523

quirement. Synaptic weight changes for one such simulation are shown in Figure 6. Initially, all excitatory (see Figure 6C) and inhibitory (see Figure 6B) synapses’ weights are zero, and so is the average spike number. Because of the “competition” rule, the weights of the excitatory synapses increase gradually, until they exceed the spiking threshold. The learning model is not direction selective because there is no inhibition. The average number of spikes per trial quickly increases over the target value: three spikes per trial. As the inhibitory synapse connection strengths start to increase, so does the competition among excitatory synapses. The inhibitory synapse weights overshoot until it gradually settles down (see Figure 6B) and the model cell stabilizes to become direction selective. We carried out 20 simulations: the model cell converged to DI = 0.97 (on average) after 200 trials in each case. 3.5 Two-cell Network Learning. Finally, we investigated if the same learning principle, together with mutual inhibition, could develop tuning for opposite directions of motion between two neurons. We assume that both cortical cells share the same connections from the LGN input array (see Figure 7A). In addition, each cell inhibits the other cell at the soma. This inhibition has a fixed delay of 10 ms, a time course given by the difference of two exponentials with time constants of 0.5 ms and 4 ms, a reversal potential EGABA = −70 mV, and a constant conductance gGABA = 5.0 nS. The associated postsynaptic conductance change rises to its maximum value within the first 2 ms and decays in the following 20 ms. This corresponds to a GABAA inhibitory synapse, generally located close to the soma. In order to model inhibitory feedback from nearby cells, an additional inhibitory synapse, identical to the previous one that connects the two-cell network, is added to each soma (see Figure 7A). The time of inhibitory input from nearby regions is randomly selected within the first 20 ms of every trial for each cell in this two-cell network. Due to this stochastic fluctuation, the activities of the two cells are not identical. Cell 1’s somatic firing induces inhibition on cell 2. This inhibition prevents cell 2 from firing and thus pushes its direction selectivity more away from cell 1’s. Indeed, from a theoretical point of view, the evolution of the network reaches equilibrium when both cells have DI = 1 and are tuned toward opposite directions; mutual inhibition has no effect when the bar moves in cell 1’s preferred direction (since cell 2 will not fire, so it will not inhibit cell 1’s firing) and guarantees that cell 1 will not fire when the bar moves in cell 1’s null direction (since cell 2 will fire in its preferred direction and thus inhibit cell 1). In reality, however, a limited number of simulation trials as well as the model’s stochastic effect determines that not all simulation scenarios end in the global sink. We carried out 60 simulations of this two-cell network learning with a 0.032 nS learning step size for 2000 trials. The weights of all excitatory synapses for both cells were initially set to 0.6 nS. Similar to the definition of the index of direction selectivity, DI, we define REcell pair = |(REFcell1 −

2524

C.-H. Mo, M. Gu, and C. Koch

B 20

gGABA Average Spike No.

10

(ns) GABA

15

g

10

0

1

2

3

4

5

5

5

0

0 0

C

Average Spike No.

A

50 Trial No.

Dendrite 1

Dendrite 2

Dendrite 3

Dendrite 4

100

g NMDA(ns)

1

0.5

0

g NMDA(ns)

1

0.5

0 0

50 Trial No.

100

0

50 Trial No.

100

Figure 6: Differential learning of both excitation and inhibition. (A) Schematic drawing of the starting condition. All initial excitatory and inhibitory connection strengths are zero. (B) Synaptic weight changes for an inhibitory input cell (solid curve) and the average spike number of the model cell (dotted curve, averaged over 30 trials) during a simulation run. (C) Synaptic weight changes for the left input cells (solid curves) and the right input cells (dotted curves) at each dendrite during one simulation run. The model cell is direction selective after learning.

Local, Synaptic Learning Rule

2525

Figure 7: Illustration of the training of a two-cell network. (A) Each neuron receives the same geniculate input. Each cell inhibits the other one and is also the target for common inhibition from other sources of inhibition (not shown). After training (on the right), both cells become selective for motion in the two opposing directions. (B) Synaptic weights changes of two dendrites (out of four) in each neuron. Synaptic weight changes for the other two dendrites have similar trajectories. After training, cell 1 develops tuning for rightward direction of motion; cell 2 develops tuning for leftward direction of motion, with DI = 1.

2526

C.-H. Mo, M. Gu, and C. Koch

REFcell2 )/2| for each two-cell network (REF = DI when a cell is tuned more toward the left direction of motion and REF = −DI when the cell is tuned more toward the right direction of motion) to assess how far apart the two cells’ direction selectivities are. In an ideal case, when the two cells develop the maximal extent of selectivity toward opposite directions of motion, RE for this pair will be 1. For all 60 runs, DI averages at 0.94, and RE averages at 0.89. Figure 7B shows the synaptic weight changes for one simulation of such mutually connected two-cell network learning. The two cells individually developed tuning for opposite directions of motion. 4 Discussion Our scheme depends on several testable assumptions. First, learning of DS is broken down into four different scenarios, depending on the absence or presence of postsynaptic inhibition and whether the cell spikes (see Figure 1C). These could be tested in calcium imaging experiments. Shunting inhibition can be mimicked using the dynamic clamp (Chance, Abbott, & Reyes, 2002). This, together with two-photon calcium imaging, can be used to determine if shunting inhibition indeed can direct local synaptic modifications via localized changes in the calcium concentration. Second, our “competition” rule suggests that neurons with direction-selective subunit structures on its dendrites have the ability to control, independently, the total excitatory input connection strengths to each of their major dendrites. Evidence for such a mechanism, albeit operating at the whole cell level, has been provided by Turrigiano, Leslie, Desai, Rutherford, and Nelson (1998). Our rule is different from other rules that depend on initial bias (Rao & Sejnowski, 2000) in that our rule is experience driven. This suggests that cats reared in an environment with little motion in one particular direction—for instance, by having the animals wear LCD goggles—will show a deficit in direction-selective cells tuned for that direction relative to the opposite direction of motion. In our learning model, random motion plays a key role in breaking the balance between the left and right input cells. Symmetry could also be broken by a bias in the initial connection strength between the left and right input. Our learning scheme relies on a few additional assumptions. Delayed inhibition is of the shunting type, which is crucial to our multiple subunits learning model in achieving the branch-specific veto of excitation and the branch-specific blocking of backpropagating spikes. We compared the single unit and the multiple subunits model with four, six, and eight subunits and shunting inhibition (at −60 mV) to hyperpolarizing inhibition with EGABA between −60 mV and −90 mV (the cell’s resting potential is −60 mV). While single unit learning is not dependent on shunting inhibition, the more subunits a cell has, the more it relies on shunting inhibition (data not shown; Mo, 2003).

Local, Synaptic Learning Rule

2527

The model is quite robust to perturbations in the initial conditions and in the amplitude of the delay between excitation and feedforward inhibition. We reduced this delay from its standard value of 10 ms to 5 and 2.5 ms (while doubling the inhibition rise time from 1 to 2 ms) without any effect on DI. We also varied the velocity of the speeding bar from its fixed value at 10 degrees per second to one of four values (5, 10, 15, and 20 degrees per second) randomly selected for each trial during learning. When a bar moves very fast in the null direction, delayed inhibition acts too slowly to reduce excitation significantly. This reduces direction selectivity. However, in combination, slow velocities served to stabilize the response to higher velocities, with the final DI converging to an average of 0.9 for all velocities. This model assumes a one-dimensional input space. It will be a challenge to extend it to direction selectivity in two dimensions. In that case, both orientation and direction selectivity will have to be simultaneously considered. Rapidly rising calcium concentration changes in dendritic spines mediated by action potentials and long, sustained rising of calcium concentration induced by synaptic inputs have been observed in calcium imaging experiments (Sabatini, Oertner, & Svoboda, 2002). The measured calcium decay constant is 12 ms at spines and 15 ms at small dendrites. We used a single decay constant of 15 ms for both calcium sources and assumed instantaneous dendrite-to-spine diffusion. We have no evidence to suggest that a more sophisticated treatment of calcium dynamics will change our conclusion appreciably. Experimental evidence suggests voltage-gated calcium channels exist in spines, while little calcium diffuses between the spine and the dendritic shaft in either direction (Sabatini et al., 2002). Such a scheme is computationally equivalent to our model setting given that we used instantaneous dendrite-to-spine unidirectional calcium diffusion and a single electrical compartment for the spine and the dendritic shaft. We choose the N-type voltage-gated calcium channel to have a voltage-sensitive calcium dynamics different from calcium flowing through NMDA channels. The high-threshold L-type voltage-gated calcium channels should also serve our purpose. We exploit a calcium gain control mechanism that dynamically shifts the learning curve according to the average, local activity level. The key to the stability of the original BCM learning rule (Bienenstock et al., 1982) is a nonlinear threshold that decreases and increases faster than the average response. Such a sliding threshold control requires that the tuning curve be narrowed for small responses and broadened for large responses. For simplicity, we simulated a linear sliding of the learning curve without changing its shape. This can be implemented in many ways, such as by an increased calcium pump within the spine with respect to time-averaged calcium exposure or adaptation of calcium-dependent enzyme activities. Both the duration and the amplitude of postsynaptic calcium concentration affect long-term potentiation (LTP) and long-term depression (LTD) (Yang, Tong,

2528

C.-H. Mo, M. Gu, and C. Koch

& Zucker, 1999). Brief and large postsynaptic calcium concentration changes lead to LTP, while sustained and moderate changes cause LTD. However, brief and moderate calcium concentration change can lead to either LTP or LTD. A calcium gain-control mechanism or sliding learning threshold can explain such phenomena. Sustained calcium concentration elevation may shift the learning curve as well as the LTD-LTP transition threshold toward high calcium concentration and thus increase the probability of LTD formation. Learning of multiple direction-selective subunits requires a rule whereby the amplitude of the learning step scales with the total number of action potentials generated at the soma during that trial (which, in our model, is roughly proportional to the peak calcium concentration at the synapse; data not shown): a group of synapses on a dendrite have their weights increased more in their preferred direction and decreased less in the null direction, once the soma responds to these synapses with more spikes than to the other group of synapses associated with opposite direction. This creates direct competition between the two groups and thus facilitates convergence. We also experimented with a modified version of this majority rule that is more plausible: here, the scaling is applied only if the synaptic weight increases. The weight reduction does not depend on the number of spikes triggered. Both rules yielded the same results. Spike-time-dependent plasticity (STDP) is a temporal asymmetry Hebbian learning rule. The synaptic weight change depends on the relative timing of the presynaptic input and the backpropagating spike. Our learning rule is different from the “prediction and sequence learning” mechanism (Montague & Sejnowski, 1994; Montague et al., 1995; Markram et al., 1997). In our learning scenario 1, the excitatory connection is increased if there are backpropagating spikes within a certain period following synaptic activation; in learning scenario 3, the excitatory connection is decreased if there is a spike a few milliseconds before its opening and the backpropagating spike is clamped by inhibition at opening. These fit into the general framework of the STDP with the exception that our learning rule takes not only the temporal sequence between the input and the output into account but also local inhibition (via its effect on calcium). This difference is critical for the learning of branch-specific DS subunits. Our model depends on action potentials generated at the soma that carry the direction-selective signal propagating back to the excitatory synapses whose weights need to be adjusted. If these synapses were located in remote parts of the distal tree where the action potential may fail to propagate (Golding, Staff, & Spruston, 2002; Mehta, 2004), the learning process would fail to converge. Thus, we would predict that these synapses, which carry the visual signals, are close enough to the soma for the backpropagating action potential to create a reliable signal for guiding the learning process. In addition to feedforward connections, there are extensive feedback interactions among V1 cells, and these feedback currents are likely to be impor-

Local, Synaptic Learning Rule

2529

tant for sharpening directional tuning (Douglas, Koch, Mahowald, Martin, & Suarez, 1995; Maex & Orban, 1996). Nonlinear shunting inhibition is likely to act during the initial stage of visual cortical processing to set the balance between opponent “on” and “off” responses in different locations of the visual receptive field (Borg-Graham et al., 1988). Our simulation demonstrates that using mutual inhibition, it is possible to generate two directionselective cells (or, presumably, two tightly coupled groups of such cells) tuned toward opposite directions of motion. It will be interesting to test if mutual excitation among cells selective to the same direction of motion helps to achieve this architecture. In addition, in V1, many columns of neurons that are selective for a given orientation are subdivided into patches preferring opposite directions of motion (Weliky, Bosking, & Fitzpatrick, 1996; Shmuel & Grinvald, 1996). We believe mutual inhibition can also serve as a possible mechanism to organize the distribution of direction-selective cells within an orientation-tuned column. However, the actual mechanisms that the visual system uses at its developmental stage to achieve an overall balanced directional tuning as well as an organized distribution of such tuning within single orientation columns still remain to be tackled experimentally. It will be exciting to see more experimental evidence to support or refute our model. Acknowledgments We thank Patrick Wilken for an insightful discussion of the BCM learning rule and Bartlett Mel for providing us with cell models, help, and criticism. Thanks to Ying Gong for a critical reading of the manuscript. This research was supported by grants from the NSF-sponsored Engineering Research Center at Caltech, the National Institutes of Health, and the National Institute of Mental Health. References Abbott, L. F., & Nelson, S. B. (2000). Synaptic plasticity: Taming the beast. Nature Neuroscience, 3 (Suppl.), 1178–1183. Anderson, J. S., Carandini, M., & Ferster, D. (2000). Orientation tuning of input conductance, excitation, and inhibition in cat primary visual cortex. J. Neurophysiology, 84, 909–926. Archie, K. A., & Mel, B. W. (2000). A model for intradendritic computation of binocular disparity. Nature Neuroscience, 3, 54–63. Barlow, H. B., & Levick, R. W. (1965). The mechanism of directional selectivity in the rabbit’s retina. J. Physiology, 173, 477–504. Benison, G., Keizer, J., Chalupa, L. M., & Robinson, D. W. (2001). Modeling temporal behavior of postnatal cat retinal ganglion cells. J. Theor. Biol., 210, 187–199.

2530

C.-H. Mo, M. Gu, and C. Koch

Bienenstock, E. L., Cooper, L. N., & Munro, P. W. (1982). Theory for the development of neuron selectivity: Orientation specificity and binocular interaction in visual cortex. J. Neuroscience, 2, 32–48. Blais, B., Cooper, L. N., & Shouval, H. (2000). Formation of direction selectivity in natural scene environments. Neural Computation, 12, 1057–1066. Borg-Graham, L. J., Monier, C., & Fr´egnac, Y. (1998). Visual input evokes transient and strong shunting inhibition in visual cortical neurons. Nature, 393, 369–373. Buchs, N. J., & Senn, W. (2002). Spike-based synaptic plasticity and the emergence of direction selective simple cells: Simulation results. J. Comput. Neurosci., 13, 167–186. Chance, F. S., Abbott, L. F., & Reyes, A. D. (2002). Gain modulation from background synaptic input. Neuron, 35, 773–782. Cynader, M., & Chernenko, G. (1976). Abolition of direction selectivity in the visual cortex of the cat. Science, 193, 504–505. Destexhe, A., Mainen, Z. F., & Sejnowski, T. J. (1994). Synthesis of models for excitable membranes, synaptic transmission and neuromodulation using a common kinetic formalism. J. Comput. Neurosci., 1, 195–230. Douglas, R. J., Koch, C., Mahowald, M., Martin, K. A. C., & Suarez, H. H. (1995). Recurrent excitation in neocortical circuits. Science, 269, 981–985. Emerson, R. C., Citron, M. C., Vaughn, W. J., & Klein, S. A. (1987). Nonlinear directionally selective subunits in complex cells of cat striate cortex. J. Neurophysiology, 58, 33–65. Feidler, J. C., Saul, A. B., Murthy, A., & Humphrey, A. L. (1997). Hebbian learning and the development of direction selectivity: The role of geniculate response timings. Network: Comput. Neural. Syst., 8, 195–214. Golding, N. L., Staff, N. P., & Spruston, N. (2002). Dendritic spikes as a mechanism for cooperative long-term potentiation. Nature, 418, 326–331. Hebb, D. O. (1949). The organization of behavior: A neuropsychological theory. New York: Wiley. Hines, M. L., & Carnevale, N. T. (1997). The NEURON simulation environment. Neural Computation, 9, 1179–1209. Humphrey, A. L., & Saul, A. B. (1998). Strobe rearing reduces direction selectivity in area 17 by altering spatiotemporal receptive-field structure. J. Neurophysiol., 22, 2945–2955. Koch, C. (1999). Biophysics of computation: Information processing in single neurons. New York: Oxford University Press. Koch, C., & Poggio, T. (1985). The synaptic veto mechanism: Does it underlie direction and orientation selectivity in the visual cortex? In D. Rose & V. Dobson (Eds.), Models of the visual cortex (pp. 408–419). New York: Wiley. Koch, C., & Poggio, T. (1987). Biophysics of computation: Neurons, synapses, and membranes. In G. M. Edelman, W. E. Gall, & W. M. Cowan (Eds.), Synaptic function (pp. 637–697). New York: Wiley. Koch, C., Poggio, T., & Torre, V. (1982). Retinal ganglion cells: A functional interpretation of dendritic morphology. Philos. Trans. R. Soc. Lond. B. Biol. Sci., 298, 227–263.

Local, Synaptic Learning Rule

2531

Livingstone, M. S. (1998). Mechanisms of direction selectivity in macaque V1. Neuron, 20, 509–526. Livingstone, M. S., Pack, C. C., & Born, R. T. (2001). 2-D substructure of MT receptive-fields. Neuron, 30, 781–793 Maex, R., & Orban, G. A. (1996). Model circuit of spiking neurons generating directional selectivity in simple cells. J. Neurophysiol., 75, 1515–1545. Markram, H., Lubke, J., Frotscher, M., & Sakmann, B. (1997). Regulation of synaptic efficacy by coincidence of postsynaptic APs and EPSPs. Science, 275, 213–215. Mehta, M. R. (2004). Cooperative LTP can map memory sequences on dendritic branches. Trends in Neuroscience, 27, 69–72. Mel, B. W. (2002). Have we been Hebbing down the wrong path? Neuron, 34, 275–288. Mo, C. H. (2003). Synaptic learning rules for local synaptic interaction: Theory and application to direction-selectivity. Unpublished doctoral dissertation, California Institute of Technology. Mo, C. H., & Koch, C. (2003). Modeling reverse-phi motion selective neurons in cortex: Double synaptic veto mechanism. Neural Computation, 15, 735–759. Montague, P. R., Dayan, P., Person, C., & Sejnowski, T. J. (1995). Bee foraging in uncertain environments using predictive Hebbian learning. Nature, 377, 725–728. Montague, P. R., & Sejnowski, T. J. (1994). The predictive brain: Temporal coincidence and temporal order in synaptic learning mechanisms. Learn. Mem., 1, 1–33. Poggio, T. (1982). Visual algorithms (A.I. Memo No. 683). Cambridge, MA: Artificial Intelligence Laboratory, Massachusetts Institute of Technology. Poggio, T., & Torre, V. (1978). A new approach to synaptic interactions. In R. Heim & G. Palm (Eds.), Lecture notes in biomathematics: Theoretical approaches to complex systems, 21 (pp. 89–115). Berlin: Springer-Verlag. Poirazi, P., Brannon, T., & Mel, B. W. (2003). Pyramidal neuron as 2-layer neural network. Neuron, 37, 989–999. Rao, R. P. N., & Sejnowski, T. J. (2000). Predictive sequence learning in recurrent neocortical circuits. In S. A. Solla, T. K. Lee, & K. R. Muller (Eds.), Advances in neural information processing systems, 12 (pp. 164–170). Cambridge, MA: MIT Press. Rao, R. P. N., & Sejnowski, T. J. (2001). Spike-timing-dependent Hebbian plasticity as temporal difference learning. Neural Computation, 13, 2221–2237. Sabatini, B. L, Oertner, T. G, & Svoboda, K. (2002). The life cycle of Ca(2+) ions in dendritic spines. Neuron, 33, 439–452. Saul, A. B., & Feidler, J. C. (2002). Development of response timing and direction selectivity in cat visual thalamus and cortex. J. Neurosci., 22, 2945–2955. Shmuel, A., & Grinvald, A. (1996). Functional organization for direction of motion and its relationship to orientation maps in cat area 18. J. Neurosci., 16, 6945–6964. Song, S., Miller, K. D., & Abbott, L. F. (2002). Competitive Hebbian learning through spike-timing-dependent synaptic plasticity. Nature Neuroscience, 3, 919–926.

2532

C.-H. Mo, M. Gu, and C. Koch

Soto-Trevino, C., Thoroughman, K. A., Marder, E., & Abbott, L. F. (2001). Activity-dependent modification of inhibitory synapses in models of rhythmic neural networks. Nature Neuroscience, 4, 297–303. Turrigiano, G. G., Leslie, K. R., Desai, N. S., Rutherford, L. C., & Nelson, S. B. (1998). Activity-dependent scaling of quantal amplitude in neocortical neurons. Nature, 391, 892–896. Weliky, M., Bosking, W. H., & Fitzpatrick, D. (1996). A systematic map of direction preference in primary visual cortex. Nature, 379, 725–728. Yang, S. N., Tang, Y. G., & Zucker, R. S. (1999). Selective induction of LTP and LTD by postsynaptic [Ca2+]i elevation. J. Neurophysiology, 81, 781–787. Zador, A. M., Koch, C., & Brown, T. (1990). Biophysical model of a Hebbian synapse. PNAS, 87, 6718–6722. Received July 22, 2003; accepted May 4, 2004.

LETTER

Communicated by Anthony Burkitt

Maximum Likelihood Estimation of a Stochastic Integrate-and-Fire Neural Encoding Model Liam Paninski [email protected] Howard Hughes Medical Institute, Center for Neural Science, New York University, New York, NY 10003, U.S.A., and Gatsby Computational Neuroscience Unit, University College London, London WC1N 3AR, U.K.

Jonathan W. Pillow [email protected] Howard Hughes Medical Institute, Center for Neural Science, New York University, New York, NY 10003, U.S.A.

Eero P. Simoncelli [email protected] Howard Hughes Medical Institute, Center for Neural Science, and Courant Institute for Mathematical Science, New York University, New York, NY 10003, U.S.A.

We examine a cascade encoding model for neural response in which a linear filtering stage is followed by a noisy, leaky, integrate-and-fire spike generation mechanism. This model provides a biophysically more realistic alternative to models based on Poisson (memoryless) spike generation, and can effectively reproduce a variety of spiking behaviors seen in vivo. We describe the maximum likelihood estimator for the model parameters, given only extracellular spike train responses (not intracellular voltage data). Specifically, we prove that the log-likelihood function is concave and thus has an essentially unique global maximum that can be found using gradient ascent techniques. We develop an efficient algorithm for computing the maximum likelihood solution, demonstrate the effectiveness of the resulting estimator with numerical simulations, and discuss a method of testing the model’s validity using time-rescaling and density evolution techniques. 1 Introduction A central issue in systems neuroscience is the experimental characterization of the functional relationship between external variables, such as sensory stimuli or motor behavior, and neural spike trains. Because neural responses to identical experimental input conditions are variable, we frame the problem statistically: we want to estimate the probability of any spiking response c 2004 Massachusetts Institute of Technology Neural Computation 16, 2533–2561 (2004)

2534

L. Paninski, J. Pillow, and E. Simoncelli

conditioned on any input. Of course, there are typically far too many possible observable signals to measure these probabilities directly. Thus, our real goal is to find a good model—some functional form that allows us to predict spiking probability even for signals we have never observed directly. Ideally, such a model will be both accurate in describing neural response and easy to estimate from a modest amount of data. A good deal of recent interest has focused on models of cascade type. These models consist of a linear filtering stage in which the observable signal is projected onto a low-dimensional subspace, followed by a nonlinear, probabilistic spike generation stage. The linear filtering stage is typically interpreted as the neuron’s “receptive field,” efficiently representing the relevant information contained in the possibly high-dimensional input signal, while the spiking mechanism accounts for simple nonlinearities like rectification and response saturation. Given a set of stimuli and (extracellularly) recorded spike times, the characterization problem consists of estimating both the linear filter and the parameters governing the spiking mechanism. Unfortunately, biophysically realistic models of spike generation, such as the Hodgkin-Huxley model or its variants (Koch, 1999), are generally quite difficult to fit given only extracellular data. As such, it has become common to assume a highly simplified model in which spikes are generated according to an inhomogeneous Poisson process, with rate determined by an instantaneous (“memoryless”) nonlinear function of the linearly filtered input (see Simoncelli, Paninski, Pillow, & Schwartz, in press, for review and partial list of references). In addition to its conceptual simplicity, this linear-nonlinear-poisson (LNP) cascade model is computationally tractable. In particular, reverse correlation analysis provides a simple unbiased estimator for the linear filter (Chichilnisky, 2001), and the properties of estimators for both the linear filter and static nonlinearity have been thoroughly analyzed, even for the case of highly nonsymmetric or “naturalistic” stimuli (Paninski, 2003). Unfortunately, however, memoryless Poisson processes do not readily capture the fine temporal statistics of neural spike trains (Berry & Meister, 1998; Keat, Reinagel, Reid, & Meister, 2001; Reich, Victor, & Knight, 1998; Aguera y Arcas & Fairhall, 2003). In particular, the probability of observing a spike is not a functional of the recent stimulus alone; it is also strongly affected by the recent history of spiking. This spike history dependence can significantly bias the estimation of the linear filter of an LNP model (Berry & Meister, 1998; Pillow & Simoncelli, 2003; Paninski, Lev, & Reyes; 2003; Paninski, 2003; Aguera y Arcas & Fairhall, 2003). In this letter, we consider a model that provides an appealing compromise between the oversimplified Poisson model and more biophysically realistic but intractable models for spike generation. The model consists of a linear filter (L) followed by a probabilistic, or noisy (N), form of leaky integrate-and-fire (LIF) spike generation (Koch, 1999). This L-NLIF model is illustrated in Figure 1, and is essentially the standard LIF model driven

ML Estimation of a Stochastic Integrate-and-Fire Model leaky integrator

stim

threshold

k linear filter

2535

spikes Wt noise

h response current

Figure 1: Illustration of the L-NLIF model.

by a noisy, filtered version of the stimulus; the spike history dependence introduced by the integrate-and-fire mechanism allows the model to emulate many of the spiking behaviors seen in real neurons (Gerstner & Kistler, 2002). This model thus combines the encoding power of the LNP cell with the flexible spike history dependence of the LIF model and allows us to explicitly model neural firing statistics. Our main result is that the estimation of the L-NLIF model parameters is computationally tractable. Specifically, we formulate the problem in terms of classical estimation theory, which provides a natural “cost function” (likelihood) for model assessment and estimation of the model parameters. We describe algorithms for computing the likelihood function and prove that this likelihood function contains no nonglobal local maxima, implying that the maximum likelihood estimator (MLE) can be computed efficiently using standard ascent techniques. Desirable statistical properties of the estimator (such as consistency and efficiency) are all inherited “for free” from classical estimation theory (van der Vaart, 1998). Thus, we have a compact and powerful model for the neural code and a well-motivated, efficient way to estimate the parameters of this model from extracellular data. 2 The Model We consider a model for which the (dimensionless) subthreshold voltage variable V evolves according to dV = (−g(V(t) − Vleak ) + Istim (t) + Ihist (t))dt + Wt ,

(2.1)

and resets instantaneously to Vreset < 1 whenever V = 1, the threshold potential (see Figure 2). Here, g denotes the membrane leak conductance, Vleak the leak reversal potential, and the stimulus current Istim is defined as Istim (t) = k · x(t),

2536

L. Paninski, J. Pillow, and E. Simoncelli

stim spikes Vthr

0

V

Vthr

0

V

.02

P(isi)

0

0

100

200

t (ms)

Figure 2: Behavior of the L-NLIF model during a single interspike interval, for a single (repeated) input current. (Top) Observed stimulus x(t) and response spikes. (Third panel) Ten simulated voltage traces V(t), evaluated up to the first threshold crossing, conditional on a spike at time zero (Vreset = 0). Note the strong correlation between neighboring time points, and the gradual sparsening of the plot as traces are eliminated by spiking. (Fourth panel) Evolution of P(V, t). Each vertical cross section represents the conditional distribution of V at the corresponding time t (i.e., for all traces that have not yet crossed threshold). Note the boundary conditions P(Vth , t) = 0 and P(V, tspike ) = δ(V − Vreset ) corresponding to threshold and reset, respectively. See section 4 for computational details. (Bottom panel) Probability density of the interspike interval (ISI) corresponding to this particular input. Note that probability mass is concentrated at the times when input drives the mean voltage V0 (t) close to threshold. Careful examination reveals, in fact, that peaks in p(ISI) are sharper than peaks in the deterministic signal V0 (t), due to the elimination of threshold-crossing traces that would otherwise have contributed mass to p(ISI) at or after such peaks (Berry & Meister, 1998).

ML Estimation of a Stochastic Integrate-and-Fire Model

2537

the projection of the input signal x(t) onto the spatiotemporal linear kernel the spike-history current Ihist is given by k; Ihist (t) =

i−1

h(t − tj ),

j=0

where h is a postspike current waveform of fixed amplitude and shape1 whose value depends on only the time since the last spike ti−1 (with the sum above including terms back to t0 , the first observed spike); finally, Wt is an unobserved (hidden) noise process, taken here to be a standard gaussian white noise (although we will consider more general Wt later). As usual, in the absence of input, V decays back to Vleak with time constant 1/g. Thus, the nonlinear behavior of the model is completely determined by only a few parameters, namely, {g, Vreset , Vleak }, and h(t). In practice, we assume the continuous aftercurrent h(t) may be written as a superposition of a small number of fixed temporal basis functions; we will refer to the We should note that the vector of coefficients in this basis using the vector h. inclusion of the Ihist current in equation 2.1 introduces additional parameters to the model that need to be fit; in cases where there is insufficient (namely, h) data to properly fit these extra parameters, h could be set to zero, reducing the model, equation 2.1, to the more standard LIF setting. It is important to emphasize that in the following, V(t) itself will be considered a hidden variable; we are assuming that the spike train data we are trying to model have been collected extracellularly, without any access to the subthreshold voltage V. This implies that the parameters of the usual LIF model can only be estimated up to an unlearnable mean and scale factor. Thus, by a standard change of variables, we have not lost any generality by setting the threshold potential, Vth , and scale of the hidden noise process, σ , to 1 (corresponding to mapping the physical voltage V → 1 + (V − Vth )/σ ); the relative noise level (the effective scale of Wt ) can be and h together. Of course, other changes changed by scaling Vleak , Vreset , k, of variable are possible (e.g., letting σ change freely and fixing Vreset = 0), but will not affect the analysis here. The dynamical properties of this type of “spike response model” have been extensively studied (Gerstner & Kistler, 2002); for example, it is known that this class of models can effectively capture much of the behavior of apparently more biophysically realistic models (e.g., Hodgkin-Huxley). We illustrate some of these diverse firing properties in Figures 3 and 4. These figures also serve to illustrate several of the important differences between the L-NLIF and LNP models. In Figure 3, note the fine structure of spike timing in the responses of the L-NLIF model, which is qualitatively similar to in vivo experimental observations (Berry and Meister, 1998; Reich et al., 1 The letter h here was chosen to stand for “history” and should not be confused with the physiologically defined Ih current.

2538

L. Paninski, J. Pillow, and E. Simoncelli

Figure 3: Simulated responses of L-NLIF and LNP models to 20 repetitions of a fixed 100-ms stimulus segment of temporal white noise. (Top) Raster of responses of L-NLIF model to a dynamic input stimulus. The top row shows the fixed (deterministic) response of the model with the noise set to zero. (Middle) Raster of responses of LNP model to the same stimulus, with parameters fit with standard methods from a long run of the L-NLIF model responses to nonrepeating stimuli. (Bottom) Poststimulus time histogram (PSTH) of the simulated L-NLIF response (black line) and PSTH of the LNP model (gray line). Note that the LNP model, due to its Poisson output structure, fails to preserve the fine temporal structure of the spike trains relative to the L-NLIF model.

1998; Keat et al., 2001). The LNP model fails to capture this fine temporal reproducibility. At the same time, the L-NLIF model is much more flexible and representationally powerful: by varying Vreset or h, for example, we can match a wide variety of interspike interval distributions and firingrate curves, even given a single fixed stimulus. For example, the model can mimic the FI curves of type I or II models, with either smooth or discontinuous growth of the FI curve away from 0 at threshold, respectively (Gerstner & Kistler, 2002). More generally, the L-NLIF model can exhibit adaptive behavior (Rudd & Brown, 1997; Paninski, Lau, & Reyes, 2003; Yu & Lee, 2003) and display rhythmic, tonic, or even bistable dynamical behavior, depending on the parameter settings (see Figure 4).

ML Estimation of a Stochastic Integrate-and-Fire Model

A

2539

stimulus 0

h current

response

0 0

V

0 0

0 0.2 t (sec after spike)

0.5 t (sec)

1

stimulus

B

c h current

0

x

response c = .1 c = .2

0

0.2 t (sec after spike)

c = .5 0

0.5 t (sec)

1

stimulus

C

0

h current

response

0

V 0 0 0.1 t (sec after spike)

0

0.5 t (sec)

1

Figure 4: Diversity of NLIF model response patterns. (A) Firing-rate adaptation. A positive DC current was injected into three different NLIF cells, all with slightly different settings for h (top, h = 0; middle, h hyperdepolarizing; bottom, h depolarizing). Note that all three voltage response traces are identical until the time of the first spike, but adapt to the constant input in three different ways. (For clarity, the noise level is set to zero in all panels.) (B) Rhythmic, bursting responses. DC current (top trace) injected into an NLIF cell with h shown at left. As amplitude c of current increases (voltage traces, top to bottom), burst frequency and duration increase. (C) Tonic and bistable (“memory”) responses. The same current (top trace) was injected into two different NLIF cells with different settings for h. The biphasic h in the bottom panel leads to a self-sustaining response that is inactivated only by the subsequent negative pulse.

2540

L. Paninski, J. Pillow, and E. Simoncelli

3 The Estimation Problem g, Vleak , Vreset , Our problem now is to estimate the model parameters θ ≡ {k, h} from a sufficiently rich, dynamic input sequence x(t) and the response spike times {ti }. We emphasize again that we are not discussing the problem of estimating the model parameters given intracellularly recorded voltage traces (Stevens & Zador, 1998; Jolivet, Lewis, & Gerstner, 2003); we assume that these subthreshold responses, which greatly facilitate the estimation problem, are unknown—“hidden”—to us. A natural choice is the maximum likelihood estimator (MLE), which is easily proven to be consistent and statistically efficient here (van der Vaart, 1998). To compute the MLE, we need to compute the likelihood and develop an algorithm for maximizing the likelihood as a function of the parameters θ. The tractability of the likelihood function for this model arises directly from the linearity of the subthreshold dynamics of voltage V(t) during an interspike interval. In the noiseless case (Pillow & Simoncelli, 2003), the voltage trace during an interspike interval t ∈ [ti−1 , ti ] is given by the solution to equation 2.1 with the noise Wt turned off, with initial conditions V0 (ti−1 ) = Vreset : V0 (t) = Vleak + (Vreset − Vleak )e−g(t−ti−1 )   t i−1 k · x(s) + + h(s − tj ) e−g(t−s) ds, ti−1

(3.1)

j=0

which is simply a linear convolution of the input current with a filter that decays exponentially with time constant 1/g. It is easy to see that adding gaussian noise to the voltage during each time step induces a gaussian density over V(t), since linear dynamics preserve gaussianity (Karlin & Taylor, 1981). This density is uniquely characterized by its first two moments; the mean is given by equation 3.1, and its covariance, Cov(t1 , t2 ) = Eg ETg =

1 −g|t2 −t1 | − e−g(t1 +t2 ) ), (e 2g

(3.2)

where Eg is the convolution operator corresponding to e−gt . We denote this gaussian density G(V(t)| xi , θ ), where index i indicates the ith spike and the corresponding stimulus segment xi (i.e., the stimuli that influence V(t) during the ith interspike interval). Note that this density is highly correlated for nearby points in time; intuitively, smaller leak conductance g leads to stronger correlation in V(t) at nearby time points. On any interspike interval t ∈ [ti−1 , ti ], the only information we have is that V(t) is less than threshold for all times before ti and exceeds threshold during the time bin containing ti . This translates to a set of linear constraints

ML Estimation of a Stochastic Integrate-and-Fire Model

2541

on V(t), expressed in terms of the set Ci = {V(t) < 1} ∩ {V(ti ) ≥ 1}. ti−1 ≤t
Therefore, the likelihood that the neuron first spikes at time ti , given a spike at time ti−1 , is the probability of the event V(t) ∈ Ci , which is given by G(V(t)| xi , θ), V∈Ci

the integral of the gaussian density G(V(t)| xi , θ ) over the set Ci of (unobserved) voltage paths consistent with the observed spike train data. Spiking resets V to Vreset ; since Wt is white noise, this means that the noise contribution to V in different interspike intervals is independent. This “renewal” property, in turn, implies that the density over V(t) for an entire experiment factorizes into a product of conditionally independent terms, where each of these terms is one of the gaussian integrals derived above for a single interspike interval. The likelihood for the entire spike train is therefore the product of these terms over all observed spikes. Putting all the pieces together, then, defines the full likelihood as L{xi ,ti } (θ) = G(V(t)| xi , θ ), i

V∈Ci

where the product, again, is over all observed spike times {ti } and corresponding stimulus segments { xi }. Now that we have an expression for the likelihood, we need to be able to maximize it over the parameters θ. Our main result is that we can use simple ascent algorithms to compute the MLE without fear of becoming trapped in local maxima.2 Theorem 1. The likelihood L{xi ,ti } (θ) has no nonglobal local extrema in the parameters θ, for any data { xi , ti }. The proof of the theorem (in appendix A) is based on the log concavity of the likelihood L{xi ,ti } (θ) under a certain relabeling of the parameters (θ ). The classical approach for establishing the nonexistence of local maxima 2

More precisely, we say that a smooth function has no nonglobal local extrema if the set of points at which the gradient vanishes is connected and (if nonempty) contains a global extremum; thus, all “local extrema” are in fact global, if a global maximum exists. (This existence, in turn, is guaranteed asymptotically by classical MLE theory whenever the model’s parameters are identifiable and guaranteed in general if we assume θ takes values in some compact set.) Note that the L-NLIF model has parameter space isomorphic

to the convex domain dim(k)+dim(h)+1 × 2+ , with + denoting the positive axis (recall that the parameter h takes values in a finite-dimensional space, g > 0, and Vreset < 1).

2542

L. Paninski, J. Pillow, and E. Simoncelli

of a given function is concavity, which corresponds roughly to the function having everywhere nonpositive second derivatives. However, the basic idea can be extended with the use of any invertible function: if f has no local extrema, neither will g( f ), for any strictly increasing real function g. The logarithm is a natural choice for g in any probabilistic context in which independence plays a role, since sums are easier to work with than products. Moreover, concavity of a function f is strictly stronger than log concavity, so log concavity can be a powerful tool even in situations for which concavity is useless (the gaussian density is log concave but not concave, for example). Our proof relies on a particular theorem (Bogachev, 1998) establishing the log concavity of integrals of log concave functions, and proceeds by making a correspondence between this type of integral and the integrals that appear in the definition of the L-NLIF likelihood above. 4 Computational Methods and Numerical Results Theorem 1 tells us that we can ascend the likelihood surface without fear of getting stuck in local maxima. Now how do we actually compute the likelihood? This is a nontrivial problem: we need to be able to quickly compute (or at least approximate, in a rational way) integrals of multivariate gaussian densities G over simple but high-dimensional orthants Ci . We describe two ways to compute these integrals; each has its own advantages. The first technique can be termed density evolution (Knight, Omurtag, & Sirovich, 2000; Haskell, Nykamp, & Tranchina, 2001; Paninski, Lau, & Reyes, 2003). The method is based on the following well-known fact from the theory of stochastic differential equations (Karlin & Taylor, 1981): given the data ( xi , ti−1 ), the probability density of the voltage process V(t) up to the next spike ti satisfies the following partial differential (Fokker-Planck) equation, ∂P(V, t) ∂[(V − Vrest )P] 1 ∂ 2P +g = , ∂t 2 ∂V 2 ∂V

(4.1)

under the boundary conditions P(V, ti−1 ) = δ(V − Vreset ), P(Vth , t) = 0, enforcing the constraints that voltage resets at Vreset and is killed (due to spiking) at Vth , respectively. Vrest (t) is defined, as usual, as the stationary point of the noiseless subthreshold dynamics 2.1:   i−1 1  Vrest (t) ≡ Vleak + h(t − tj ) . k · x(t) + g j=0

ML Estimation of a Stochastic Integrate-and-Fire Model

2543

The integral P(V, t)dV is simply the probability that the neuron has not yet spiked at time t, given that the last spike was at ti−1 ; thus, 1 − P(V, t)dV is the cumulative distribution of the spike time since ti−1 . Therefore, ∂ f (t) ≡ − P(V, t)dV; ∂t the conditional probability density of a spike at time t (defined at all times t∈ / {ti } and at all times ti by left-continuity), satisfies

t

f (s)ds = 1 −

P(V, t)dV.

ti−1

Thus, standard techniques (Press, Teukolsky, Vetterling, & Flannery, 1992) for solving the drift-diffusion evolution equation, 4.1, lead to a fast method for computing

f (t) (as illustrated in Figure 2). Finally, the likelihood Lxi ,ti (θ ) is simply i f (ti ). While elegant and efficient, this density evolution technique turns out to be slightly more powerful than what we need for the MLE. Recall that we do not need to compute the conditional probability of spiking f (t) at all times t, but rather at just a subset of times {ti }. In fact, while we are ascending the likelihood surface (in particular, while we are far from the maximum), we do not need to know the likelihood precisely and can trade accuracy for speed. Thus, we can turn to more specialized, approximate techniques for faster performance. Our algorithm can be described in three steps. The first is a specialized algorithm due to Genz (1992), designed to compute exactly the kinds of integrals considered here, which works well when the orthants Ci are defined by fewer than ≈ 10 linear constraints. The number of actual constraints grows linearly in the length of the interspike interval (ti+1 − ti ); thus, to use this algorithm in typical data situations, we adopt a strategy proposed in our work on the deterministic form of the model (Pillow & Simoncelli, 2003), in which we discard all but a small subset of the constraints. The key point is that only a few constraints are actually needed to approximate the integrals to a high degree of precision, basically because of the strong correlations between the value of Vt at nearby time points. This idea provides us with an efficient approximation of the likelihood at a single point in parameter space. To find the maximum of this function using standard ascent techniques, we obviously have to compute the likelihood at many such points. We can make this ascent process much quicker by applying a version of the coarse-to-fine idea. Let Lj denote the approximation to the likelihood given by allowing only j constraints in the above algorithm. Then we know, by a proof identical to that of theorem 1, that Lj has no local maxima; in addition, by the above logic, Lj → L as j grows. It takes little additional effort to prove that argmaxθ∈ Lj (θ) → argmaxθ∈ L(θ)

2544

L. Paninski, J. Pillow, and E. Simoncelli

as j → ∞; thus, we can efficiently ascend the true likelihood surface by ascending the coarse approximants Lj , then gradually refining our approximation by letting j increase. The j = ∞ term is computed via the full density evolution method. The last trick is a simple method for choosing a good starting point for each ascent. To do this, we borrow the jackknife idea from statistics (Efron & Stein, 1981; Strong, Koberle, de Ruyter van Steveninck, & Bialek, 1998): set our initial guess for the maximizer of LjN to be θj0N = θj∞ + N−1

−1 j−1 N − jN−1

−1 j−1 N−1 − jN−2

(θj∞ − θj∞ ), N−1 N−2

the linear extrapolant on a 1/j scale. Now that we have an efficient ascent algorithm, we need to provide it with a sensible initialization of the parameters. We employ the following simple method, related to our previous work on the deterministic LIF model (Pillow & Simoncelli, 2003): we set g0 to some physiologically plausible 0 value (say, 50 ms−1 ), then k0 , h0 and Vleak to the ML solution of the following regression problem:   i−1 Eg k · xi + gVleak + h(t − tj ) = 1 + σi i , j=0

with Eg the exponential convolution matrix and i a standard independent and identically distributed (i.i.d.) normal random variable scaled by 1 σi = Cov(ti − ti−1 , ti − ti−1 )1/2 = e−g(ti −ti−1 ) , 2g the standard deviation of the Ornstein-Uhlenbeck process V (recall expres0 sion 3.2) at time ti − ti−1 . Note that the reset potential Vreset is initially fixed at zero, away from the threshold voltage Vth = 1, to prevent the trivial θ = 0 solution. The solution to this regression problem has the usual leastsquares form and can thus be quickly computed analytically (see Sahani & Linden, 2003) for a related approach), and serves as a kind of j = 1 solution (with the single voltage constraint placed at ti , the time of the spike). See also Brillinger (1992) for a discrete-time formulation of this single-constraint approximation. To summarize, we provide pseudocode for the full algorithm in Figure 5. One important note is that due to its ascent structure, the algorithm can be gracefully interrupted at any time without catastrophic error. In addition, the time complexity of the algorithm is linear in the number of spikes. An application of this algorithm to simulated data is shown in Figure 6. Further applications to both simulated and real data will be presented elsewhere.

ML Estimation of a Stochastic Integrate-and-Fire Model

2545

• Initialize (k, Vl , h) to regression solution • Normalize by observed scale of i • for increasing j Let θj maximize Lj Jackknife θj+1 end • Let θM LE ≡ θ∞ maximize L

Figure 5: Pseudocode for the L-NLIF MLE.

true h estim h

true K STA estim K 0

-200

-100 t (ms before spike)

0

0

30 t (ms after spike)

60

Figure 6: Demonstration of the estimator’s performance on simulated data. Dashed lines show the true kernel k and aftercurrent h; k is a 12-sample function chosen to resemble the biphasic temporal impulse response of a macaque retinal ganglion cell (Chichilnisky, 2001), while h is a weighted sum of five gamma functions whose biphasic shape induces a slight degree of burstiness in the model’s spike responses (see Figure 4). With only 600 spikes of output (given temporal white noise input), the estimator is able to retrieve an estimate of k that closely matches the true k and h. Note that the spike-triggered average, which is an unbiased estimator for the kernel of a LNP neuron (Chichilnisky, 2001), differs significantly from the true k (and, of course, provides no estimate for h).

2546

L. Paninski, J. Pillow, and E. Simoncelli

5 Time Rescaling g, Vleak , Vreset , h), Once we have obtained our estimate of the parameters (k, how do we verify that the resulting model provides a self-consistent description of the data? This important model validation question has been the focus of recent elegant research, under the rubric of “time rescaling” techniques (Brown, Barbieri, Ventura, Kass, & Frank, 2002). While we lack the room here to review these methods in detail, we can note that they depend essentially on knowledge of the conditional probability of spiking f (t). Recall that we showed how to efficiently compute this function in section 4 and examined some of its qualitative properties in the L-NLIF context in Figure 2. The basic idea is that the conditional probability of observing a spike at time t, given the past history of all relevant variables (including the stimulus and spike history), can be very generally modeled as a standard (homogeneous) Poisson process, under a suitable transformation of the time axis. The correct such “time change” is fairly intuitive: we want to speed up the clock exactly at those times for which the conditional probability of spiking is high (since the probability of observing a Poisson process spike in any given time bin is directly proportional to the length of time in the bin). This effectively “flattens” the probability of spiking. To return to our specific context, if a given spike train was generated by an L-NLIF cell with parameters θ, then the following variables should constitute an i.i.d. sequence from a standard uniform density: qi ≡

ti+1

f (s)ds,

ti

where f (t) = fxi ,ti ,θ (t) is the conditional probability (as defined in the preceding section) of a spike at time t given the data ( xi , ti ) and parameters θ . The statement follows directly from the time-rescaling theorem (Brown et al., 2002), the inverse cumulative integral transform, and the fact that the L-NLIF model generates a conditional renewal process. This uniform representation can be tested by standard techniques such as the KolmogorovSmirnov test and tests for serial correlation. 6 Extensions It is worth noting that the methods discussed above can be extended in various ways, enhancing the representational power of the model significantly. 6.1 Interneuronal Interactions. First, we should emphasize that the input signal x(t) is not required to be a strictly “external” observable; if we have access to internal variables such as local field potentials or multiple single-unit activity, then the influences of this network activity can be easily

ML Estimation of a Stochastic Integrate-and-Fire Model

2547

included in the basic model. For example, say we have observed multiple (single-unit) spike trains simultaneously, via multielectrode array or tetrode. Then one effective model might be dV = (−g(V(t) − Vleak ) + Istim (t) + Ihist (t) + Iinterneuronal (t))dt + Wt , with the interneuronal current defined as a linearly filtered version of the other cells’ activity: Iinterneuronal (t) = kln · nl (t). l

Here, nl (t) denotes the spike train of the lth simultaneously recorded cell, and the additional filters kln model the effect of spike train l on the cell of interest. Similar models have proven useful in a variety of contexts (Tsodyks, Kenet, Grinvald, & Arieli, 1999; Harris, Csicsvari, Hirase, Dragoi, & Buzsaki, 2003; Paninski, Fellows, Shoham, Hatsopoulos, & Donoghue, 2003). The main point is that none of the results mentioned above are at all dependent on the identity of x(t), and therefore can be applied unchanged in this new, more general setting. 6.2 Nonlinear Input. Next, we can use a trick from the machine learning and regression literature (Duda & Hart, 1972; Cristianini & Shawe-Taylor, 2000; Sahani, 2000) to relax our requirement that the input be a strictly linear function of x(t). Instead, we can write Istim = ak Fk [ x(t)], k

where k indexes some finite set of functionals Fk [.] and ak are the parameters we are trying to learn. This reduces exactly to our original model when Fk are defined to be time-translates, that is, Fk [ x(t)] = x(t − k). We are essentially unrestricted in our choice of the nonlinear functionals Fk , since, as above, all we are doing is redefining the input x(t) in our basic model to be x∗ (t) ≡ {Fk ( x(t))}. Under the obvious linear independence restrictions on {Fk ( x(t))}, then, the model remains identifiable (in particular, the MLE remains consistent and efficient under smoothness assumptions on {Fk ( x(t))}). Clearly the postspike and interneuronal currents Ihist (t) and Iinterneuronal (t), which are each linear functionals of the network spike history, may also be replaced by nonlinear functionals; for example, Ihist (t) might include current contributions just from the preceding spike (Gerstner & Kistler, 2002), not the sum over all previous spikes. Some obvious candidates for {Fk } are the Volterra operators formed by taking products of time-shifted copies of the input x(t) (Dayan & Abbott, 2001; Dodd & Harris, 2002): F[ x(t)] = x(t − τ1 ) · x(t − τ2 ),

2548

L. Paninski, J. Pillow, and E. Simoncelli

for example, with τi ranging over some compact support. Of course, it is well-known that the Volterra expansion (essentially a high-dimensional Taylor series) can converge slowly when applied to neural data; other more sophisticated choices for Fk might include a set of basis functions (Zhang, Ginzburg, McNaughton, & Sejnowski, 1998) that span a reasonable space of possible nonlinearities, such as the principal components of previously observed nonlinear tuning functions (see also Sahani & Linden, 2003, for a similar idea, but in a purely linear setting). 6.3 Regularization. The extensions discussed in the two previous sections have made our basic model considerably more powerful, but at the cost of a larger number of parameters that must be estimated from data. This is problematic, as it is well known that the phenomenon of overfitting can actually hurt the predictive power of models based on a large number of parameters (see, e.g., Sahani & Linden, 2003; Smyth, Willmore, Baker, Thompson, & Tolhurst, 2003; Machens, Wehr, & Zador, 2003) for examples, again in a linear regression setting). How do we control for overfitting in the current context? One simple approach is to use a maximum a posteriori (MAP, instead of ML) estimate for the model parameters. This entails maximizing an expression of the penalized form log L(θ) + Q(θ) instead of just L(θ), where L(θ) is the likelihood function, as above, and −Q is some “penalty” function (where in the classical Bayesian setting, eQ is required to be a probability measure on the parameter space ). If Q is taken to be concave, a glance at the proof of theorem 1 shows that the MAP estimator shares the MLE’s global extrema property. As usual, simple regularity conditions on Q ensure that the MAP estimator converges to the MLE given enough data and therefore inherits the MLE’s asymptotic efficiency. Thus, we are free to choose Q as we like within the class of smooth, concave functions, bounded above. If Q peaks at a point such that all the weight depending on the version of the model in question) are coefficients (ai or k, zero, the MAP estimator will basically be a more conservative version of the MLE, with the chosen coefficients shifted nonlinearly toward zero. This type of “shrinkage” estimator has been extremely well studied from a variety of viewpoints (e.g., James & Stein, 1960; Donoho, Johnstone, Kerkyacharian, & Picard, 1995; Tipping, 2001) and is known, for example, to perform strictly better than the MLE in certain contexts. Again, see Sahani and Linden (2003), Smyth et al. (2003), and Machens et al. (2003) for some illustrations of this effect. One particularly simple choice for Q is the weighted L1 norm, = |b(l)k(l)|, Q(k) l

ML Estimation of a Stochastic Integrate-and-Fire Model

2549

where the weights b(l) set the relative scale of Q over the likelihood and may be chosen by symmetry considerations, cross-validation (Machens et al., 2003; Smyth et al., 2003), or evidence optimization (Tipping, 2001; Sahani & Linden, 2003). This choice for Q has the property that sparse solutions (i.e., solutions for which as many components of k as possible are set to zero) are favored; the desirability of this feature is discussed in, for example, Girosi (1998) and Donoho and Elad (2003). 6.4 Correlated Noise. In some situations (particularly when the cell is poorly driven by the input signal x(t)), the whiteness assumption on the noise Wt will be inaccurate. Fortunately, it is possible to generalize this part of the model as well, albeit with a bit more effort. The simplest way to introduce correlations in the noise (Fourcaud & Brunel, 2002; Moreno, de la Rocha, Renart, & Parga, 2002) is to replace the white Wt with an OrnsteinUhlenbeck process Nt defined by dN = −

N dt + Wt . τN

(6.1)

As above, this is simply white noise convolved with a simple exponential filter of time constant τN (and therefore the conditional gaussianity of V(t) is retained); the original white noise model is recovered as τN → 0, after suitable rescaling. (Nt here is often interpreted as synaptic noise, with τN the synaptic time constant, but it is worth emphasizing that Nt is not voltage dependent, as would be necessary in a strict conductance-based model.) Somewhat surprisingly, the essential uniqueness of the global likelihood maximum is preserved for this model: for any τN ≥ 0, the likelihood has no g, Vleak , Vreset , h). local extrema in (k, Of course, we do have to make a few changes in the computational schemes associated with this new model. Most of the issues arise from the loss of the conditional renewal property of the interspike intervals for this model: a spike in no longer conditionally independent of the last interspike interval (indeed, this is one of the main reasons we are interested in this correlated noise model). Instead, we have to write our likelihood L{xi ,ti } (θ, τN ) as p(Nt , τN ) 1(Vt ( xi , θ, Nt ) ∈ Ci )dNt , i

where the integral is over all noise paths Nt , under the gaussian measure p(Nt , τN ) induced on Nt by expression 6.1; the multiplicands on the right are 1 or 0 according to whether the voltage trace Vt , given the noise path Nt , the stimulus xi , and the parameters θ, was in the constraint set Ci or not, respectively. Despite the loss of the renewal property, Nt is still a Gauss-Markov diffusion process, and we can write the Fokker-Planck equation (now in two

2550

L. Paninski, J. Pillow, and E. Simoncelli

dimensions, V and N), ∂[(V − Vrest − ∂P(V, N, t) 1 ∂ 2P +g = 2 ∂t 2 ∂N ∂V

N g )P]

+

1 ∂[NP] , τN ∂N

under the boundary conditions P(Vth , N, t) = 0,

∂P(V, N, t− 1 i−1 ) P(V, N, t+ ) = − ) δ(V − V reset i−1 Z ∂V V=Vth

N − ×R − Vth + Vrest (ti−1 ) , g with R the usual linear rectifier 0 u ≤ 0, R(u) = u u>0 and Z the normalization factor

∂P(V, N, t− N i−1 ) − Z=− R + V (t ) − V rest th i−1 dN. ∂V g V=Vth The threshold condition here is the same as in equation 4.1, while the reset condition reflects the fact that V is reset to Vreset with each spike, but N is not (the complicated term on the right is obtained from the usual ex∂V(t− )

i−1 > 0). Note that the pression by conditioning on V(t− i−1 ) = Vth and ∂t relevant discretized differential operators are still extremely sparse, allowing for efficient density propagation, although the density must now be propagated in two dimensions, which does make the solution significantly more computationally costly than in the white noise case. Simple approximative approaches like those described in section 4 (via the Genz algorithm) are available as well.

6.5 Subthreshold Resonance. Finally, it is worth examining how easily generalizable our methods and results might be to subthreshold dynamics more interesting than the (linear) leaky integrator employed here. While the density evolution methods developed in section 4 can be generalized easily to nonlinear and even time-varying subthreshold dynamics, the Genz algorithm obviously depends on the gaussianity of the underlying distributions (which is unfortunately not preserved by nonlinear dynamics), and the proof of theorem 1 appears to depend fairly strongly on the linearity of the transformation between input current and subthreshold membrane voltage (although linear filtering by nonexponential windows is allowed). Perhaps the main generalization worth noting here is the extension from purely “integrative” to “resonant” dynamics. We can accomplish this by

ML Estimation of a Stochastic Integrate-and-Fire Model

2551

the simple trick of allowing the membrane conductance g to take complex values (see, e.g., Izhikevich, 2001, for further details and background on subthreshold resonance). This transforms the low-pass exponential filtering of equation 3.1 to a bandpass filtering by a damped sinusoid: a product of an exponential and a cosine whose frequency is determined, as usual, by the imaginary part of g. All of the equations listed above remain otherwise unchanged if we ignore the imaginary part of this new filter’s output, and theorem 1 continues to hold for complex g, with g restricted to the upperright quadrant (real(g), imag(g) ≥ 0) to eliminate the conjugate symmetry of the filter corresponding to g. The only necessary change is in the density evolution method, where we need to propagate the density in an extra dimension to account for the imaginary part of the resulting dynamics (importantly, however, the Markov nature of model 2.1 is retained, preserving the linear diffusion nature of equation 4.1). 7 Discussion We have shown here that the L-NLIF model, which couples a filtering stage to a biophysically plausible and flexible model of neuronal spiking, can be efficiently estimated from extracellular physiological data. In particular, we proved that the likelihood surface for this model has no local peaks, ensuring the essential uniqueness of the maximum likelihood and maximum a posteriori estimators in some generality. This result leads directly to reliable algorithms for computing these estimators, which are known by general likelihood theory to be statistically consistent and efficient. Finally, we showed that the model lends itself directly to analysis using tools from the modern theory of point processes, such as time-rescaling tests for model validation. As such, we believe the L-NLIF model could become a fundamental tool in the analysis of neural data—a kind of canonical encoding model. Our primary goal was an elaboration of the LNP model to include spike history (e.g., refractory) effects. As detailed in Simoncelli et al. (in press), the basic LNP model provides a powerful framework for analyzing neural encoding of high-dimensional signals; however, it is well known that the Poisson spiking model is inadequate to capture the fine temporal properties of real spike trains. Previous attempts to address this shortcoming have fallen into two classes: multiplicative models (Snyder & Miller, 1991; Miller & Mark, 1992; Iyengar & Liao, 1997; Berry & Meister, 1998; Brown et al., 2002; Paninski, 2003), of the basic form p(spike(t) | stimulus, spike history) = F(stimulus)H(history) —in which H encodes purely spike-history-dependent terms like refractory or burst effects—and additive models like p(spike(t) | stimulus, history) = F(stimulus + H(history)),

2552

L. Paninski, J. Pillow, and E. Simoncelli

(Brillinger, 1992; Joeken, Schwegler, & Richter, 1997; Keat et al., 2001; Truccolo, Eden, Fellows, Donoghue, & Brown, 2003), in which the spike history is basically treated as a kind of additional input signal; the L-NLIF model is of the latter form, with the postspike current h injected directly into expression 2.1 with the filtered input k · x(t). It is worth noting that one popular form of the multiplicative history-dependence functional H(·) above, the “inversegaussian” density model (Seshardri, 1993; Iyengar & Liao, 1997; Brown et al., 2002), arises as the first-passage time density for the Wiener process, effectively the time of the first spike in the L-NLIF model given constant input at no leak (g = 0) (see Stevens & Zador, 1996, and Plesser & Gerstner, 2000, for further such multiplicative-type approximations). It seems that the treatment of history effects as simply another form of stimulus might make the additive class slightly easier to estimate (this was certainly the case here, for example); however, any such statement remains to be verified by systematic comparison of the accuracy of these two classes of models, given real data. We based our model on the LIF cell in an attempt to simultaneously maximize two competing objectives: flexibility (explanatory power) and tractability (in particular, ease of estimation, as represented by theorem 1). We attempted to make the model as general as possible without violating the conditions necessary to ensure the validity of this theorem. Thus, we included the h current and the various extensions described in section 6 but did not, for example, attempt to model postsynaptic conductances directly, or permit any nonlinearity in the subthreshold dynamics (Brunel & Latham, 2003), or allow any rate-dependent modulations of the membrane conductance g (Stevens & Zador, 1998; Gerstner & Kistler, 2002); it is unclear at present whether theorem 1 can be extended to these cases. Of course, due largely to its simplicity, the LIF cell has become the de facto canonical model in cellular neuroscience (Koch, 1999). Although the model’s overriding linearity is often emphasized (due to the approximately linear relationship between input current and firing rate, and lack of active conductances), the nonlinear reset has significant functional importance for the model’s response properties. In previous work, we have shown that standard reverse-correlation analysis fails when applied to a neuron with deterministic (noise-free) LIF spike generation. We developed a new estimator for this model and demonstrated that a change in leakiness of such a mechanism might underlie nonlinear effects of contrast adaptation in macaque retinal ganglion cells (Pillow & Simoncelli, 2003). We and others have explored other “adaptive” properties of the LIF model (Rudd & Brown, 1997; Paninski, Lau, & Reyes, 2003; Yu & Lee, 2003). We provided a brief sampling of the flexibility of the L-NLIF model in Figures 3 and 4; of course, similar behaviors have been noted elsewhere (Gerstner & Kistler, 2002), although the spiking diversity of this particular model (with no addi-

ML Estimation of a Stochastic Integrate-and-Fire Model

2553

tional time-varying conductances, for example) has not, to our knowledge, been previously collected in one place, and some aspects of this flexibility (e.g., Figure 4C) might come as a surprise in such a simple model. The probabilistic nature of the L-NLIF model provides several important advantages over the deterministic version we have considered previously (Pillow & Simoncelli, 2003). First, clearly, this probabilistic formulation is necessary for our entire likelihood-based presentation; moreover, use of an explicit noise model greatly simplifies the discussion of spiking statistics. Second, the simple subthreshold noise source employed here could provide a rigorous basis for a metric distance between spike trains, useful in other contexts (Victor, 2000). Finally, this type of noise influences the behavior of the model itself (see Figure 2), giving rise to phenomena not observed in the purely deterministic model (Levin & Miller, 1996; Rudd & Brown, 1997; Burkitt & Clark, 1999; Miller & Troyer, 2002; Paninski, Lau, & Reyes, 2003; Yu & Lee, 2003). We are currently in the process of applying the model to physiological data recorded both in vivo and in vitro in order to assess whether it accurately accounts for the stimulus preferences and spiking statistics of real neurons. One long-term goal of this research is to elucidate the different roles of stimulus-driven and stimulus-independent activity on the spiking patterns of both single cells and multineuronal ensembles (Warland, Reinagel, & Meister, 1997; Tsodyks et al., 1999; Harris et al., 2003; Paninski, Fellows et al., 2003). Appendix A: Proof of Theorem 1 Proof. We prove the main result indirectly, by establishing the more general statement in section 6.4: for any τN ≥ 0, the likelihood function for the g, Vleak , Vreset , h) (including L-NLIF model has no local extrema in θ = (k, possibly complex g); the theorem will be recovered in the special case that τN → 0 and g is real. As discussed in the text, we need only establish that the likelihood function is log concave in a certain smoothly invertible reparameterization of θ. The proof is based on the following fact (Bogachev, 1998): Theorem (integrating out log-concave functions). concave in x ∈ j and y ∈ k , j, k < ∞, and define f0 (x) ≡

f (x, y)dy.

Then f0 is log concave in x.

Let f (x, y) be jointly log

2554

L. Paninski, J. Pillow, and E. Simoncelli

To apply this theorem, we write the likelihood in the following “path integral” form, L{xi ,ti } (θ) = p(Nt , τN ) 1(Vt ( xi , θ, Nt ) ∈ Ci )dNt , (A.1) i

where we are integrating over each possible path of the noise process Nt , p(Nt , τN ) is the (gaussian) probability measure induced on Nt under the parameter τN , and 1(Vt ( xi , θ, Nt ) ∈ Ci ) is the indicator function for the event that Vt ( xi , θ, Nt )—the voltage path driven by the noise sample Nt under the model settings θ and input data xi —is in the set Ci . Recall that Ci is defined as the convex set satisfying a collection of linear inequalities that must be satistfied by any V(t) path consistent with the observed spike train {ti }; however, the precise identity of these inequalities will not play any role below (in particular, Ci depends on only the real part of V(t) and is independent of τN and θ). The logic of the proof is as follows. Since the product of two log-concave functions is log concave, L(θ) will be log concave under some reparameterization if p and 1 are both log concave under the same reparameterization of the variables N and θ, for any fixed τN . This follows by (1) approximating the full path integral by (finite-dimensional) integrals over suitably timediscretized versions of path space, (2) applying the above integrating-out theorem, (3) noting that the pointwise limit of a sequence of (log)concave functions is (log)concave, and (4) applying the usual separability and continuity limit argument to lift the result from the arbitrarily finely discretized (but still finite) setting to the full (infinite-dimensional) path space setting. To discretize time, we simply sample V(t) and N(t) (and bin ti ) at regular intervals t, where t > 0 is an arbitrary small parameter we will send to zero at the end of the proof. We prove the log concavity of p and 1 in the reparameterization (g, Vleak ) → (α, IDC ) ≡ (e−gt , gVleak ). This map is clearly smooth, but due to aliasing effects, the map g → α is smoothly invertible only if the imaginary part of g satisfies gt < 2π . Thus, we restrict the parameter space further to (0 ≤ real(g), 0 ≤ imag(g) ≤ π(t)−1 ), an assumption that becomes negligible as t → 0. Finally, importantly, note that this reparameterization preserves the convexity of the parameter space . Now we move to the proof of the log concavity of the components of the integrand in equation A.1. Clearly, p is the easy part: p(N, τN ) is independent of all variables but N and τN ; p is gaussian in N and is thus the prototypical log-concave function. Now we consider the function 1(Vt ( xi , θ, Nt ) ∈ Ci ). First, note that this function is independent of τN given N. Next, an indicator function for a set

ML Estimation of a Stochastic Integrate-and-Fire Model

2555

is log concave if and only if the set is convex. Thus, it is sufficient to prove that the set (N, θ) such that Vt (N, θ ) ∈ C is convex, for any convex C. To see this, we write out the dependence of Vt on N and θ in operator form:  Vt = Eg Vreset δ(0) + IDC + k · x(t) +

 h(t − tj ) + Nt  ,

j

where Eg , recall, is the exponential convolution operator corresponding to g. Now, the key fact is that E−1 g depends linearly on α: 



1 α  2  E g = α  

1 α ···

1 .. . α2

   ,  

..

. α

1

while 



E−1 g

1 −α   =  

1 −α

1 .. .

   ,  

..

. −α

1

as can be shown by direct computation. Thus, the set (N, θ ) such that Vt (N, θ) ∈ C can be written as the set N ∈ A(θ)C, with A(θ ) an invertible operator, affine in θ, namely, (t) − A(θ)V(t) = E−1 g V(t) − Vreset δ(0) − IDC − k · x

h(t − tj )

j

for any V(t) ∈ C. Since C, , and the set of all possible N are convex, the proof is complete, because the union of the graphs of a convex set of nonsingular affine translates of a convex set is itself convex. We have theorem 1 as a corollary upon restricting α (or equivalently, g) to the real axis and letting τN → 0, rescaling, and again noting that the pointwise limit of a sequence of (log-)concave functions is (log-)concave. In a previous version of this article, we gave a different proof, in which the key log-concavity property was established not by the result on integrating out but rather by an appeal to the Prekopa-Rinott theorem (Bogachev, 1998; Rinott, 1976) on log-concave measures. This earlier proof relied on a

2556

L. Paninski, J. Pillow, and E. Simoncelli

somewhat complex construction of convex translations of sets and required a more involved reparameterization; the current proof seems simpler. In addition, the current proof clarifies the generality of the result in at least two directions. First, it is clear that the proof is valid for any fixed log-concave noise measure p(N) (possibly including correlations, nongaussianity, and nonstationarities), not just gaussian white noise. Second, integrating over hyperparameters (e.g., in a Bayesian model selection setting; Sahani & Linden, 2003) does not induce any local maxima as long as the log concavity of the integrands is undisturbed. Finally, it is interesting to note that a nearly identical proof demonstrates that the likelihood of the model introduced in Keat et al. (2001) contains no nonglobal local maxima, in all parameters except for the time constant τp of the after-potential introduced in equation 7 in Keat et al. (2001); however, this proof does not extend in any obvious way to the non-likelihood-based cost function minimized by Keat et al. It is also worth noting that this proof cannot directly give us log concavity in τN for gaussian densities. In fact, no gaussian density with diagonal covariance of the form   f1 (τN )   f2 (τN )     . ..   fi (τN ) (we have in mind the covariance operator of a stationary process, expressed in the Fourier basis) can be jointly log concave in (N, τN ). To see this, set N = 0. This implies that fi−1 must be of the form eh , for h a concave function. Since the determinant of the Hessian of the function −N2 /fi (τN ) = −eh(τN ) N2 , 2e2h N2 (h − (h )2 ), is nonpositive in general (since h is concave, i.e., h ≤ 0), −eh N2 cannot be jointly concave, and this implies that the gaussian cannot be jointly log concave either (to see this, let N → ∞). Nevertheless, it is not difficult to think of reasonable densities that are jointly log concave in N and additional parameters like τN . This may prove useful in other contexts (Williams & Barber, 1998; Seeger, 2002). Appendix B: Computing the Likelihood Gradient The ascent of the likelihood surface is greatly accelerated by the computation of the gradient. This gradient can always be computed by finite differencing schemes, of course; however, in the case of a large number of parameters (see sections 6.1 and 6.2), it is much more efficient to compute gradients with respect to a few auxiliary parameters and then arrive at the gradient with respect to the full parameter set using the chain rule for derivatives. We focus on the discretized case for clarity. Thus, we take the derivatives with respect to the mean function V0 (t), evaluated at the constraint times

ML Estimation of a Stochastic Integrate-and-Fire Model

2557

{tk }1≤k≤j . These derivatives turn out to be gaussian integrals themselves, albeit over a (j−1)- instead of j-dimensional box, and can be easily translated into derivatives with respect to the parameters. In order to derive the gradient, note that the discretized approximation to the likelihood can be written z1 ∞ Lj = ··· p(y1 , . . . , yj )dy1 , . . . , dyj , −∞

zj

where yk represent the transformed variables yk = V(tk ) − V0 (tk ), zk = 1 − V0 (tk ), and p denotes the corresponding gaussian density, with 0 mean and covariance we will call (recall expression 3.2). Now, the partial derivatives of L with respect to the zk are: z1 zk−1 zk+1 ∞ ∂ L= ··· ··· p(y1 , . . . , yk = zk , . . . , yj )dy1 · · ·dyj ∂zk zj −∞ −∞ −∞ = p( yi=k |yk = zk )d yi=k p(yk = zk ), Ci=k

with a sign change to account for the upward integral corresponding to the final, above-threshold constraint. We can compute the marginal and conditional densities p(yk = zk ) and p( yi=k |yk = zk ) using standard gaussian identities: p(yk = zk ) = N (0, k,k )(zk ), p( yi=k |yk = zk ) = N (µ∗ , ∗ )(1), where 0 (ti=k ) + zk i=k,k µ∗ = V k,k i=k,k k,i=k ∗ = i=k,i=k − . k,k Thus, the gradient ∇z L requires computing one gaussian integral for each constraint zk . From the vector ∇z L, we can use simple linear operations to obtain the gradient with respect to any of the parameters that enter only via and Vleak . V0 (t), namely, h, k, Acknowledgments We thank E. J. Chichilnisky, W. Gerstner, Z. Ghahramani, B. Lau, and S. Shoham for helpful suggestions. L. P. was partially supported by pre- and postdoctoral fellowships from HHMI; J. W. P. was partially supported by a predoctoral fellowship from NSF and by an NYU Dean’s Dissertation Fellowship.

2558

L. Paninski, J. Pillow, and E. Simoncelli

References Aguera y Arcas, B., & Fairhall, A. (2003). What causes a neuron to spike? Neural Computation, 15, 1789–1807. Berry, M., & Meister, M. (1998). Refractoriness and neural precision. Journal of Neuroscience, 18, 2200–2211. Bogachev, V. (1998). Gaussian measures. New York: AMS. Brillinger, D. (1992). Nerve cell spike train data analysis: A progression of technique. Journal of the American Statistical Association, 87, 260–271. Brown, E., Barbieri, R., Ventura, V., Kass, R., & Frank, L. (2002). The timerescaling theorem and its application to neural spike train data analysis. Neural Computation, 14, 325–346. Brunel, N., & Latham, P. (2003). Firing rate of the noisy quadratic integrate-andfire neuron. Neural Computation, 15, 2281–2306. Burkitt, A., & Clark, G. (1999). Analysis of integrate-and-fire neurons: Synchronization of synaptic input and spike output. Neural Computation, 11, 871–901. Chichilnisky, E. (2001). A simple white noise analysis of neuronal light responses. Network: Computation in Neural Systems, 12, 199–213. Cristianini, N., & Shawe-Taylor, J. (2000). An introduction to support vector machines. Cambridge: Cambridge University Press. Dayan, P., & Abbott, L. (2001). Theoretical neuroscience. Cambridge, MA: MIT Press. Dodd, T., & Harris, C. (2002). Identification of nonlinear time series via kernels. International Journal of Systems Science, 33, 737–750. Donoho, D., & Elad, M. (2003). Optimally sparse representation in general (nonorthogonal) dictionaries via l1 minimization. PNAS, 100, 2197–2202. Donoho, D. L., Johnstone, I. M., Kerkyacharian, G., & Picard, D. (1995). Wavelet shrinkage: Asymptopia? J. R. Statist. Soc. B., 57(2), 301–337. Duda, R., & Hart, P. (1972). Pattern classification and scene analysis. New York: Wiley. Efron, B., & Stein, C. (1981). The jackknife estimate of variance. Annals of Statistics, 9, 586–596. Fourcaud, N., & Brunel, N. (2002). Dynamics of the firing probability of noisy integrate-and-fire neurons. Neural Computation, 14, 2057–2110. Genz, A. (1992). Numerical computation of multivariate normal probabilities. Journal of Computational and Graphical Statistics, 1, 141–149. Gerstner, W., & Kistler, W. (2002). Spiking neuron models: Single neurons, populations, plasticity. Cambridge: Cambridge University Press. Girosi, F. (1998). An equivalence between sparse approximation and support vector machines. Neural Computation, 10, 1455–1480. Harris, K., Csicsvari, J., Hirase, H., Dragoi, G., & Buzsaki, G. (2003). Organization of cell assemblies in the hippocampus. Nature, 424, 552–556. Haskell, E., Nykamp, D., & Tranchina, D. (2001). Population density methods for large-scale modelling of neuronal networks with realistic synaptic kinetics. Network, 12, 141–174.

ML Estimation of a Stochastic Integrate-and-Fire Model

2559

Iyengar, S., & Liao, Q. (1997). Modeling neural activity using the generalized inverse gaussian distribution. Biological Cybernetics, 77, 289–295. Izhikevich, E. (2001). Resonate-and-fire neurons. Neural Networks, 14, 883–894. James, W., & Stein, C. (1960). Estimation with quadratic loss. Proceedings of the Fourth Berkeley Symposium on Mathematical Statistics and Probability, 1, 361–379. Joeken, S., Schwegler, H., & Richter, C. (1997). Modeling stochastic spike train responses of neurons: An extended Wiener series analysis of pigeon auditory nerve fibers. Biological Cybernetics, 76, 153–162. Jolivet, R., Lewis, T., & Gerstner, W. (2003). The spike response model: A framework to predict neuronal spike trains. Springer Lecture Notes in Computer Science, 2714, 846–853. Karlin, S., & Taylor, H. (1981). A second course in stochastic processes. New York: Academic Press. Keat, J., Reinagel, P., Reid, R., & Meister, M. (2001). Predicting every spike: A model for the responses of visual neurons. Neuron, 30, 803–817. Knight, B., Omurtag, A., & Sirovich, L. (2000). The approach of a neuron population firing rate to a new equilibrium: An exact theoretical result. Neural Computation, 12, 1045–1055. Koch, C. (1999). Biophysics of computation. New York: Oxford University Press. Levin, J., & Miller, J. (1996). Broadband neural encoding in the cricket cercal sensory system enhanced by stochastic resonance. Nature, 380, 165– 168. Machens, C., Wehr, M., & Zador, A. (2003). Spectro-temporal receptive fields of subthreshold responses in auditory cortex. In S. Becker, S. Thrun, & K. Obermayer (Eds.), Advances in neural information processing systems, 15. Cambridge, MA: MIT Press. Miller, K., & Troyer, T. (2002). Neural noise can explain expansive, power-law nonlinearities in neural response functions. Journal of Neurophysiology, 87, 653–659. Miller, M., & Mark, K. (1992). A statistical study of cochlear nerve discharge patterns in response to complex speech stimuli. Journal of the Acoustical Society of America, 92, 202–209. Moreno, R., de la Rocha, J., Renart, A., & Parga, N. (2002). Response of spiking neurons to correlated inputs. Physical Review Letters, 89, 288101. Paninski, L. (2003). Convergence properties of some spike-triggered analysis techniques. Network: Computation in Neural Systems, 14, 437–464. Paninski, L., Fellows, M., Shoham, S., Hatsopoulos, N., & Donoghue, J. (2003). Nonlinear population models for the encoding of dynamic hand position signals in primary motor cortex. Poster session presented at the annual Computational Neuroscience meeting, Alicante, Spain. Paninski, L., Lau, B., & Reyes, A. (2003). Noise-driven adaptation: In vitro and mathematical analysis. Neurocomputing, 52, 877–883. Pillow, J., & Simoncelli, E. (2003). Biases in white noise analysis due to nonPoisson spike generation. Neurocomputing, 52, 109–115. Plesser, H., & Gerstner, W. (2000). Noise in integrate-and-fire neurons: From stochastic input to escape rates. Neural Computation, 12, 367–384.

2560

L. Paninski, J. Pillow, and E. Simoncelli

Press, W., Teukolsky, S., Vetterling, W., & Flannery, B. (1992). Numerical recipes in C. Cambridge: Cambridge University Press. Reich, D., Victor, J., & Knight, B. (1998). The power ratio and the interval map: Spiking models and extracellular recordings. Journal of Neuroscience, 18, 10090–10104. Rinott, Y. (1976). On convexity of measures. Annals of Probability, 4, 1020–1026. Rudd, M., & Brown, L. (1997). Noise adaptation in integrate-and-fire neurons. Neural Computation, 9, 1047–1069. Sahani, M. (2000). Kernel regression for neural systems identification. Presented at NIPS00 workshop on Information and Statistical Structure in Spike Trains: Abstract available online at http://www-users.med.cornell.edu/∼jdvicto/ nips2000speakers.html. Sahani, M., & Linden, J. (2003). Evidence optimization techniques for estimating stimulus-response functions. In S. Becker, S. Thrun, & K. Obermayer (Eds.), Advances in neural information processing systems, 15. Cambridge, MA: MIT Press. Seeger, M. (2002). PAC-Bayesian generalisation error bounds for gaussian process classifiers. Journal of Machine Learning Research, 3, 233–269. Seshardri, V. (1993). The inverse gaussian distribution. Oxford: Clarendon. Simoncelli, E., Paninski, L., Pillow, J., & Schwartz, O. (in press). Characterization of neural responses with stochastic stimuli. In M. Gazzaniga (Ed.), The cognitive neurosciences (3rd ed.). Cambridge, MA: MIT Press. Smyth, D., Willmore, B., Baker, G., Thompson, I., & Tolhurst, D. (2003). The receptive-field organization of simple cells in primary visual cortex of ferrets under natural scene stimulation. Journal of Neuroscience, 23, 4746–4759. Snyder, D., & Miller, M. (1991). Random point processes in time and space. New York: Springer-Verlag. Stevens, C., & Zador, A. (1996). When is an integrate-and-fire neuron like a Poisson neuron? In D. Touretzky, M. Mozer, & M. Hasselmo (Eds.), Advances in neural information processing systems, 8 (pp. 103–108). Cambridge, MA: MIT Press. Stevens, C., & Zador, A. (1998). Novel integrate-and-fire-like model of repetitive firing in cortical neurons. In Proceedings of the 5th Joint Symposium on Neural Computation, UCSD. La Jolla, CA: Institute for Neural Computation. Available online at: http://www.sloan.salk.edu/zador/rep fire inc.html. Strong, S. Koberle, R., de Ruyter van Steveninck R., & Bialek, W. (1998). Entropy and information in neural spike trains. Physical Review Letters, 80, 197–202. Tipping, M. (2001). Sparse Bayesian learning and the relevance vector machine. Journal of Machine Learning Research, 1, 211–244. Truccolo, W., Eden, U., Fellows, M., Donoghue, J., & Brown, E. (2003). Multivariate conditional intensity models for motor cortex. Society for Neuroscience Abstracts, 607.11. Tsodyks, M., Kenet, T., Grinvald, A., & Arieli, A. (1999). Linking spontaneous activity of single cortical neurons and the underlying functional architecture. Science, 286, 1943–1946. van der Vaart, A. (1998). Asymptotic statistics. Cambridge: Cambridge University Press.

ML Estimation of a Stochastic Integrate-and-Fire Model

2561

Victor, J. (2000). How the brain uses time to represent and process visual information. Brain Research, 886, 33–46. Warland, D., Reinagel, P., & Meister, M. (1997). Decoding visual information from a population of retinal ganglion cells. Journal of Neurophysiology, 78, 2336–2350. Williams, C., & Barber, D. (1998). Bayesian classification with gaussian processes. IEEE PAMI, 20, 1342–1351. Yu, Y., & Lee, T. (2003). Dynamical mechanisms underlying contrast gain control in single neurons. Physical Review E, 68, 011901. Zhang, K., Ginzburg, I., McNaughton, B., & Sejnowski, T. (1998). Interpreting neuronal population activity by reconstruction: Unified framework with application to hippocampal place cells. Journal of Neurophysiology, 79, 1017–1044. Received January 9, 2004; accepted May 5, 2004.

LETTER

Communicated by Eero Simoncelli

Image Representation by Complex Cell Responses Ingo J. Wundrich [email protected] ¨ Neuroinformatik, Ruhr-Universit¨at Bochum, D-44780 Bochum, Germany Institut fur

Christoph von der Malsburg [email protected] ¨ Neuroinformatik, Ruhr-Universit¨at Bochum, D-44780 Bochum, Germany, Institut fur and Department of Computer Science, University of Southern California, Los Angeles, CA 90089, U.S.A.

Rolf P. Wurtz ¨ [email protected] ¨ Neuroinformatik, Ruhr-Universit¨at Bochum, D-44780 Bochum, Germany Institut fur

We present an analysis of the representation of images as the magnitudes of their transform with complex-valued Gabor wavelets. Such a representation is a model for complex cells in the early stage of visual processing and of high technical usefulness for image understanding, because it makes the representation insensitive to small local shifts. We show that if the images are band limited and of zero mean, then reconstruction from the magnitudes is unique up to the sign for almost all images. 1 Introduction The first stage of processing of visual stimuli in the cortex is constituted by simple and complex cells in V1. Simple cell responses are modeled to considerable accuracy by linear convolution with Gabor functions (Daugman, 1985; Jones & Palmer, 1987). Pairs of cells differing in phase by 90 degrees are frequently found (Pollen & Ronner, 1981). Complex cells differ from simple cells by showing less specificity concerning the position of the stimulus. Their responses are well modeled by the magnitudes of the Gabor filter responses (short Gabor magnitudes) (Pollen & Ronner, 1983). Although the properties of these cells are more complicated, especially concerning temporal behavior, this functional description remains a good approximation to the cell responses. Besides the biological modeling, convolution with Gabor functions and magnitude building are very useful for technical image processing purposes. The complex cells can be combined to more complicated feature detectors such as corner detectors (Wurtz ¨ & Lourens, 2000). They have also c 2004 Massachusetts Institute of Technology Neural Computation 16, 2563–2575 (2004)

2564

I. Wundrich, C. von der Malsburg, and R. Wurtz ¨

proven useful for higher image understanding tasks such as texture classification (Fogel & Sagi, 1989; Portilla & Simoncelli, 2000), recognition of faces (Lades et al., 1993; Wurtz, ¨ 1997; Duc, Fischer, & Bigun, ¨ 1999; Wiskott, Fellous, Kruger, ¨ & von der Malsburg, 1997), vehicles (Wu & Bhanu, 1997), hand gestures (Triesch & von der Malsburg, 1996), and analysis of electron microscopic sections of brain tissue (Konig, ¨ Kayser, Bonin, & Wurtz, ¨ 2001). Regarding texture classification, in Portilla and Simoncelli (2000), the successful steerable filters have been enhanced to matched filters at the price of redundancy, because the magnitudes capture important properties of natural textures so well. The deeper reason for this is that the magnitude operation introduces robustness under local shifts in the sense that in the presence of small displacements, the images of the Gabor magnitudes are more robust than the full complex-valued responses because they are much smoother. This robustness is crucial for the registration part of recognition systems, which have to cope with small local deformations. As a practical consequence, similarity landscapes between local features are smoother if magnitudes are used, which makes matching faster and less prone to local maxima (Lades et al., 1993; Wurtz, ¨ 1997; Wiskott et al., 1997). If the Gabor functions are arranged into a wavelet transform and the sampling is dense enough (see section 2), then the original image can be recovered from the transform values with arbitrary quality (except for the DC value). Given the useful properties of the magnitudes of the Gabor transform, an important theoretical question is how much image information can be recovered from that. It has been observed experimentally that a recognizable image can be recovered from the Gabor magnitudes (von der Malsburg & Shams, 2002). In this letter, we present a proof that, given appropriate transform parameters and band limitation, no image information is lost beside the DC value of the image and a global sign. The proof uses techniques from Hayes (1982) and applies to all images except a possible subset of measure zero. The extension to localized Fourier transforms such as Gabor wavelets has not been shown before. Theorems 4 and 5 have been mentioned without details of the proof in Wundrich, von der Malsburg, and Wurtz ¨ (2002), where we presented an algorithm for image reconstruction from the Gabor amplitudes. The fact that images can be reconstructed from their Fourier amplitudes or phases alone is not widely known within the computer vision community. There is an extensive discussion on the relative importance of phase and amplitude information, which is reviewed briefly at the beginning of section 4. The practical importance of reconstruction from Gabor amplitudes seems quite limited. For the visual system, it is clearly not a problem because the simple cell information is readily available. The importance of our result lies in the theoretical analysis of the effects of the nonlinearity introduced

Image Representation by Complex Cell Responses

2565

by the magnitudes. We demonstrate that, on the one hand, the resulting representation shows invariance under small shifts and, on the other hand, not more information is lost than by subsampling by a factor of two, the global sign and the DC value. 2 Gabor Wavelets For the analysis of image properties at various scales, the wavelet transform is in wide use. The image is projected onto a family of wavelet functions, which are derived from a single mother wavelet by translation x0 , rotation with a matrix Q(ϑ), and scaling by a of the image plane: 1 ∗ 1 2 I (x0 , a, ϑ) = d x I( x) ψ (2.1) Q(ϑ)( x − x0 ) . a a R2 The mother wavelet (and, consequently, all wavelets) must satisfy the admissibility condition (Kaiser, 1994), which means that their DC value is zero = 0). ˆ 0) (ψ( For modeling biological properties as well as to fulfill admissibility, standard Gabor functions are modified by a term that removes their DC value, turning the real and imaginary part into a strict matched filter pair. Following Murenzi (1989), we let: 1 σ2 1 ψ( x) = exp − Sσ,τ x2 (2.2) exp(j xT e1 ) − exp − στ 2 2 1 1 ˆ ω) − e1 )2 − exp − (S−1 2 + σ 2 ) . (2.3) ψ( = exp − S−1 σ,τ (ω σ,τ ω 2 2 In these equations, the diagonal matrix Sσ,τ = Diag(1/σ, 1/τ ) controls the shape of the elliptical gaussian relative to the wavelength. We now switch from continuous functions to discretely sampled images of N1 × N2 pixels. This lattice is denoted by SN , the sampling interval for image and translation space by . To avoid confusion, we use three different symbols for the different Fourier transforms: the continuous one (FT) is ˆ ω), ˇ ν ), and denoted by I( the 2D equivalent of Fourier series (DSFT) by I( ˘ the completely discretized and finite version (DFT) by I(ρ), defined on the same lattice SN . For simplification, we also use normalized DFT coordinates ρ˜ = [ρ1 /N1 ρ2 /N2 ]T . The final discretization of our wavelet families in both spatial and frequency domain takes the form 2πl −m −1 −m ψn0 ,m,l ( ( n − n n) = a−1 a ψ a a Q ) , (2.4) 0 min 0 min 0 L 2πamin am 2π l 2π 0 ˆ ˜ (2.5) ψ amin am = √ Q ψ˘ n0 ,m,l (ρ) ρ ˜ exp(−j2π n T0 ρ). 0 L N1 N2 2

2566

I. Wundrich, C. von der Malsburg, and R. Wurtz ¨

Now, the discrete Gabor wavelet transform can be computed in either domain by the inner product,

I (n0 , m, l) =

n ∈SN

=

ρ∈S N

I( n) ψn∗0 ,m,l ( n) ˘ ρ) I( ψ˘ n∗0 ,m,l (ρ),

(2.6)

and the inverse transform becomes L−1 M−1 4 −2m −1 ˘ ρ) Y ( ρ) a−2 I (n0 , m, l)ψ˘ n0 ,m,l (ρ), I( = 0 min a0 4π 2 n ∈S m=0 l=0 0

(2.7)

N

with Y0 (ρ) =

L−1 M−1 m=0 l=0

2 ψˆ amin am Q 2πl 2π ρ˜ , 0 L

(2.8)

is regularized for inversion by assigning Y0−1 (ρ) = 0 where where Y0 (ρ) Y0 (ρ) drops below an appropriate threshold. The transform constitutes a ˘ ρ) frame for a given image class M if ∀I ∈ M: (Y0 (ρ) = 0 ⇒ I( = 0). For full details of the Gabor wavelet framework, see Lee, 1996. 3 Shift Insensitivity of Gabor Magnitudes We now argue that the magnitude of the Gabor wavelet transform is less sensitive to shift than the transform itself. Generally, the amount of shift that can be tolerated by a representation is directly related to the frequency content of the underlying signal. For signals without band limitation (like Gabor-filtered images), the frequency content can be measured by the normalized frequency moments introduced by Gabor (1946), who called them mean frequencies. The first one, F1 ( f ) =

fˆ(ω)| 2 d2 ω ω| , 2 f

(3.1)

can be interpreted as the center of signal energy in the frequency space. For the Gabor function from equation 2.3 it is easily checked that F1 (ψ) ≈ e1 , that is, any Gabor function is centered at the frequency that appears in the exponent of the first gaussian in equation 2.3 (due to the admissibility correction, this holds only approximately). For nonpathological images, F1 (I ∗ ψ) ≈ e1 will also hold, and for any real nonzero image, F1 (I ∗ ψ) must On the other hand, the Fourier transform of |I ∗ ψ|2 is be different from 0.

Image Representation by Complex Cell Responses

2567

ˆ Any autocorrelation is symmetric around the the autocorrelation of Iˆ · ψ. This argument shows that the magni origin, and therefore F1 (|I ∗ ψ|2 ) = 0. tudes consist of lower frequencies and consequently show slower variation than the Gabor bands themselves. It can be refined by looking at the second moment—the variance in frequency domain as well. 4 Reconstruction from Fourier Magnitudes Due to their translation invariance, Fourier magnitudes have been used for pattern recognition (Gardenier, McCallum, & Bates, 1986). The fact that the inverse DFT applied to a modified transform with all magnitudes set to 1 and original phases preserves essential image properties (Oppenheim & Lim, 1981) is frequently interpreted as saying that the Fourier magnitudes contain “less” image information than the phases. However, analytical results and existing phase retrieval algorithms provide hints that the situation is not as simple. Tadmor and Tolhurst (1993) show that the assumption of the global Fourier amplitudes being irrelevant for the image contents is too simple, although the amplitude spectrum averaged over orientations does not vary too much for natural images. As a consequence, the distribution of image energy across orientations is an important distinguishing image property, and strong orientations can be conveyed by the amplitude spectrum alone. Lohmann, Mendlovic, and Shabtay (1997) show that the relative importance of Fourier phase and amplitude can be reversed by modification of a single pixel. These examples are rejected by Millane and Hsiao (2003), because of remaining correlations of the final phases with the original phases. So far, the discussion is not concluded, because generally accepted notions of relative importance and of images are lacking. The possibility of reconstructing recognizable images from phase or amplitude information alone is not a contradiction to the above results. It only shows that it is very hard to trace the image information in the presence of the simplest of nonlinearities. In the following we review two articles (Hayes, 1982; Hayes & McClellan, 1982) that show that almost all images can be reconstructed from their Fourier magnitudes. The argument identifies unique reconstructability with the reducibility of a polynomial in D variables, where D is the signal dimension. 4.1 Polynomials in One or More Dimensions. The set P (n, D) of all polynomials with complex coefficients and total degree n in D variables is a vector space over C of dimension α(n, D). In the case D = 1, all polynomials are reducible according to the fundamental theorem of algebra, that is, they can be factored into polynomials of lower degree. For polynomials in two or more variables, the situation is different. The following is a modified version of a theorem from Hayes and McClellan (1982) and shows that in a

2568

I. Wundrich, C. von der Malsburg, and R. Wurtz ¨

certain sense, reducible polynomials are very uncommon in more than one variable. Theorem 1. The subset of polynomials in P (n, D) that are reducible over the complex numbers corresponds to a set of measure zero in R2α(n,D) , provided D > 1 and n > 1. In the following sections, we will make full use of this result for 2D signal processing. The idea is that only images on a finite support are taken into account and that the many wrong phase functions lead to reconstructed images with nonvanishing values outside this support. Throughout this article, we mean by the support of a function of two variables the smallest rectangle with edges parallel to the coordinate axes, which contains all nonzero pixels. This should probably bear a different name, but we do not expect any confusion by this simple terminology. 4.2 The Hayes Theorem and Extensions. Hayes’s theorem identifies the 2D z-transform, 1 −n2 ˇ z) = 1 I( I( n)z−n 1 z2 , 2π n∈S

(4.1)

N

and the 2D discrete space Fourier transform (DSFT) on a compact support, with polynomials in two variables, to which theorem 1 applies. Theorem 2 (Hayes, 1982). Let I1 , I2 be 2D real sequences with support SN and let a set of || distinct points in [−π, π [2 arranged on a lattice L() with || ≥ (2N1 − 1)(2N2 − 1). If Iˇ1 (z) has at most one irreducible nonsymmetric factor and |Iˇ1 (ν )| = |Iˇ2 (ν )| ∀ν ∈ L(),

(4.2)

− n − 1), n) ∈ {I2 ( n), I2 (N I1 ( − n − 1)}. − I2 ( n), −I2 (N

(4.3)

then

Theorem 2 states that DSFT magnitudes-only reconstruction yields either the original, or a negated, point reflected, or a negated point reflected version of the input signal. Together with the statement of theorem 1 that ˇ z) is of measure zero, the technicality the set of all reducible polynomials I( about the irreducible nonsymmetric factors can be omitted, and we generalize theorem 2 to complex-valued sequences as follows (the proof is in appendix A):

Image Representation by Complex Cell Responses

2569

Theorem 3. Let I1 , I2 be complex sequences defined on the compact support SN , and let Iˇ1 (ν ) and Iˇ2 (ν ) be only trivially reducible (i.e., have factors only of the form p p z11 z22 ), and |Iˇ1 (ν )| = |Iˇ2 (ν )| ∀ν ∈ L(),

(4.4)

with L(), || as in theorem 2. Then

− n − 1) | η ∈ [0, 2π[ . n) ∈ exp(jη)I2 ( n), exp(jη)I2∗ (N I1 (

(4.5)

5 Extension to Gabor Magnitudes Theorem 3 is a theorem about arbitrary polynomials; therefore, it can be applied equally well to the magnitudes of a complex spatial image signal for the reconstruction of the discrete Fourier transform. Thus, the following is a consequence of theorem 3: Theorem 4. Let I1 , I2 be complex sequences defined on the compact support TK = {−(K1 − 1)/2, . . . , (K1 − 1)/2} × {−(K2 − 1)/2, . . . , (K2 − 1)/2} with K1 , K2 odd numbers, N1 ≥ 2K1 − 1, N2 ≥ 2K2 − 1, and let I1 ( n) and I2 ( n) be only trivially reducible, and |I1 ( n)| = |I2 ( n)| Then

∀ n ∈ TK .

(5.1)

∈ exp(jη)I˘2 (ρ), exp(jη)I˘2∗ (−ρ) | η ∈ [0, 2π [ , I˘1 (ρ)

and consequently

n) ∈ exp(jη)I2 ( n), exp(jη)I2∗ ( n) | η ∈ [0, 2π [ . I1 (

(5.2)

To complete the argument, we now relate the reconstructability from the Gabor magnitudes to the reconstructability from the Gabor transform itself. Theorem 5 (Gabor magnitude theorem). Let B (N1 , N2 ) be the space of all ˘ ρ) = 0 for |ρ1 | > N41 , |ρ2 | > zero-mean band-limited functions on SN such that I( N2 ˘ n) constitute a frame in 4 , and I(0) = 0, and let the wavelet family ψn 0 ,m,l ( B (N1 , N2 ). For all I1 , I2 ∈ B (N1 , N2 ) such that I1 (n0 , m, l) = I1 , ψn0 ,m,l and I2 (n0 , m, l) = I2 , ψn0 ,m,l are only trivially reducible polynomials and n0 , m, l)| = |I2 ( n0 , m, l)| |I1 (

∀ n0 , m, l,

(5.3)

it follows that n) = ±I2 ( n). I1 (

(5.4)

2570

I. Wundrich, C. von der Malsburg, and R. Wurtz ¨

The proof of theorem 5 is presented in appendix C. Theorem 1 states that almost all images fulfill the irreducibility condition. It follows that images that fulfill the band limitation condition of theorem 5 and have zero mean can be reconstructed from their Gabor amplitudes up to the sign. The band limitation implies that the images are sampled at twice the necessary density in each dimension. 6 Discussion We have shown that almost all images sampled at twice their critical rate can be recovered from their Gabor magnitudes, DC value, and sign. This means that images are still represented well on the complex cell level. The higher sampling rate looks plausible in neuronal terms, if one considers a single complex number to be represented by four positive real numbers (because cell activities cannot be negative). Four simple cells, which code for the linear wavelet coefficient, must be replaced by four complex cells at slightly different positions in order to convey the same information. Thus, the functional relevance of complex cells seems to be the local translational invariance, and no further information seems to be lost. There is, of course, a residual possibility for some natural images to fall into the subset of measure zero of those images not represented uniquely by their Gabor magnitudes. We have treated only Gabor wavelets, because we are interested in their properties from the point of view of biological models as well as their technical applications. The reconstructability will also hold for other wavelet types. The conditions on the wavelet should be that their Fourier transform is real, localized in frequency space (to satisfy lemma 1; see appendix B), and without too many zeros in frequency space, so that common nonzero points between subbands can always be found. Another important question that arose is the stability of the uniqueness result in the presence of noise on the amplitudes. We were not able to find a satisfactory theoretical answer to this. We have, however, developed a reconstruction algorithm that is based on the principles of the proof presented here and are in the process of tackling this issue numerically. A brief description of the algorithm can be found in Wundrich et al. (2002). Appendix A: Proof of Theorem 3 From the equality of the squared moduli, it follows that Iˇ1 (ν )Iˇ1∗ (ν ) = Iˇ2 (ν )Iˇ2∗ (ν ).

(A.1)

The factorization of Iˇ1 and Iˇ2 is Iˇ1 (ν ) = |α| exp(jβ) exp(−jν1 p1 ) exp(−jν2 p2 )Iˇ10 (ν ), Iˇ2 (ν ) = |γ | exp(jζ ) exp(−jν1 q1 ) exp(−jν2 q2 )Iˇ20 (ν ),

(A.2)

Image Representation by Complex Cell Responses

2571

where Iˇ10 , Iˇ20 are irreducible and normalized. Substituting equation A.2 in A.1, and simplifying yields ∗ ∗ |α|2 Iˇ10 (ν )Iˇ10 (ν ) = |γ |2 Iˇ20 (ν )Iˇ20 (ν ).

(A.3)

Since the polynomials on both sides of equation A.3 are normalized, it follows that |γ | = |α|. Because of the irreducibility, one polynomial on the left has to be equal to one on the right of equation A.3. This results in the following two cases: 1. Iˇ20 (ν ) = Iˇ10 (ν ) Iˇ2 (ν ) = Iˇ1 (ν ) exp(j(ζ − β)) exp(−jν1 (q1 − p1 )) exp(−jν2 (q2 − p2 )) I2 ( n) = exp(j(ζ − β))I1 (n1 − q1 + p1 , n2 − q2 + p2 ),

(A.4)

∗ 2. Iˇ20 (ν ) = Iˇ10 (ν ) ∗ Iˇ2 (ν ) = Iˇ1 (ν ) exp(j(ζ + β)) exp(−jν1 (q1 + p1 )) exp(−jν2 (q2 + p2 ))

I2 ( n) = exp(j(ζ + β))I1∗ (−n1 + q1 + p1 , −n2 + q2 + p2 ).

(A.5)

That these two possibilities are compatible with each other can be checked by simple calculations. From the fact that I1 and I2 have the same support, the shifts can be determined, and after substitution of η = ζ ± β, we end up with the two cases 1. I2 ( n) = exp(jη)I1 ( n), ∗ 2. I2 ( n) = exp(jη)I (N − n − 1), 1

(A.6) (A.7)

which concludes the proof. Appendix B: Lemma About Gabor Function The following lemma is a technicality required for the proof of theorem 5 and allows distinguishing a subband from its point-inflected version. Lemma 1. 2 > |ψˆ x0 ,a,ϑ (−ω)| 2. ω T e(ϑ) > 0 ⇒ |ψˆ x0 ,a,ϑ (ω)|

(B.1)

Proof. Because of the wavelet property, it suffices to prove the proposition θ = 0 and a = 1. Then the |·| can be removed for the mother wavelet: x0 = 0, because the functions are real in frequency space. ω T e(ϑ) reduces to ω1 . After substituting equation 2.3, simplification, and removal of common positive factors, it remains to show that exp(2σ 2 ω1 ) − 2 exp(σ 2 ω1 ) > exp(−2σ 2 ω1 ) − 2 exp(−σ 2 ω1 ),

(B.2)

2572

I. Wundrich, C. von der Malsburg, and R. Wurtz ¨

which is equivalent to 4 sinh(σ 2 ω1 ) exp(−σ 2 ω1 ) > 0

(B.3)

by elementary manipulations. Equation B.3 holds for all ω1 > 0 and all σ = 0. Appendix C: Proof of Theorem 5 Due to Plancherel’s theorem, I ( n0 , m, l) is a polynomial:

I1 (n0 , m, l) 2π am 2π l 2π 0 m ˜ (C.1) ˘ ˆ ψ amin a0 Q √ ρ˜ exp(j2π n T0 ρ). I1 (ρ) = L N1 N2 2 ρ∈S N

I1 and I2 are defined on SN , and their Gabor wavelet transforms are only trivially reducible polynomials in each subband (m, l). For the frequency support argument, we shift the DFT frequency box so that ρ = 0 is located in the middle of it. Then the support of both I˘1 and I˘2 becomes TK = {−(K1 − 1)/2, . . . , (K1 −1)/2}×{−(K2 −1)/2, . . . , (K2 −1)/2}. Since the images are real, K1 and K2 must be odd numbers, and the support is symmetrical around 0. (With support, we imply that K1 and K2 are the smallest numbers to include all nonzero elements of the images.) Furthermore, we define I˘ s (ρ, m, l) and ˘ ρ) as the restriction of I˘ (ρ, m, l) and I( on TK . I˘s (ρ) From the condition of the theorem, N1 ≥ 2K1 − 1, N2 ≥ 2K2 − 1. Thus, theorem 4 can be applied and leads to the following: |I2 ( n0 , m, l)| = |I1 ( n0 , m, l)| ∀ n0 ∈ SN , ∀m, l ⇒ I˘ 2s (ρ, m, l) s ∈ exp(jη(m, l))I˘ 1 (ρ, m, l),

s ∗ m, l)| η(m, l) ∈ [0, 2π[ . exp(jη(m, l))I˘ (−ρ, 1

(C.2)

That is, both Gabor transforms must be equal up to a phase and a possible point reflection, both of which may depend on the subband. In the following steps, we remove the ambiguities exploiting known inter- and intrasubband structure in the frequency domain. First, we show that the magnitudes must be equal and the point-reflected case cannot happen. Suppose |I˘ 2s (ρ, m, l)| = |I˘ 1s ∗ (−ρ, m, l)|

(C.3)

or, equivalently, |I˘2s (ρ)|| ψ˘ 0,m,l (ρ)| = |I˘1s (ρ)|| ψ˘ 0,m,l (−ρ)|.

(C.4)

Image Representation by Complex Cell Responses

2573

such that I˘s (ρ0 ) = 0 and such that the angle between We pick a ρ0 = 0, 1 ρ˜ 0 and the kernel’s center frequency is less than π/2. From lemma 1, it can be concluded that |ψ˘ 0,m,l (ρ0 )| > |ψ˘ 0,m,l (−ρ0 )|.

(C.5)

Now, if we substitute ρ0 into equation C.4, then equation C.5 can be satisfied only if |I˘1s (ρ0 )| > |I˘2s (ρ0 )|, while substituting −ρ0 into equation C.4 makes |I˘1s (ρ0 )| < |I˘2s (ρ0 )| necessary. This is a contradiction. Second, we consider the Fourier phases in each subband: arg I˘ 2s (ρ, m, l) = η(m, l) + arg I˘ 1s (ρ, m, l).

(C.6)

Because the phases of the Gabor functions are equal for both images, it follows that arg I˘2s (ρ) = η(m, l) + arg I˘1s (ρ),

(C.7)

except at the points where the Gabor function itself is zero, which are the L grid points on the axis through 0 with the angle 2π L (l + 2 ). Applying equation C.7 to a ρ0 and −ρ0 outside that axis and the zeros of I˘2s yields, together with the fact that the images are real, η(m, l) = 0 ∨ η(m, l) = π.

(C.8)

Choosing any two combinations of m, l, there is always some point that lies on neither of the exceptional axes and where the image is nonzero. Thus, η(m, l) must be equal for those two levels, and consequently for all. This means that all subbands have the correct Fourier phase, or the phase function of all subbands has an offset by π, and we conclude that I˘2s (ρ) = ±I˘1s (ρ).

(C.9)

From the band limitation in the conditions of the theorem, I˘1 and I˘2 are zero outside TK , and therefore equation C.9 also holds for them. Finally, the inverse DFT yields I2 ( n) = ±I1 ( n),

(C.10)

which concludes the proof of the theorem. Acknowledgments This work has been supported by grants from BMBF, ONR, and ARO.

2574

I. Wundrich, C. von der Malsburg, and R. Wurtz ¨

References Daugman, J. G. (1985). Uncertainty relation for resolution in space, spatial frequency, and orientation optimized by two-dimensional visual cortical filters. Journal of the Optical Society of America A, 2(7), 1362–1373. Duc, B., Fischer, S., & Bigun, ¨ J. (1999). Face authentication with Gabor information on deformable graphs. IEEE Transactions on Image Processing, 8(4), 504–516. Fogel, I., & Sagi, D. (1989). Gabor filters as texture discriminator. Biological Cybernetics, 61, 103–113. Gabor, D. (1946). Theory of communication. Journal IEE, 93(III), 429–457. Gardenier, P. H., McCallum, B. C., & Bates, R. H. T. (1986). Fourier transform magnitudes are unique pattern recognition templates. Biological Cybernetics, 54, 385–391. Hayes, M. H. (1982). The reconstruction of a multidimensional sequence from the phase or magnitude of its Fourier transform. IEEE Transactions on Acoustics, Speech, and Signal Processing, 30(2), 140–154. Hayes, M. H., & McClellan, J. H. (1982). Reducible polynomials in more than one variable. Proceedings of the IEEE, 70(2), 197–198. Jones, J., & Palmer, L. (1987). An evaluation of the two-dimensional Gabor filter model of simple receptive fields in cat striate cortex. Journal of Neurophysiology, 58(6), 1233–1258. Kaiser, G. (1994). A friendly guide to wavelets. Boston: Birkh¨auser. Konig, ¨ P., Kayser, C., Bonin, V., & Wurtz, ¨ R. P. (2001). Efficient evaluation of serial sections by iterative Gabor matching. Journal of Neuroscience Methods, 111(2), 141–150. Lades, M., Vorbruggen, ¨ J. C., Buhmann, J., Lange, J., von der Malsburg, C., Wurtz, ¨ R. P., & Konen, W. (1993). Distortion invariant object recognition in the dynamic link architecture. IEEE Transactions on Computers, 42(3), 300–311. Lee, T. S. (1996). Image representation using 2D Gabor wavelets. IEEE Transactions on Pattern Analysis and Machine Intelligence, 18(10), 959–971. Lohmann, A., Mendlovic, D., & Shabtay, G. (1997). Significance of phase and amplitude in the Fourier domain. Journal of the Optical Society of America A, 14(11), 2901–2904. Millane, R., & Hsiao, W. (2003). On apparent counterexamples to phase dominance. Journal of the Optical Society of America A, 20(4), 753–756. Murenzi, R. (1989). Wavelet transforms associated to the n-dimensional Euclidean group with dilations: Signal in more than one dimension. In J. M. Combes, A. Grossmann, & P. Tchamitchian (Eds.), Wavelets—Time-frequency methods and phase space (pp. 239–246). New York: Springer-Verlag. Oppenheim, A. V., & Lim, J. S. (1981). The importance of phase in signals. Proceedings of the IEEE, 96(5), 529–541. Pollen, D. A., & Ronner, S. F. (1981). Phase relationships between adjacent simple cells in the visual cortex. Science, 212, 1409–1411. Pollen, D. A., & Ronner, S. F. (1983). Visual cortical neurons as localized spatial frequency filters. IEEE Transactions on Systems, Man, and Cybernetics, 13(5), 907–916.

Image Representation by Complex Cell Responses

2575

Portilla, J., & Simoncelli, E. P. (2000). A parametric texture model based on joint statistics of complex wavelet coefficients. Intl J. Computer Vision, 40, 49–71. Tadmor, Y., & Tolhurst, D. (1993). Both the phase and the amplitude spectrum may determine the appearance of natural images. Vision Research, 33(1), 141– 145. Triesch, J., & von der Malsburg, C. (1996). Robust classification of hand postures against complex backgrounds. In Proceedings of the Second International Conference on Automatic Face and Gesture Recognition (pp. 170–175). Piscataway, NJ: IEEE Computer Society Press. von der Malsburg, C., & Shams, L. (2002). The role of complex cells in object recognition. Vision Research, 42(22), 2547–2554. Wiskott, L., Fellous, J.-M., Kruger, ¨ N., & von der Malsburg, C. (1997). Face recognition by elastic bunch graph matching. IEEE Transactions on Pattern Analysis and Machine Intelligence, 19(7), 775–779. Wu, X., & Bhanu, B. (1997). Gabor wavelet representation for 3D object recognition. IEEE Transactions on Image Processing, 6(1), 47–64. Wundrich, I. J., von der Malsburg, C., & Wurtz, ¨ R. P. (2002). Image reconstruction from Gabor magnitudes. In H. H. Bulthoff, ¨ S.-W. Lee, T. A. Poggio, & C. Wallraven (Eds.), Biologically motivated computer vision (pp. 117–126). New York: Springer-Verlag. Wurtz, ¨ R. P. (1997). Object recognition robust under translations, deformations and changes in background. IEEE Transactions on Pattern Analysis and Machine Intelligence, 19(7), 769–775. Wurtz, ¨ R. P., & Lourens, T. (2000). Corner detection in color images through a multiscale combination of end-stopped cortical cells. Image and Vision Computing, 18(6–7), 531–541. Received July 22, 2003; accepted May 11, 2004.

LETTER

Communicated by Klaus Pawelzik

Modeling of Synchronized Bursting Events: The Importance of Inhomogeneity Erez Persi [email protected]

David Horn [email protected]

Vladislav Volman [email protected]

Ronen Segev [email protected]

Eshel Ben-Jacob [email protected] School of Physics and Astronomy, Raymond and Beverly Sackler Faculty of Exact Sciences, Tel-Aviv University, Tel-Aviv 69978, Israel

Cultured in vitro neuronal networks are known to exhibit synchronized bursting events (SBE), during which most of the neurons in the system spike within a time window of approximately 100 msec. Such phenomena can be obtained in model networks based on Markram-Tsodyks frequency-dependent synapses. In order to account correctly for the detailed behavior of SBEs, several modifications have to be implemented in such models. Random input currents have to be introduced to account for the rising profile of SBEs. Dynamic thresholds and inhomogeneity in the distribution of neuronal resistances enable us to describe the profile of activity within the SBE and the heavy-tailed distributions of interspike intervals and interevent intervals. Thus, we can account for the interesting appearance of L´evy distributions in the data. 1 Introduction How an ensemble of neurons links together to form a functional unit is a fundamental problem in neuroscience. The difficulty in visualizing and measuring the activity of in vivo neuronal networks has accelerated the tendency to pursue surrogate methods. As a result, much effort has been devoted to studying cultured neuronal networks in vitro. It has been suggested (Marom & Shahaf, 2002; Shahaf & Marom, 2001) that large, random networks in vitro are probably the most appropriate experimental model to study the formation of activity of groups of neurons and the response of this functional connectivity to stimulation. Unlike in vivo networks, ranc 2004 Massachusetts Institute of Technology Neural Computation 16, 2577–2595 (2004)

2578

E. Persi, D. Horn, V. Volman, R. Seger, and E. Ben-Jacob

domly formed in vitro networks are free of any predefined functional role and enable one to study how a system self-organizes. Many different forms of synchronized activity of populations of neurons have been observed in the central nervous system (CNS) through the years. However, the role of synchronization in the neural code and information processing is still a matter of considerable debate. Recently, Segev et al. (2001) and Segev, Shapira, Benveniste, and Ben Jacob (2001) performed long-term measurements of spontaneous activity of in vitro neuronal networks placed on multielectrode arrays. These developing networks show interesting temporal and spatiotemporal properties on many timescales. At early stages (up to a week), the neurons spike sporadically, after which they begin to correlate, leading to the emergence of synchronized bursting events (SBEs). An SBE involves extensive spiking of a large fraction of neurons in the system during approximately 100 msec. One hypothesis (Segev, 2002) is that these dynamic SBEs are the substrate for information encoding at the network level, in analogy with action potentials at the neuronal level. Different morphological structures and sizes of networks were explored (Segev, Benveniste, et al., 2001). All networks exhibit (1) long-term correlation of the SBEs’ time sequence manifested by power law decay of the power spectrum density, (2) temporal scaling behavior of neuronal activity during the SBEs (where most of the activity takes place), and (3) temporal scaling behavior of the SBEs’ time sequence. Temporal scaling is usually expressed by a power law decay. The time sequences of spikes and SBEs were well fitted to a L´evy distribution, which has a power law decay. Such scale-invariant behavior is characteristic of dynamical systems, which are composed of many nonlinear components (Mantegna & Stanley, 1995; Peng et al., 1993). Our goal is to elucidate the underlying mechanisms and properties of the different elements in the system responsible for the generation and structure of the observed neuronal (synchronized) activity. 2 Experimental Observations The experimental system is based on a two-dimensional (2D) 60-multielectrode array on which the biological system is grown. The biological system is composed of both neurons and glia, taken from the cortex of a baby rat. The coupling between the electrodes and the neuronal membranes enables the recording of neuronal electric activity (action potentials). Different morphological types and sizes of networks were explored (Segev, Benveniste, et al., 2001): (1) quasi one-dimensional (1DS), a small network composed of 50 cells; (2) rectangular (2DM) a medium-sized network composed of 104 cells; and (3) circular (2DL), a large-size network composed of 106 cells. All of these networks shared the same characteristic patterns of activity. The mean rate of SBEs increases with the size of the network, while the variance decreases, but only up to a factor of 2.

Modeling of Synchronized Bursting Events

2579

2.1 Synchronized Bursting Events. Figure 1 illustrates a characteristic raster plot, revealing the appearance of SBEs and their structure. The SBE involves rapid spiking of almost all the neurons in the network. It starts abruptly and decays during a 100 msec period. During the SBE, each neuron has its own firing pattern, consisting of up to 20 spikes. The SBEs are separated by long quiescent periods (1–10 sec) during which there is almost no activity except for a few sporadic spikes. The mean rate of SBEs is between 0.1 Hz and 0.4 Hz. 2.2 Distribution of Neuronal Activity. The distributions of ISI at the neuron level, and interevent interval (IEI) at the network level were analyzed (Segev, Benveniste et al., 2001). Since the SBE width is of the order 100 msec, this defines the resolution of IEIs. The ISI and IEI distributions of experimental data reveal heavy tails, indicating possible temporal scaling behavior. To study the appearance of heavy tails in ISI and IEI, the distributions of the ISI and IEI increments, defined by ISI(t) = ISI(t) − ISI(t − 1)

(2.1)

IEI(t) = IEI(t) − IEI(t − 1),

(2.2)

were investigated. Unlike the ISI and IEI, the increments have symmetric stationary distributions with zero mean. Figure 2 displays the histograms of the ISI increments of three different neurons, as well as the average of all experimental data and the margin of variation. We note that since the activity is very low between any two consecutive SBEs, the first part of the histogram (up to 100 msec) is separated from the second part (from 2 sec), which reveals the heavy-tailed distribution of the IEI increments (since ISI increments larger than 1–2 sec are caused mainly by the SBEs). Between 100 msec and 2 sec, the probability increases slightly for each neuron due to both the synchrony and sporadic activity in the system. We also calculate the distance between the individual neurons’ histograms and the averaged histogram using the KL divergence measure defined by: DKL =

P(x) · log2 (P(x)/Q(x)).

(2.3)

x

The values are given in the figure caption. These values will be compared later with model neurons. Since these distributions exhibit large tails, it seems only natural to try to fit them by L´evy distributions, which would also be expected from central limit theorem arguments (see below). In this article, we follow the parameterization S(α, β, γ , δ; 0) (Nolan, 1998, 2004) to describe L´evy distribution. Every L´evy distribution is a stable distribution. A random variable X is stable if X1 + X2 + · · · + Xn ∼ Cn X + Dn , meaning that the normalized

2580

E. Persi, D. Horn, V. Volman, R. Seger, and E. Ben-Jacob

# neuron

0 10 A

20 30 0

5 10

15 20

25 30

35 40

45 50

Time[sec] # neuron

0 10 B

20 30 0

50

100

150

200

Time[msec] # of spikes

4 3 C

2 1 0

0

50

100

150

200

Time[msec] Figure 1: Typical activity of the experimental network. (Top) The raster plot of 1 msec time bins reveals SBEs. (Middle) Zoom into an SBE. (Bottom) Detailed structure of an SBE, displayed by the number of spikes in the network per time bin.

Modeling of Synchronized Bursting Events

2581

−1

10

−2

10

activity of SBEs reflected by the neurons’ activity

−3

10

−4

10

PDF

activity within the SBEs

−5

10

−6

10

−7

10

−8

10

0

10

1

10

2

10

3

10

4

10

5

10

6

10

Time[msec] Figure 2: Positive half of the neurons’ ISI histograms, taken from the experimental data. Solid lines represent the margins of variability. They are calculated by averaging the histograms of all neurons and plotting this averaged PDF ± the standard deviation. Symbols represent data of three selected neurons, also showing the diversity in the data. The KL distance measures of these neurons from the average PDF are 4.1193 · 10−6 , 3.1369 · 10−6 , and 3.1355 · 10−5 , ordered according to the order of the neuron values at Time = 1 from top to bottom.

sum of independent and identically distributed (i.i.d.) random variables is also distributed as X, with a scaling factor Cn = n1/α and a shift Dn . For the data in Figure 2, we find that ISI distributions are well fitted with a L´evy distribution S(α, 0, γ , 0; 0) up to 100 msec, while IEIs are well fitted with L´evy over another three decades. 0 < α ≤ 2 is the index of stability, which

2582

E. Persi, D. Horn, V. Volman, R. Seger, and E. Ben-Jacob

determines the long tail of the distribution, and γ > 0 is a scale factor, which determines the location of the bending point (see the examples in Figure 3). Special cases of the L´evy distributions are the gaussian distribution (α = 2) and the Cauchy distribution (α = 1). All α < 2 have divergent variance. There are several important properties of L´evy distributions worth pointing out. (1) The generalized central limit theorem (Nolan, 2004) states that the normalized sum of i.i.d. variables converges to a L´evy distribution. If the variance is finite, then α = 2; the convergence is to a gaussian distribution. Otherwise, α < 2. (2) If α < 1, the mean is infinite too. (3) For α < 2, the asymptotic behavior of the distribution is given by a power law f (| X |) ∝| X |−(1+α) . In Figure 3, we present examples of different L´evy distributions plotted on log-log scales of the type used in Figure 2. This should help develop the understanding that γ replaces the SD parameter, specifying when the bending of the PDF occurs on the log-log plot. α determines the slope of the long tail. Since there are no closed forms for these stable distributions (except for particular cases), we use numerical simulations and display histograms instead of analytic distribution functions. The observed behavior of ISI and IEI implies that their variances diverge; there is no characteristic timescale in the system. Previous models of IF neurons with dynamic synapses were unable to account for these heavy-tailed distributions (Segev, Benveniste, et al. 2001). Our aim is to suggest mechanisms that will generate distributions of the kind observed experimentally. 3 The Model Our model is based on leaky integrate-and-fire (IF) neurons endowed with frequency-dependent synapses (Markram & Tsodyks, 1996). An IF neuron captures the most important aspects of neuronal behavior: the integration of inputs during the subthreshold period and the generation of a spike once the threshold is reached. The IF neuron is described by τmem ·

dv = −v + Rmem · (Isyn + Iext ), dt

(3.1)

where v, τmem , and Rmem are the voltage, time constant, and resistance of the cell membrane, respectively. Once v reaches a threshold, the neuron fires, and v is being reset to vres . Neural networks composed of IF neurons are able to generate bursting activity (Tsodyks, Uziel, & Markram, 1999) in a model based on frequencydependent synapses. The phenomenological model of dynamic synapses was shown to accurately predict the behavior and properties of various neocortical connections (Markram, Pikus, Gupta, & Tsodyks, 1998; Tsodyks, Pawelzik, & Markram, 1998). The strength of a synapse is described by a

Modeling of Synchronized Bursting Events

−1

A

10 −2 10 −3 10 −4 10 1

10

2

10

3

10

4

10

2583

−1

10−2 10−3 10−4 10−5 10

5

B

1

10

10

2

10

3

10

4

10

5

10

−1

−1

10

C

−3

10

−5

10

1

10

2

10

3

10

4

10

10 −3 10 −5 10 −7 10

5

D

1

10

10

2

10

3

10

4

10

5

10

−1

−1

10 −3 10 −5 10 −7 10

E

1

10

2

10

3

10

4

10

10 −3 10 −5 10 −7 10 −9 10

5

F

1

10

10

−1

2

10

3

10

4

10

5

10

−1

10 −3 10 −5 10 −7 10 −9 10

G

1

10

2

10

3

10

4

10

5

10

10 −3 10 −5 10 −7 10 −9 10

H

1

10

2

10

3

10

4

10

5

10

Figure 3: Examples of the positive half of gaussian and L´evy distributions on a logarithmic scale. (First row) Gaussian distributions with SD = 5 (left) and SD = 50 (right). Notice that increasing the variance shifts the bending point to the right but does not affect the asymptotic behavior of the distribution. (Second row) L´evy distributions with α = 1.5, γ = 5 (left), and γ = 50 (right). Notice the slow convergence to zero of the tails. (Third row) L´evy distributions with α = 1.05, γ = 5 (left), and γ = 50 (right). (Fourth row) L´evy distributions with α = 0.5, γ = 5 (left), and γ = 50 (right).

parameter Aij , representing the efficacy of a synapse connecting a presynaptic neuron (j) to a postsynaptic neuron (i). The dynamics of the synapse is described by the following system of differential equations: z dx = − u · x · δ(t − tsp ) dt τrec

(3.2)

2584

E. Persi, D. Horn, V. Volman, R. Seger, and E. Ben-Jacob

y dy + u · x · δ(t − tsp ) =− dt τina y z dz = − , dt τina τrec

(3.3) (3.4)

where x, y, z are state variables representing the fraction of ionic channels in the synapse in the recover, active, and inactive states, respectively, with x+y+z = 1. u represents the fraction of utilization of the recover state by each presynaptic spike. Once a spike from a presynaptic neuron arrives at the synaptic terminal at time tsp , a fraction u of the recover state is transferred to the active state, which represents the fraction of open ionic channels through which neurotransmitters can flow. The synaptic current from all presynaptic neurons to the postsynaptic neuron is therefore i Isyn =

n

Aij · yij .

(3.5)

j=1

After a short time τina , the ionic channels switch into the inactive state. From the inactive state, there is a slow process of recovery (τrec >> τina ) back to the recover state, completing a cycle of synapse dynamics. The above description (with a constant variable u) captures well the dynamics of a depressing synapse. The variable u describes the effective use of synaptic resources and could be assigned to the probability of release of neurotransmitters. In facilitating synapses, each presynaptic spike increases the probability of excreting neurotransmitters to the synaptic cleft. In order to capture the dynamics of facilitating synapses, another equation was added to the model: du u + U · (1 − u) · δ(t − tsp ), =− dt τ f ac

(3.6)

where the constant parameter U determines the increase in the value of u each time a presynaptic spike arrives. The initial condition is that U = u. Note that when τ f ac approaches zero, facilitation is not exhibited. When a presynaptic spike arrives, u is updated first and then all other parameters (x, y, z). Tsodyks et al. (1999) have demonstrated that a network of IF neurons with dynamic synapses of the type described above generates synchronized bursts. Their network was composed of 400 excitatory neurons and 100 inhibitory neurons with probability of 0.1 for connection between two neurons. Each postsynaptic excitatory neuron is connected to a presynaptic neuron through a depressing synapse, while each postsynaptic inhibitory neuron is connected through a facilitating synapse. The network is partially balanced—on average 3·AEE = AEI , but AIE = AII . The network was fed a fixed external input current Iext that was generated by a random flat distribution centered at firing threshold (with a range of 5% of the threshold).

Modeling of Synchronized Bursting Events

2585

The rate of SBEs obtained with this description is approximately 1 Hz. After an SBE occurs, it fades away rapidly since the fast firing neurons with depressing connections cause a sharp decline of recovering synapses. Between SBEs, there is a low-rate activity that enables the recovery of synapses, so Isyn builds up and leads to a new burst. The choice of parameters is given in the appendix. Trying this approach on our system, we find that it needs modifications. It is difficult to find parameters that fit both the rate of SBEs and the low firing rate of neurons in between SBEs. Moreover, the profile of the experimental SBE—the activity of neurons within this event—rises sharply and decays exponentially, while the model as described so far leads to a narrow (20 msec) gaussian profile. The modifications that we propose are discussed in the next section. 4 Importance of Noise, Inhomogeneity, and Dynamic Thresholds Here we investigate three modifications of the model. We add noise to the input current, introduce inhomogeneity into the resistances, and add dynamic thresholds that are also inhomogeneously distributed. The simulations reported below were performed on a network of 27 excitatory and 3 inhibitory neurons with 25% connectivity. For simplicity, we chose Vth and Vres to be 1 and 0, respectively, and τmem = 10 msec. Since τina determines the decay of synaptic currents that dominate during the SBE, we increased τina to 10 msec. This leads to wide bursts of order 100 msec. The biological system is grown on top of a biochemical substrate. Therefore, the neurons in the dish are subjected to sustained changes in the concentration of different substances that compose the external environment of the neurons. (For more details on the experimental methods, see Segev 2002 and Segev, Shapira et al., 2001.) In addition, each neuron (in the large networks) receives thousands of noisy synaptic inputs. We introduce external noisy current in our model in order to account for all these effects, as well as other hidden biological mechanisms, on both the neuronal and the synaptic levels. The external noise is chosen to be gaussian with an expectation value of µ = 0.86 and standard deviation of σ = 0.15. Each 10 time steps (of 0.1 msec each), a different value of the external current is used. This leads to both quiescence between successive SBEs and a sharp increase in neuronal activity once an SBE starts, as shown in Figure 4. Since µ is below threshold, the neurons’ firing rate is very low. This enables the synapses to recover quickly and almost completely before the next SBE. At the point where the synapses are recovered, a single spike from one of the neurons in the network generates a large Isyn in its targets, thus increasing the probability of generating an SBE. When the SBE fades away, the synapses are in the inactive state, hence, the only activity is the one driven by the noisy external currents. The expectation value and variance of the noise control the mean rate of SBEs, which can be adjusted to fit the

2586

E. Persi, D. Horn, V. Volman, R. Seger, and E. Ben-Jacob

# neuron

0 10 A 20 30 0

5 10

15 20

25 30

35 40

45 50

Time[sec]

# neuron

0 10 B 20 30 0

50

100

150

200

Time[msec] # of spikes

10

C

5

0

0

50

100

150

200

Time[msec] Figure 4: The bursting profile due to noisy external currents. (Top) Raster plot shows the quiescence between SBE. (Middle) Zoom into the SBE at 20 sec. The SBE width is 100 msec. (Bottom) The activity during SBE has the right profile but is too intensive.

Modeling of Synchronized Bursting Events

2587

experimental values. However, the profile of the simulated SBE builds up too fast in comparison with experiments. The ISI distribution possesses a heavy tail (power law behavior up to 100 msec), but it does not fit a L´evy distribution. This is a direct result of high activity during the SBE, since the probability of large ISI during the SBE is too low. To cure these problems, we introduce inhomogeneity into the system. We start with inhomogeneity in neuronal resistances, the Rmem parameters of eq. 2.3. We select them randomly from a flat distribution over the interval [0.1, 0.4]. This modifies each neuronal Isyn , and the correlations between the activity of neurons during the SBE are weakened, leading to a slower SBE buildup. Since Isyn ∼ Aij , this is actually a rescaling of synaptic strengths. The resulting SBE perfectly matches the observed spatiotemporal structure (see Figure 5). Moreover, this choice leads to a L´evy distribution in ISI up to 50 msec, which suggests that the system still lacks a mechanism that allows the probability for large ISI to increase. Next, we introduce dynamic thresholds in order to improve the L´evy distributions of ISI and IEI. Adaptation and regulation are well-known characteristics of neuronal activity. It may be represented by dynamical thresholds, allowing for fatigue to set in upon receiving strong stimulation over long periods. For recent discussions of these effects, see Horn & Usher, 1989; Horn & Opher, 2000; Ying-Hui & Wang, 2000; Brandman & Nelson, 2002. Here we introduce dynamic thresholds that change as a function of the neuronal firing rate: dθi θi − θi (t = 0) + ηi · δ(t − tsp ). =− dt τth

(4.1)

θ describes the change in threshold, ηi is a factor chosen from a flat distribution over the interval [−0.05,+0.05], and τth equals the SBE time width (100 msec), consistent with biological data. Note that our choices for ηi mean that there are two kinds of neurons, with thresholds that increase or decrease during an SBE. In other words, one kind of neuron displays fatigue while the other displays facilitation. Using this description, we obtain a good match (over five decades) to the probability distribution functions of ISI and IEI and to the experimental spatiotemporal structure of SBEs, as demonstrated in Figures 5 and 6. The fit to IEI is interesting. Note that with homogeneous synaptic strengths, refractory periods, membrane time constant, and so forth, the probability of generating an SBE is determined by the time-varying external inputs and the specific connectivity of the network. This leads to a periodic-like SBE behavior as in Tsodyks et al. (1999). By allowing slight dynamical changes in these parameters, we are able to generate aperiodic behavior with long periods of quiescence, resulting in a L´evy distribution. The choice of parameters is given in the appendix. τmem and τref are selected from flat distributions. This enables us to limit the dispersion of the selected

2588

E. Persi, D. Horn, V. Volman, R. Seger, and E. Ben-Jacob

# of spikes

5 4 3

A

2 1 0 0

50

100 Time[msec]

150

200

0

10

−1

PDF

10

−2

B

10

−3

10

−4

10

0

1

10

2

10 Time [msec]

10

−1

10

−2

PDF

10

−3

C

10

−4

10

−5

10

0

10

1

10

2

10

3

10

Time [100msec] Figure 5: Effect of dynamic thresholds and inhomogeneous resistances. (Top) The SBE profile is similar to the experimental one (see Figure 1). (Middle) Circles represent the averaged histogram of ISI histograms of all the simulated neurons. An averaged value is obtained for each time bin. The solid line is a computer-generated L´evy histogram with α = 1.25, γ = 5. Fits to the data followed the maximum likelihood method of (Nolan 2001). The temporal resolution of spikes is set at 1 msec. (bottom) Circles represent the histogram of the simulated IEI time sequence. The solid line is a computer-generated L´evy histogram (α = 1.6, γ = 25). The temporal resolution of SBEs is 100 msec.

Modeling of Synchronized Bursting Events

2589

−1

10

−2

10

−3

Averaged PDF

10

−4

10

−5

10

−6

10

−7

10

−8

10

0

10

1

10

2

10

3

10

4

10

5

10

Time[msec] Figure 6: Effect of τth on the model for ISI. Solid curve: The averaged histogram of all the neurons of the experimental system (same data as in Figure 2). Symbols: Averaged histograms calculated from the results of the simulation. (+) curve τth = 100 msec. (o) curve: τth = 450 msec α 0.8, γ 20. Note that τth affects mainly the slope of the histogram. The KL distance measure between these two models and the averaged experimental histogram are DKL (+) = 9.5695 · 10−6 , DKL (o) = 2.5147 · 10−7 .

2590

E. Persi, D. Horn, V. Volman, R. Seger, and E. Ben-Jacob

values of the resistances but is not crucial to obtain the effects we already discussed. 5 Testing the Model 5.1 Sensitivity to Choice of Parameters. To gain further understanding of the model, we test its sensitivity to different parameters. The ratio of the number of excitatory to inhibitory neurons affects the rate and the shape of SBEs. Increasing the number of inhibitory neurons decreases the probability of generating SBEs. Moreover, strong inhibition decreases the activity of neurons during an SBE, resulting in a very dilute event. In the absence of inhibition (i.e., in an excitatory network), the network operates in an asynchronous mode, without generating SBEs. However, decreasing synaptic strengths ten-fold, the network reduces its total activity, and SBEs appear. This effect was referred to in Tsodyks et al. (1999). Reducing synaptic strengths further, the SBEs are still obtained, but their profile becomes symmetric. The synaptic strengths (or resistances) determine the shape of the distributions. When Aij is reduced by the same factor for all synapses attached to a postsynaptic neuron, the activity during an SBE is less dense (small Isyn ), and the density of SBEs is reduced. This results in an increase of both γ (the bending point) and α (the long-tail decay) of the L´evy distribution of ISI. The effect on the shape of the IEI distribution is very weak. There is a limit to our ability to control the shape of the distributions. Reducing the synaptic strengths leads to a low activity rate with symmetric SBEs and eventually to the disappearance of SBEs. Increasing synaptic strengths leads to high activity rate, eventually obtaining the merger of SBEs into a tonic firing mode. The time constant of the dynamic thresholds τth provides further flexibility in adjusting ISI to experimental data. An example is shown in Figure 6, where we compare our model to a specific experiment, obtaining a good fit to a L´evy distribution almost up to 1000 msec. In fact, the increase of τth allowed us to increase the domain over which the L´evy distribution is valid from 100 to 1000 msec. 5.2 Self-Consistency. Using experimental data, we can test the model. We feed a simulated neuron with spike sequences of the experimental neurons via frequency-dependent synapses while removing the external input. We have data of 36 real neurons and arbitrarily label 32 of them as excitatory presynaptic neurons and 4 as inhibitory ones. We expect the simulated neuron to be synchronized with the network activity, that is, to spike mainly within SBEs. The question to be tested is whether the distribution of its output spike train will match the experimental one (average of all experimental neurons). The results indicate that the model is consistent with the experimental activity. When a simulated neuron receives a L´evy distributed

Modeling of Synchronized Bursting Events

2591

stimulus, it responds with a L´evy distributed pattern of activity. The exact pattern is determined mainly by the choice of synaptic strengths. There are only slight differences between the response of an IF neuron with a fixed threshold and an IF neuron with a dynamic threshold given our set of parameters. In some experimental preparations, we notice that there are few neurons that have much higher firing rates than others. It would be interesting to find the character of these neurons—In particular, to understand if they are inhibitory neurons, whose role is to regulate the activity of the network, or excitatory ones, reacting to the changing environment. An experimental method to answer this question can be suggested on the basis of our model. The real neurons can be stimulated electrically through the electrode they are attached to. One can inject positive currents to one of the neurons, causing it to spike intensively. The specific balance in our simulated network enables a higher firing rate for inhibitory neurons than for excitatory neurons. This could be tested. Injecting an excitatory neuron with positive currents, thus making it permanently active, we find in our model that this is sufficient to induce permanent activity in any inhibitory neuron connected to that excitatory neuron but not in any other excitatory neuron. This is demonstrated in Figure 7. It can be used as a basis for mapping out inhibitory neurons in the experimental preparation. It also suggests that the observed spiking neurons in the (unstimulated) preparation are inhibitory neurons whose activity may regulate the system. 6 Discussion From the results of section 4, we conclude that inhomogeneities in neural parameters lead to correct behavior of the neuronal activity: (1) time-varying external currents lead to the rising spatiotemporal profile of a SBE; (2) inhomogeneity in the resistances leads to the detailed structure of a SBE; (3) the accurate distribution of spikes is obtained using dynamic thresholds; and (4) L´evy distribution of SBEs is also obtained due to inhomogeneity in the system. Dynamic thresholds are often used to account for neuronal adaptation through their increase. In the model of section 4, we have employed two types of neurons, which have both increasing and decreasing dynamical thresholds. This observation is consistent with known biological observations of a large firing repertoire of neurons (Meunier & Segev, 2001). Moreover, a recent study of interneurons (Markram, 2003) reported the existence of many types of inhibitory neurons, each having its own characteristic electrical response to stimulus. It was suggested a long time ago (Gerstein & Mandelbrot, 1964) that if the membrane voltage has a random walk behavior, one can obtain many heavy-tailed distributions. Dynamic thresholds of the kind we use impose a random walk–like behavior on the membrane voltage and were neces-

2592

E. Persi, D. Horn, V. Volman, R. Seger, and E. Ben-Jacob

# neuron

5 10

A

15 20 25 30 0

5

10

15

20

25

Time[sec]

# neuron

5 10

B

15 20 25 30 0

5

10

15

20

25

Time[sec] Figure 7: Response of the model to external stimulations. (Top) Positive external current is injected to an excitatory neuron (number 9). The target inhibitory neurons 2 and 21 respond with continuous spiking. Target neuron 8 is also inhibitory and shows extensive spiking outside SBEs. The rest are excitatory neurons that spike only within SBEs. (Bottom) Positive external current is injected to an excitatory neuron (number 5) that does not have any inhibitory neurons as targets. Hence, no other neurons exhibit continuous spiking behavior.

Modeling of Synchronized Bursting Events

2593

sary to obtain the distributions observed experimentally, so our results embrace their suggestion. To reinforce this claim, we also examined nonlinear one-dimensional models of neurons, which describe more accurately the changes in the membrane voltage. Using these models in the absence of dynamic thresholds did not lead to better results. However, using a twodimensional model that endows its membrane with a more complicated response by making partial use of ionic channel dynamics can explain the neuronal distributions we see in the experiments (Volman, Baruchi, Persi, & Ben-Jacob, 2004). The variance of the heavy-tailed distributions of ISI diverges. This may imply that the system we study has the potential to operate on many different timescales. This flexibility is important to allow different neuronal networks, with different functional roles, to be responsive to different timescales depending on the specific inputs they have to process. Therefore, in a case of processing a given task (or stimulation), we would expect to observe changes in the shape and variability of the discussed distributions. It will be interesting to understand which features of the stimulation will be coded by the individual neurons and which features (if any) will be coded on the whole network level. This will be tested in the future. An important open problem is to understand why networks of different sizes behave similarly. We believe that this can be understood by invoking the concept of neural regulation (Horn, Levy & Ruppin, 1998) or synaptic scaling (Turrigiano, Leslie, Desai, Rutherford, & Nelson, 1998; Abbott & Nelson, 2000). Preliminary studies show this to be the case, but it requires further investigation. Appendix Parameters of simulation in (Tsodyks et al., 1999): 400 excitatory neurons and 100 inhibitory neurons. Reset and threshold membrane potentials: Vreset = 13.5 and Vthreshold = 15. Resistances are 1. External currents were driven from a unit distribution centered at the threshold level with a range of 5% of the threshold. Time constants of the neurons: τmem = 30 msec, refractory period of 2 and 3 msec for inhibitory and excitatory neurons, respectively. Synaptic parameters were taken from gaussian distributions. Standard deviations are half of the following average values (with cutoffs of 2·average and 0.2·average): A(ee) = 1.8, A(ei) = 5.4, A(ie) = A(ii) = 7.2. Average values of utilization: U(ee) = U(ei) = 0.5, U(ie) = U(ii) = 0.04. Average time constants: τrec (ie) = τrec (ii) = 100 msec, τrec (ee) = τrec (ei) = 800 msec, τ f ac (ie) = τ f ac (ii) = 1000 msec. Fixed time constant of τina = 3 msec. It is reported that the results of this simulation are robust under changes of up to 50% in the average values. Parameters of our model: 27 excitatory neurons and 3 inhibitory neurons. Reset and threshold membrane potentials: Vreset = 0 and Vthreshold = 1. Resistances to synaptic currents ∼ U(0.1,0.4). Resistance to external currents

2594

E. Persi, D. Horn, V. Volman, R. Seger, and E. Ben-Jacob

remains 1. External currents ∼ N(0.86,0.15). Time constants of the neurons: τmem ∼ U(9.5,10.5)msec, refractory periods ∼ U(2,3) msec for all neurons. Fixed time constant of τina = 10 msec. All other parameters are unchanged. Acknowledgments This work was supported by a research grant of the Israel Science Foundation. References Abbott, L. F., & Nelson, S. B., (2000). Synaptic plasticity: Taming the beast. Nature Neuroscience Supplement, 3, 1178–1183. Brandman, R., & Nelson, M. E. (2002). A simple model of long-term spike train regularization. Neural Computation, 14, 1575–1597. Gerstein, G., & Mandelbrot, B. (1964). Random walk models for the spike activity of single neuron. Biophysical Journal, 4, 41–68. Horn, D., Levy, N., & Ruppin, E. (1998). Memory maintenance via neuronal regulation. Neural Computation, 10, 1–18. Horn, D., & Opher, I. (2000). Complex dynamics of neuronal thresholds. Neurocomputing, 32,. 161–166. Horn, D., & Usher M. (1989). Neural networks with dynamical thresholds. Phys. Rev., A40, 1036–1044. Mantegna, R. N., & Stanley, H. E. (1995). Scaling behaviour in the dynamics of an economic index. Nature, 376, 46–49. Markram, H. (2003). Molecular determinants of neuronal diversity. Lecture at the Brain in Motion symposium, Lausanne. Markram, H., Pikus, D., Gupta, A., & Tsodyks, M. (1998). Potential for multiple mechanisms, phenomena and algorithms for synaptic plasticity at single synapses. Neuropharmacology, 37, 489–500. Markram, H., & Tsodyks M. (1996). Redistribution of synaptic efficacy between neocortical pyramidal neurons. Nature, 382, 807–810. Marom, S., & Shahaf, G. (2002). Development, learning and memory in large random networks of cortical neurons: Lessons beyond anatomy. Quarterly Rev. Biophysics, 35, 63–87. Meunier, C., & Segev, I. (2001). Neurons as physical objects: Structure, dynamics and function. In F. Moss & S. Gielen (Eds.), Handbook of Biological Physics, (Vol. 4, pp. 353–466). Dordrecht: Elseview. Nolan, J. P. (1998). Parameterization and modes of stable distribution. Statistics and Probability Letters, 38, 187–195. Nolan, J. P. (2001). Maximum likelihood estimation and diagnostics for stable distributions. In O. E. Barndorff-Nielsen, T. Mikosch, & S. I. Resnick (Eds.), L´evy processes: Theory and applications. Boston: Birkh¨auser. Nolan, J. P. (2004). Stable distributions—models for heavy-tailed data. Boston: Birkh¨auser.

Modeling of Synchronized Bursting Events

2595

Peng, C. K., Mietus, J., Hausdorff, J. M., Havlin, S., Stanley, H. E., & Goldberger, A. L. (1993). Long-range anticorrelations and non-gaussian behavior of the heartbeat. Phys. Rev. Lett., 70, 1343–1346. Segev R. (2002). Self-organization of in vitro neuronal networks. Unpublished doctoral dissertation, Tel-Aviv University. Segev, R., Benveniste, M., Shapira, Y., Hulata, E., Cohen, N., Palevski, A., Kapon, E., & Ben-Jacob E. (2001). Long-term behavior of lithographically prepared in vitro neural network. Physical Review Lett., 88, 118102. Segev, R., Shapira, Y., Benveniste, M., & Ben-Jacob E. (2001). Observation and modeling of synchronized bursting in two-dimensional neural network Physical Review E., 64, 011920. Shahaf, G., & Marom, S. (2001). Learning in networks of cortical neurons. Journal of Neuroscience, 21, 8782–8788. Tsodyks, M., Pawelzik, K., & Markram H. (1998). Neural networks with dynamic synapses. Neural Computation, 10, 821–835. Tsodyks, M., Uziel, A., & Markram, H. (1999). Synchrony generation in recurrent networks with frequency-dependent synapses. Journal of Neuroscience, 20, RC50 1–5. Turrigiano, G., Leslie, K., Desai, N., Rutherford, L., & Nelson, S. (1998). Activitydependent scaling of quantal amplitude in neocortical neurons. Nature, 391, 892–895. Volman, V., Baruchi, I., Persi, E., & Ben-Jacob, E. (2004). Generative modelling of regulated dynamical behavior in cultured neuronal networks. Physica A, 335, 249–278. Ying-Hui, L., & Wang, X. J. (2000). Spike-frequency adaptation of generalized leaky integrate and fire model neuron. Journal of Computational Neuroscience 10, 25–45. Received July 30, 2003; accepted May 21, 2004.

LETTER

Communicated by John Hertz

Mean Field and Capacity in Realistic Networks of Spiking Neurons Storing Sparsely Coded Random Memories Emanuele Curti [email protected] INFM, Dipartimento di Fisica, Universit´a di Roma La Sapienza, 00185 Rome, Italy

Gianluigi Mongillo [email protected] Dipartimento di Fisiologia Umana, Universit´a di Roma La Sapienza, 00185 Rome, Italy

Giancarlo La Camera [email protected] Institute of Physiology, University of Bern, 3012 Bern, Switzerland

Daniel J. Amit [email protected] INFM, Dipartimento di Fisica, Universit´a di Roma La Sapienza, 00185 Rome, Italy, and Racah Institute of Physics, Hebrew University, Jerusalem, Israel

Mean-field (MF) theory is extended to realistic networks of spiking neurons storing in synaptic couplings of randomly chosen stimuli of a given low coding level. The underlying synaptic matrix is the result of a generic, slow, long-term synaptic plasticity of two-state synapses, upon repeated presentation of the fixed set of the stimuli to be stored. The neural populations subtending the MF description are classified by the number of stimuli to which their neurons are responsive (multiplicity). This involves 2p + 1 populations for a network storing p memories. The computational complexity of the MF description is then significantly reduced by observing that at low coding levels ( f ), only a few populations remain relevant: the population of mean multiplicity – pf and those of multiplicity of order pf around the mean. The theory is used to produce (predict) bifurcation diagrams (the onset of selective delay activity and the rates in its various stationary states) and to compute the storage capacity of the network (the maximal number of single items used in training for each of which the network can sustain a persistent, selective activity state). This is done in various regions of the space of constitutive parameters for the neurons and for the learning process. The capacity is computed in MF versus potentiation amplitude, ratio of potentiation to depression probability and coding level f . The MF results compare well with recordings of delay activity rate distributions in simulations of the underlying microscopic network of 10,000 neurons. c 2004 Massachusetts Institute of Technology Neural Computation 16, 2597–2637 (2004)

2598

E. Curti, G. Mongillo, G. La Camera, and D. Amit

1 Introduction Mean-field (MF) theory has provided a useful tool for obtaining fast and reliable insight into the stationary states (alias, persistent delay activity states, attractors, working memory) of recurrent neural networks. For networks of binary neurons—the Amari-Hopfield model (Amari, 1972; Hopfield, 1982), it allowed a very detailed description of the landscape of stationary states and even an accurate estimate of the storage capacity—the maximum number of memories that can be stored in the synaptic matrix and recalled (Amit, Gutfreund, & Sompolinsky, 1987; Amit, 1989). Its effectiveness is due to the high connectivity of the network (the large number of presynaptic contacts afferent on each neuron) and the relatively weak effect of a single neuron on another. Its precision is substantiated by detailed microscopic simulations. The systematic MF study of the Amari-Hopfield model relied on two features: (1) the symmetry of the synaptic matrix, which allowed the application of techniques of statistical mechanics, and (2) the separability of synaptic efficacies, which allowed a description of the states of the network in terms of global, collective variables. These studies were extended to the storage of (0–1) neurons with low coding levels (Tsodyks & Feigelman, 1988; Buhmann, Divko, & Schulten, 1989) and to neurons with continuous response curves (Kuhn, 1990). In all these cases, it was found possible to describe the stationary states of the network in terms of collective variables, that is, in terms of population-averaged similarities (overlaps) of the state of the system and the memories embedded in the synapses. The Willshaw model (Willshaw, Buneman, & Longuet-Higgins, 1969), which is fully symmetric, has not found an MF description due to the nonseparability of its synaptic efficacies, except in the limit of low coding level and high storage (Golomb, Rubin, & Sompolinksky, 1990). As we show below, this treatment is a particular case of the approach we present: the limit of vanishing depression probability in training. The experimental reality that such models capture is learned persistent selective delay spiking activity in temporal and prefrontal cortex (Miyashita & Chang, 1988; Miller, Erickson, & Desimone, 1996). To come nearer to the biological systems, the elements are chosen as spike-emitting units of two types, excitatory and inhibitory, occurring in different proportions and with different characteristics. The synaptic matrix connecting the neurons expresses the randomness of the connectivity, as well as the efficacies that existing synapses assume. In such networks, the afferent currents are driven by presynaptic spikes, and these currents feed the neural membrane depolarization. A neuron emits a spike whenever its depolarization crosses the threshold. Given a choice of the numbers of neurons of both types and their individual characteristics, as well as the synaptic matrix, the dynamics of the system can be simulated at the microscopic level. Such simulation serves as a substrate for recordings, upon presentation of stimuli in protocols representing various tasks, as would be the case for the corresponding experimental

Mean Field with Random Sparse Memories

2599

situation (see, e.g., Amit, 1998). In such recordings one can register spike rasters for samples of neurons and monitor average rates (or higher-level statistics) on relevant time intervals, as is commonly done in experiments. The space of parameters, though, is vast: for example, each type of neuron has several parameters, and the synapses of different types (excitatoryexcitatory, excitatory-inhibitory, inhibitory-excitatory, inhibitory-inhibitory) can have different mean amplitudes and involve different delays. This renders a MF description quite essential. Scanning the parameter space to find realistic yet computationally interesting zones of parameters in microscopic simulations with a sensible number of spiking units (about 10,000) is computationally very intensive, while MF scanning of network state space is very rapid. This becomes even more pressing when one investigates the collective properties of learning, with dynamic plastic, spike-driven synapses. The difficulty is that in more realistic situations, the synaptic matrix is strongly nonsymmetric, so there is no recourse to statistical mechanics. Nor can one expect a factorizable synaptic matrix. In the case considered here, the sparse and weak connectivity renders the neuronal firing rather weakly correlated. As a consequence, the high number of synaptic connections renders the afferent current to a neuron approximately gaussian and uncorrelated in time (for a step toward more general situations, see Brunel & Sergi, 1998; Fulvi Mari, 2000; Moreno & Parga, 2004). This allows for a MF description in terms of the average rates in functionally distinct populations (Amit & Brunel, 1997a).1 In this situation, it is possible to pass from the dynamics of the depolarization of all neurons (in the microscopic description) to the dynamics (when slow) of the quantities driving the neural spike rates—the means and the variances per unit time of the afferent currents.2 The dynamics of these quantities is driven by the rates of all neurons in the system. Given these two quantities for a given neuron, the average emission rate of the neuron is determined by its transduction function. Thus, the dynamics is closed, in that the rates determine the dynamics of the afferent currents, and those in turn determine the rates (Amit & Tsodyks, 1991; Amit, Brunel, & Tsodyks, 1994; Hopfield, 1984). Thus far, the theory has been pursued only for the case of disjoint populations corresponding to different stimuli. In other words, each neuron was responsive to at most one visual stimulus.3 While this approach turned out to be useful and rather accurate in its results, compared with the microscopic simulations (Brunel, 2000), it suffered from several defects. First, the extremely sharp tuning curves are not 1

For a precursor of this approach see, e.g., Wilson and Cowan (1972). In the following, we will refer to them as the mean and the variance, to retain the language used in Amit and Brunel (1997a). 3 Visual is used here in the restricted sense that cells in IT and PF express an elevated rate when visual stimuli are presented. These responses are not considered to be directly related to specific features of the visual objects. 2

2600

E. Curti, G. Mongillo, G. La Camera, and D. Amit

realistic; neurons in experiments are rarely responsive to only one stimulus. Nor are they so selective in delay activity. Second, disjoint populations imply a very low storage capacity of the network. If the fraction of excitatory neurons in a selective population is f , one can store at most 1/f different stimuli in the network. On the empirical side, progress in this field must cope with the finding of Miyashita and Chang (1988) and Miyashita (1988), who observed that a column 1 mm in diameter of IT cortex is able to sustain as many as 100 distinct, selective delay activity distributions, each employing about 2% to 3% of the cells in the column (Brunel, 1996). Each of these delay activity states corresponds to one fractal image. Hence, cells must be active in the delay activity state of more than one image, since 100×(2–3%) is more than 100%. Here we present an extension of the MF (population rate) dynamics to the case in which the populations selected (as responsive) for p different stimuli are selected at random. Hence, if the coding level is f , any two populations have a fraction f of each common to both (a fraction f 2 of the total number of cells); a fraction f 3 of the neurons in the network will be common, on average, to any three populations coding for different stimuli. The basic ideas have been suggested by La Camera (1999): to classify neurons in the network by the number of stimuli (of the training set) to which they are responsive (multiplicity of a cell) and by whether the selected stimulus is or is not among them. Previously, when one population was selected by a stimulus (either in response to stimuli or in selective delay activity), the excitatory neurons were divided into three homogeneous populations: one for the cells currently selected; one for all neurons selective to remaining stimuli and not currently selected, and one for all neurons not selective to any stimulus (background). Now there are 2p populations of excitatory neurons: p for selected neurons with all multiplicities (1,. . . , p) and p for nonselected neurons, with all possible multiplicities (0,. . . , (p−1)). The latter include neurons of multiplicity 0 (not selective to any stimulus). Including inhibition, the MF dynamics would deal with 2p+1 populations. It is then verified that if average rates of cells within each of these 2p+1 populations are equal and the synaptic matrix is generated in a rather organic learning process of long-term potentiation (LTP) and long-term depression (LTD) transitions between two discrete states of synaptic efficacy (Brunel, Carusi, & Fusi, 1998), the statistics of afferent currents (mean and variance) to the cells in each of these populations is equal, and hence these populations are natural candidates for MF analysis. The approach is reinforced by observing that in the sparse coding limit, for any finite value of p, the number of populations that actually affect the afferents of all neurons is very much lower than 2p+1. This is due to the fact that the number of neurons in a population varies strongly with the multiplicity of the population. In fact, when f 1, this number is sharply peaked in populations with multiplicities around the mean pf . As a consequence, only a reduced number of relevant populations (of order pf )

Mean Field with Random Sparse Memories

2601

contributes significantly to the feedback. The rates of neurons in all other populations are determined by the afferents from the relevant populations. We first present a general description of the MF approach. Section 3 describes the methods used to derive and solve the full and reduced MF equations for the stationary states, as well as of the simulation process employed to test the MF predictions. These are followed by the results section, including MF bifurcation diagrams compared with simulation rates and the dependence of the critical capacity of the network on potentiation-to-depression efficacy, potentiation-to depression probability, the inhibitory efficacy, and the coding level. In section 5, we consider prospects of the applications of the new approach as well as its relation with the Willshaw model. 2 Mean-Field Approach and Synaptic Structuring 2.1 Generalities. In this section, we recapitulate general ideas that allow for a network of spiking neurons to be described by collective variables, that is, mean rates in statistically homogeneous neural populations, and extend these ideas to the case of randomly chosen selectivity for stimuli and to a synaptic matrix generated by slow learning of many repeated presentations of such sets of stimuli. Section 3 will deal with the technical aspects of the resulting framework with a specific model for the spiking neuron. The MF approach consists in dividing the network into distinct and statistically homogeneous subpopulations: a neuron belongs to at most one subpopulation, and two neurons belonging to the same subpopulation have the same statistics of afferent synapses. One then assumes that all neurons in the same subpopulation have an equal average spike emission rate. This renders the statistics of afferent currents homogeneous within each subpopulation. The statistics of the afferent currents in turn determines the average emission rate in the subpopulation. The steady mean emission rate in each subpopulation is obtained requiring self-consistency—that the output rates (generated by the transduction function) be equal to the input rates (which determine the input currents). When neurons within the network respond to more than one stimulus, the subpopulations formed by collecting neurons responding to a given stimulus are in general not distinct. Cells responding to more than one stimulus belong to more than one subpopulation. The consequence is that these overlapping populations cannot be used to carry out a MF analysis, in the sense described above, as was done in Amit and Brunel (1997a). In order to obtain subpopulations suitable for MF analysis, we follow La Camera (1999) and lump together neurons responding to the same number of stimuli, regardless of their identity. These, we show next, solve the problem. 2.2 The Model. The model network is much as in Amit and Brunel (1997a). It is composed of NE excitatory and NI inhibitory spiking neurons. Each neuron receives, on average, CE excitatory and CI inhibitory synaptic

2602

E. Curti, G. Mongillo, G. La Camera, and D. Amit

contacts from presynaptic neighbors (i.e., neurons with a direct afferent on the given neuron), randomly and independently selected. Neurons also receive Cext excitatory contacts, with efficacy Jext , coming from outside the network and carrying noise as well as signals of presented stimuli. They emit spikes independently, in a Poissonian process with mean rate νext . Plasticity is restricted to the excitatory-to-excitatory (EE) synapses. An existing EE synapse has two possible efficacies: potentiated, Jp , or depressed, Jd < Jp . In this study, the distribution of the EE synapses is conceived as the outcome of a long, slow training stage, in which the p stimuli to be memorized are repeatedly presented in a random sequence. The synaptic distribution is kept fixed throughout the analysis (see below). The remaining synapses have fixed, unstructured efficacies. The distribution of these efficacies (mean and variance) depends on the type of pre- and postsynaptic neurons (EI, IE, II). A stimulus is characterized by the set of excitatory neurons that are responsive to it. Of particular interest is the subset of cells whose rate is enhanced by the stimulus presentation to a level that can lead to synaptic plasticity. When the cell populations are selected at random, with a given coding level f (average fraction of cells in the assembly responsive to the stimulus), cells responsive to one stimulus may respond to another (fraction f 2 ) and to more than 2. Every cell can be characterized (uniquely) by specifying the list of stimuli it responds to and does not respond to. If the network has been exposed systematically to p stimuli, a cell is fully defined by the p-bit word, its response pattern, with 1 (0) at position k indicating that it is responsive (nonresponsive) to stimulus k. Thus, the p-bit word (0, 1, 1, 1, 0, 0, . . . , 0) corresponding to a given cell indicates that the cell responds exclusively to stimuli 2, 3, and 4 out of p, although the rates are not necessarily equal. The number of 1’s in the response pattern of a neuron will be called its multiplicity, α, β, . . . . It should be emphasized that the (0/1) are not the responses of the neurons to a stimulus, but merely shorthand for whether a neuron has a strong or weak response to the stimulus. Consider two given cells and the synapse connecting them. The corresponding two p-bit response patterns determine what will occur to the synapse connecting them when a given stimulus is presented. The structuring process is envisaged to be stochastic and slow. If the bit corresponding to the stimulus is 1 in both response patterns and if the synapse is in its low state, the synapse will potentiate with probability q+ , (Jd → Jp , LTP); if the presynaptic bit is 1 and the postsynaptic is 0 and if the synapse is in its high state , the synapse will depress with probability q− (Jp → Jd , LTD);4 and if the presynaptic bit is 0, it will not change. In the following we use q+ = q, 4

q− = fρq.

This LTD mechanism can be easily generalized. See, e.g., Brunel et al. (1998).

(2.1)

Mean Field with Random Sparse Memories

2603

This scaling of q− /q+ optimizes network capacity (Amit & Fusi, 1994). ρ is a parameter of order 1, defined by equation 2.1, regulating the LTD-to-LTP probability ratio. Suppose that during training, the p stimuli are shown at a uniform rate, with equal probability. The total number, P, of potentiation conditions (1,1) seen by a given synapse, and the total number D of depression conditions (0,1) is determined by the two p-bit response patterns corresponding to the two cells connected by it. 2.3 Asymptotic Synaptic Structuring. The synaptic dynamics is a Markov process. A random walk is induced among the two states of every synapse by the presentation of the stimuli. In general, the final state of the synapse, following a given sequence of stimuli, could depend on the order of presentation due to the fact that the synapse has only two states. If the transition probabilities are low (slow learning limit, small q+ , q− ), the variability from sequence to sequence becomes negligible. What matters is the number of potentiating or depressing patterns seen by the synapse as a consequence of the presentation of the whole training set and not the order in which they occurred (Brunel et al., 1998). In this limit, the probability of finding a synapse in its potentiated state, Jp , is obtained by averaging over all possible realizations of the random sequences of presentation (Brunel et al., 1998), as γ (P, D) =

q+ P P = . q+ P + q− D P + fρD

(2.2)

P (D) is the number of potentiating (depressing) pairs of activity seen by the synapse as a consequence of the presentation of the sequence. This is the case (see equation 2.2) when at least one of the p stimuli tends to potentiate or depress the synapse. Otherwise, the synapse is not affected, and the final probability (i.e., following training) that the synapse be in the potentiated state coincides with the initial probability γ0 (i.e., prior to training and uncorrelated with any of the stimuli), or γ (P = 0, D = 0) = γ0 .

(2.3)

In the absence of depression (the limit ρ → 0), the statistical description of the synaptic structuring (see equations 2.2 and 2.3) provides a description also for a Willshaw-like synaptic matrix. The formal correspondence is exact if Jp = 1, Jd = 0, and γ0 =0 (Willshaw et al., 1969). In this limit, equation 2.2 becomes γ if P = 0 . (2.4) lim γ (P, D) = 0 1 if P > 0 ρ→0 Every synapse that experiences at least one potentiating pattern of activity is in its potentiated state with probability 1. Otherwise, it remains unstruc-

2604

E. Curti, G. Mongillo, G. La Camera, and D. Amit

tured, that is, potentiated with probability γ0 . In other words, the stimuli can potentiate only the synaptic efficacy. Thus, MF equations for a recurrent network with a Willshaw-like synaptic matrix (see, e.g., Golomb et al., 1990) can be obtained from the MF theory developed below, in the limit of vanishing ρ. 2.4 Afferent Currents and Multiplicity. Next, we show that cells with the same multiplicity form natural populations for a MF analysis. Note first that the average (replacing spikes by rates, e.g. Amit & Tsodyks, 1991) current afferent to neuron i from other excitatory neurons is Ii =

Jij νj .

(2.5)

j

j goes over all presynaptic neurons. If neurons with the same multiplicity have the same mean emission rate, νβ , the sum can be rewritten as a sum over multiplicities, Ii = CE

Jij β π(β)νβ ,

(2.6)

β

where CE is the average EE connectivity, π(β) is the probability that the presynaptic neuron is of multiplicity β, and Jij β is the average synaptic efficacy from a neuron of multiplicity β to postsynaptic neuron i. This last average of Jij is over all possible presynaptic response patterns of neurons of multiplicity β. It is given by Jij β =

[γ (P, D)Jp + (1 − γ (P, D))Jd ]ψβi (P, D),

(2.7)

P,D

where γ (P, D) is given by equations 2.2 and 2.3, and ψβ(i) (P, D) is the joint probability distribution of P and D (in an average MF sense; (Brunel et al., 1998). Given the response pattern of neuron i (of multiplicity α), ψβi (P, D) can be determined (see below). Suppose, without loss of generality, that the postsynaptic neuron responds to the first α stimuli and that α < β. The maximal value of P is α, since α is the maximal number of coincident active bits in the two response patterns. If α + β ≤ p, the minimal value of P is zero. If α + β > p, there must be at least α + β − p coincident bits in the two response patterns. Hence, P ranges from 0 (or from α + β − p when α + β > p) to α. The reasoning is analogous when α > β, exchanging α and β. The actual value of P is determined by the number of ones among the first α bits in the response pattern of the presynaptic cell. Each active bit in this

Mean Field with Random Sparse Memories

2605

position will find a corresponding active bit in the response pattern of the postsynaptic cell. The corresponding value of D is the number of 1’s among the remaining p − α bits of the response pattern of the presynaptic neuron: Each active bit in this position in the presynaptic response pattern will find an inactive bit in the response pattern of the postsynaptic cell. Because the presynaptic cell responds to just β stimuli, D is given by D = β − P. p possible response patterns for a neuron of multiplicity β. There are β p−α α have h 1’s among the first α bits and β − h 1’s Among these, β −h h among the remaining p − α. For all these, P = h and D = β − h. Hence, if neuron i is of multiplicity α, ψβ(i) (P, D) =

p p−α α , with D = β − P, / β D P

(2.8)

depending only on the multiplicity α of the postsynaptic neuron and not on its particular response pattern. Replacing ψβ(i) (P, D) ≡ ψαβ (P, D) in equation 2.7 and equation 2.7 in 2.6, for the afferent current, we find µα = CE Jαβ π(β)νβ , (2.9) β

where µα is the mean current to a neuron of multiplicity α, while Jαβ is given by Jαβ = [γ (P, D)Jp + (1 − γ (P, D))Jd ]ψαβ (P, D). (2.10) P,D

Thus, all neurons with the same multiplicity receive the same mean afferent current, with the same variance, provided the rates in a population of equal multiplicity are equal. Consequently, the cells in each such population would emit at the same mean emission rate, which expresses the consistency of the underlying assumption. In fact, the mean emission rate varies from neuron to neuron, mostly due to the random connectivity in the network. However, for large networks, the distribution of rates within a statistically homogeneous population is strongly peaked around its (population) mean value. This is confirmed by simulations (see below) and makes the cells with the same multiplicity a neural population suitable for MF analysis. To exemplify the import of the above arguments we present in Figure 1 the average fraction of potentiated synapses afferent on a neuron, as a function of the multiplicity α of that neuron, defined as γ α =

β

P,D

γ (P, D)ψαβ (P, D) π(β),

(2.11)

2606

E. Curti, G. Mongillo, G. La Camera, and D. Amit 1

0.8

<γ>α

0.6

0.4

0.2

0 0

10

20

α

30

40

Figure 1: Sensitivity of afferent synaptic potentiation to neural multiplicity. Mean fraction of potentiated synapses versus postsynaptic multiplicity, in a network with f = 0.05, ρ = 1, and p = 40. This precludes the use of a single excitatory population for MF purposes.

with γ (P, D) and ψαβ (P, D) given by equations 2.2 and 2.8, respectively. As can be seen, the fraction of potentiated synapses varies with the multiplicity. Consequently, in a steady state, the mean emission rates of the neurons responding to the same stimulus are not homogeneous. They depend on the multiplicity of that neuron. Thus, the number of subpopulations of excitatory neurons of potentially different rates, required for MF analysis, is p + 1—one for each multiplicity (0, . . . , p). The above discussion was limited to the case without selective activity. When neurons selective to a given stimulus are active, not all populations of a given multiplicity are expected to have the same rate. Those responsive to the stimulus will emit at a higher rate than the others within the same subpopulation. The arguments presented above can be extended (see the appendix), leading to the conclusion that 2p excitatory populations would be required, with rates νβs for selective neurons of multiplicity β and νβn , for nonselective neurons of the same multiplicity. The resulting 2p populations consist of p nonselective ones, of multiplicity 0, 1, . . . , p − 1, and p selective ones, of multiplicity 1, 2, . . . , p. By contrast, in the case of nonoverlapping memories, the description of the retrieval state needs only three excitatory populations (selective with multiplicity 1 and nonselective neurons with multiplicities 0 or 1), regardless of the number of stored memories (Amit & Brunel, 1997a).

Mean Field with Random Sparse Memories

2607

2.5 Mean-Field Analysis. To study whether the network can sustain single-item working memory states, we consider 2p+1 populations, adding the population of inhibitory neurons to the 2p populations mentioned above. The mean currents to selective (s) and nonselective (n) neurons in the excitatory populations of multiplicity α, µsα

= CE

p

µnα

= CE

β=1 p β=1

ss Jαβ πs (β)νβs

ns Jαβ πs (β)νβs

+

p−1 β=0

+

p−1 β=0

sn Jαβ πn (β)νβn

+ µI + µext

nn Jαβ πn (β)νβn

+ µI + µext .

(2.12)

CE is the average number of recurrent excitatory collaterals received by a ss neuron, Jαβ is the mean efficacy of synapses received by a selective neuron sn with multiplicity α from selective neurons with multiplicity β, and Jαβ is the mean efficacy received by a selective neuron with multiplicity α from ns nn nonselective neurons with multiplicity β. Jαβ and Jαβ are defined analogously; πs (β)(πn (β)) is the probability that a selective (nonselective) neuron has multiplicity β; νβs (νβn ) is the mean emission rate in subpopulation of selective (nonselective) neurons of multiplicity β; µI and µext are, respectively, the mean afferent inhibitory current and the mean external current, given in equations A.13. The latter include the selective stimulus current. Analogously, the variances of the afferent currents are obtained as (σαs )2

= CE

(σαn )2

= CE

p β=1

p β=1

ss Jαβ πs (β)νβs

ns Jαβ πs (β)νβs

+

+

p−1 β=0

p−1 β=0

sn Jαβ πn (β)νβn

nn Jαβ πn (β)νβn

2 + σI2 + σext

2 , + σI2 + σext

(2.13)

2 are, respectively, the variance of the inhibitory current where σI2 and σext xy and the variance of the external current, given in equations A.14. The Jαβ , xy Jαβ and πx (β) are computed in the appendix. Note that in computing the variance of the currents, we have ignored the correlations between synaptic efficacies that have a common presynaptic neuron. This is justified as long as the ratio between the correlated contribution to the variance is small compared to the uncorrelated one. This ratio is, for low f and high loading (∼ 1/f 2 ), ∼ Cf 2 · pf 2 exp −2pf 2 1 (see Amit & Fusi, 1994; Golomb et al., 1990). In our simulations, it is always 0.72. Moreover, with our parameters, the variance of the afferent currents is dominated by the external and the inhibitory contributions, while the potential correlations affect exclusively the recurrent excitatory contribution.

2608

E. Curti, G. Mongillo, G. La Camera, and D. Amit

The statistics of the current—its mean and variance—determines in turn the mean emission rate of the neuron according to να(s,n) = (µ(s,n) , σα(s,n) ), α

(2.14)

where the functional form of the transduction function (·) depends on the neural model chosen. The mean emission rates within the subpopulations determine the statistics of the afferent currents. Self-consistency then implies that the mean emission rates determining a given statistics of the currents be equal to the rates at which spikes are produced in the various subpopulations given those currents:

ναs = (µsα ({νβs , νβn }), σαs ({νβs , νβn })) . (2.15) ναn = (µnα ({νβs , νβn }), σαn ({νβs , νβn })) µxα ({νβs , νβn }), σαx ({νβs , νβn }) represent the dependence of the means and the variances of the afferent currents on all the population rates. The dependence on the inhibitory and the external emission rates is not explicitly indicated. A solution of equations 2.15 provides the mean emission rates of all subpopulations in a stationary state of activity. The phenomenology one expects is (Amit & Brunel, 1997a): • Spontaneous activity (SA). For low values of Jp /Jd ∼ 1 there exists a single solution of equations 2.15 with νβs = νβn for β = 1, . . . , p − 1. In other words, selective and nonselective populations have the same mean emission rate. Rates may depend on the multiplicity of the cells, due to potentiations that did take place between certain groups of neurons. The system is ergodic. See section 4. • Retrieval state or working memory (WM). Above some value of Jp /Jd , a bifurcation takes place, and besides the SA solution, a new solution of equations 2.15 appears with νβs > νβn for β = 1, . . . , p − 1. In this state, one population of neurons, responsive to a given stimulus, has elevated activity, and the remaining neurons, not selective to that stimulus, emit at low rates. But within each of the two populations (selective and not selective), the rates will depend on the multiplicity of the cells. There exist p different retrieval states, each for one of the learned stimuli. • Destabilized SA. When Jp /Jd becomes too high, the spontaneous activity state becomes unstable and only selective delay activity states exist. 2.5.1 Reduced Mean Field. Since one expects the number of stored memories to become large, a system of 2p+1 dynamical variables in mean field is still excessive and may become as large as the number of cells in the network, or larger. A further step, beyond the 2p + 1 reduction, can be achieved

Mean Field with Random Sparse Memories

2609

by observing that πx (β) becomes negligible for values of β distant from the peak of the distribution, at fp. Disregarding selectivity for the sake of simplicity, π(β) is π(β) =

p β f (1 − f )p−β , β

(2.16)

where f β (1 − f )p−β is the probability that a neuron respond to exactly β out of p stimuli. For large values of p, this distribution becomes gaussian, with mean and variance given by µp = fp,

σp2 = f (1 − f )p.

But if f is low enough, even for pf ∼ 2, 3 the distribution is quite peaked, as we show below. If, for example, one considers populations within 3σ of the peak, about 99% of the totality of excitatory neurons of the network are included. But σ ∼ fp, and hence this number of populations is quite low compared to 2p. One then takes this reduced number of populations as the only ones that contribute to the currents, reducing the number of equations to be treated for self-consistency. The rates of neurons in the other populations are computed directly from the afferent currents produced by the populations retained. For example, for f = 0.05 and p = 100, σp 2.24, 3σ ∼ 7, and when selective and nonselective populations are considered, one needs only 24 populations instead of the initial 200. 3 Methods 3.1 Spiking Neuron Model. The underlying single unit in our network is the standard integrate-and-fire element, without adaptation. Membrane depolarization integrates the afferent current I(t) according to V V˙ = − + I(t), τ

(3.1)

where τ is the membrane time constant. Whenever the depolarization reaches a threshold θ , the cell emits a spike and remains refractory for a time τarp . Then V is reset to Vr and normal dynamics resumes. The current I(t) to a neuron, produced by the afferent spikes, is I(t) = Iext (t) +

Jj δ(t − tj(k) − δj ), j

(3.2)

k

where Iext is the current coming from outside the network (see, e.g., Amit & Brunel, 1997a); j goes over the presynaptic sites; Jj is the efficacy of the corresponding synapse; k runs over all spikes arriving at a given site; tj(k)

2610

E. Curti, G. Mongillo, G. La Camera, and D. Amit

is the emission time of the kth spike by neuron j; and δj is the associated transmission delay. The effect of a spike is very brief. When I(t) is stationary, gaussian, and independent at different times, the transduction function of a neuron (·) is (Amit & Tsodyks, 1991; Ricciardi, 1977) ν = (µ, σ ) = τarp + τ

θ−µ σ Vr−µ σ

√

−1 u2

πe (1 + erf(u))du

,

(3.3)

where µ and σ 2 are, respectively, the mean and the variance of the afferent current. Since in a randomly and weakly connected network, the afferent current is gaussian to a good approximation (Amit & Brunel, 1997b), we use equation 3.3 as the transduction function in our MF studies. 3.2 Stationary States in Mean Field 3.2.1 The Full Case. For most of the results obtained in MF, we used a simplified set of dynamical equations for the evolution of the rates. In the full case, we integrate the following 2p + 1 equations (all multiplicities),  s s n s τE ν˙ α (t) = −να (t) + E ({νβ (t)}, {νβ (t)}, νI (t)), α = 1, . . . , p s τE ν˙ n (t) = −ναn (t) + E ({νβ (t)}, {νβn (t)}, νI (t)), α = 0, . . . , p − 1  α τI ν˙ I (t) = −νI (t) + I ({νβs (t)}, {νβn (t)}, νI (t)),

(3.4)

performing a first-order numerical integration, with a time step t = 0.01 ms (about 10−3 of the time constants involved). τE , τI are fictitious excitatory and inhibitory time constants, whose actual values are immaterial, since we consider only stationary states. We take them equal to the membrane time constants of the excitatory and inhibitory spiking neurons, respectively (see Table 1). {νβs (t)}, {νβn (t)} are, respectively, the set of rates in the selective and nonselective populations of multiplicity β. The ’s are the transduction functions defined in equation 3.3. Their dependence on the type of neuron (excitatory and inhibitory) is via τ , Vr , and . Their dependence on the rates is via the mean and variance of the afferent current to each type of neuron, equations 2.12 and 2.13. We looked for the steady states of this dynamical system, which are the solutions of equations 2.15. The test for stationarity is that all 2p + 1 rates, upon the next integration step, undergo a relative variation less than a given tolerance f ∼ 10−6 , that is, x να (t + t) − ναx (t) < f . ναx (t) The iterative approach to the stationary states, via equations like equation 3.4, guarantees their stability with respect to this dynamics.

Mean Field with Random Sparse Memories

2611

Table 1: Parameters Used in MF Calculations and Simulations. Single-Cell Parameters θ —Spike emission threshold Vr —Reset potential τ —Membrane time constant τarp —Absolute refractory period

E

I

20 mV 10 mV 20 ms 4 ms

20 mV 10 mV 10 ms 2 ms

Network Parameters

Values

CE —Number of recurrent excitatory connections per cell CI —Number of recurrent inhibitory connections per cell Cext —Number of external connections per cell νext —Spike rate from external neurons

1600 400 3200 5.00 Hz

Synaptic Parameters

Values

(E) Jext —Synaptic efficacy ext → E (I) Jext —Synaptic efficacy ext → I JIE —Synaptic efficacy E → I

JEI —Synaptic efficacy I → E JII —Synaptic efficacy I → I Jd —Depressed efficacy of plastic EE synapses Jp —Potentiated efficacy of plastic EE synapses γ0 —Fraction of potentiated synapses before learning Variable Parameters g—Ratio between potentiated/depressed efficacies (Jp = gJd ) ρ—Ratio between LTD/LTP probabilities (q− = fρq+ ) f —Fraction of E cells responding to a stimulus p—Number of stored memories Spiking Network Parameters (SIMULATION) NE —Number of excitatory cells NI —Number of inhibitory cells c—Probability of synaptic contact δ—Synaptic delay Tstim —Duration of visual presentation Tdelay —Interval between two successive presentations Gstim —External rate increase during presentation (νext → Gstim νext )

0.070 mV 0.115 mV 0.080 mV 0.275 mV 0.178 mV 0.030 mV gJd 0.05 Values 1–10 0–10 0.005–0.2 1–80 Values 8000 2000 0.2 1–10 ms 500 ms 1000 ms 1.5

Notes: The first four sections are common to both. The bottom one are parameters specific to the simulation.

3.2.2 Initial Conditions. To find the solution corresponding to spontaneous activity (SA), the initial conditions are chosen to be

ναs (t = 0) ∼ ναn (t = 0) ∼ νI (t = 0) ∼ νext ,

(3.5)

2612

E. Curti, G. Mongillo, G. La Camera, and D. Amit

with νext the rate of the afferent external noise (see section 2.2). For the retrieval states (delay activity, WM), if they exist, we start with ναs (t = 0) ∼ 10 · νext

ναn (t = 0) ∼ νI (t = 0) ∼ νext ,

(3.6)

which mimics the presentation of a stimulus. The behavior of the system is rather insensitive to the precise values of the initial conditions, except near critical points. We consider three scenarios, which depend on the various network parameters: • No selective delay activity—ergodic dynamics. The stationary rates in populations of the same multiplicity are equal, regardless of whether they are selective and independent of the initial conditions. The resulting rate distributions are denoted as νβ (SA). • Coexistence of single-item WM and spontaneous activity. The stationary solutions arrived at from initial condition 3.5 and 3.6 are different. For condition 3.5, νβs (SA) = νβn (SA) for β = 1, . . . , p − 1. The solution corresponding to the WM initial conditions, 3.6, has νβs (WM) > νβn (WM) for β = 1, . . . , p − 1. Ergodicity is broken—the final state depends on the initial condition. There will be one single-item, WM solution for each selected stimulus.5 • Disappearance of spontaneous activity. Upon integration with initial conditions 3.5 the system does not end up in a symmetric state as in the above two cases. Instead, it is always attracted to a WM state. Which of the p patterns is actually in the active state is determined by a fluctuation in the initial conditions or in the dynamics. Upon presentation of a stimulus, initial condition 3.6, the system ends up in a WM state corresponding to the stimulus presented. 3.2.3 Observables in MF. To obtain a compact description of the state of activity of the excitatory subnetwork, we use the rate averaged over multiplicities in the selective populations, ν˜ s , and in the nonselective populations ν˜ n , given by s n β νβ πs (β) β νβ πn (β) ν˜ s = , ν˜ n = . (3.7) β πs (β) β πn (β) We are interested primarily in determining the region in the space of the parameters in which the network is able to exhibit both the SA and the WM 5 There may be also multi-item solutions—states in which several populations sustain high delay activity simultaneously, as with nonoverlapping stimuli (e.g., Amit, Bernacchia, Yakovlev, & Orlov, 2003). Those could be studied by the present approach, but it goes beyond the scope of this work.

Mean Field with Random Sparse Memories

2613

states of activity. In particular, we study the space of states of the network as we vary the ratio g(≡ Jp /Jd ), between the potentiated and the depressed synaptic efficacy, varying actually Jp , at fixed Jd ; the number p of stored patterns; ρ – the ratio between the potentiation and depression synaptic transition probabilities, equation 2.1, and the coding level f , varying one parameter at a time. 3.2.4 Reduced Population Number. To reduce the number of dynamical equations in equation 2.15, we proceeded as follows. The number of relevant rates was determined starting from the mean multiplicity fp and including only the rates, in selective and nonselective populations, of multiplicity within 3σ (= 3 pf ) of the mean, as discussed in section 2.5.1.6 In some instances, this number of populations is cut off at the lower end, e.g., if it reaches the lowest possible multiplicity (0 or 1). We denote the reduced set of multiplicities {nr }, where nr is the number of subpopulations considered in the reduced treatment. The number of equations in the dynamical system (see equation 3.4) reduces correspondingly to nr equations, like equation 3.4, with nr variables. Then nr is increased to nr , adding one population on each side of the previously retained set of populations (where possible) and comparing the solution with the one of {nr }. If

|ναx ({nr }) − ναx ( nr )| |νI ({nr }) − νI ( nr )| max , < r = 10−2 , α ναx ({nr }) νI ({nr })

(3.8)

where α ∈ {nr } and x = s, n, then nr is assumed as the number of populations to be retained. Otherwise, if the condition is not satisfied, the number of populations is increased and the condition checked again. Finally, the neglected rates ναx ’s are obtained, via the transduction function, by computing the mean and variance of the currents afferent on neurons of each neglected multiplicity, using the rates of the retained populations, that is,    µx    α

= CE

   x 2  (σα ) = CE

β∈{nr }

β∈{nr }

xs s Jαβ νβ πs (β)

+

xs s Jαβ νβ πs (β)

β∈{nr }

β∈{nr }

xn n Jαβ νβ πn (β)

xn n Jαβ νβ πn (β)

+ µI + µext ,

(3.9) +

σI2

+

2 . σext

In some cases, we compare the solution with the reduced number of populations to the full MF solution (2p + 1 populations). In some sample 6 A limiting case of this approach, f → 0, pf very large, was employed in Golomb et al. (1990) to study the MF of the Willshaw model.

2614

E. Curti, G. Mongillo, G. La Camera, and D. Amit

situations, we also double-checked the convergence of our dynamics by considering the MF dynamical equations for the evolution of the means and variances of the currents, as in (Amit & Brunel, 1997a), rather than the simplified equations, equations 3.4, for the evolution of the rates. 3.2.5 Bifurcation Diagram: Varying g at Fixed p. As examples of the different possible stationary states of the network, we generate sample bifurcation diagrams. At fixed p, the synaptic structure is set up according to equations A.2. Then g is varied from 1 up. For each value of g, the MF equations, equations 3.4, are solved for the stationary states with both initial conditions, 3.5 and 3.6. We then plot the average rates ν˜ s and ν˜ n , equations 3.7, versus g. The phenomenology is the following: • For 1 ≤ g < g(1) ˜ s ∼ ν˜ n c (p), there is a single solution at low rate, with ν – spontaneous activity. (1) (2) • g(1) c (p) is a bifurcation point. For gc (p) ≤ g < gc (p), a second solution appears, with ν˜ s ν˜ n , that is, selective activity rates are significantly higher than nonselective rates. Both increase with increasing g.

• For g ≥ g(2) c (p), there is again a single solution: either the WM state, for p 1/f , or symmetric (SA) at high rate, for p > 1/f (see below). 3.2.6 Storage Capacity. For a given set of parameters, the network, if it has WM states, will have a critical capacity; there will be a highest value of p(≡ pc ) for which the WM solutions still exist and for pc +1, there is only one stationary state that is spontaneous activity. For a given set of parameters and a given p, the synaptic couplings among the various neural populations are computed, according to equations A.2, and the corresponding steadystate solutions are obtained, as above. Keeping all parameters fixed and increasing p, we obtain pc corresponding to that set of parameters. Then one parameter is varied at a time, giving pc as a function of that parameter. 3.3 Simulations 3.3.1 The Network Architecture. The model network is composed of NE excitatory and NI inhibitory integrate-and-fire spiking neurons, equation 3.1. The direct afferent presynaptic cells of a given neuron are selected, independently and randomly, by a binary process with probability c, so that each neuron receives, on average, cNE = CE excitatory and cNI = CI inhibitory contacts. The structure of the connectivity remains fixed all along the simulation. The existing excitatory as well as inhibitory synapses onto inhibitory neurons are assigned a uniform value. The structuring of the excitatory-toexcitatory synapses (i.e., the statistics of the potentiated/depressed synapses) is generated according to the set of stimuli supposed to have been presented in training (see below).

Mean Field with Random Sparse Memories

2615

To each recurrent synapse is associated a transmission delay δ—the time after which the emitted spikes are delivered to the postsynaptic cell. The synaptic delays are uniformly distributed within a bounded interval. Cext excitatory contacts are afferent on each neuron from outside the network. Each is modeled by a random and independent Poissonian spike train of rate νext . Hence, all neurons are receiving a nonselective external current Iext = Jext η(t), where Jext is the efficacy of the external synapses; η(t) is a Poisson process with mean Cext νext per unit time. See Table 1. 3.3.2 The Simulation Process. The simulation consists of numerical integration of the discretized dynamical equations of the membrane depolarizations, equation 3.1, of all (NE + NI ) neurons. The temporal step is t(= 0.05 ms), shorter than the average time between two successive afferent spikes. The initial distribution of depolarization in the network is set uniform at a subthreshold value. The actual value has little effect on the network dynamics—the network reaches its stationary state, corresponding to the SA state, within short relaxation times (∼ 50 ms). Spikes begin to be emitted due to the external currents. The depolarization of every neuron is sequentially updated. If Vj (t + t) > θ , a spike is delivered to all postsynaptic neurons, and the depolarization is reset to Vj = Vr and kept fixed for τarp milliseconds. The spike adds to the value of the depolarization of the postsynaptic neuron i, at time t + t + δij , the value Jij of the corresponding synaptic efficacy. (For more details on the simulation process, see Amit & Brunel, 1997b.) The complete list of parameters is reported in Table 1. 3.3.3 Statistics of Stimuli and Learning. The p stimuli to be learned are set up, when the simulation is initialized, by a binary process: an excitatory neuron responds independently to each stimulus with probability f . Each stimulus corresponds (on average) to a pool of f NE visually responsive excitatory neurons. These pools are kept fixed all along the simulation. Hence, µ one can associate with each neuron a response pattern {ξi }, µ = 1, . . . , p, µ where ξi = 1 if neuron i responds to (i.e., is selective to) the stimulus number µ, and 0 otherwise. To repeat, the ξ s are not rates but merely a bookkeeping of the selectivity of the cells. The asymptotic effect of learning is expressed in the following way. The (two-state) synapse between the presynaptic excitatory neuron j and the postsynaptic excitatory neuron i, if it exists, is set in the up state (Jij = Jp ) with probability Pr(Jij = Jp ) =

γ0 Pij /(Pij + fρDij )

if Pij = Dij = 0 otherwise

(3.10)

2616

E. Curti, G. Mongillo, G. La Camera, and D. Amit

where Pij =

p µ=1

µ µ

ξi ξj ; Dij =

p µ=1

µ

µ

(1 − ξi )ξj ,

(3.11)

where Pij and Dij are, respectively, the P and D introduced at the end of section 2.2, for the specific synapse connecting neuron j to i. γ0 , the synaptic distribution prior to learning, is defined in equation 2.3. The synaptic matrix is generated for every value of p according to this recipe. This structuring on top of an unstructured synaptic background, follows the tradition of Brindley (1969). 3.3.4 Testing Protocol. To check the MF predictions, the network is subjected to a testing stage during which the entire set of stimuli is presented, to check which of the stimuli has a corresponding delay activity with the given synaptic structure at the given loading level. The presentation of a stimulus is expressed by an increase in the rates of external afferents to the selective cells (the corresponding pool) for an interval of Tstim . The rate of spikes arriving at these neurons is increased by a factor Gstim > 1, Table 1. External currents to other excitatory neurons, as well as to the inhibitory neurons, are unaltered. Accordingly, the neurons selective to a given stimulus emit at elevated rates, during the presentation of the stimulus. Note that no synaptic plasticity takes place during stimulus presentation. The typical trial consists of a presentation interval of Tstim , followed by a delay interval of Tdelay , during which none of the populations is stimulated. The presentation sequence of the stimuli is either kept fixed (1 → 2 → · · · p → 1) or generated by choosing each stimulus to be presented independently and randomly, with probability 1/p, until each stimulus has been presented at least once. 3.3.5 Recordings and Observables. In order to compare MF predictions with the behavior of the simulated spiking network, the average emission rates within selective and nonselective populations are estimated by equation 3.7, within each trial period, for the entire set of stimuli. The mean emission rate of a population in an interval T is estimated as the total number of spikes emitted by the neurons belonging to that population in the interval divided by T and by the total number of neurons within the population. Single-cell data are measured in a randomly selected subset of excitatory neurons (10% of the total). The subset of neurons sampled is kept fixed during the “recordings” to mimic experimental procedures. The mean emission rate of neuron i in an interval T is estimated as the number of spikes emitted by the neuron in the interval divided by T. The mean emission rate of a neuron in the delay period is estimated starting 200 ms following the removal of the stimulus until Tdelay after the

Mean Field with Random Sparse Memories

2617 µ

removal of a stimulus µ, and is denoted by vi . This is to avoid the transient following the stimulus. µ From the single-cell rate estimates {vi }, measured in the simulations, we compute the average rates in populations of cells that are selective and nonselective to stimulus #µ, with a given multiplicity β, off line, as Vµs (β)

=

µ µ i∈β ξi vi µ , i∈β ξi

(3.12)

µ where i∈β vi is over all neurons of multiplicity β selective to stimulus µ. s s Vµ (β) → νβ for each µ, as N → ∞. Analogously,

i∈β (1

Vµn (β) =

µ

µ

− ξi )vi

i∈β (1

µ

− ξi )

,

(3.13)

which → νβn as N → ∞. The compact average rates, that is, selective and nonselective rates averaged over all multiplicities, are obtained directly from the simulations, as µ µ s β πs (β)νβ i ξi vi Vµs = → = ν˜ s µ β πs (β) i ξi n µ µ β πn (β)νβ i (1 − ξi )vi n Vµ = = ν˜ n → µ β πn (β) i (1 − ξi ) and could also be obtained from single-cell recordings. Note that Vµs , Vµn are the analogs of the overlaps in the Hopfield model (Amit, Gutfreund, & Sompolinsky, 1985). In this way, we obtain a Vµs and Vµn , for each µ. To compare with ν˜ x of the MF analysis, equation 3.7, we compute V x =

1 x Vµ → ν˜ x . p µ

(3.14)

The corresponding error is estimated as the standard deviation of the (p-component) vector of the Vµs ’s from its mean. In some cases, when we studied only compact average rates, we presented only half of the stimulus set, to reduce simulation time. 4 Results 4.1 Bifurcation Diagrams in MF and Simulations. The equations 2.15 for the stationary states of the network, in MF, with the set of parameters

2618

E. Curti, G. Mongillo, G. La Camera, and D. Amit

given in Table 1 and the synaptic matrix set-up as described in section 3, are solved with 2p + 1 populations for p = 40 and 60. Figures 2A and 2C present the population rates, averaged over all multiplicities, in three populations: (1) the selective, active population (˜νs ); (2) the remaining nonselective neurons, including those responsive to no stimulus (˜νn ); and (3) the inhibitory neurons (νI ). The rates are plotted versus g ≡ Jp /Jd ≥1 (Jd fixed). For low values of this parameter, there is a single low-rate solution with ν˜ s ∼ ν˜ n (spontaneous activity), independent of the initial condition. Both rates, as well as the mean emission rate in the inhibitory population, increase with the potentiation level. Above a critical value g = gc , which increases with p, gc =7.37 (see Figure 2A: p = 40), and gc = 8.35 (see Figure 2C: p = 60), a second solution appears, in which one selective population has a persistent elevated rate (˜νs ν˜ n ), starting at 23.02 Hz in Figure 2A and 29.83 Hz in Figure 2C. Also the average, selective enhanced rate increases with increasing g. From the microscopic simulation, we estimate (as described in section 3.3.5) the average emission rate in the selective population (i.e., µ neurons responsive to the stimulus, for which ξi = 1) and in the nonseµ lective population (i.e., neurons nonresponsive to the stimulus, ξi = 0), following stimulus removal. In this way, p values for ν˜s and p values for ν˜n are obtained (see section 3). The population emission rates averaged over all p stimuli, with the corresponding error bars (standard errors over the p emission rates), are also reported in Figures 2A and 2C for five values of g— two below and two above gc and one near gc . The results of the numerical simulations are in good agreement with theory, except near gc . Figure 2B shows the distribution of the measured (simulation) selective average emission rate, during the delay period, over the p(= 40) populations, for three values of g, marked (a,b,c). Far from gc , the distribution is unimodal, either at low rates (a) or at high rates (c). By contrast, for g ∼ gc (point b), the distribution is bimodal, with a peak at low rates, for populations unable to sustain WM and a peak at high rates, for those that sustain WM. The same distribution is reported in Figure 2D for p = 60. In this case, the distribution of the rates is bimodal at all measured points (as witnessed also by the large error bars in Figure 2C). This is due to the fact that p = 60 is very close to the critical capacity, pc (g), of the network and, as we show below, pc (g) does not vary much with g, in this region. At still higher values of g, there is another critical value beyond which there are two different behaviors: for low p 1/f , the network is close to one with nonoverlapping stimuli, the SA state destabilizes, and only WM states are stable. For higher p (> 1/f ), the system displays only a symmetric stable state, much as the one for g < gc , but at high rates. The data for these high values of g is not shown. Figure 3 presents the detailed dependence of the rates of selective and nonselective cells, in the delay activity state (WM) versus the multiplicity β, for g above gc , Figures 3A and 3B: p = 40, g = 8, and Figures 3C and 3D:

Mean Field with Random Sparse Memories

A

B

2619

1

60

ν [Hz]

50 40 30

(b)

20 10 0

C

8

g

9

D (c)

50

ν [Hz]

Fraction of populations

60

40 (b)

30 20 (a)

10

g

9

10

20

0

20

40

60 (c)

ν [Hz]

40

60

0

0

0.2

20

40

60

0

0.2

20

40

60

20

80 (c)

g>gc 0

80 (b)

g∼ gc

0.1 0

(a)

g
0.5

60 (b)

g>gc

0.1

8

40

1

0

7

0

0.1

7

20

g∼ gc

0.2 0

6

0

0.1

(a)

70

0

0 0.2 0

(a)

g
0.5

Fraction of populations

(c)

40

ν [Hz]

60

80

Figure 2: Bifurcation diagrams in MF and in simulation with sample (measured) rate distributions in several network states. (A, B) Networks with p = 40. (C, D) p = 60. (A, C) Average rates versus potentiation parameter g, in MF theory (curves, no fitting) and in simulation (points), in three populations: (black) selective, (red) nonselective, averaged over all multiplicities, and (green) inhibitory. Dashed curves: Retrieval (delay activity) state; continuous curves: SA state. The vertical dotted lines correspond to the bifurcation points in MF. Points in the figures on the left are average rates measured in the simulation, over an interval of 800 ms, starting 200 ms after the removal of the stimulus (see section 3). Error bars are standard errors computed over the p populations corresponding to the set of stimuli. Error bars for the inhibitory rates and the nonselective rates are too small to be noticed. (B, D) Rate histograms for selective populations, measured in the simulations, for three values of g, corresponding to the network on the left and to the point in the bifurcation diagram marked by the same letter. For p = 40, the rate distributions are unimodal away from the bifurcation point: B(a), at low rates for g = 6.5 < gc , and B(c), at high rates for g = 8.5 > gc . For g = 7.5 ∼ gc , the distribution is bimodal, due to inhomogeneities B(b). For p = 60, all distributions are bimodal, since 60∼ pc (critical capacity) for all three sample values: g = (7.5, 8.5, 9.5). See section 4.2 and Figure 6. Parameters of Table 1, with f = 0.05, ρ = 1.

2620

E. Curti, G. Mongillo, G. La Camera, and D. Amit

B

A 60 50

50

D

80

80

70

70

60

60

50

50

20

30

ν [Hz]

30

20

10

ν [Hz]

40

ν [Hz]

40

ν [Hz]

C

60

40 30

40 30

20

20

10

10

10

0

0 0

20

β

40

0

0

2

β

4

6

8

0 0

20

β

40

60

0

β

5

10

Figure 3: Average rate versus multiplicity of neural response in a delay activity state, in selective and nonselective cells, in MF and in simulations. (A, B): p = 40, g = 8; (C, D): p = 60, g = 9. (A, C): Rates for all (2p+1) multiplicities (in MF); (B, D): Rates for those multiplicities on the left actually found in the simulation. β ≤ 7 for p = 40; β ≤ 9 for p = 60, together with the corresponding simulation results. Upper curves: selective cells; lower curves: nonselective cells. Error bars in simulation data are the standard error over the neural sample. Network parameters as in Figure 2.

p = 60, g = 9. On the left of Figures 3A and 3C are the rates in populations of all multiplicities as obtained by MF theory. On the right of Figures 3B and 3D are zooms of the part of the figures on the left for the relevant multiplicities—those actually found in the simulation (β ≤7 for p=40 and β ≤ 9 for p = 60). Shown are MF rates together with the corresponding rates measured in the simulation, each with its standard error, as found across the subset of sampled neurons (10%). Again, simulation results are in good agreement with the theory. Also, here the fluctuations for p = 60 are much larger (see above). The relevant rates (of the multiplicities retained in the reduction) vary relatively slowly with the multiplicity. The large standard errors near gc are due to inhomogeneities. These have several sources. There is structural inhomogeneity due to the random selection of the number of presynaptic afferent cells (we fix only the average number C); another source of inhomogeneity is due to the fluctuations in the coding level (around the mean f N) (see sections 3.3.1 and 3.3.3). The random connectivity produces different afferent currents to different, equivalent neurons (see Amit & Brunel, 1997b). Similarly, the randomness in coding level leads to different numbers of neurons in different (equivalent) selective populations. These fluctuations are quenched (i.e., fixed in a simulation, once the network and the stimuli have been determined). Their main effect consists in allowing, for the same value of g, some populations to be able to sustain enhanced delay activity and others not. Consequently, the critical value of g varies from population to population (see Brunel,

Mean Field with Random Sparse Memories

2621

2000). If one were to define a critical g in the simulation as the lowest value for which all populations have persistent delay activity, it would be higher than the value obtained from MF theory, as is to be expected and as is confirmed by Figures 2A and 2C, where it is seen that for Figure 2A, it would be higher than 7.5 (compared to theoretical, MF, 7.37) and for Figure 2C higher than 9.5 (compared to theoretical 8.35). Yet, this empirical gc may vary from simulation to simulation of the same network. To exhibit these effects we report in Table 2 the results of simulations, with one or both of these parameters—connectivity and coding level—uniform, to compare the fluctuations with the case when they are both free (distributed as described in section 3). We do this for p = 40, at several values of g. Uniform connectivity implies that all neurons receive exactly the same number of afferent connections, cNE excitatory and cNI inhibitory; uniform coding level implies that all selective populations have the same number of neurons f NE . As can be read from the standard errors, the inhomogeneity that more strongly affects the dynamics of the network around gc is the variability in the coding level. When the number of neurons is set equal (at f NE ) for all populations, the standard errors are small even very near gc , despite the variability in connectivity. They are essentially the same, as if both parameters were kept uniform. In the opposite case, uniform connectivity and random coding level, the standard errors are as large as if both variables were free. For instance, at g = 7.5, the standard error at uniform f and random C is 1.90 (column 3 in Table 2), while it grows to 12.28 for random coding level and uniform connectivity (column 4). Far from gc , the various inhomogeneities only mildly affect the mean emission rate (averaged also over the stimuli) predicted by MF (last row in Table 2). For the network tested here, the residual inhomogeneities, such as in the synaptic structuring, in the number of neurons with a given multiplicity in a given selective population, or the finite number of neurons (finite-size effects) have a rather mild effect on the average rates.

4.1.1 Effectiveness of Multiplicity Reduction. Figure 4 is an example of the effectiveness of the multiplicity reduction approach. It shows the relative errors, defined as (rate in the reduced solution − rate in full solution)/(rate in full solution), for all rate branches in Figure 2, all along the potentiation process (g). The system memorized 60 stimuli with f = 0.05, so the full solution involves 121 populations. The reductions, performed around β = pf = 3, are done in three ways: with nr = 5, and nr = 7 and automatic, that is, until condition 3.8 is fulfilled (see section 3). For nr = 5, in some branches, the relative error is as large as 15% to 20%; with nr = 7, it is contained, along the entire g interval, below 2%. The automatic process constrains the error to below 1%, by construction. It is rather striking that in all three approximations, the critical value of g is equal to within about 2.5%.

2622

E. Curti, G. Mongillo, G. La Camera, and D. Amit

Table 2: Effect of Inhomogeneities: Average Selective Emission Rates and Corresponding Standard Errors Measured in the Microscopic Simulation for Several Values of g Around gc . g=

Jp Jd

7 7.37∗ 7.5 8

Both Uniform

Uniform Coding

Uniform Connection

Both Free

MF

3.32±1.95 27.94±2.12 31.28±1.72 41.75±1.63

2.35±1.17 26.08±2.46 29.39±1.90 40.65±1.69

11.04±11.32 26.99±11.07 29.73±12.28 45.25±3.41

7.97±9.59 25.92±11.90 29.75±11.15 42.68±4.66

2.08 23.02 28.53 40.82

Notes: Network parameters as in Figure 2A. Col. 2—uniform connectivity and uniform coding level. Col. 3—uniform coding level and random connectivity. Col. 4—uniform connectivity and random coding level. Col. 5—random connectivity and random coding level. Col. 6—theoretical (MF) emission rates. The asterisk indicates the MF critical potentiation level, gc = 7.37. For uniform coding level, regardless of the connectivity distribution, the standard errors are as small as when both variables are uniform, even at gc (cols. 2 and 3). In particular, for uniform coding level, at the MF gc , all populations exhibit persistent delay activity. By contrast, when the coding level is random, the standard errors are as large as if both parameters were free (cols. 4 and 5). These large standard errors express the fact that some populations are in a WM state, while others are in SA. Each entry in cols. 2–5 is obtained from a single simulation. The delay activity rate (in a time window of 800 ms, starting 200 ms following the removal of the stimulus; see section 3) is averaged over the cells belonging to each stimulus. The reported rate (in Hz) is the average over the 40 stimuli of the resulting population-averaged delay rates, given together with its corresponding standard error.

4.2 Storage Capacity in MF 4.2.1 Phenomenology of Storage Capacity. One expects that a given system (fixed constitutive parameters), memorizing random stimuli as stationary states, arrives at a critical point at which adding more memories results in overloading. This was foreseen by Hopfield (1982) and proved by Amit et al. (1987), for ±1 neurons. The considerations were extended to 0–1 neurons and low coding by Tsodyks and Feigelman (1988) and Buhmann et al. (1989). Here we basically follow the logic of Amit et al. (1987): for a given set of parameters, we look for the value of p for which the equations 2.15 cease to have retrieval solutions, with ν˜ s > ν˜ n . In Figure 5 we report the average emission rates, selective and nonselective, in the retrieval (WM) state, as a function of the number of stored memories, p. The set of parameters is that of Table 1 with g = Jp /Jd = 9 and ρ = 5, at coding level f = 0.05. The selective emission rate ν˜ s smoothly decreases as the number of memories increases until p = 40. Between p = 40 and p = 40+1, the MF selective delay rate collapses to the nonselective level, that is, ν˜ s ∼ ν˜ n . The retrieval solution disappears. pc = 40 is the storage capacity of the network. Note that the disappearance of the delay activity at this loading level is tantamount to a blackout, that is, disappearance of delay activity for all stimuli. This is due to the fact that the slow repetitive training,

Mean Field with Random Sparse Memories

2623

Relative error

0.1 0 −0.1 −0.2 0.02

6

7

8

9

10

6

7

8

9

10

6

7

8

9

10

0

−0.02 −0.04 0.01 0 −0.01 −0.02

g

Figure 4: Convergence of MF rates for reduced population number: p = 60, f = 0.05. Relative rate differences between reduced solution and the full MF solution (121 populations) versus potentiation level g (range corresponding to the bifurcation diagram in Figure 2C). (Top) nr = 5 (for nonselective (n) α = 0, 1, 2, 3, 4, for selective (s) α = 1, 2, 3, 4, total 9+1 populations). (Middle) nr = 7 (n, α = 0, . . . , 6; s, α = 1, . . . , 6, total 13+1 populations). (Bottom) Automatic reduction, converged at nr = 7 for g < 7; at nr = 8 for 7 < g < 9; and at nr = 9 for g > 9 (see section 3). The relative error <1%. Color coding as in Figure 2. Full curves: spontaneous activity, dashed curves: working memory.

implied by the synaptic matrix used, renders the network fully symmetric under permutations of the different stimuli. In Figure 5 we also present the average selective and nonselective rates in the p retrieval states, measured in the simulation of the spiking network, with the corresponding standard errors over the p rates. The results are in good agreement with theoretical predictions. As p approaches pc , the error bars increase. This is a signature of the bimodality of the distribution of the p selective rates, as presented, for example, in Figure 2: some populations sustain WM activity at enhanced rates, while others do not. Similar effects appear in the measured average selective rate in the spiking network, for p > pc . It smoothly decreases to the spontaneous level as p increases. By contrast, in MF, the transition is abrupt, and all memories are lost together.

2624

E. Curti, G. Mongillo, G. La Camera, and D. Amit

80 ∼ νs(WM)

ν [Hz]

60

40 ∼ νn(WM)

20

0

0

20

p

40

60

Figure 5: Selective and nonselective average rates versus number of stored memories, in MF and simulations, for g = 9 and ρ = 5, at coding level f = 0.05. The full curves represent average MF rates, selective (marked ν˜ s ) and nonselective (˜νn ), in the retrieval (WM) state. Points represent the corresponding average rates measured in the simulation. Error bars as in Figure 2. For p > pc ( = 40), MF selective rate collapses to spontaneous level, ν˜ s ∼ ν˜ n . The retrieval state disappears. The transition in the spiking network takes place around the value of pc predicted by MF theory. It appears smooth because of structural fluctuations in the connectivity, synaptic structuring, coding level, and so forth (see text). For p < pc , but nearby, some populations are unable to sustain WM activity. Similarly, for p > pc , a subset of populations (< p) can still sustain elevated delay activity. See also Figure 2 and Amit (1989) and Sompolinsky (1987).

4.2.2 Memory Capacity vs. Model Parameters. We study the dependence of the storage capacity, pc , on the ratio between the potentiated and depressed efficacies, g = Jp /Jd ; and on the balance of potentiation to depression probabilities in the learning process ρ ≡ q− /( f q+ ); on the inhibitoryto-excitatory synaptic efficacy JEI , all at fixed coding level f = 0.05. Then we study its dependence on f , with the parameters mentioned above kept fixed. In Figure 6 we plot pc versus g at several values of ρ. The storage capacity pc is obtained, at fixed g and ρ, as the largest p at which equations 2.15 still have a retrieval solution, while for p = pc + 1, equations 2.15 have only the ergodic, spontaneous activity solution (see section 3). For each value of ρ, pc rises monotonically as g increases, until it reaches a maximum at gmax (ρ). Near the maximum, pc is insensitive to g and then starts to decrease. This is connected to the large fluctuations observed in the discussion of Figures 2C and 2D.

Mean Field with Random Sparse Memories

2625

80 ρ=0.02 0 (Willshaw)

ρ=1

p

c

60

ρ

40 ρ=5 20

ρ=10 0

7

7.5

g8

8.5

9

Figure 6: Storage capacity versus g = Jp /Jd , for several values of ρ at f = 0.05. pc rises with g up to a maximum value, gmax , which rises with ρ. Near gmax , pc is insensitive to the value of g. Beyond gmax , the capacity begins to decrease. For g not too high, capacity increases when ρ decreases, that is, when depression is reduced relative to potentiation. The curves are not smooth due to the discrete nature of pc .

In Figure 7 we plot pc versus g at ρ = 1 for three values of the inhibitory synaptic efficacy onto excitatory neurons, JEI . As can be seen, gmax increases with JEI , as the inhibitory currents are made stronger. As expected on the basis of the signal-to-noise analysis, the storage capacity increases as gmax increases. Note that decreasing JEI results in a lowering of the minimal g for which the network sustains the retrieval (WM) state. As long as the inhibitory subnetwork is able to ensure the right inhibition-excitation balance (see Amit & Brunel, 1997a), the storage capacity increases with g and decreases with increasing ρ, that is, curves with higher ρ lie lower (see Figure 6). At fixed g, the storage capacity increases as the ratio of LTD to LTP transition probabilities decreases (ρ → 0). For instance, for g = 7.5, the storage capacity is approximately 10 at ρ = 10, while it grows to about 50 when ρ decreases to 0.02 (see Figure 6). This is further elaborated in Figure 8, which shows the dependence of pc on ρ, for g = 8, g = 9, and g → ∞ (the Willshaw model, Willshaw et al., 1969). As in Figure 6, pc is higher for higher g, and decreases with ρ for fixed g, due to excess LTD (large ρ), see e.g. (Amit & Fusi, 1992, 1994; Brunel et al., 1998). This is so because the number of depressing patterns is of order f , while the number of potentiating ones is

2626

E. Curti, G. Mongillo, G. La Camera, and D. Amit JEI=0.29mV

80 JEI=0.275mV

p

c

60

40

JEI=0.26mV

20

0

5

6

7

g

8

9

10

Figure 7: Storage capacity versus g = Jp /Jd , for several values of JEI , at ρ = 1 and f = 0.05 . It rises and reaches a maximum at gmax and starts to decrease. As JEI increases, so does gmax and the capacity at gmax (squares). The lower JEI , the lower the bifurcation value of g, corresponding to pc = 1, where WM starts.

of order f 2 , so for fρ ∼ 1, the resulting mean potentiation within a selective population is quite low and cannot sustain WM. To trace the dependence of the storage capacity on the relative strength of depression, ρ, we proceed as follows: pc was estimated using the criterion of the synaptic structuring dynamics (see Brunel et al., 1998). There the storage capacity for stationary states was estimated as the value of p at which the excess fraction of potentiated synapses within a selective population over the average potentiation level of all excitatory synapses became less than 0.5. This condition implies that pc is the solution of the equation p k=0

1 k ak exp(−a) = , k + ρa k 2(ρ + 1)

(4.1)

with a = pf 2 . But in Brunel et al. (1998), Jd = 0, which in our case corresponds to g → ∞. As such, it would be expected to give an overestimate of the capacity at finite g. Solving this equation at a discrete set of points 0 ≤ ρ ≤ 15 gives the higher set of points in Figure 8. These points were fitted with a functional form suggested by the signal-to-noise analysis (Amit & Fusi, 1994), together with the consideration that pc must be finite at ρ = 0, where the model coincides with the Willshaw model. The ansatz used was 2 1 ρ pc (ρ) = A + B · ln +C . (4.2) 1+ρ 1+ρ

Mean Field with Random Sparse Memories

2627

300 250

p

c

200 BCF estimate (g → ∞ )

150 100

g=9

50 0

g=8 0

5

ρ

10

15

80 BCF estimate (g → ∞ )

p

c

60

40

20

g=8 g=9

0

0

5

ρ

10

15

Figure 8: Storage capacity (in MF) versus ρ ( f = 0.05). Diamonds: storage capacity of the structuring criterion, equation 4.1, (Brunel et al., 1998), at a discrete set of point 0 ≤ ρ ≤ 15. Squares: MF, g = 8. Circles: MF, g = 9. Curves: Leastsquare fit with equation 4.2. (Top) extended range of pc : to highlight quality of the fit for the structuring criterion, which also produces the exact value corresponding to the Willshaw model for ρ = 0. Coefficients: A = 348.1, B = 31.76, C = 0.026. (Bottom) data and fit to equation 4.2 for MF results: for g = 8, A = 95.10, B = 27.82, C = 0.901; for g = 9, A = 154.4, B = 44.91, C = 0.346.

The success of the fit (see Figure 8) led us to use the same functional form to fit the MF results. This was done for two values of g (8 and 9). The data points (MF pc ) and the fitting curves are presented in Figure 8. Finally we study the dependence of the critical capacity on the coding level f , at fixed g for q− = f q+ , that is, ρ = 1. If trends are well captured by

2628

E. Curti, G. Mongillo, G. La Camera, and D. Amit

Amit and Fusi (1994), Golomb et al. (1990) and Willshaw et al. (1969), then pc should vary as f −2 when f goes to zero. Figure 9 shows pc versus f , for g = 7: pc increases with decreasing f , (circles). However, when f reaches the value f0 (=0.07), for the parameters used, the critical capacity pc starts to fall off, as the coding level is further decreased. This is due to the fact that the selective signal scales as f , for small f , while nonselective currents as well as inhibition remain at O(1). (See below and section 5.) The rising part of pc ( f ) is fit with the function pc ( f ) = a/f + b (see Figure 9, dotted line). The trend is well captured by the fit, but a fit by 1/f 2 would be only marginally worse. To compensate for the decrease of the signal with f , the efficacy of potentiated synapses and the selective contributions to the inhibitory neurons were scaled by 1/f , so that all contributions to the signal remain O(1) as f → 0 s (see also Golomb et al., 1990). This is obtained by multiplying g, as well as JIE (the excitatory-to-inhibitory synapses from selective cells) by f0 /f , with f0 = 0.07. The contributions to the variances of the currents scale correspondingly. In this case, the critical capacity continues to grow below f < f0 (see Figure 9, squares). The dashed curve is a least-square fit to a/f 2 +b. In general, we observe that f is the most relevant variable for the variation of the capacity, rather than the number of cells N or the connectivity, much as in the Willshaw model (Willshaw et al., 1969; Golomb et al., 1990; see also section 5.2). 5 Discussion 5.1 Mean-Field Theory. The main achievement of this study is the development of an effective MF theory for the delay activity states of a network of spiking neurons, in a situation that is very close to realistic. It is realistic in several important aspects: the network is composed of randomly connected spiking IF neurons (the IF model used for the neurons fits well with in-vitro studies; see, e.g., Rauch, La Camera, Luscher, ¨ Senn, & Fusi, 2003); neurons are divided into excitatory and inhibitory, and learning respects this division; no symmetry is assumed in the synaptic matrix; the network sustains spontaneous as well as selective activity; and it is the intensity of training that determines whether only the first or both are present (see Erickson & Desimone, 1999); the stimuli learned excite random (nonexclusive) subsets of cells in the network and the synaptic matrix that is generated is naturally connected to one that would be generated by unsupervised Hebbian learning in a system of synapses with two discrete stable states (Brunel et al., 1998). The existence of cells responsive to several different stimuli renders the theory more complicated than previous developments, with stimuli that elicited exclusive “visual” responses (Amit & Brunel, 1997a, 1997b).7 The 7 After the submission of the article, we became aware of the experimental study of Tamura, Kaneko, Kawasaky, & Fujita (2004), which measures the multiplicities of the cells in IT cortex during stimulus presentation (e.g., their Figure 4).

Mean Field with Random Sparse Memories

2629

120

scaling

100

p

c

80 60

no scaling

40 20 0 0

0.02

0.04

0.06

0.08

0.1

0.12

f Figure 9: Storage capacity (in MF) versus coding level f . q− = f q+ (ρ = 1) and g = 7. Circles: fixed parameters (no scaling). Squares: with scaling of g s and JIE (excitatory-to-inhibitory synapses from selective populations), by f/f0 ( f0 = 0.07). Dotted line: Least-square fit of the rising part of pc ( f ) by a/f + b: a = 3.45 and b = −7.14. Dashed line: Least-square fit to a/f 2 + b: a = 0.23 and b = −4.93.

natural populations for the study of MF theory, given overlapping stimuli, are of cells responding to a given number of stimuli. This is because the learning process envisaged respects the same population structure: it generates a synaptic structuring in which the probability of a synapse being potentiated depends only on the multiplicity of its pre- and postsynaptic neurons. The main complication is the need to consider a coupled system of equations (for population rates) whose number grows linearly with the number of stimuli learned. However, we show that the actual number of relevant multiplicities is much smaller, due to the fact that at low coding levels, corresponding to the experimental situation, the number of neurons of multiplicity away from the mean decreases very rapidly. Hence, the number of population rates, and of the equations required, reduces sharply. The mean multiplicity is fp, and the number of required populations is of order fp around the mean, where p is the number of memorized stimuli. For example, obtain precise solutions for the rates in the stationary states of the network, storing 100 memories at 5% coding level, with about 20 populations whose

2630

E. Curti, G. Mongillo, G. La Camera, and D. Amit

coupled rates have to be solved for. The solutions are precise in the sense that they vary little when the number of populations is increased and also in their good correspondence with simulations of the underlying microscopic system of 10,000 IF spiking cells with the same synaptic matrix. This result may open the way to a MF description of networks learning much more general sets of stimuli, not necessarily chosen at random or when the coding level of different stimuli is different. The availability of an effective MF theory is also an important tool for the exploration of the space of states of networks in experimental situations of delay response tasks of higher complexity, such as the pair-associate task (Sakai & Miyashita, 1991; Erickson & Desimone, 1999; Mongillo, Amit, & Brunel, 2003) or tasks generating context correlations (Miyashita, 1988; Amit et al., 1994). Typically, in modeling such tasks, in order to be able to explore the large parameter space, using MF theory, it was assumed that stimuli had no overlaps in the populations of neurons that coded for them. In other words, neurons had perfectly sharp tuning curves. This limitation is eliminated by our study. Our study also opens the way for the detailed study of the learning dynamics, when synaptic plasticity is driven by spikes, modulated by the emerging synaptic structure and by unconstrained afferent stimuli. This double dynamics amplifies the number of parameters, and MF theory becomes an indispensable tool. Again, in the past, this problem was approached with nonoverlapping stimuli (see, e.g., Amit & Mongillo, 2003; Del Giudice & Mattia, 2001). 5.2 Mean Field at Vanishing ρ. It was mentioned in section 2.3 that in the limit of vanishing ρ, the synaptic structure becomes that of the Willshaw model (Willshaw et al., 1969), when Jp = 1, Jd = 0 and the initial fraction of potentiated synapses γ0 = 0. However, Willshaw’s original network was a fully synchronized system of a single-layer (feedforward) architecture. Golomb et al. (1990) developed a MF theory for a fully connected recurrent network with asynchronous Glauber dynamics and Willshaw synaptic matrix. This theory is closely related to the theory presented here, from which it can be formally obtained in the limit ρ → 0. In fact, the two models share several characteristics. We defer the formal correspondences to another report. Here we mention two of the common results. First, the dependence of the capacity on the connectivity (the number of neurons in Willshaw et al. (1969) is weak. What renders the capacity insensitive to the connectivity (N in the Wilshaw model) is the fact that in both, the difference in the signals (mean current), to neurons of enhanced rate and those at low rates in a given attractor, decreases with increasing p, the number of memorized items. This is in contrast to models such as that of Amari-Hopfield (Amari, 1972; Hopfield, 1982) or Tsodyks and Feigelman (1988), in which the signal is constant and only the noise depends on p. Second, if parameters are appropriately scaled when f is varied, the storage capacity increases as

Mean Field with Random Sparse Memories

2631

1/f 2 (see below and section 4). The scaling of the parameters is implicit in the capacity estimates in the Willshaw model, where the selective signal is proportional to f , and if the threshold were to remain constant as f →0, the states would have been destabilized. Hence, one could either rescale the efficacies by 1/f , to maintain the signal constant, or scale down the threshold. This is explicitly effected in Golomb et al. (1990). 5.3 Storage Capacity, Blackout, and Palimpsest. The reliability of the MF description allows for a detailed study of the storage capacity of the network, that is, the maximal number of memories that can be stored in the synaptic couplings and retrieved by the network dynamics. The destabilization of the retrieval state depends principally on the balance between the excitatory and the inhibitory subnetworks. As the number of stored memories grows, the fraction of potentiated synapses between two selective populations, as well as within the whole excitatory subnetwork, increases. Near saturation, when a population is at elevated rates, the current afferent to the other selective populations becomes large. These, as a consequence, tend to emit at high rates, further increasing the inhibitory feedback. When the inhibitory feedback exceeds a given level, depending on the couplings within the inhibitory population and between the two subnetworks, the resulting effect is the destabilization of the retrieval state and a collapse of the network into the spontaneous activity state. When overloading takes place, it eliminates all memories together: a blackout as in classic models (Willshaw et al., 1969; Amit et al., 1987). This, as was already mentioned, is due to the fact that the slow repeated learning renders all stimuli symmetric for the network. Yet here the situation is different, since the synaptic matrix employed derives from a realistic framework of learning (Amit & Mongillo, 2003; Fusi, 2003, for a review) and includes depression as well as potentiation. Hence, following saturation, learning of the same or a different set of stimuli can recommence and a new functioning synaptic matrix asymptotically generated. This is not part of the older paradigms of associative memory. The evolution of synaptic structuring on retraining has been studied in Brunel et al. 1998, where the speeds of learning and of forgetting were computed. Despite the fact that in that study persistent delay activity was not explicitly considered, it strongly indicates a palimpsest effect (Nadal, Toulouse, Changeux, & Dehaene, 1986); as new stimuli enter the training set, the memory of the old ones is erased to make room for the new ones. It remains to be confirmed also in the wider context of the double dynamics of neurons and synapses, where the interplay between neural activity and synaptic dynamics could generate various instabilities (see Amit & Mongillo, 2003). Moreover, it is tempting to speculate that perhaps, given long experience, most associative networks (like those in the temporal and prefrontal lobe) work near saturation. If that were the case, one could reap yet another bonus: the emission rates of the network near saturation, for random

2632

E. Curti, G. Mongillo, G. La Camera, and D. Amit

stimuli, become relatively low—that is, not much higher than spontaneous activity. The rates are significantly lower than those at low loading of the network (see Figure 5). This feature has been observed over a wide range of parameters for which the system was tested. The low delay rates are a feature observed in experiment and a lingering problem of neural modeling. The solutions projected for this problem are usually additional complexity in modeling, such as slow receptors and adaptive neural or synaptic elements. Here, low delay rates are a natural feature of the network with overlapping memories near saturation. It is worth noting that the dependence of the storage capacity on network parameters like the ratio of LTD/LTP probabilities, the ratio of potentiated/depressed efficacies, and the coding level turns out to be consistent with the simple estimates of Amit and Fusi (1994) and Brunel et al. (1998), as long as the network operates in a balanced regime (excitation versus inhibition; Amit & Brunel, 1997a). For this to hold, the mean inhibitory rate must be proportional to the mean emission rate within the excitatory subnetwork. The storage capacity increases with increasing ratio of potentiated/depressed efficacies (see Figure 6) and logarithmically decreases with increasing ratio of LTD/LTP probabilities. One also observes the increase of capacity with decreasing coding level f , foreshadowed in Willshaw et al. (1969), Tsodyks and Feigelman (1988), Buhmann et al. (1989), and Gardner (1986). On the basis of S/N considerations (Amit & Fusi, 1994), the storage capacity would increase as f −2 , in the limit of vanishing f and q− ∼ f q+ (ρ = 1). The agreement of MF results with these S/N expectations is notable (see, e.g., Figure 9). Furthermore, estimating capacity by the criterion (fraction of potentiated synapses within a selective population)−(fraction of potentiated synapses within the whole excitatory subnetwork)> 0.5, for ρ = 1 (Brunel et al., 1998) found pc ∼ 0.3/f 2 , surprisingly near our 0.23/f 2 (see Figure 9). Our results go deeper in that they relate the potential signal separation directly to the existence of retrieval solutions of the underlying MF equations. In some sense, the signal separation, as evaluated by S/N analysis, is a necessary condition for the existence of the retrieval solution of MF equations. It may be compared to the relation between results for the Amari-Hopfield model, such as Weisbuch and Fogelman-Souli`e (1985) and those of Amit et al. (1987). Appendix: MF Populations for Selective Activity A subpopulation of neurons in the network, corresponding to a given stimulus, may be active at selective rates, higher than the others, due to external, selective stimulation, or selective delay activity. In that case, we double the number of populations into which we divide the network. A subpopulation of neurons of a given multiplicity will be further subdivided into two (exclusive) populations—one of the neurons selective to the active stimulus

Mean Field with Random Sparse Memories

2633

and the other for those that are not. There will be populations of multiplicity α = 1, . . . , p, composed of neurons that are selective to the stimulus, and populations with α = 0,. . ., p − 1 of the neurons that are not selective to the special stimulus—altogether, 2p populations of excitatory neurons. Note that the population with α = 0 cannot be divided, because its neurons cannot be selective. Similarly, the population with α = p cannot, because its neurons must be selective. These 2p excitatory populations are our candidates for MF homogeneous populations. We therefore assume that the rates within each of them are equal, and equal to νβs (for selective cells of multiplicity β) and νβn (for nonselective cells of the same multiplicity). The generalization of equation 2.6 is p p−1 s s n n Ii = CE Jij β νβ πs (β) + Jij β νβ πn (β) . (A.1) β=1

β=0

πs (β)(πn (β)) is the probability that a selective (nonselective) neuron be of multiplicity β. Jij xβ is an average over synapses with selective (x = s) and nonselective (x = n) presynaptic neurons of multiplicity β connected to the postsynaptic neuron i. The joint probability distributions of P potentiations and D depressions during the training with p stimuli, defined following equation 2.7, now have to be computed separately for selective and nonselective pre- and postsynaptic neurons but do not depend on the particular postsynaptic response pattern of the neuron. They depend on the multiplicity and the selectivity xy of the two neurons: ψαβ (P, D) where x(y) is the postsynaptic(presynaptic) selectivity, while α and β are the post- and presynaptic multiplicity. The generalization of equation 2.7 becomes, xy xy Jαβ = [γ (P, D)Jp + (1 − γ (P, D))Jd ]ψαβ (P, D), (A.2) P,D

where xy stand for the four possibilities for having an (n,s) selectivity pair for the post- and presynaptic neurons. Following the reasoning leading to equation 2.8, one obtains, for the four possible combinations xy: p−α α−1 D P−1 ss , P ≥ 1, D = β − P (P, D) = ψαβ p−1 β −1 p−α α−1 D P sn ψαβ , P ≥ 0, D = β − P (P, D) = p−1 β

(A.3)

(A.4)

2634

E. Curti, G. Mongillo, G. La Camera, and D. Amit

p−α−1 α D−1 P ns , P ≥ 0, D = β − P ψαβ (P, D) = p−1 β −1 p−α−1 α D P nn ψαβ , P ≥ 0, D = β − P. (P, D) = p−1 β p−1 β πs (β) = f (1 − f )p−β β −1

(A.5)

(A.6)

(A.7)

and πn (β) =

p−1 β

f β (1 − f )p−β .

(A.8)

In the first case, the p-1 and β-1 are due to the fact that presy for the p−1 ways of naptic cell, the selective bit must be on. Hence, there are β −1 distributing the remaining β − 1 1’s among the remaining p − 1 bits. In the second case, for nonselective presynaptic with multiplicity β, since neurons p−1 ways of distributing the β the first bit must be inactive, there are β 1’s among the remaining p-1 bits. Using equations A.3 and A.4 for selective postsynaptic neurons and equations A.5 and A.6 for nonselective ones, together with the expressions for πs (β) and πn (β) in equation A.1, one obtains for the average current afferent on selective and nonselective cells of multiplicity α, respectively: µsα ≡ Ii = CE µnα

≡ Ii = CE

β

β

ss s Jαβ νβ πs (β) +

ns s Jαβ νβ πs (β)

+

β

β

sn n Jαβ νβ πn (β)

(A.9)

nn n Jαβ νβ πn (β)

.

For the variances of the afferent currents, we obtain xy y (σαx )2 = CE Jαβ νβ πy (β),

(A.10)

(A.11)

y=s,n β

where xy

Jαβ =

xy [γ (P, D)Jp2 + (1 − γ (P, D))Jd2 ]ψαβ (P, D). P,D

(A.12)

Mean Field with Random Sparse Memories

2635

Finally, for completeness, we write the expressions for the mean external and inhibitory currents, µext and µI : µext = Cext Jext νext ; µI = CI JEI νI .

(A.13)

The corresponding variances are given by 2 2 2 σext = Cext Jext νext ; σI2 = CI JEI νI .

(A.14)

Acknowledgments This study was supported by the Center of Excellence Grant “Changing Your Mind” of the Israel Science Foundation and the Center of Excellence Grant “Statistical Mechanics and Complexity” of the INFM, Roma-1. We are grateful to N. Brunel and Y. Amit for very helpful comments on the manuscript. G.L.C. thanks S. Fusi, F. Carusi, and M. Mascaro for very helpful discussions. References Amari, S. I. (1972). Characteristics of random nets of analog neuron-like elements. IEEE Trans Systems, Man and Cybernetics SMC, 2, 643–657. Amit, D. (1989). Modeling brain function. Cambridge: Cambridge University Press. Amit, D. (1998). Simulation in neurobiology—theory or experiment? TINS, 21, 231–237. Amit, D., Brunel, N., & Tsodyks, M. V. (1994). Correlations of cortical Hebbian reverberations: Experiment versus theory. Journal of Neuroscience, 14, 6435– 6445. Amit, D., Bernacchia, A., Yakovlev, V., & Orlov, T. (2003). Multiple–object working memory—A model for behavioral performance. Cerebral Cortex, 13, 435– 443. Amit, D., & Brunel, N. (1997a). Model of global spontaneous activity and local structured (learned) delay activity during delay periods in cerebral cortex. Cerebral Cortex, 7, 237–252. Amit, D., & Brunel, N. (1997b) Dynamics of a recurrent network of spiking neurons before and following learning. Network: Computation in Neural Systems, 8, 373–404. Amit, D., Gutfreund, H., & Sompolinsky, H. (1985). Spin–glass models of neural networks. Phys. Rev., A32, 1007–1018. Amit, D., Gutfreund, H., & Sompolinsky, H. (1987). Statistical mechanics of neural networks near saturation. Annals of Physics, 173, 30–67. Amit, D., & Fusi, S. (1992). Constraints on learning in dynamic synapses. Network, 3, 443–464. Amit, D., & Fusi, S. (1994). Learning in neural networks with material synapses. Neural Computation, 6, 957–982.

2636

E. Curti, G. Mongillo, G. La Camera, and D. Amit

Amit, D., & Mongillo, G. (2003). Spike-driven synaptic dynamics generating working memory states. Neural Computation, 15, 565–596. Amit, D., & Tsodyks, M. V. (1991). Quantitative study of attractor neural networks retrieving at low spike rates I: Substrate spikes, rates and neurons gain. NETWORK, 2, 259–274. Brindley, G. S. (1969). Nerve net models of plausible size that perform many simple learning tasks. Proc. Roy. Soc. Lond. B, 174, 173. Brunel, N. (1996). Hebbian learning of context in recurrent neural networks. Neural Computation, 8, 1677–1710. Brunel, N., & Sergi, S. (1998). Firing frequency of leaky integrate–and–fire neurons with synaptic currents dynamics. J. Theor. Biol., 195, 87–95. Brunel, N., Carusi, F., & Fusi, S. (1998). Slow stochastic Hebbian learning of classes in recurrent neural networks. Network: Computation in Neural Systems, 9, 123–152. Brunel, N. (2000). Persistent activity and the single cell f–I curve in a cortical network model. Network, 11, 261–280. Buhmann, J., Divko, R., & Schulten, K. (1989). Associative memory with high information content. Phys. Rev. A, 39, 2689–2692. Del Giudice, P., & Mattia, M. (2001). Long- and short–term synaptic plasticity and the formation of working memory: A case study. Neurocomputing, 38–40, 1175–1180. Erickson, C. A., & Desimone, R. (1999). Responses of macaque perirhinal neurons during and after visual stimulus association learning. J. Neuroscience, 19, 10404–10416. Fulvi Mari, C. (2000). Random networks of spiking neurons: Instability in the Xenopus tadpole moto–neural pattern. Physical Review Letters, 85, 210–213. Fusi, S. (2003). Spike–driven synaptic plasticity for learning correlated patterns of mean firing rates. Reviews in the Neurosciences, 14, 73–84. Gardner, E. (1986). Structure of metastable states in the Hopfield model. J. Phys. A: Math. Gen., 19, L1047–1052. Golomb, D., Rubin, N., & Sompolinsky, H. (1990). Willshaw model: Associative memory with sparse coding and low firing rates. Phys. Rev. A, 41, 1843– 1854. Hopfield, J. J. (1982). Neural networks and physical systems with emergent selective computational abilities. Proc. Natl. Acad. Sci. USA, 79, 2554–2558. Hopfield, J. J. (1984). Neurons with graded response have collective computational properties like those of two–state neurons. Proc. Natl. Acad. Sci. USA, 81, 3088–3092. Kuhn, R. (1990). Statistical mechanics of networks of analogue neurons. In L. Garrido (Ed.), Statistical mechanics of neural networks. Berlin: Springer-Verlag. La Camera, G. (1999). Apprendimento di stimoli sovrapposti in una rete di neuroni impulsanti, Unpublished thesis, Universita di Roma, La Sapienza. Miller, E., Erickson, C. A., & Desimone, R. (1996). Neural mechanisms of visual working memory in prefrontal cortex of the macaque. J. Neuroscience, 16, 5154–5167. Miyashita, Y. (1988). Neuronal correlate of visual associative long–term memory in the primate temporal cortex. Nature, 335, 817–820.

Mean Field with Random Sparse Memories

2637

Miyashita, Y., & Chang, H. S. (1988). Neuronal correlate of pictorial short-term memory in the primate temporal cortex. Nature, 331, 68–70. Mongillo, G., Amit, D., & Brunel, N. (2003). Retrospective and prospective persistent activity induced by Hebbian learning in a recurrent cortical network. European Journal of Neuroscience, 18, 2011–2024. Moreno, R., & Parga, N. (2004). Response of a leaky integrate–and–fire neuron to white noise input filtered by synapses with an arbitrary time constant. Neurocomputing, 58–60, 197–202. Nadal, J. P., Toulouse, G., Changeux, J. P., & Dehaene, S. (1986). Networks of formal neurons and memory palimpsests. Europhysics Letters, 1, 535–542. Rauch, A., La Camera, G., Luscher, ¨ H. R., Senn, W., & Fusi, S. (2003). Neocortical pyramidal cells respond as integrate–and–fire neurons to in vivo–like input currents. J. Neurophysiol., 90, 1598–1612. Ricciardi, L. (1977). Diffusion processes and related topics in biology. Berlin: SpringerVerlag. Sakai, K., & Miyashita, Y. (1991). Neural organization for the long–term memory of paired associates. Nature, 354, 152–155. Sompolinsky, H. (1987). The theory of neural networks: The Hebb rule and beyond. In L. van Hemmen & I. Morgenstern (Eds.), Heidelberg colloquium on glassy dynamics. Heidelberg: Springer-Verlag. Tamura, H., Kaneko, H., Kawasaky, K, & Fujita, I. (2004). Presumed inhibitory neurons in the macaque inferior temporal cortex: Visual response properties and functional interactions with adjacent neurons. J. Neurophysiol, 91, 2782– 2796. Tsodyks, M. V., & Feigelman, M. V. (1988). The enhanced storage capacity in neural networks with low activity level. Europhys. Lett., 6, 101–105. Weisbuch, G., & Fogelman–Souli`e, F. (1985). Scaling laws for the attractors of Hopfield networks. J. Physique Lett., 46, 623–630. Willshaw, D., Buneman, O. P., & Longuet-Higgins, H. (1969). Nonholographic associative memory. Nature (London), 222, 960–962. Wilson, H. R., & Cowan, J. D. (1972). Excitatory and inhibitory interactions in localized populations of model neurons. Biophysical Journal, 12, 1–24. Received January 7, 2004; accepted June 3, 2004.

LETTER

Communicated by Erkki Oja

Canonical Correlation Analysis: An Overview with Application to Learning Methods David R. Hardoon [email protected]

Sandor Szedmak [email protected]

John Shawe-Taylor [email protected] School of Electronics and Computer Science, Image, Speech and Intelligent Systems Research Group, University of Southampton, Southampton S017 1BJ, U.K.

We present a general method using kernel canonical correlation analysis to learn a semantic representation to web images and their associated text. The semantic space provides a common representation and enables a comparison between the text and images. In the experiments, we look at two approaches of retrieving images based on only their content from a text query. We compare orthogonalization approaches against a standard cross-representation retrieval technique known as the generalized vector space model.

1 Introduction During recent years, there have been advances in data learning using kernel methods. Kernel representation offers an alternative learning to nonlinear functions by projecting the data into a high-dimensional feature space to increase the computational power of the linear learning machines, though this still leaves unresolved the issue of how best to choose the features or the kernel function in ways that will improve performance. We review some of the methods that have been developed for learning the feature space. Principal component analysis (PCA) is a multivariate data analysis procedure that involves a transformation of a number of possibly correlated variables into a smaller number of uncorrelated variables known as principal components. PCA makes use of only the training inputs while making no use of the labels. Independent component analysis (ICA), in contrast to correlation-based transformations such as PCA, not only decorrelates the signals but also reduces higher-order statistical dependencies, attempting to make the signals as independent as possible. In other words, ICA is a way of finding a linear, not only orthogonal, coordinate system in any multivariate data. The c 2004 Massachusetts Institute of Technology Neural Computation 16, 2639–2664 (2004)

2640

D. Hardoon, S. Szedmak, and J. Shawe-Taylor

directions of the axes of this coordinate system are determined by both the second- and higher-order statistics of the original data. The goal is to perform a linear transform that makes the resulting variables as statistically independent from each other as possible. Partial least squares (PLS) is a method similar to canonical correlation analysis. It selects feature directions that are useful for the task at hand, though PLS uses only one view of an object and the label as the corresponding pair. PLS could be thought of as a method that looks for directions that are good at distinguishing the different labels. Canonical correlation analysis (CCA) is a method of correlating linear relationships between two multidimensional variables. CCA can be seen as using complex labels as a way of guiding feature selection toward the underlying semantics. CCA makes use of two views of the same semantic object to extract the representation of the semantics. Proposed by Hotelling in 1936, CCA can be seen as the problem of finding basis vectors for two sets of variables such that the correlations between the projections of the variables onto these basis vectors are mutually maximized. In an attempt to increase the flexibility of the feature selection, kernelization of CCA (KCCA) has been applied to map the hypotheses to a higher-dimensional feature space. KCCA has been applied in some preliminary work by Fyfe and Lai (2001) and Akaho (2001) and the recent Vinokourov, Shawe-Taylor, and Cristianini (2002) with improved results. During recent years, there has been a vast increase in the amount of multimedia content available both off-line and online, though we are unable to access or make use of these data unless they are organized in such a way as to allow efficient browsing. To enable content-based retrieval with no reference to labeling, we attempt to learn the semantic representation of images and their associated text. We present a general approach using KCCA that can be used for content-based retrieval (Hardoon & ShaweTaylor, 2003) and mate-based retrieval (Vinokourov, Hardoon, & ShaweTaylor, 2003; Hardoon & Shawe-Taylor, 2003). In both cases, we compare the KCCA approach to the generalized vector space model (GVSM), which aims at capturing some term-term correlations by looking at co-occurrence information. This study aims to serve as a tutorial and give additional novel contributions in several ways. We follow the work of Borga (1999) where we represent the eigenproblem as two eigenvalue equations, as this allows us to reduce the computation time and dimensionality of the eigenvectors. Further to that, we follow the idea of Bach and Jordan (2002) to compute a new correlation matrix with reduced dimensionality. Though Bach and Jordan address a very different problem, they use the same underlining technique of Cholesky decomposition to re-represent the kernel matrices. We show that using partial Gram-Schmidt orthogonalization (Cristianini, Shawe-Taylor, & Lodhi, 2001) is equivalent to incomplete Cholesky decomposition, in the sense that incomplete Cholesky decomposition can be seen

Canonical Correlation Analysis

2641

as a dual implementation of partial Gram-Schmidt. We show that the general approach can be adapted to two different types of problems, content, and mate retrieval by changing only the selection of eigenvectors used in the semantic projection. And to simplify the learning of the KCCA, we explore a method of selecting the regularization parameter a priori such that it gives a value that performs well in several different tasks. In this study, we also present a generalization of the framework for canonical correlation analysis. Our approach is based on the work of Gifi (1990) and Ketterling (1971). The purpose of the generalization is to extend the canonical correlation as an associativity measure between two set of variables to more than two sets, while preserving most of its properties. The generalization starts with the optimization problem formulation of canonical correlation. By changing the objective function, we will arrive at the multiset problem. Applying similar constraint sets in the optimization problems, we find that the feasible solutions are singular vectors of matrices, which are derived in the same way for the original and generalized problem. In section 2, we present the theoretical background of CCA. In section 3, we present the CCA and KCCA algorithm. Approaches to deal with the computational problems that arose in section 3 are presented in section 4. Our experimental results are presented in section 5, and section 6 draws final conclusions. 2 Theoretical Foundations Proposed by Hotelling in 1936, Canonical correlation analysis can be seen as the problem of finding basis vectors for two sets of variables such that the correlation between the projections of the variables onto these basis vectors is mutually maximized. Correlation analysis is dependent on the coordinate system in which the variables are described, so even if there is a very strong linear relationship between two sets of multidimensional variables, depending on the coordinate system used, this relationship might not be visible as a correlation. Canonical correlation analysis seeks a pair of linear transformations, one for each of the sets of variables, such that when the set of variables is transformed, the corresponding coordinates are maximally correlated. Definition 1. ·, · denotes the Euclidean inner product of the vectors x,y, and it is equal to x y, where we use A to denote the transpose of a vector or matrix A. Consider a multivariate random vector of the form (x, y). Suppose we are given a sample of instances S = ((x1 , y1 ), . . . , (xn , yn )) of (x, y). Let Sx denote (x1 , . . . , xn ) and similarly Sy denote (y1 , . . . , yn ). We can consider defining a new coordinate for x by choosing a direction wx and projecting

2642

D. Hardoon, S. Szedmak, and J. Shawe-Taylor

x onto that direction, x → wx , x. If we do the same for y by choosing a direction wy , we obtain a sample of the new x coordinate. Let Sx,wx = (wx , x1 , . . . , wx , xn ) , with the corresponding values of the new y coordinate being Sy,wy = wy , y1 , . . . , wy , yn . The first stage of canonical correlation is to choose wx and wy to maximize the correlation between the two vectors. In other words, the function’s result to be maximized is ρ = max corr(Sx wx , Sy wy ) wx ,wy

= max

wx ,wy

Sx wx , Sy wy . Sx wx Sy wy

ˆ [ f (x, y)] to denote the empirical expectation of the function If we use E f (x, y), where m 1 Eˆ [ f (x, y)] = f (xi , yi ), m i=1

we can rewrite the correlation expression as ρ = max wx ,wy

Eˆ [wx , x2 ]Eˆ [wx , x2 ]

= max wx ,wy

Eˆ [wx , xwy , y] Eˆ [wx xy wy ]

Eˆ [wx xx wx ]Eˆ [wy yy wy ]

.

It follows that ρ = max wx ,wy

ˆ [xy ]wy wx E ˆ [xx ]wx w E ˆ wx E y [yy ]wy

.

Now observe that the covariance matrix of (x, y) is ˆ C(x, y) = E

x x C = xx y y Cyx

Cxy = C. Cyy

(2.1)

Canonical Correlation Analysis

2643

The total covariance matrix C is a block matrix where the within-sets covariance matrices are Cxx and Cyy and the between-sets covariance matrices are Cxy = Cyx , although equation 2.1 is the covariance matrix only in the zero-mean case. Hence, we can rewrite the function ρ as ρ = max wx ,wy

wx Cxy wy wx Cxx wx wy Cyy wy

.

(2.2)

The maximum canonical correlation is the maximum of ρ with respect to wx and wy . 3 Algorithm In this section, we give an overview of the CCA and KCCA, algorithms where we formulate the optimization problem as a standard eigenproblem. 3.1 Canonical Correlation Analysis. Observe that the solution of equation 2.2 is not affected by rescaling wx or wy either together or independently, so that, for example, replacing wx by αwx gives the quotient

αwx Cxy wy

wx Cxy wy = . wx Cxx wx wy Cyy wy α 2 wx Cxx wx wy Cyy wy

Since the choice of rescaling is therefore arbitrary, the CCA optimization problem formulated in equation 2.2 is equivalent to maximizing the numerator subject to wx Cxx wx = 1

wy Cyy wy = 1. The corresponding Lagrangian is λy λx (wx Cxx wx − 1) − (wy Cyy wy − 1). 2 2 Taking derivatives in respect to wx and wy , we obtain L(λ, wx , wy ) = wx Cxy wy −

∂L = Cxy wy − λx Cxx wx = 0 ∂wx ∂L = Cyx wx − λy Cyy wy = 0. ∂wy

(3.1) (3.2)

Subtracting wy times the second equation from wx times the first, we have 0 = wx Cxy wy − wx λx Cxx wx − wy Cyx wx + wy λy Cyy wy = λy wy Cyy wy − λx wx Cxx wx ,

2644

D. Hardoon, S. Szedmak, and J. Shawe-Taylor

which together with the constraints implies that λy −λx = 0, let λ = λx = λy . Assuming Cyy is invertible, we have wy =

C−1 yy Cyx wx λ

,

(3.3)

and so substituting in equation 3.1 gives 2 Cxy C−1 yy Cyx wx = λ Cxx wx .

(3.4)

We are left with a generalized eigenproblem of the form Ax = λBx. We can therefore find the coordinate system that optimizes the correlation between corresponding coordinates by first solving for the generalized eigenvectors of equation 3.4 to obtain the sequence of wx ’s and then using equation 3.3 to find the corresponding wy ’s. If Cxx is invertible, we are able to formulate equation 3.4 as a standard eigenproblem of the form B−1 Ax = λx, although to ensure a symmetric standard eigenproblem, we do the following. As the covariance matrices Cxx and Cyy are symmetric positive definite, we are able to decompose them using a complete Cholesky decomposition, Cxx = Rxx · Rxx , where Rxx is a lower triangular matrix. If we let ux = Rxx · wx , we are able to rewrite equation 3.4 as follows:

−1 2 Cxy C−1 yy Cyx Rxx ux = λ Rxx ux

−1 −1 2 R−1 xx Cxy Cyy Cyx Rxx ux = λ ux .

We are therefore left with a symmetric standard eigenproblem of the form Ax = λx. 3.2 Generalization of the Canonical Correlation. In this section, we show a possible framework of the generalization of the canonical correlation for more than two samples. We show the relationship between the canonical correlation and the minimization of the total distance. (For details, see Gifi, 1990, and Ketterling, 1971.) 3.2.1 The Simultaneous Formulation of the Canonical Correlation. Instead of using the successive formulation of the canonical correlation, we can join the subproblems into one. We use subscripts 1, 2, . . . instead of x, y to denote more than two classes. The matrices W (1) and W (2) contain the (1) (2) (2) vectors (w(1) 1 , . . . , wp ) and (wp , . . . , wp ). The simultaneous formulation

Canonical Correlation Analysis

2645

is the optimization problem assuming p iteration steps: p

max

(w(1) ,...,wp(1) ),...,(w(2) ,...,wp(2) ) 1 1

s.t.

(1)T C12 w(2) i i=1 wi

C11 wj(1) w(1)T i

1

=

w(2)T C22 wj(2) = i

(3.5)

if i = j,

0

1

otherwise,

0

otherwise,

if i = j,

i, j = 1, . . . , p, w(1)T C12 wj(2) = 0, i i, j = 1, . . . , p, j = i. Based on equation 3.5, we can give a compact form to this optimization, max

W (1) ,W (2)

s.t.

Tr W (1)T C12 W (2)

(3.6)

W (k)T Ckk W (k) = I, Ckl wj(l) = 0, w(k)T i k, l = {1, 2}, l = k, i, j = 1, . . . , p, j = i.

where I is an identity matrix with size p × p. The canonical correlation problem can be transformed into a distance minimization problem where the distance between two matrices is measured by the Frobenius norm: min

W (1) ,W (2)

s.t.

(1) (1) S W − S(2) W (2)

F

(3.7)

W (k)T Ckk W (k) = I, Ckl wj(l) = 0, w(k)T i k, l = 1, . . . , 2, l = k, i, j = 1, . . . , p, j = i.

Unfolding the objective function of equation 3.7 shows this optimization problem is the same as equation 3.6. Exploiting the distance problem, we can give a generalization of the canonical correlation for more than two known samples. Let us give a set of samples in matrix form {S(1) , . . . , S(K) } with dimension m × n1 , . . . , m × nK . We are looking for the linear combinations of the columns of these matrices in the matrix form W (1) , . . . , W (K) such that they give the optimum solution

2646

D. Hardoon, S. Szedmak, and J. Shawe-Taylor

of the problem: K

min

k,l=1,k =l

W (1) ,...W (K)

(k) (k) S W − S(l) W (l)

F

(3.8)

W (k)T Ckk W (k) = I,

s.t.

w(k)T Ckl wj(l) = 0, i k, l = 1, . . . , K, l = k, i, j = 1, . . . , p, j = i. Unfolding the objective function, we have the sum of the squared Euclidean distances between all of the pairs of the column vectors of the matrices S(k) W (k) , k = 1, . . . , K. One can show this problem can be solved by using singular value decomposition for arbitrary K. 3.3 Kernel Canonical Correlation Analysis. CCA may not extract useful descriptors of the data because of its linearity. Kernel CCA offers an alternative solution by first projecting the data into a higher-dimensional feature space (Cristianini & Shawe-Taylor, 2000): φ: x = (x1 , . . . xm ) → φ(x) = (φ1 (x), . . . , φN (x)) (m < N), before performing CCA in the new feature space, essentially moving from the primal to the dual representation approach. Kernels are methods of implicitly mapping data into a higher-dimensional feature space, a method known as the kernel trick. A kernel is a function K, such that for all x, z ∈ X, K(x, z) = φ(x) · φ(z) ,

(3.9)

where φ is a mapping from X to a feature space F. Kernels offer a great deal of flexibility, as they can be generated from other kernels. In the kernel, the data appears only through entries in the Gram matrix. Therefore, this approach gives a further advantage as the number of tuneable parameters and updating time does not depend on the number of attributes being used. Using the definition of the covariance matrix in equation 2.1, we can rewrite the covariance matrix C using the data matrices (of vectors) X and Y, which have the sample vector as rows and are therefore of size m × N; we obtain Cxx = X X Cxy = X Y. The directions wx and wy (of length N) can be rewritten as the projection of the data onto the direction α and β (of length m): wx = X α wy = Y β.

Canonical Correlation Analysis

2647

Substituting into equation 2.2, we obtain the following: ρ = max √ α,β

α XX YY β . α XX XX α · β YY YY β

(3.10)

Let Kx = XX and Ky = YY be the kernel matrices corresponding to the two representation. We substitute into equation 3.10, α Kx Ky β ρ = max . α,β α Kx2 α · β Ky2 β

(3.11)

We find that in equation 3.11, the variables are now represented in the dual form. Observe that as with the primal form presented in equation 2.2, equation 3.11 is not affected by rescaling of α and β either together or independently. Hence, the KCCA optimization problem formulated in equation 3.11 is equivalent to maximizing the numerator subject to α Kx2 α = 1 β Ky2 β = 1. The corresponding Lagrangian is λβ 2 λα 2 (α Kx α − 1) − (β Ky β − 1). 2 2 Taking derivatives in respect to α and β, we obtain L(λ, α, β) = α Kx Ky β −

∂L = Kx Ky β − λα Kx2 α = 0 ∂α ∂L = Ky Kx α − λβ Ky2 β = 0. ∂β

(3.12) (3.13)

Subtracting β times equation 3.13 from α times equation 3.12, we have 0 = α Kx Ky β − α λα Kx2 α − β Ky Kx α + β λβ Ky2 β = λβ β Ky2 β − λα α Kx2 α, which together with the constraints implies that λα −λβ = 0; let λ = λα = λβ . Considering the case where the kernel matrices Kx and Ky are invertible, we have β= =

Ky−1 Ky−1 Ky Kx α λ Ky−1 Kx α λ

.

2648

D. Hardoon, S. Szedmak, and J. Shawe-Taylor

Substituting in equation 3.12, we obtain Kx Ky Ky−1 Kx α − λ2 Kx Kx α = 0. Hence, Kx Kx α − λ2 Kx Kx α = 0 or Iα = λ2 α.

(3.14)

We are left with a standard eigenproblem of the form Ax = λx. We can deduce from equation 3.14 that λ = 1 for every vector of α; hence, we can choose the projections α to be unit vectors ji i = 1, . . . , m while β are the columns of λ1 Ky−1 Kx . Hence, when Kx or Ky is invertible, perfect correlation can be formed. Since kernel methods provide high-dimensional representations, such independence is not uncommon. It is therefore clear that a naive application of CCA in kernel-defined feature space will not provide useful results. This highlights the potential problem of overfitting that arises in high-dimensional feature spaces. Clearly these correlations are failing to distinguish spurious features from those capturing the underlying semantics. In the next section, we investigate how this problem can be avoided. 4 Computational Issues We observe from equation 3.14 that if Kx is invertible, maximal correlation is obtained, suggesting learning is trivial. To force nontrivial learning, we introduce a control on the flexibility of the projections by penalizing the norms of the associated weight vectors by a convex combination of constraints based on partial least squares. Another computational issue that can arise is the use of large training sets, as this can lead to computational problems and degeneracy. To overcome this issue, we apply partial GramSchmidt orthogonalization (equivalently incomplete Cholesky decomposition) to reduce the dimensionality of the kernel matrices. 4.1 Regularization. To force nontrivial learning on the correlation by controlling the problem of overfitting and hence finding nonrelevant correlations, we introduce a control on the flexibility of the projection mappings using partial least squares (PLS) to penalize the norms of the associated weights. We convexly combine the PLS term with the KCCA term in the

Canonical Correlation Analysis

2649

denominator of equation 3.11, obtaining ρ = max α,β

= max α,β

α Kx K y β (α Kx2 α + κwx 2 ) · (β Ky2 β + κwy 2 )) α Kx K y β (α Kx2 α + κα Kx α) · (β Ky2 β + κβ Ky β)

.

We observe that the new regularized equation is not affected by rescaling of α or β; hence, the optimization problem is subject to (α Kx2 α + κα Kx α) = 1

(β Ky2 β + κβ Ky β) = 1. We follow the same approach as in section 3.3, where we assume that K is invertible. We are able to find that Kx Ky (Ky + κI)−1 Kx α = λ2 Kx (Kx + κI)α Ky (Ky + κI)−1 Kx α = λ2 (Kx + κI)α (Kx + κI)−1 Ky (Ky + κI)−1 Kx α = λ2 α. We obtain a standard eigenproblem of the form Ax = λx . 4.2 Incomplete Cholesky Decomposition. Complete decomposition (Golub & Loan, 1983) of a kernel matrix is an expensive step and should be avoided with real-world data. Incomplete Cholesky decomposition as described in Bach and Jordan (2002) differs from Cholesky decomposition in that all pivots below a certain threshold are skipped. If M is the number of nonskipped pivots, then we obtain a lower triangular matrix Gi with only M nonzero columns. Symmetric permutations of rows and columns are necessary during the factorization if we require the rank to be as small as possible (Golub & Loan, 1983). In algorithm 1, we describe the algorithm from Bach and Jordan (2002) (with slight modification). The algorithm involves picking one column of K at a time, choosing the column to be added by greedily maximizing a lower bound on the reduction in the error of the approximation. After l steps, we have an approximation of the form K˜ l = Gil G il , where Gil is N×l. The ranking of the N − l vectors can be computed by comparing the diagonal elements of the remainder matrix K − Gil G il . 4.3 Partial Gram-Schmidt Orthogonalization. We explore the partial Gram-Schmidt orthogonalization (PGSO) algorithm, described in Cristianini et al. (2001), as our matrix decomposition approach. ICD could been

2650

D. Hardoon, S. Szedmak, and J. Shawe-Taylor

Algorithm 1: Pseudocode for ICD Input matrix K of size N × N, precision parameter η Initialization: i = 1, K = K, P = I, for j ∈ [1, N], Gjj = Kjj N while j=1 Gjj > η and i! = N + 1 Find best new element: j∗ = arg maxj∈[i,N] Gjj Update j∗ = (j∗ + i) − 1 Update permutation P : Pnext = I, Pnextii = 0, Pnextj∗ j∗ = 0, Pnextij∗ = 1, Pnextj∗ i = 1 P = P · Pnext Permute elements i and j∗ in K : K = Pnext · K · Pnext Update (due to new permutation) the already calculated elements of G: Gi,1:i−1 ↔ Gj∗ ,1:i−1 Permute elements j∗ , j∗ and i, i of G: ∗ ∗ G(i, i) ↔ G(j √ ,j ) Set Gii = Gii Calculate ith column of G: i−1 Gi+1:n,i = G1ii (Ki+1:n,i − j=1 Gi+1:n,j Gij ) 2 Update only diagonal elements: for j ∈ [i + 1, N], Gjj = Kjj − ik=1 Gjk Update i = i + 1 end while Output P, G and M = i Output: N × M lower triangular matrix G, a permutation matrix P such that P KP − GG ≤ η.

as equivalent to PGSO as ICD is the dual implementation of PGSO. PGSO works as follows. The projection is built up as the span of a subset of the projections of a set of m training examples. These are selected by performing a Gram-Schmidt orthogonalization of the training vectors in the feature space. We slightly modify the Gram-Schmidt algorithm so it will use a precision parameter as a stopping criterion, as shown in Bach and Jordan (2002). The PGSO pseudocode is as shown in algorithm 2. The pseudocode to classify a new example is given in algorithm 3. The advantage of using the PGSO in comparison to the incomplete Cholesky decomposition (as described in section 4.2) is that there is no need for a permutation matrix P. 4.4 Kernel-CCA with PGSO. So far we have considered the kernel matrices as invertible, although in practice this may not be the case. In this section, we address the issue of using large training sets, which may lead to computational problems and degeneracy. We use PGSO to approximate the kernel matrices such that we are able to re-represent the correlation with reduced dimensionality.

Canonical Correlation Analysis

2651

Algorithm 2: Pseudocode for PGSO Given kernel K from a training set, precision parameter η m = size of K, a N × N matrix j=1 size and index are a vector with the same length as K f eat a zeros matrix equal to the size of K for i = 1 to m do norm2[i] = Kii ; end for while i norm2[i] > η and j! = N + 1 do ij = arg maxi (norm2[i]); index[j] = ij ; size[j] = norm2[ij ]; for i = 1 to m do (k(di ,di )−

j−1

f eat[i,t]· f eat[ij ,t])

j t=1 feat[i, j] = ; size[j] norm2[i] = norm2[i]− feat(i, j)· feat(i, j); end for j= j+1 end while M=j return feat Output: K − f eat · f eat ≤ η where f eat is a N × M lower triangular matrix

Algorithm 3: Pseudocode to Classify a New Example at Location i Given a kernel K from a testing set for j = 1 to M do j−1 newfeat[j] = (Ki,index[j] − t=1 newfeat[t] · feat[index[j], t])/size[j]; end for Decomposing the kernel matrices Kx and Ky via PGSO, where R is a lower triangular matrix, gives Kx = ˜ Rx Rx

Ky = ˜ Ry Ry . Substituting the new representation into equations 3.12 and 3.13 and multiplying the first equation with Rx and the second equation with Ry gives Rx Rx Rx Ry Ry β − λRx Rx Rx Rx Rx α = 0

(4.1)

Ry Ry Ry Rx Rx α

(4.2)

−

λRy Ry Ry Ry Ry β

= 0.

2652

D. Hardoon, S. Szedmak, and J. Shawe-Taylor

Let Z be the new correlation matrix with the reduced dimensionality: Rx Rx = Zxx

Ry Ry = Zyy Rx Ry = Zxy

Ry Rx = Zyx . Let α˜ and β˜ be the reduced directions, such that α˜ = Rx α β˜ = Ry β. Substituting in equations 4.1 and 4.2, we find that we return to the primal representation of CCA with a dual representation of the data: Zxx Zxy β˜ − λZ2xx α˜ = 0 Zyy Zyx α˜ − λZ2yy β˜ = 0. Assume that the Zxx and Zyy are invertible. We multiply the first equation −1 with Z−1 xx and the second with Zyy : Zxy β˜ − λZxx α˜ = 0 Zyx α˜ − λZyy β˜ = 0.

(4.3) (4.4)

We are able to rewrite β˜ from equation 4.4 as β˜ =

˜ Z−1 yy Zyx α λ

,

and substituting in equation 4.3 gives Zxy Z−1 ˜ = λ2 Zxx α. ˜ yy Zyx α

(4.5)

We are left with a generalized eigenproblem of the form Ax = λBx. Let SS be equal to the complete Cholesky decomposition of Zxx such that Zxx = SS , where S is a lower triangular matrix, and let αˆ = S · α. ˜ Substituting in equation 4.5, we obtain

−1 S−1 Zxy Z−1 αˆ = λ2 α. ˆ yy Zyx S

We now have a symmetric standard eigenproblem of the form Ax = λx.

Canonical Correlation Analysis

2653

We combine the dimensionality reduction with the regularization parameter (see section 4.1). Following the same approach, it is easy to show that we are left with a generalized eigenproblem of the form Ax = λBx:

Zxy (Zyy + κI)−1 Zyx S−1 α = λ2 (Zxx + κI)α. Performing a complete Cholesky decomposition on Zxx + κI = SS where S is a lower triangular matrix, let αˆ = S α: ˜

ˆ S−1 Zxy (Zyy + κI)−1 Zyx S−1 αˆ = λ2 α. We obtain a symmetric standard eigenproblem of the form Ax = λx. 5 Experimental Results In the following experiments, the problem of learning semantics of multimedia content by combining image and text data is addressed. The synthesis is addressed by the kernel canonical correlation analysis described in section 4.4. We test the use of the derived semantic space in an image retrieval task that uses only image content. The aim is to allow retrieval of images from a text query but without reference to any labeling associated with the image. This can be viewed as a cross-modal retrieval task. We used the combined multimedia image-text web database, which was kindly provided by Kolenda, Hansen, Larsen, and Winther (2002), where we are trying to facilitate mate retrieval on a test set. The data were divided into three classes: Sport, Aviation, and Paintball (see Figure 1). There were 400 records each, consisting of jpeg images retrieved from the Internet with attached text. We randomly split each class into two halves, used as training and test data, accordingly. The extracted features of the data used were the same as in Kolenda et al. (2002). (A detailed description of the features used can be found in Kolenda et al.) They were image HSV color, image Gabor texture, and term frequencies in text. The first view (Kx ) comprises a gaussian kernel (with σ set to be the minimum distance between images) on the image HSV color add with a linear kernel on the Gabor texture. The second view (Ky ) is a linear kernel on the term frequencies. We compute the value of κ for the regularization by running the KCCA with the association between image and text randomized. Let λ(κ) be the spectrum without randomization, the database with itself, and λR (κ) be the spectrum with randomization, the database with a randomized version of itself (by spectrum, we mean the vector whose entries are the eigenvalues). We would like to have the nonrandom spectrum as distant as possible from the randomized spectrum, as if the same correlation occurs for λ(κ) and λR (κ). Then overfitting clearly is taking place. Therefore, we expect that for κ = 0 (no regularization) and j = 1, . . . , 1 (the all ones vector) that we may have λ(κ) = λR (κ) = j, since it is possible that the examples are linearly

2654

D. Hardoon, S. Szedmak, and J. Shawe-Taylor

Figure 1: Example of images in database.

independent. Though we find that only 50% of the examples are linearly independent, this does not affect the selection of κ through this method. We choose the value of κ so that the difference between the spectrum of the randomized set is maximally different (in the two norm) from the true spectrum: κ = arg max λR (κ) − λ(κ) We find that κ = 7 and set via a heuristic technique the Gram-Schmidt precision parameter η = 0.5 . To perform the test image retrieval, we compute the features of the images and text query using the Gram-Schmidt algorithm. Once we have obtained the features for the test query (text) and test images, we project them into the semantic feature space using β˜ and α˜ (which are computed through training), respectively. Now we can compare them using an inner product of the semantic feature vector. The higher the value of the inner product, the more similar the two objects are. Hence, we retrieve the images whose inner products with the test query are highest. We compared the performance of our methods with a retrieval technique based on the generalized vector space model (GVSM). This uses as a semantic feature vector the vector of inner products between either a text query and each training label or test image and each training image. For both methods, we used a gaussian kernel, with σ = max distance/20 for the image color component, and all experiments were an average of 10 runs. For convenience, we review the content-based and mate-based approaches in separate sections. Whereas in the content-based retrieval, we are interested in retrieving images of the same class (label), in mate based, we are inter-

Canonical Correlation Analysis

2655

Figure 2: Images retrieved for the text query: “height: 6-11 weight: 235 lbs position: forward born: september 18, 1968, split, croatia college: none.”

ested in the retrieval of the exact matching image. Although one could argue the relevance of the content-based retrieval experiment due to the lack of labels, we feel it is important in showing the duality and possibility of the application. 5.1 Content-Based Retrieval. In this experiment, we used the first 30 and 5 α˜ eigenvectors and β˜ eigenvectors (corresponding to the largest eigenvalues). We computed the 10 and 30 images for which their semantic feature vector has the closest inner product with the semantic feature vector of the chosen text. Success is considered if the images contained in the set are of the same label as the query text (see Figure 2, the retrieval example for set of five images). In Tables 1 and 2, we compare the performance of the kernel CCA algorithm and generalized vector space model. In Table 1, we present the performance of the methods over 10 and 30 image sets, and in Table 2, as plotted in Figure 3, we see the overall performance of the KCCA method against the GVSM for image sets (1–200). In the 200 th image set location, the maximum of 200 × 600 of the same labeled images over all text queries can be retrieved (we have only 200 images per label). The success rate in

2656

D. Hardoon, S. Szedmak, and J. Shawe-Taylor

Table 1: Success Cross-Results Between Kernel-CCA and Generalized Vector Space. Image Set

GVSM Success

KCCA Success (30)

KCCA Success (5)

10 30

78.93% 76.82%

85% 83.02%

90.97% 90.69%

Table 2: Success Rate Over All Image Sets (1–200). Method

Overall Success

GVSM KCCA (30) KCCA (5)

72.3% 79.12% 88.25%

Table 1 and Figure 3 is computed as follows: 600 i success % for image set i =

j=1

j k=1 countk

i × 600

× 100,

j

where countk = 1 if the image k in the set is of the same label as the text query j

present in the set, else countk = 0. The success rate in Table 2 is computed as above and averaged over all image sets. As visible in Figure 4, we observe that when we add eigenvectors to the semantic projection, we will reduce the success of the content-based retrieval. We speculate that this may be the result of unnecessary detail in the semantic eigenvectors and that the semantic information needed is contained in the first few eigenvectors. Hence a minimal selection of 5 eigenvectors is sufficient to obtain a high success rate. 5.2 Mate-Based Retrieval. In the experiment, we used the first 150 and 30 α˜ eigenvectors and β˜ eigenvectors (corresponding to the largest eigenvalues). We computed the 10 and 30 images for which their semantic feature vector has the closest inner product with the semantic feature vector of the chosen text. A successful match is considered if the image that actually matched the chosen text is contained in this set. We computed the success as the average of 10 runs (see Figure 5, the retrieval example for set of 5 images). In Table 3, we compare the performance of the KCCA algorithm with the GVSM over 10 and 30 image sets, whereas in Table 4, we present the overall success over all image sets. In Figure 6, we see the overall performance of the KCCA method against the GVSM for all possible image sets.

Canonical Correlation Analysis

2657

95 KCCA (5) KCCA (30) GVSM

90

Success Rate (%)

85

80

75

70

65

60 0

20

40

60

80

100 120 Image Set Size

140

160

180

200

Figure 3: Success plot for content-based KCCA against GVSM.

90

Overall Success (%)

80

70

60

50

40

30 0

5

10

15

20

25 30 eigenvectors

35

40

45

50

Figure 4: Content-based plot of eigenvector selection against overall success (%).

2658

D. Hardoon, S. Szedmak, and J. Shawe-Taylor

Figure 5: Images retrieved for the text query: “at phoenix sky harbor on july 6, 1997. 757-2s7, n907wa phoenix suns taxis past n902aw teamwork america west america west 757-2s7, n907wa phoenix suns taxis past n901aw arizona at phoenix sky harbor on july 6, 1997.” The actual match is the middle picture in the top row.

Table 3: Success Cross-Results Between Kernel-CCA and Generalised Vector Space. Image Set

GVSM Success

KCCA Success (30)

KCCA Success (150)

10 30

8% 19%

17.19% 32.32%

59.5% 69%

The success rate in Table 3 and Figure 6 is computed as follows: 600 success % for image set i =

j=1 countj

600

× 100,

where countj = 1 if the exact matching image to the text query was present in the set, else countj = 0. The success rate in Table 4 is computed as above and averaged over all image sets.

Canonical Correlation Analysis

2659

100

90

80

70

60

50

40

30

20 KCCA (150) KCCA (30) GVSM

10

0

0

100

200

300

400

500

600

Figure 6: Success plot for KCCA mate-based against GVSM (success (%) against image set size). Table 4: Success Rate Over All Image Sets. Method

Overall Success

GVSM KCCA (30) KCCA (150)

70.6511% 83.4671% 92.9781%

As visible in Figure 7, we find that unlike the content-based retrieval, increasing the number of eigenvectors used will assist in locating the matching image to the query text. We speculate that this may be the result of added detail toward exact correlation in the semantic projection. We do not compute for all eigenvectors as this process would be expensive and the remaining eigenvectors would not necessarily add meaningful semantic information. It is clear that the kernel CCA significantly outperforms the GVSM method in both content retrieval and mate retrieval. 5.3 Regularization Parameter. We next verify that the method of selecting the regularization parameter κ a priori gives a value that performs well. We randomly split each class into two halves, used as training and test data accordingly. We keep this divided set for all runs. We set the value of the incomplete Gram-Schmidt orthogonalization precision parameter η = 0.5

2660

D. Hardoon, S. Szedmak, and J. Shawe-Taylor 95 90

overall success (%)

85 80 75 70 65 60 55 50 0

20

40

60

80 100 eigenvectors

120

140

160

180

Figure 7: Mate-based plot of eigenvector selection against overall success (%). Table 5: Overall Success of Content-Based (CB) KCCA with Respect to κ. κ

CB-KCCA (30)

CB-KCCA (5)

0 κˆ 90 230

46.278% 83.5238% 88.4592% 88.5548%

43.8374% 91.7513% 92.7936% 92.5281%

Note: Bold type signifies the best overall success achieved.

and run over possible values κ, where for each value, we test its contentbased and mate-based retrieval performance. Let κˆ be the previous optimal choice of the regularization parameter κˆ = κ = 7. As we define the new optimal value of κ by its performance on the testing set, we can say that this method is biased (loosely, it is cheating). We will show that despite this, the difference between the performance of the biased κ and our a priori κˆ is slight. In Table 5, we compare the overall performance of the content-based (CB) performance with respect to the different values of κ, and Figures 8 and 9 plot the comparison. We observe that the difference in performance between the a priori value κˆ and the newly found optimal value κ for 5 eigenvectors is 1.0423% and for 30 eigenvectors it is 5.031%. The more substantial increase in performance on the latter is due to the increase in the selection of the

Canonical Correlation Analysis

2661

Overall Success (%)

88.55

88.5

88.45

88.4 100

120

140

160

180

200 220 Kappa

240

260

280

300

Figure 8: Content-based κ selection against overall success for 30 eigenvectors.

92.796 92.794

Overall Success (%)

92.792 92.79

92.788 92.786 92.784 92.782 92.78 92.778 80

85

90

95

100

Kappa

105

110

115

120

Figure 9: Content-based κ selection against overall success for five eigenvectors.

2662

D. Hardoon, S. Szedmak, and J. Shawe-Taylor

Table 6: Overall Success of Mate-Based (MB) KCCA with Respect to κ. κ

MB-KCCA (30)

MB-KCCA (150)

0 κˆ 170 240 430

73.4756% 84.75% 85.5086% 85.5086% 85.4914%

83.46% 92.4% 92.9975% 93.0083% 93.027%

Note: Bold type signifies the best overall success achieved. 85.515

Overall Success (%)

85.51

85.505

85.5

85.495

85.49

85.485 100

120

140

160

180

200

Kappa

220

240

260

280

300

Figure 10: Mate-based κ selection against overall success for 30 eigenvectors.

regularization parameter, which compensates for the substantial decrease in performance (see Figure 6) of the content-based retrieval, when highdimensional semantic feature space is used. In Table 6, we compare the overall performance of the mate-based (MB) performance with respect to the different values of κ, and in Figures 10 and 11 we view a plot of the comparison. We observe that in this case, the difference in performance between the a priori value κˆ and the newly found optimal value κ is for 150 eigenvectors 0.627% and for 30 eigenvectors is 0.7586%. Our observed results support our proposed method for selecting the regularization parameter κ in an a priori fashion, since the difference between the actual optimal κ and the a priori κˆ is very slight.

Canonical Correlation Analysis

2663

93.03

93.02

Overall Succes (%)

93.01

93.00

92.99

92.98

92.97

92.96

92.95 100

150

200

250

300

350

kappa

400

450

500

550

Figure 11: Mate-based κ selection against overall success for 150 eigenvectors.

6 Conclusions In this study, we have presented a tutorial on canonical correlation analysis and have established a novel general approach to retrieving images based solely on their content. This is then applied to content-based and mate-based retrieval. Experiments show that image retrieval can be more accurate than with the generalized vector space model. We demonstrate that one can choose the regularization parameter κ a priori that performs well in very different regimes. Hence, we have come to the conclusion that kernel canonical correlation analysis is a powerful tool for image retrieval via content. In the future, we will extend our experiments to other data collections. In the procedure of the generalization of the canonical correlation analysis, we can see that the original problem can be transformed and reinterpreted as a total distance problem or variance minimization problem. This special duality between the correlation and the distance requires more investigation to give more suitable descriptions of the structure of some special spaces generated by different kernels. These approaches can provide tools to handle some problems in the kernel space where the inner products and the distances between the points are known but the coordinates are not. For some problems, it is sufficient to know only the coordinates of a few special points, which can be expressed from the known inner product (e.g., do cluster analysis in the kernel space and compute the coordinates of the cluster centers only).

2664

D. Hardoon, S. Szedmak, and J. Shawe-Taylor

Acknowledgments We acknowledge the financial support of EU Projects KerMIT, No. IST2000-25341 and LAVA, No. IST-2001-34405. The majority of work was done at Royal Holloway, University of London. References Akaho, S. (2001). A kernel method for canonical correlation analysis. In International Meeting of Psychometric Society. Osaka, Japan. Bach, F., & Jordan, M. (2002). Kernel independent component analysis. Journal of Machine Leaning Research, 3, 1–48. Borga, M. (1999). Canonical correlation. Online tutorial. Available online at: http://www.imt.liu.se/∼magnus/cca/tutorial. Cristianini, N., & Shawe-Taylor, J. (2000). An introduction to support vector machines and other kernel-based learning methods. Cambridge: Cambridge University Press. Cristianini, N., Shawe-Taylor, J., & Lodhi, H. (2001). Latent semantic kernels. In C. Brodley & A. Danyluk (Eds.), Proceedings of ICML-01, 18th International Conference on Machine Learning (pp. 66–73). San Francisco: Morgan Kaufmann. Fyfe, C., & Lai, P. L. (2001). Kernel and nonlinear canonical correlation analysis. International Journal of Neural Systems, 10, 365–374. Gifi, A. (1990). Nonlinear multivariate analysis. New York: Wiley. Golub, G. H., & Loan, C. F. V. (1983). Matrix computations. Baltimore: Johns Hopkins University Press. Hardoon, D. R., & Shawe-Taylor, J. (2003). KCCA for different level precision in content-based image retrieval. In Proceedings of Third International Workshop on Content-Based Multimedia Indexing. Rennes, France: IRISA. Hotelling, H. (1936). Relations between two sets of variates. Biometrika, 28, 312–377. Ketterling, J. R. (1971). Canonical analysis of several sets of variables. Biometrika, 58, 433–451. Kolenda, T., Hansen, L. K., Larsen, J., & Winther, O. (2002). Independent component analysis for understanding multimedia content. In H. Bourlard, T. Adali, S. Bengio, J. Larsen, & S. Douglas (Eds.), Proceedings of IEEE Workshop on Neural Networks for Signal Processing XII (pp. 757–766). Piscataway, NJ: IEEE Press. Vinokourov, A., Hardoon, D. R., & Shawe-Taylor, J. (2003). Learning the semantics of multimedia content with application to web image retrieval and classification. In Proceedings of Fourth International Symposium on Independent Component Analysis and Blind Source Separation. Nara, Japan. Vinokourov, A., Shawe-Taylor, J., & Cristianini, N. (2002). Inferring a semantic representation of text via cross-language correlation analysis. In S. Becker, S. Thrun, & K. Obermayer (Eds.), Advances in neural information processing systems, 15. Cambridge, MA: MIT Press. Received June 24, 2003; accepted May 4, 2004.

LETTER

Communicated by A. David Redish

Cognitive Map Formation Through Sequence Encoding by Theta Phase Precession Hiroaki Wagatsuma [email protected] Laboratory for Dynamics of Emergent Intelligence, RIKEN BSI, Wako-shi, Saitama 351-0198, Japan

Yoko Yamaguchi [email protected] Laboratory for Dynamics of Emergent Intelligence, RIKEN BSI, Wako-shi, Saitama 3510198, Japan; College of Science and Engineering, Tokyo Denki University, Hatoyama, Saitama 350-0394, Japan; and CREST, Japan Science and Technology Corporation

The rodent hippocampus has been thought to represent the spatial environment as a cognitive map. The associative connections in the hippocampus imply that a neural entity represents the map as a geometrical network of hippocampal cells in terms of a chart. According to recent experimental observations, the cells fire successively relative to the theta oscillation of the local field potential, called theta phase precession, when the animal is running. This observation suggests the learning of temporal sequences with asymmetric connections in the hippocampus, but it also gives rather inconsistent implications on the formation of the chart that should consist of symmetric connections for space coding. In this study, we hypothesize that the chart is generated with theta phase coding through the integration of asymmetric connections. Our computer experiments use a hippocampal network model to demonstrate that a geometrical network is formed through running experiences in a few minutes. Asymmetric connections are found to remain and distribute heterogeneously in the network. The obtained network exhibits the spatial localization of activities at each instance as the chart does and their propagation that represents behavioral motions with multidirectional properties. We conclude that theta phase precession and the Hebbian rule with a time delay can provide the neural principles for learning the cognitive map. 1 Introduction The rodent hippocampus stores information concerning environmental space. According to the cognitive map theory of O’Keefe and Nadel (1978), an entire environment is represented by the population activities of hipc 2004 Massachusetts Institute of Technology Neural Computation 16, 2665–2697 (2004)

2666

H. Wagatsuma and Y. Yamaguchi

pocampal cells. When an animal is in a specific portion of an environment, a “place cell” fires (O’Keefe & Dostrovsky, 1971). The portions of these individual place cells, called place fields, are distributed over the entire environment. After a brief exploration of an environment (about 10 minutes), pyramidal cells in the CA3 and CA1 regions of the hippocampus exhibit firing preferences based on the animal’s location (Wilson & McNaughton, 1993). This means that place cells for a new location are quickly established, and the geometry of the new environment soon becomes familiar. It is important to understand how the hippocampus acquires the cognitive map after such a short running experience. The neural basis of hippocampal learning was theoretically proposed by Marr (1971) as a framework of an associative memory. After that, several theoretical models of hippocampal associative memory were proposed (McNaughton & Morris, 1987; McNaughton, 1989; McNaughton & Nadel, 1989; Rolls, 1989). In this framework, Muller and his colleagues (Muller, Kubie, & Saypoff, 1991; Muller, Stead, & Pach, 1996) proposed that simultaneous firing of place cells forms a network where these cells have strong synaptic connections with neighboring place fields. Accordingly, the geometric network of place cells, called a chart, represents the environment through a spatial relationship among place cells. Such a chart formation was demonstrated by K´ali and Dayan (2000). The strength of the connection between a pair of place cells represents the distance between two corresponding locations in the environment. Associative connections in the CA3 region serve as the functional basis of a cognitive map during spatial navigation, providing neural activities that are spatially localized in the chart. Recently, Nakazawa et al. (2002) experimentally confirmed that recurrent connections of pyramidal cells in the CA3 region serve as associative memory by using mutant mice with the ablated NMDA receptor gene specifically in the CA3 region. These studies on the cognitive map are in line with the traditional Hebbian learning rule. Several recent studies of synaptic plasticity and neural dynamics in the hippocampus suggest the importance of asymmetric connections in memory encoding of the temporal sequence. O’Keefe and Recce (1993) discovered theta phase precession, which is an advancement of a place cell’s firing phase relative to the cycle of the theta rhythm (around 8 Hz) in the hippocampus as a rat goes through the place field. By using parallel recordings of the place cells, Skaggs, McNaughton, Wilson, and Barnes (1996) demonstrated that the phase precession is not only the phase advancement of individual cells, but also retains the robust phase difference in the firing of a population of cells. The firings with phase differences represent a behavioral sequence in a compressed form within each theta cycle. According to the results of Skaggs et al. (1996), the phase difference in place cells has a timescale similar to the asymmetric time window of synaptic plasticity in pyramidal cells. Therefore, they suggested that theta phase precession enables memory encoding of the behavioral temporal sequence in asymmetric connections. Rosen-

Cognitive Map Formation by Theta Phase Precession

2667

zweig, Ekstrom, Redish, McNaughton, and Barnes (2000) observed theta phase precession in a novel environment, which suggested its contribution to the initial stage of the learning process. The asymmetric time window of synaptic plasticity in pyramidal cells was discovered in the tetanic stimulus condition (Levy & Steward, 1983; Larson & Lynch, 1989) and was further clarified in the relative timing of individual spikes (Bi & Poo, 1998). When the presynaptic neuron fires 50 msec before the postsynaptic neuron, the synapse is enhanced though longterm potentiation (LTP). The reverse timing depresses the synapse though long-term depression (LTD). This asymmetric property in synaptic plasticity enables memory storage through the firing order in the form of asymmetric connections. Yamaguchi (2003) proposed a hippocampal model of memory encoding that used theta phase precession, which encoded one-time experience into the recurrent network in the CA3 region (Yamaguchi & McNaughton, 1998). According to experimental evidence (Skaggs et al., 1996; Yamaguchi, Aota, McNaughton, & Lipa, 2002), theta phase precession is assumed to appear in the entrance of the hippocampus. The phase precession is inherited by the CA3 region and conducts temporal coding of the rat’s behavioral sequences into asymmetric connections among the cells. The theoretical model enables memory encoding with a variety of timescales of the temporal sequence (Sato & Yamaguchi, 2003) and a variety of spatiotemporal patterns (Wu & Yamaguchi, 2004). The question is whether sequence learning based on theta phase precession in the presence of asymmetric synaptic plasticity can be extended to other hippocampal memories. The consideration of asymmetric synapses and theta phase precession in the cognitive map was developed by Redish and Touretzky (1998), and this theory consisted of two stages. One was for chart formation and the other separate for route learning. The chart formation follows the associative memory with symmetric connections, and theta phase precession is applied to route learning after the chart formation. It results in the embedding of asymmetric connections encoding sequence memory into the symmetric network of the chart formed before. They concluded that these two could coexist in the hippocampal network. Samsonovich and McNaughton (1997) also studied theta phase precession in the cognitive map. In their model, a chart with symmetric connections is assumed to be in accordance with previous studies. The phase precession during running is generated by an asymmetric property in the extrahippocampal layer encoding the direction of motion. In these studies, the chart formation is still within the framework of the associative memory with symmetric connections and is independent of the experimental evidence for the synaptic plasticity of asymmetric connections. The purpose of this study is to elucidate roles of theta phase precession in chart formation and spatial representation. We reported that it was possible for a hippocampal network model with theta phase precession to gener-

2668

H. Wagatsuma and Y. Yamaguchi

ate the geometrical network of the CA3 region (Wagatsuma & Yamaguchi, 1999, 2000). Our computer experiments are devoted to clarifying how the asymmetric property of synaptic plasticity contributes to the formation and other computational functions of the cognitive map. 2 Hypothesis We hypothesize that theta phase precession provides memory encoding of the temporal sequence of a behavioral experience into asymmetric connections of the CA3 region, and that the one-dimensional network encoding of the sequence is integrated during the animal’s exploration. The resulting network, which encodes the two-dimensional properties of the environment, is the chart. In order to elucidate the animal’s spatial navigation, we consider the behavior and the hippocampal dynamics in our simulated animal “rat,” as follows: 1. Time series of sensory inputs. As the rat runs in an environment, a time series of multimodal sensory information is provided to the hippocampus. For simplicity, we assume that the time series consists of visual scenes opened out before the rat. 2. Generation of theta phase precession. The time series of sensory input sequentially activates neural units in the entorhinal cortex as oscillators. By assuming the neural oscillation and input of the local field potential with the theta rhythm (LFP theta), the mutual interaction leads oscillators to generate theta phase precession. Through the feedforward projection, the temporal pattern of theta phase precession is inherited by the CA3 region to keep the stable phase difference among cells in CA3. 3. Asymmetric Hebbian plasticity. Synaptic plasticity in CA3 is characterized by the asymmetric Hebbian plasticity, which enhances synapses that are sensitive to the firing timing between pre- and postsynaptic neurons. The dynamics of theta phase precession in the presence of this plasticity generates synaptic connections that represent the running sequence. 4. Unification of sequential memories. In explorations of the same places, recent memory interferes with the past memories already stored in CA3 recurrent connections. Therefore, the temporal structure of the sensory input that implicitly includes some metric for a map of the environment allows a group of cells to activate recent and past running episodes and to share connections within the group. As the interference increases, one-dimensional representation in the sequence develops into two-dimensional representation as the network of the chart.

Cognitive Map Formation by Theta Phase Precession

2669

3 Model Our neural network model consists of three layers: the input layer, the EC layer, and the CA3 layer, as illustrated in Figure 1A. The input layer represents the local view of visual information. The EC layer corresponds to the superficial layer of the entorhinal cortex, receiving the series of visual information from the input layer and generating the temporal coding as the phase precession. The CA3 layer receives the phase precession temporal patterns from the EC layer. Recurrent connections in CA3 yield asymmetric Hebbian plasticity, the evolution of which is the central issue in chart formation. The dentate gyrus and CA1 region of the hippocampus are abbreviated for the present purpose. The number of units in each layer is represented in the same way, by the letter N, and the projections between the subsequent layers are one-to-one connections for simplicity. This simple assumption is sufficient for the analysis here, restricted to the formation of a single chart, and our model can be easily extended to multiple charts in a network. Details of mathematical equations are given in the appendix. 3.1 Environment and Rat Behavior. The environmental space in this study is a square area of the (x, y) coordinate space, given by [0, 1] × [0, 1]. N landmarks are distributed at the lattice points in the environmental space (see Figure 1B). Walls are set up along all of the edges in the environment so that the running direction of the rat must change. The rat keeps a constant running velocity of 0.043 space units per theta cycle. Supposing that the theta rhythm is 8 Hz and that one space unit is 1 m, then the corresponding running velocity is 0.34 m/sec. 3.2 Input Layer. The behavior-dependent state of the input layer is represented by {Ii }, (i = 1, . . . , N), where each component, Ii , derives from a single landmark in the environmental space. The input, Ii , at each instant takes a positive value when the landmark is within the visual field V that spreads in a semicircle form in front of the rat’s location. This means that input of the local view is body-centered information depending on the location of the rat and its running direction. Input during a behavioral sequence is given as a continuous sequence of local views according to the trajectory in the environmental space. 3.3 EC Layer. The EC layer consists of N neural units and an LFP theta unit. The LFP theta has a frequency of ω0 . Assuming that there is a oneto-one projection from the input layer to the EC layer, each neural unit receives only one component of the input vector {Ii }. We assume that a neural unit is an oscillator, and units with an input from landmarks are activated as sustained oscillations, while others remain in resting states. The key assumption of the generation of theta phase precession is the change in the internal parameter of each unit, so that the intrinsic, or native, frequency

2670

H. Wagatsuma and Y. Yamaguchi

A Input EC

CA3

B y

Vlength V

ri Ii

Theta unit (EC LFP)

Theta unit (CA3 LFP)

Vangle

x

Figure 1: Illustration of the network structure and input of local view in the model. (A) The input, EC, and CA3 layers. Each input (Ii ) derives from a single landmark shown in B. Open circles in the EC and CA3 layers represent neural units. Projections from the input to the EC and from the EC to the CA3 are assumed to be one-to-one connections. Black half circles represent synaptic connections. Recurrent connections in the CA3 layer are assumed to follow the asymmetric Hebbian rule, as shown in section A.2 in the appendix. The LFP theta units representing the local field theta oscillation are shown at the bottom of the EC and CA3 layers. (B) Local view input given by a set of landmarks. Individual landmarks are assumed to distribute at lattice points, m × m, in a square environment with two-dimensional coordinates (x, y). Landmarks around and in front of the rat activate individual units in the input layer within a semicircle of visual field V defined by the radius (Vlength ) and the angle (Vangle ), whose magnitude depends on the distance between the landmark and the current position of the rat (ri ).

of neural oscillation, ω(t), gradually increases around ω0 in the presence of the input (Yamaguchi & McNaughton, 1998; Yamaguchi, 2003). Lengyel, ´ Szatm´ary, and Erdi (2003) called this property dynamically detuned oscillations and examined the validity for theta phase precession in several description levels of a single neuron, including an abstract model and an integrate-andfire neuron. In our model, the LFP theta unit couples with each neural oscillation as a reference oscillator that leads to phase locking. As the native frequency increases and phase locking occurs, the relative phase between the neural and LFP units also increases gradually. The phase locking instantaneously forms a robust phase precession. The time series of local views is compressed by the phase precession into theta cycles when the rat is running. Activated units associated with different landmarks have distinct phase differences according to the onset times of the inputs. Our model does not require use of the geometrical structure in the environment that is used in the model of theta phase precession by Burgess, Recce, and O’Keefe (1994). Units in the group perpendicular to

Cognitive Map Formation by Theta Phase Precession

2671

the running direction activate with the same phase regardless of the spatial locations of the landmarks. The time evolution of the membrane potential of each unit is described by using a phase model. A developed form of this phase model (Hoppensteadt, 1986; Yamaguchi, 1996) enables a simple description of nerve excitation and oscillation (see section A.1 in the appendix). In our model, the membrane potential of the unit i is given by cos φiEC , normalized in the range [−1, 1], where the variable φiEC represents the oscillation phase changing from 0 to 2π. The firing of the unit fiEC is given by a sigmoidal function of the membrane potential. 3.4 CA3 Layer. The CA3 layer also consists of N neural units and an LFP theta unit, which are both described by the same type of neural oscillator in the EC layer, with the exception of the change in the native frequency. The CA3 unit i receives four types of input: the firing of the EC unit i, fiEC , the activity of the LFP theta of the CA3 layer, cos φ0CA3 , recurrent excitation by other CA3 units j through recurrent connections, wij , and global inhibition. The LFP theta of CA3 has the stable theta oscillation with a π/4 phase delay from that of the EC layer. The value is estimated based on the difference between the LFP theta of the dentate gyrus and that of the CA1 region (Skaggs et al., 1996; Yamaguchi et al., 2002). The effect of recurrent connections is given by the excitation coefficient E. The global inhibition depends on activities in all CA3 units, with an inhibition coefficient G. The global inhibition and excitatory connections control the number of active CA3 units because of competition (Amari, 1977; Amit & Tsodyks, 1992). The time evolution of recurrent connections in the CA3 layer is given by a modified Hebbian rule with a time delay, according to experimental results of Bi and Poo (1998). The time window of LTP is given by a specific firing time difference between pre- and postsynaptic cells and with a nonspecific time difference in LTD. We call this learning rule the asymmetric Hebbian rule. The evolution of connection weight from unit j to unit i, wij , is described as w˙ ij (t) = (λLTP { fjCA3 (t) · fiCA3 (t − τ )} − λLTD { fjCA3 (t) + fiCA3 (t − τ )}) · Rw (wij (t)), (i, j = 1, . . . , N),

(3.1)

where fiCA3 is the firing rate of the CA3 unit i, and Rw (wij ) is the regulation factor that keeps the wij value in the range from zero to one (see section A.2 in the appendix). The specific timing of LTP is simply given by a time point τ , which is physiologically known to be less than 50 ms (see Figure 2A). When the firing of unit i follows that of unit j with a time delay of τ , the equation then gives asymmetric changes that satisfy w˙ ij > 0 and w˙ ij < 0. As fiCA3 has

2672

H. Wagatsuma and Y. Yamaguchi

A

LTP LTD

Dwij

t

B

Dt ij

0.02

Dwij 0.01

-0.4

-0.2 -0.005

0

0.2

0.4

Dt ij

Figure 2: The asymmetric Hebbian rule used in the model. (A) Illustration of LTP and LTD in equation 3.1. The LTP is given at specific time point τ , and LTD is independent of the timing of firing in units i and j. (B) The change of the connection weight after a pair of firings of units i and j, with a given time difference in the CA3 layer of the present model. A single firing in the phase model has a time width of 0.3 theta cycles. Each point of data is obtained by the numerical integration with the initial value, wij = 0.5. The horizontal axis represents the time difference in firings, given as tij = Ti − Tj , where Ti is the onset time of firing of unit i. It indicates that the time window of the synaptic plasticity has a time width nearly equal to that of the single firing around the time point τ = 0.19. This time window is consistent with the experimental result by Bi and Poo (1998).

a duration period of 0.3 theta cycles in a single firing, equation 3.1 results in plasticity with a time window of around 50 ms, as shown in Figure 2B. Here, we assume that the CA3 network has no geometric connections before learning, and therefore, the zero value is set to be the initial value of all recurrent connections. It should be noted that the numbering of all of the units with i from 1 to N in equation 3.1 is merely a conventional tag and is independent of the anatomical positions of the neurons. It is consistent with experimental observations that the topology of place fields in a certain

Cognitive Map Formation by Theta Phase Precession

2673

environment is not preserved in different environments (Muller & Kubie, 1987; Bostock, Muller, & Kubie, 1991). The population activity of the CA3 units replicates the temporal pattern of theta phase precession from the EC units with a projection delay. According to the asymmetric Hebbian rule, recurrent connections develop according to the phase differences in the population activity. When the rat is running, sequential inputs of local cues along the running direction are transformed into the phase precession pattern in the EC layer. CA3 units inherit the phase differences from the EC units, and then asymmetric connections are formed between them. On the other hand, connections could become symmetric in pairs of units that are associated with local cues perpendicular to the running direction because of the same firing phase. Thus, the running direction will affect the formation of the connections. Spatial exploration provides asymmetric and symmetric connections in the CA3 layer. 3.5 Method of Computer Experiments. As the basic properties of memory encoding of temporal sequences, first, we must examine the learning in a short-run episode of a single trial and then the learning in a spatial exploration. We call this period of computer experimentation the learning stage. After learning, retrieval activities are analyzed in order to understand any stored information in the recurrent connections. This period is called the retrieval stage. For quantitative analysis of the chart formation, several quantities on the CA3 units and their recurrent connections are defined below. • Expected position: The expected position of each CA3 unit in the chart is defined by its landmark position via projections. The expected positions of the units in the input, EC, and CA3 layers are all similarly defined. • Expected array: The expected array of the CA3 unit arrangement is defined according to the expected positions. This array reveals the properties of the chart. Since landmarks are latticed in the environmental space, the expected array is treated as an m × m(m2 = N) matrix. This definition is also used for the input and EC layers. • Distance in the expected array: The distance between the units i and j in the CA3 layer of the expected array, d(i, j), is defined as the distance between the expected positions of units i and j. • Efferent/afferent map of connection weights: In order to visualize the distribution of connections of CA3 units in the expected array, efferent and afferent maps are used. We distinguish the efferent connection weights and afferent connection weights of unit i, which are given by WiE = {wji : j = 1, . . . , N} and WiA = {wij : j = 1, . . . , N}, respectively. The efferent map of the presynaptic unit i is defined as the arrange-

2674

H. Wagatsuma and Y. Yamaguchi

ment of the WiE values on the expected array. The afferent map of the postsynaptic unit i is defined as the arrangement of the WiA values on the expected array. • Directionality map: In order to elucidate the distribution of the asymmetry of connections in each pair of units, we define the connection asymmetry of unit i as the set of differences between the efferent and ¯ i = WiE − WiA = {wji − wij : afferent connection weights, given as W ¯ j = 1, . . . , N}. If Wi = 0, then every pair of connections associated ¯ i = 0, directionality can be defined with unit i is symmetric. When W as the direction from the expected position to the peak position of the connection asymmetry distribution. The directionality map is given as a spatial distribution of all units’ directionalities on the expected array so that these directionalities are shown as arrows on the map. The equations were numerically integrated using the Runge-Kutta method under various behavioral conditions. Time is represented by the number of theta cycles such that 2π/ω0 = 1 theta cycle. 4 Results of Computer Experiments 4.1 Learning of a Short-Run Episode. First, we examine the fundamental dynamics of theta phase precession and memory encoding of the temporal sequence in the above model as the rat runs along a straight line from (0, 0.5) to (1, 0.5) in the (x, y) environmental space. Figure 3A illustrates the population activities of individual layers in a theta cycle. In the input layer, the instantaneous population activity forms a semicircle in the expected array, moving in the running direction, which represents sequential local views. Within a semicircle of the local view of the EC layer, instantaneous population activity appears in a crescent shape. The population activity begins at the flat end of the semicircle and progresses forward until the region of active units reaches the outer curve of the semicircle during the theta cycle. The spatiotemporal pattern of the theta cycle is a wave propagating along the trajectory in the expected array. This property is in good agreement with the phase precession pattern on a chart reconstructed from experimental data (Samsonovich & McNaughton, 1997). This property of the wave propagation indicates that the population activities of the current local view are transformed into a sequence of instantaneous subpopulations representing the running temporal sequence in a compressed form. Results of the retrieval stage in the CA3 layer are shown in Figure 3B. A partial cue is given at the local view of the starting point of the experienced trajectory. Once the partial cue is turned off, spontaneous activity appears and propagates smoothly along the entire trajectory in the expected array. Instantaneous distribution of the active units forms a crescent, similar to that of the phase precession in the learning stage. As shown in Figure 4, both efferent and afferent connections make crescents at the expected posi-

Cognitive Map Formation by Theta Phase Precession

A. Learning Stage Input

EC

2675

B. Retrieval Stage CA3

CA3

1

2

3

4

5

6

Figure 3: (A) Theta phase precession in the learning stage of a short run. The population activity at each instant is observed in the expected array, shown as a block of 20 × 20 pixels. In the input layer, each pixel denotes the intensity of each input unit. The semicircle shape of the population corresponds to the range of the local view. In the EC and CA3 layers, each pixel denotes the spike density of each unit. The clock shown in the top-right corner in each square represents the value of the LFP theta in each EC and CA3 layers changing from 0 to 2π. The temporal pattern of phase precession can be observed in the EC and CA3 as a kind of wave propagation. Time series of the population activity, A1–6, are obtained every 0.16 theta cycles. (E = 0.07, G = 2.0) (B) Retrieval activities after the running of (A) and the fixed LFP theta. Time series of the population activity of CA3, B1–6, are obtained every 0.5 theta. Trigger input is given for a theta cycle shown as black pixels in B1, as indicated by the arrow. (E = 0.225, G = 2.25).

2676

H. Wagatsuma and Y. Yamaguchi

A. Efferent map y

B. Afferent map

1

0

0

x 1

0

x1

Figure 4: An example of efferent/afferent maps after a short run. (A) The efferent map of connection weights from a unit, shown by the white square. (B) The afferent map of connection weights from the same unit. Schematic illustrations of efferent/afferent connections from a unit are shown at the bottom.

tions of any given unit’s neighborhood. However, the efferent connections expand before the expected position of the presynaptic unit, and afferent connections are distributed behind the same position. This shows that the units along the trajectory construct a network of unidirectional connections. It is considered a kind of synfire chain, proposed by Abeles (1991), where each layer of the synfire network corresponds to a crescent-shaped group of units, providing chain reactions of synchronized firing. This network encodes the temporal sequence of the running episode in a single trial. The results shown in Figure 4 indicate that connections of individual units depend on the running direction, while they do not uniquely depend on the distance in the expected array. This implies that these connections do not satisfy the chart property. 4.2 Formation of the Cognitive Map 4.2.1 Learning by Spatial Exploration. In the next analysis, we examine chart formation during spatial exploration. We assume that the rat continues to run for 1 minute in the square environment. The running direction of the rat is assumed to change once every 18 theta cycles. The amount of direction change is assigned to a random variable with a uniform distribution in the range of [−90◦ , 90◦ ]. At a wall, the rat’s direction refracts back like a ball off a wall. Figure 5A shows the evolution of the trajectory during exploration. There are three representative points of time, T1 , T2 , and T3 , with T1 = 35 cycles

Cognitive Map Formation by Theta Phase Precession

A

2677

B

1

y

y

Strength 0.1 0.5

1

0 0 1

0

T1

T1 0.5

x1

x

1

0.3 0.1 0 0 1

T2 T2 0.4 0.2 0

T3 0

T3

Figure 5: (A) Trajectories in spatial exploration during time T1 − T3 . Solid lines denote trajectories of the rat. The running direction is shown as arrowheads on the line. The closed circle denotes the starting position. The open circle shown only in T3 represents the final position of the exploration. (B) Development of the efferent map of an example unit during time T1 − T3 . The white bar standing on the expected array indicates the expected position of the presynaptic unit to which the cross mark in A corresponds (E = 0.07).

2678

H. Wagatsuma and Y. Yamaguchi

(4.375 s), T2 = 117 cycles (14.625 s), and T3 = 467 cycles (58.375 s). The behaviors of periods [0, T1 ], [0, T2 ], and [0, T3 ] respectively correspond to a simple trajectory, a trajectory with crossings at several points, and well exploration. As the trajectory extends over more of the environment, the number of times a path is crossed also increases. The evolution of the efferent map of connection weights is shown in Figure 5B. At time T1 , it is in a crescent form that is similar to previous experiments encoding a running episode. At time T2 , the efferent map is in a circular form in relation to the presynaptic units (indicated by the white bar). A concentric shape of the connection map sharpens at time T3 . As the crossing points increase, the distribution of the connection weights on the map becomes more concentric around the expected position of the presynaptic unit. Figure 6A shows the different evolution of efferent maps of individual presynaptic units. At time T1 , similar crescent-shaped distributions appear with flattened surfaces. At time T2 , some of efferent maps display distributions with single peaks, while others display distributions with multiple peaks. At time T3 , most of the distributions have single peaks. For elucidating the collective distribution of efferent maps, the average efferent map is obtained by superimposing individual efferent maps aligned on the individual expected positions of presynaptic units by parallel shift (see Figure 6B) and by parallel shift and rotation (see Figure 6C). Figure 6B shows the average distribution by the parallel shift of efferent maps in time T3 . Apparently, the distribution forms a two-dimensional gaussian shape. This result indicates that the strength of the efferent connections decreases monotonically with respect to the distance on the expected array. Thus, the CA3 recurrent network sufficiently establishes the chart property within a minute. This property is consistent with the results obtained by Figure 6: Facing page. (A) Individual efferent maps during time T1 − T3 for 7 × 7 sampled presynaptic units. In the entire layout, individual maps are aligned according to the own expected position of the presynaptic unit, shown by a white pixel. (B) The average efferent map obtained by the parallel shift in time T3 . In unit i’s efferent connection weights, W Ei , each expected position of unit j is transformed to the two-dimensional space bin, in the way of the parallel shift of (xj , yj ) − (xi , yi ), where (xi , yi ) is the expected position of unit i. The twodimensional space bin is given as 40 × 40 space bins in one square field. (C) The average efferent map obtained by the parallel shift and rotation in time T3 . The peak position, used for the rotation of individual maps, is given as the center of mass of connection weights larger than 70% of the maximum values of each map. After the parallel shift, each expected position of unit j is transformed by the rotation of the angle toward the peak position. Thus, the horizontal gray line indicates the line connecting the peak and the expected position of all presynaptic units. (D) Average distribution by the parallel shift and rotation of the connection asymmetry in time T3 .

Cognitive Map Formation by Theta Phase Precession

A

B

2679

Average strength

0.1 0.05 0

1 0.08

0 0

-1 -1

1

0 Relative coordinate

T1

( after the parallel shift )

0.15

C

0.05 0

1 0.17

0 0

-1 -1

1

0 Relative coordinate

( after the parallel shift and rotation )

T2

D

-1

0

1

0.5

0

0.1 0 Connection -0.1 asymmetry

T3

2680

H. Wagatsuma and Y. Yamaguchi

Muller et al. (1991, 1996) and Redish and Touretzky (1998) using a Hebbian rule as a rate coding. In their models, simultaneous firing of neighboring place cells produces symmetric connections proper to the chart, while temporal coding with theta phase precession in the present model produces asymmetric connections. Interestingly, in the average distribution by parallel shift and rotation in time T3 (see Figure 6C), the peak statistically deviates from the expected position in this distribution. Therefore, the overall structure of the learned network has a two-dimensional geometry with a symmetric distribution on the expected array, while the asymmetry is significantly preserved in the individual distributions of efferent/afferent connections. Figure 6D shows the average distribution by parallel shift and rotation of the connection asymmetry in time T3 . The average distribution has a sharp peak and valley in opposite directions. The positive peak is located with a gap of 0.15 space units from the expected position (three grids in Figure 6D), at the same position of the peak in the average distribution of Figure 6C. Thus, the directionality statistically coincides with the direction from the expected position to the peak position of the efferent maps. This property was observed in individual units. The relationship property between the average strength of efferent connections and the distance as shown in Figure 6B was confirmed in 20 computer experiment data sets with different running trajectories (data are not shown). On the other hand, the asymmetric property seen in Figure 6D has been well preserved throughout the entire learning process. In the relationships between the heights of individual efferent maps and the ranges from the peak to the valley in individual connection asymmetry, the correlation coefficient r keeps high values of r = 0.958 in T1 and r = 0.866 in T3 . The highly correlated relationship through the learning process means that the chart formation is accompanied by the growth of asymmetric connections. Thus, the asymmetry in the connection is preserved and provides the directionality in subsequent exploration. Figure 7 shows the time evolution of the directionality maps colored by the strength of individual efferent connection weights of the presynaptic unit, which is given as the summation of weights. As the trajectory expands, the strength develops over the entire environment. With respect to the directionalities at time T1 , arrows are aligned with the experienced trajectory, indicating the encoding of a single behavioral episode. According to the time lapse, the directionality distribution does not disappear but spreads out over the environment. The directionality of each unit changes according to the evolution of exploration, while the heterogeneous distribution of the directionalities in the environment is kept as a collective property. It is because the directionality of each unit derives from the interactive processes among the activities and synaptic plasticity. Since CA3 units receive input from the EC units, the recurrent excitations, and the global inhibition, the population activities are formed through a competitive process. The activity pattern contributes to the synaptic plasticity as a positive feedback loop.

Cognitive Map Formation by Theta Phase Precession

y

2681

1

0.01

0 1

0

T1

x1

0

0.2

0 1

T2

0

0.36

0

T3

0

Figure 7: The time evolution of the directionality map during time T1 − T3 . (The detailed definition is described in section 3.5.) The length of the arrow is given as the scale-down of 0.3 times the original length. The color of the arrow represents the summation value of the efferent connection weights of each unit, except for the units with a summation value less than 2.5 × 10−4 . The directionalities are heterogeneously distributed in both T2 and T3 .

2682

H. Wagatsuma and Y. Yamaguchi

These processes modulate the directionality at each instance and lead to continuous changes in the heterogeneous pattern of directionality during running. The directionality map is different from the vector fields in theoretical models of the goal-oriented map (Redish & Touretzky, 1998; Trullier & Meyer, 2000). In the model of Trullier and Meyer (2000), individual arrowheads in the vector field represent the direction of motion of the retrieval activities so that the arrows are oriented to the goal. Redish and Touretzky (1998) showed that the vector fields of asymmetric connections are embedded in the chart, similar to our definition of the directionality map. In their vector field, each arrowhead directly represents the direction to the goal and determines the movement direction at each position. The vector fields of these two models represent the motion of direction uniquely at each point, consistent with the vector fields in dynamical systems. In our model, each arrowhead reflects integration of the experienced trajectories at each position and does not uniquely represent the direction of motion. Its function will be elucidated in the next section. 4.2.2 Retrieval Stage. To analyze the dynamics caused by recurrent connections without any external input, the synaptic modification is stopped and the LFP theta of CA3 is fixed at zero. The values of the recurrent connections are used for spatial explorations until time T3 . For a trigger input, a random pattern of input units is supplied in one theta cycle. The excitation coefficient E is given larger than that of the learning stage, which corresponds to the absence of ACh in the physiological state without theta, such as large-amplitude irregular activity (Vanderwolf, 1971). In our model, two coefficients, E and G, control the level of competition in the population activity so that the increase of E induces active units, while increase of G restricts the size of the population. In this study, we examine the temporal evolution of retrieval activities in the expected array in two conditions: condition 1 of E = 2.0 and G = 2.2, and condition 2 of E = 3.0 and G = 3.0. In condition 1 (see Figure 8A, left column), the population travels over the environment. However, in condition 2 (see Figure 8A, right column), the initial random pattern quickly localizes and moves to a certain point. In both conditions, the populations tend to localize in a circular shape. We observed a crescent shape in the retrieval activities of the short-run episode. After spatial exploration, the retrieval activity showed a temporal sequence of the population activity with a circular shape. The difference in the shape of the instantaneous patterns, a crescent or a circle, indicates that memory contents in the CA3 recurrent network develop into representations of places by the integration of experiences in various running directions. Figure 8B shows the trace of the center of mass during eight theta cycles in both conditions—condition 1 and 2 are shown in the top and bottom, respectively. In condition 1, the center of mass of the population travels over

Cognitive Map Formation by Theta Phase Precession

A 1

2683

B y

1

0.5 2 0

0

0.5

0

0.5

3

x

1

1

4

5

0.5

0

1

6

Figure 8: Retrieval activities based on the learned recurrent connections after a random initial input. Connection weights among the CA3 units are used after spatial exploration until t = T3 . (A) Spontaneous activities after the trigger input in a theta cycle as a random pattern shown in A1. The same random pattern is given in two conditions. Arrows represent the directionalities of units. The gray-scale of the individual arrow represents the firing rate of individual unit i, given as fiCA3 (t). (Left) Condition 1 of E = 2.0 and G = 2.2. (Right) Condition 2 of E = 3.0 and G = 3.0. Time series of the population activity, A1–6, are obtained every 0.3 theta in condition 1 and every 0.1 theta in condition 2. (B) Trajectory of the center of mass of the CA3 population activity for eight theta cycles from the start. Top and bottom show conditions 1 and 2, respectively. The open circle denotes the start of population, given the center of mass in A1.

2684

H. Wagatsuma and Y. Yamaguchi

the entire environment but does not directly correspond to the experienced trajectories. The traces cross everywhere. This means that the direction of motion is not uniquely determined by the arrowhead at each point, but by the collective property of the directionality in the active units. The heterogeneous distribution of the directionality makes an instantaneous direction of motion that is dependent on the spatiotemporal pattern of the active units. In condition 2, the center of mass of the population smoothly approaches a point and stays around it. These results indicate that the values of E and G control the dynamical behavior through competition in population. The result obtained in condition 2 appears to be similar to the drift of the place code in the vector field shown by Redish and Touretzky (1998). The place code in the vector field always approaches a single location and does not continue to move over the entire environment. Redish and Touretzky assumed the formation of symmetric connections for the chart before route learning so that the symmetric connections give the population activity that is spatially localized in the chart. However, our results on learning with theta phase precession show that symmetric connections are not necessary for localization to occur and that the heterogeneity of asymmetric connections can produce it. The difference between these two learning types appears in the property of the direction of motion in the retrieval stage. Learning with theta phase precession produces asymmetric connections with the heterogeneous property, showing the flexibility of the direction of motion. 4.2.3 Running Experience After Learning. In running, activities with behavior-dependent input interfere with the retrieval activities of previous experiences. Let us examine how input from recurrent connections affects current activities of theta phase precession in different stages of learning. In order to observe the experience-dependent effects of the phase precession, we compared the recurrent networks after short running and after spatial exploration. Two conditions were prepared in this computer experiment: the T1 condition, where the rat experienced the learning stage until t = T1 , and the T3 condition, where the rat experienced the learning stage until t = T3 in the same condition of section 4.2.1. In this analysis, rats in both T1 and T3 conditions were forced to run, tracing along the same trajectory until t = T1 . Parameters are the same as those in the learning stage, except for E = 0.2, which satisfies the condition of ACh release during running. This is discussed in section 4.2.2. In both conditions, CA3 unit activities follow that of the EC layer with theta phase precession (see Figure 9A). In the T1 condition, activity patterns in CA3 are consistent with those of the EC in the early phases (see Figure 9B, rows 1 and 2), indicating the inheritance of the phase precession. In comparing the patterns in the EC and CA3, activities of the CA3 are larger than those of the EC activities at each instance in the late phases (see Figure 9B, rows 3–5). The enlargement of the activities in the CA3 is in

Cognitive Map Formation by Theta Phase Precession

A

B

2685

C

1

2

3

4

5

Figure 9: Temporal evolution of population activities when rats are running on the same trajectory with different preceding experiences, time T1 and T3 . (A) The population activity of the EC in the T1 and T3 condition. (B) The population activity of the CA3 in the T1 condition. (C) The population activity of the CA3 in the T3 condition. Arrows in B and C represent the directionality of the units. The gray-scale of the individual arrow represents the firing rate of the individual unit, which is the same as in Figure 8A. The current position is marked by the symbol of the rat in each time. The clock displayed in the top-right corner in each square represents the value of the LFP theta in each layer. Each plot is given in a half theta cycle. (E = 0.2).

2686

H. Wagatsuma and Y. Yamaguchi

agreement with the arrowheads in the active units as well as the running direction. This means that the activities propagate through asymmetric connections formed in the previous experience. The phase precession inherited from the EC layer and the propagation through asymmetric connections in CA3 cooperate to give a sequence in a theta cycle. The latter factor was proposed as a mechanism of theta phase precession by Tsodyks, Skaggs, Sejnowski, and McNaughton (1996) and Wallenstein and Hasselmo (1997). We hypothesized that theta phase precession is generated in the EC without recurrent connections, while two mechanisms can cooperate in the CA3 after sequence learning. The above propagation among units means the anterograde expansion of firing fields in individual units, which is consistent with the experience-dependent expansion of place fields shown in a linear track (Mehta, Barnes, & McNaughton, 1997; Mehta, Quirk, & Wilson, 2000; Ekstrom, Meltzer, McNaughton, & Barnes, 2001). In the T3 condition, activity patterns in CA3 are similar to those in the T1 condition in the early phases (see Figure 9C, rows 1 and 2), as well as those of the EC. In the late phases, the different patterns appear in the component of the propagation (see Figure 9C, rows 3–5). Activities expand in multiple directions according to individual directionality in the individual positions. The spherical enlargement of population activities is consistent with the representation of place in a circular shape, as shown in Figure 8A. In this process, the sequence of population patterns is considered as follows: 1. Current perception: Representation of a local view in the current position by the current EC input 2. Pattern completion of place: Representation of the current place based on the resulting recurrent connections 3. Anticipation: Coordination between the representations of the current perception and place This sequence of computation is repeated in every theta cycle during running. This cycle suggests that the theta rhythm acts as an internal clock, coordinating perception and memory. Additionally, the temporal evolution of active units with multiple directions generates the spherical expansion in running directions, which can be regarded as a synfire chain with multidirectional properties. The expansion also generates noisy patterns of theta phase precession in individual units. Skaggs et al. (1996) documented a consistent observation. They showed the existence of theta phase precession during the spatial exploration of an open field, while the pattern is noisier than in a linear track. Furthermore, Samsonovich and McNaughton (1997) reconstructed the spatiotemporal pattern of population activities within the theta cycle from the experimental data obtained from the foraging task in an open field. They demonstrated that a small population appears closer to the rat’s position in the early phase and

Cognitive Map Formation by Theta Phase Precession

2687

then extends and moves along the running direction in the late phase. This is consistent with our results shown in Figure 9C. In order to clarify the functional role of asymmetric recurrent connections, we evaluated the running experience with partial visual cues in an environment in which half of the landmarks were removed. The rat in the T3 condition is forced to run the straight path, which has not been experienced before. Figure 10 shows the activities of the CA3 units at the peak of the LFP theta during the traversal along the straight path. In Figure 10A, the activities in the input layer are seen to have random deficits in the semicircle. According to the projection from the input layer to the EC layer, the EC units generate the phase precession patterns with deficits. We compared activities in the CA3 layer in cases with and without recurrent connections. In the network without recurrent connections, wij = 0, for all i and j (see Figure 10B), and CA3 activities are seen to have deficits directly reflecting the EC patterns. In contrast, Figure 10C shows that in the network with the learned recurrent connections by the spatial exploration, activities evoked by deficit patterns are fulfilled in a circular region. In comparison with the pattern in the full cue condition, as seen in Figure 10C, row 3, the crescent shape in the anterior part is completed and the activities expand in various directions. The result indicates that the recurrent connections enable retrieving activities experienced at individual positions. This ability of recurrent connections was similarly observed in the experiment conducted by Nakazawa et al. (2002), who used mutant mice with NMDA receptors in the CA3 region. They showed that CA3 recurrent connections are necessary to reach the goal in the Morris water maze in partial cue conditions. In our study, we noticed a pattern change at every theta cycle and that the pattern completion is observed not in the initial phase of activation, but near the peak of every theta oscillation in the CA3 layer. It will be of interest when comparing the theta phase–dependent property of the population activity in electrophysiological experiments in the future. 5 Discussion 5.1 Neural Mechanism of Theta Phase Precession. A number of theoretical models for theta phase precession have been proposed; however, none is conclusive. From the standpoint of sequential memory, these models can be divided into two groups. The first causes theta phase precession by connections formed after learning, and the other generates phase precession in the absence of learning. Tsodyks et al. (1996) assumed in their model that the asymmetric connections in CA3 and observed theta phase precession as asymmetric spread of activation of the cells. Similar assumptions are found in models proposed by Jensen and Lisman (1996), Wallenstein and Hasselmo (1997), and Samsonovich and McNaughton (1997). These models demonstrated phase

2688

H. Wagatsuma and Y. Yamaguchi

A

B

C

1

2

3

4

5

6

Figure 10: Population of the CA3 in the T3 condition with and without recurrent connections in the condition of partial cues. (A) Activities in the input layer. The current position is marked by the symbol of the rat in each time. (B) CA3 activities without recurrent connections (wij = 0 for all i and j). (C) CA3 activities with the learned recurrent connections in the T3 condition. The partial input is given by removing half of the landmarks. The rat runs on the diagonal line in the environmental space from the lower left to the upper right. The gray-scale of each pixel represents the firing rate of each unit. Time series, 1–6, are obtained at the peak of the CA3 LFP theta over six cycles. CA3 activities without recurrent connections (B) are the same as in EC (A), while those with recurrent connections (C) are fulfilled in accordance with the current position (E = 0.2).

Cognitive Map Formation by Theta Phase Precession

2689

precession as a type of retrieval process, while they did not cause sequence learning. O’Keefe and Recce (1993) and Bose and Recce (2001) proposed a mechanism in the second group, which generates phase shifts as transient dynamics. Mehta, Lee, and Wilson (2002) proposed a neural mechanism that translates the firing rate to phase coding by considering the monotonic increases of the firing rate of a place cell when the rat traverses the place field. As discussed above, the importance in phase coding is not the individual phases but the relative phase relations among neurons for selective synaptic plasticity. In this view, it is not clear that the above three models can result in a stable phase relation among neurons under a continuously changing neural population. Thus, they are not directly applicable to memory encoding. Kamondi, Acs´ady, Wang, and Buzs´aki (1998) and Harris et al. (2002) proposed models based on phase locking between neural oscillations. Although they do not apply the models to memory encoding, these neural mechanisms are common to our assumptions of neural dynamics. The difference is that they assume phase locking in the CA3 layer and not in the entorhinal cortex. Our assumption at the entorhinal cortex basically comes from the observation that theta phase precession was observed in the dentate gyrus (Skaggs et al., 1996). Further quantitative analysis of theta phase precession in the DG and CA1 (Yamaguchi et al., 2002) revealed that after the separation of two components according to a bimodality of spike distributions, the components of theta phase precession in the CA1 had a phase delay of 0.2 theta cycles in comparison to the dentate gyrus. This strongly suggested transmissions of phase precession along the feedforward pathway in the hippocampus. This evidence supports our assumption. Further comparisons with biological experiments are beyond the scope of this article. It should be noted that the generation site of theta phase precession is critical in elucidating the dynamics of the entire circuit of hippocampal formation. In our current computer experiments focused on chart formation in the CA3, the results do not critically depend on the generation site of theta phase precession. Burgess et al. (1994) and Redish and Touretzky (1998) phenomenologically assumed phase precession during running in the maze by using a geometrical relation between the rat position and parameters of individual place fields. It is uncertain whether those parameters are available as neural mechanisms in the hippocampus before cognitive map formation. Thus, phase locking of neural oscillations is a proper neural mechanism of theta phase precession that helps memory encoding. 5.2 The Retrieval Activity During Memory Encoding. Hasselmo, Bodelon, and Wyble (2002) proposed independent activities of memory encoding and retrieval in different phases for each theta cycle in order to avoid

2690

H. Wagatsuma and Y. Yamaguchi

interference. Consistent with this, in Yamaguchi’s (2003) model of theta phase precession, the original version of the model presented here, the contribution of the memory reactivation should be relatively smaller than the current input to present an ordered pattern of theta phase precession. On the other hand, the contribution of learning to place cell activities appears as an experience-dependent expansion of place fields (Mehta et al., 1997, 2000; Ekstrom et al., 2001). This suggests that place cell activity in the rat’s running is modulated in some degree by experience-dependent properties. We successfully observed the localization of population activities in accordance with current sensory inputs and experience-dependent properties in place cells. As shown in Figure 10A, after learning, the current sequential activation projected from the entorhinal cortex is accompanied by some memory reactivation based on the CA3 recurrent connections. The synchronous activation between current and recall sequences generates the next position representation. The synchrony serves for the anticipation of the position by harmonizing memory retrieval with new perceptions by using the theta rhythm as an internal clock. Its example is the pattern completion in the partial cue condition in the study examined here. 5.3 A Possible Function of the Chart for Navigation. A navigation model based on theta phase precession was proposed by Burgess et al. (1994), and provides a shortcut to a fixed goal using a feedforward type of network. Blum and Abbott (1996) proposed a goal-oriented map represented in a recurrent network with asymmetric connections, which was extended by Redish and Touretzky (1998) by adding theta phase precession. Trullier and Meyer (2000) demonstrated that the phase precession provides recurrent connections in the CA3 in order to encode the sequences between places toward a goal, where the phase precession was phenomenologically assumed (Burgess et al., 1994). Trullier and Meyer (2000) assumed a property of place cells in their map a priori, where recurrent connections encode orientations specific to a goal. However, as is demonstrated by Gothard, Skaggs, Moore, and McNaughton (1996), the change of goal in the same environment does not alter all place cells’ properties. It suggests that the chart is formed independent of the goal. On the other hand, as a chart model, our results suggest that the chart may also have a navigation function. It means that memory reactivation is sensitive to the asymmetric connections. In the retrieval stage (see Figure 9B), the asymmetric connections in the CA3 network cause a spontaneous evolution of the recalled place, as well as the localization of the activity. The recalled place is sometimes recursive and irregularly goes around on the chart or sometimes converges into one place. Spontaneous and irregular dynamics of the neural system were observed in rabbit odor recognition experiments conducted by Freeman (1987). Furthermore, the importance of the “chaotic” dynamics in learning was proposed by Skarda and Freeman (1987). The global parameters of excitation and inhibition were found to

Cognitive Map Formation by Theta Phase Precession

2691

be crucial in order to control spontaneous recall. Acetylcholine and GABA are candidates because of their molecular basis, as proposed in associative memory learning in the hippocampus (Hasselmo, 1999; Hasselmo, Hay, Ilyn, & Gorchetchnikov, 2002). Approach to a certain point was observed in the retrieval stage by controlling the E and G values. If a biased signal from the extrahippocampal regions can be given as an intentional signal for the approaching point, spontaneous recall appears to converge at a desired goal. Thus, the dynamical properties of the chart with asymmetric connections could serve for navigational functions in a flexible way. Appendix: Mathematical Formulation of the Model The temporal evolution of the network is formulated below. A.1 Neural Unit. The basic equation of the oscillation phase φ in the phase model is given by φ˙ = ω + {β − I(t)} sin φ

(A.1)

with the output function P(cos φ) =

cos φ 0

if cos φ > 0 otherwise.

(A.2)

Here β and cos φ represent the angular frequency and the membrane potential, respectively. According to the biophysical rule, the external input I(t) consisting of the electric current contributes to the change in the membrane potential. Therefore, the factor − sin φ is multiplied to I(t), to obtain the vector component in the tangential direction of unit circle, as illustrated in Figure 11. When I(t) is zero and 0 < β < ω, equation A.1 has a stable fixed point, with P = 0 representing the resting state. When I(t) takes a constant value near β, the equation becomes a well-known equation of sustained oscillations. A phasic increase of I(t) gives an excitatory motion of the phase so that P gives a pulse density between 0 and 1. A.2 Network Model A.2.1 Input Layer. The input vector {Ii }, (i = 1, . . . , N), of local view at each time t is given as follows: Ii (t) =

λI (1 − ri (t)) 0

if Ci ∈ V(t) , with otherwise

V(t) = {Ck : rk (t) ≤ Vlength ∩ |αk (t)| ≤ Vangle },

(A.3)

2692

H. Wagatsuma and Y. Yamaguchi

sin f I - I sinf

f

cosf

Figure 11: The input to a unit of the phase model.

where λI , ri , and αi , respectively, represent the input intensity coefficient, the distance between the rat’s current position and the position of landmark i, and the angle to Ci from the rat’s current running direction. Here, V is defined as the set of landmarks within the visual field, and its area is determined by the field length Vlength and the egocentric bearing range Vangle , as shown in Figure 1. Hence, if a landmark is located outside the rat’s visual field, for instance, behind the rat, the corresponding input is zero. In all of the experiments, the parameters are chosen as Vlength = 0.5 in the short run, Vlength = 0.3 in spatial exploration, and Vangle = 90◦ , λI = 1.5, and N = 400 (the number of units in every layer). A.2.2 EC Layer. The phase model with the oscillation phase φiEC describes the individual neuronal units. The membrane potential of unit i is given by cos φ0EC , where a fixed value gives a resting state and the oscillation is given by the motion of the oscillation phase φiEC changing from 0 to 2π. Let ωi (t) be the native frequency, and the time evolution of the oscillation phase is given by φ˙ iEC (t) = ωi (t) + {β0 − Ii (t) − cos φ0EC (t)} sin φiEC (t),

(i = 1, . . . , N). (A.4)

The positive/negative terms in the braces represent the hyperpolarizing/ depolarizing currents. The first term, β0 , is an intrinsic component with a positive value representing a stable resting state. The second term, Ii , is the input component from the input layer, and the unit goes into the oscillation state when Ii is positive. The third term, cos φ0EC , represents the oscillation of local field potential theta (LFP theta). The LFP theta oscillation in the EC layer is assumed to have a constant frequency of ω0 and an oscillation

Cognitive Map Formation by Theta Phase Precession

2693

period of 2π/ω0 . The time evolution is given by φ˙ 0EC (t) = ω0 .

(A.5)

Equations A.4 and A.5 describe the coupled oscillator system when Ii (t) > 0. The relative value of ωi (t) to ω0 determines the phase difference between unit i and the LFP theta, φiEC −φ0EC , as is known as phase locking of nonlinear oscillations. Assuming that the native frequency of unit i increases gradually in the period when Ii (t) > 0, we then have the following simple equation for native frequency: ωA (t − t0 ) + ωS if Ii (t) > 0 , (i = 1, . . . , N), (A.6) ωi (t) = ω0 otherwise where ωA and ωS are the slope coefficient and base coefficient of the native frequency, which are given as 1.16 × 10−3 and 0.5, respectively. The time t = 0 denotes the onset time of positive input, when Ii (t) > 0. It is the time when the landmark associated with the unit begins to appear. Other parameters are fixed at ω0 = 1.0 in equations A.5 and A.6 and β0 = 1.5 in equation A.4. A.2.3 CA3 Layer. Neuronal units and the LFP theta in the CA3 layer are described by the phase model with the oscillation phases φiCA3 and φ0CA3 , as follows: φ˙ iCA3 (t) = ω1 + β1 − fiEC (t) − γ1 cos φ0CA3 (t) −E·

wij fjCA3 (t)

+ G · Rh (t) sin φiCA3 (t),

j

(i = 1, . . . , N), φ˙ 0CA3 (t) = ω0 − γ2 cos φ0EC (t) sin φ0CA3 (t),

(A.7) (A.8)

where ω1 , β1 , γ1 , γ2 , G, and E, respectively, denote the common native frequency of the CA3 units, the intrinsic component of the resting state, the intensity of LFP theta modulation, the intensity of EC LFP theta, the inhibition coefficient, and the excitation coefficient. Here, fiEC and fjCA3 represent the firing rates of EC unit i and CA3 unit j, which are given by the output function of the membrane potential cos φiEC and cos φjCA3 in equation A.2. Letting Rh (t) be the amount of global inhibition, it is given by the following exponential function:   λ I Rh (t) = exp  f CA3 (t) − 1, (A.9) N j j

2694

H. Wagatsuma and Y. Yamaguchi

where λI = 4.5. The LFP theta in the CA3 layer has a frequency of ω0 (the first term in equation A.8) and the interaction with the LFP theta in the EC layer (the second term in equation A.8) has a magnitude of γ2 , so as to keep the phase delay of π/4. In our experiments, the parameters are fixed as ω1 = 1.5, β1 = 1.2, γ1 = 0.8, γ2 = 0.1, and G = 2.2 with the exception of LFP fixation in the retrieval stage, given by β1 = 0.7 and γ1 = 0. The value of E is shown in the text of each experiment. Finally, wij represents the connection weight from unit j to unit i. The modification rule of connection weights is given as AHebb (t) if wij (t) > 0 w˙ ij (t) = , with max{AHebb (t), 0} if wij (t) = 0 AHebb (t) = (λLTP { fjCA3 (t) · fiCA3 (t − τ )} − λLTD { fjCA3 (t) + fiCA3 (t − τ )}) · (1 − w2ij (t))λw ,

(A.10)

where λLTP , λLTD , and λw denote the LTP coefficient, LTD coefficient, and the learning rate, respectively. Parameters are fixed at λLTP = 1, λLTD = 0.07, λw = 0.03, and τ = 0.19 theta cycles. Acknowledgments We are grateful to Bruce McNaughton and Bill Skaggs for their valuable comments and suggestions. We also thank Rainer Walter Paine and Bonnie Lee La Madeleine for their careful reading and refining of the manuscript. References Abeles, M. (1991). Corticonics: Neural circuits of the cerebral cortex. Cambridge: Cambridge University Press. Amari, S. (1977). Dynamics of pattern formation in lateral-inhibition type neural fields. Biological Cybernetics, 27(2), 77–87. Amit, D. J., & Tsodyks, M. V. (1992). Effective neurons and attractor neural networks in cortical environment. Network: Computation in Neural Systems, 3(2), 121–137. Bi, G. Q., & Poo, M. M. (1998). Synaptic modifications in cultured hippocampal neurons: Dependence on spike timing, synaptic strength, and postsynaptic cell type.Journal of Neuroscience, 18(24), 10464–10472. Blum, K. I., & Abbott, L. F. (1996). A model of spatial map formation in the hippocampus of the rat. Neural Computation, 8(1), 85–93. Bose, A., & Recce, M. (2001). Phase precession and phase-locking of hippocampal pyramidal cells. Hippocampus, 11(3), 204–215. Bostock, E., Muller, R. U., & Kubie, J. L. (1991). Experience-dependent modifications of hippocampal place cell firing. Hippocampus, 1(2), 193–205. Burgess, N., Recce, M., & O’Keefe, J. (1994). A model of hippocampal function. Neural Networks, 7(6–7), 1065–1081.

Cognitive Map Formation by Theta Phase Precession

2695

Ekstrom, A. D., Meltzer, J., McNaughton, B. L., & Barnes, C. A. (2001). NMDA receptor antagonism blocks experience-dependent expansion of hippocampal “place fields.” Neuron, 31(4), 631–638. Freeman, W. J. (1987). Simulation of chaotic EEG patterns with a dynamic model of the olfactory system. Biological Cybernetics, 56(2–3), 139–150. Gothard, K. M., Skaggs, W. E., Moore, K. M., & McNaughton, B. L. (1996). Binding of hippocampal CA1 neural activity to multiple reference frames in a landmark-based navigation task. Journal of Neuroscience, 16(2), 823–835. Harris, K. D., Henze, D. A., Hirase, H., Leinekugel, X., Dragoi, G., Czurko, ´ A., & Buzs´aki, G. (2002). Spike train dynamics predicts theta-related phase precession in hippocampal pyramidal cells. Nature, 417, 738–741. Hasselmo, M. E. (1999). Neuromodulation: Acetylcholine and memory consolidation. Trends in Cognitive Science, 3(9), 351–359. Hasselmo, M. E., Bodelon, C., & Wyble, B. P. (2002). A proposed function for hippocampal theta rhythm: Separate phases of encoding and retrieval enhance reversal of prior learning. Neural Computation, 14(4), 793–817. Hasselmo, M. E., Hay, J., Ilyn, M., & Gorchetchnikov, A. (2002). Neuromodulation, theta rhythm and rat spatial navigation. Neural Networks, 15(4–6), 689–707. Hoppensteadt, F. C. (1986). An introduction to the mathematics of neurons. Cambridge: Cambridge University Press. Jensen, O., & Lisman, J. E. (1996). Hippocampal CA3 region predicts memory sequences: Accounting for the phase precession of place cells. Learning and Memory, 3(2–3), 279–287. K´ali, S., & Dayan, P. (2000). The involvement of recurrent connections in area CA3 in establishing the properties of place fields: A model. Journal of Neuroscience, 20(19), 7463–7477. Kamondi, A., Acs´ady, L., Wang, X. J., & Buzs´aki, G. (1998). Theta oscillations in somata and dendrites of hippocampal pyramidal cells in vivo: Activitydependent phase-precession of action potentials. Hippocampus, 8(3), 244– 261. Larson, J., & Lynch, G. (1989). Theta pattern stimulation and the induction of LTP: The sequence in which synapses are stimulated determines the degree to which they potentiate. Brain Research, 489(1), 49–58. ´ Lengyel, M., Szatm´ary, Z., & Erdi, P. (2003). Dynamically detuned oscillations account for the coupled rate and temporal code of place cell firing. Hippocampus, 13(6), 700–714. Levy, W. B., & Steward, O. (1983). Temporal contiguity requirements for longterm associative potentiation/depression in the hippocampus. Neuroscience, 8(4), 791–797. Marr, D. (1971). Simple memory: A theory for archicortex. Philosophical Transactions of the Royal Society of London Series B; Biological Sciences, 262(841), 23– 81. McNaughton, B. L. (1989). Neuronal mechanisms for spatial computation and information storage. In L. Nadel, L. Cooper, P. Culicover, & R. M. Harnish (Eds.), Neural connections, mental computation (pp. 285–350). Cambridge, MA: MIT Press.

2696

H. Wagatsuma and Y. Yamaguchi

McNaughton, B. L., & Morris, R. G. M. (1987). Hippocampal synaptic enhancement and information storage within a distributed memory system. Trends in Neurosciences, 10(10), 408–415. McNaughton, B. L., & Nadel, L. (1989). Hebbian-Marr networks and the neurobiological representation of action in space. In M. A. Gluck & D. E. Rumelhart (Eds.), Neuroscience and connectionist theory (pp. 1–63). Hillsdale, NJ: Erlbaum. Mehta, M. R., Barnes, C. A., & McNaughton, B. L. (1997). Experience-dependent, asymmetric expansion of hippocampal place fields. Proceedings of the National Academy of Sciences of the United States of America, 94(16), 8918–8921. Mehta, M. R., Lee, A. K., & Wilson, M. A. (2002). Role of experience and oscillations in transforming a rate code into a temporal code. Nature, 417, 741– 746. Mehta, M. R., Quirk, M. C., & Wilson, M. A. (2000). Experience-dependent asymmetric shape of hippocampal receptive fields. Neuron, 25(3), 707–715. Muller, R. U., & Kubie, J. L. (1987). The effects of changes in the environment on the spatial firing of hippocampal complex-spike cells. Journal of Neuroscience, 7(7), 1951–1968. Muller, R. U., Kubie, J. L., & Saypoff, R. (1991). The hippocampus as a cognitive graph (abridged version). Hippocampus, 1(3), 243–246. Muller, R. U., Stead, M., & Pach, J. (1996). The hippocampus as a cognitive graph. Journal of General Physiology, 107(6), 663–694. Nakazawa, K., Quirk, M. C., Chitwood, R. A., Watanabe, M., Yeckel, M. F., Sun, L. D., Kato, A., Carr, C. A., Johnston, D., Wilson, M. A., & Tonegawa, S. (2002). Requirement for hippocampal CA3 NMDA receptors in associative memory recall. Science, 297, 211–218. O’Keefe, J., & Dostrovsky, J. (1971). The hippocampus as a spatial map: Preliminary evidence from unit activity in the freely-moving rat. Brain Research, 34(1), 171-175. O’Keefe, J., & Nadel, L. (1978). The hippocampus as a cognitive map. Oxford: Clarendon Press. O’Keefe, J., & Recce, M. L. (1993). Phase relationship between hippocampal place units and the EEG theta rhythm. Hippocampus, 3(3), 317–330. Redish, A. D., & Touretzky, D. S. (1998). The role of the hippocampus in solving the Morris water maze. Neural Computation, 10(1), 73–111. Rolls, E. T. (1989). The representation and storage of information in neuronal networks in the primate cerebral cortex and hippocampus. In R. Durbin, C. Miall, and G. Mitchison (Eds.), The computing neuron (pp. 125–159). Reading, MA: Addison-Wesley. Rosenzweig, A. D., Ekstrom, A. D., Redish, A. D., McNaughton, B. L., & Barnes, C. A. (2000). Phase precession as an experience-independent process: Hippocampal pyramidal cell phase precession in a novel environment and under NMDA-recepter blockade. Society for Neuroscience Abstract, 26, 982. Samsonovich, A., & McNaughton, B. L. (1997). Path integration and cognitive mapping in a continuous attractor neural network model. Journal of Neuroscience, 17(15), 5900–5920. Sato, N., & Yamaguchi, Y. (2003). Memory encoding by theta phase precession in the hippocampal network. Neural Computation, 15(10), 2379–2397.

Cognitive Map Formation by Theta Phase Precession

2697

Skaggs, W. E., McNaughton, B. L., Wilson, M. A., & Barnes, C. A. (1996). Theta phase precession in hippocampal neuronal populations and the compression of temporal sequences. Hippocampus, 6(2), 149–172. Skarda, C. A., & Freeman, W. J. (1987). Brains make chaos to make sense of the world. Behavioral Brain Research, 10(2), 161–173. Trullier, O., & Meyer, J. A. (2000). Animat navigation using a cognitive graph. Biological Cybernetics, 83(3), 271–285. Tsodyks, M. V., Skaggs, W. E., Sejnowski, T. J., & McNaughton, B. L. (1996). Population dynamics and theta rhythm phase precession of hippocampal place cell firing: A spiking neuron model. Hippocampus, 6(3), 271–280. Vanderwolf, C. H. (1971). Limbic-diencephalic mechanisms of voluntary movement. Psychological Review, 78(2), 83–113. Wagatsuma, H., & Yamaguchi, Y. (1999). A neural network model self-organizing a cognitive map using theta phase precession. In Man Cybernetics, Society Staff, IEEE Systems (Eds)., IEEE SMC’99 Conference Proceedings (Vol. 3, pp. 199–204). Amsterdam: IOS Press. Wagatsuma, H., & Yamaguchi, Y. (2000). Self-organization of the cognitive map in a neural network model using theta phase precession. Society for Neuroscience Abstract, 26, 1589. Wallenstein, G. V., & Hasselmo, M. E. (1997). GABAergic modulation of hippocampal population activity: Sequence learning, place field development, and the phase precession effect. Journal of Neurophysiology, 78(1), 393–408. Wilson, M. A., & McNaughton, B. L. (1993). Dynamics of the hippocampal ensemble code for space. Science, 261, 1055–1058. Wu, Z., & Yamaguchi, Y. (2004). Input-dependent learning rule for the memory of spatiotemporal sequences in hippocampal network with theta phase precession. Biological Cybernetics, 90(2), 113–124. Yamaguchi, Y. (1996). Synchronization of theta activities as a supervisory mechanism of the memory formation in a neural network model of the hippocampus. In N. Kato (Ed.), The hippocampus: Functions and clinical relevance (pp. 339–344). Dordrecht: Elsevier Science. Yamaguchi, Y. (2003). A theory of hippocampal memory based on theta phase precession. Biological Cybernetics, 89(1), 1–9. Yamaguchi, Y., Aota, Y., McNaughton, B. L., & Lipa, P. (2002). Bimodality of theta phase precession in hippocampal place cells in freely running rats. Journal of Neurophysiology, 87(6), 2629–2642. Yamaguchi, Y., & McNaughton, B. L. (1998). Nonlinear dynamics generating theta phase precession in hippocampal closed circuit and generation of episodic memory. In S. Usui & O. T. Omori (Eds.), ICONIP98: The Fifth International Conference on Neural Information Processing Joined with JNNS’98: The International Conference of the Japanese Neural Network Society Proceedings (Vol. 2, pp. 781–784). Amsterdam: IOS Press. Received June 18, 2003; accepted May 28, 2004.

LETTER

Communicated by Simon Haykin

A Complex-Valued RTRL Algorithm for Recurrent Neural Networks Su Lee Goh [email protected]

Danilo P. Mandic [email protected] Department of Electrical and Electronic Engineering, Imperial College London, London SW7 2AZ, U.K.

A complex-valued real-time recurrent learning (CRTRL) algorithm for the class of nonlinear adaptive filters realized as fully connected recurrent neural networks is introduced. The proposed CRTRL is derived for a general complex activation function of a neuron, which makes it suitable for nonlinear adaptive filtering of complex-valued nonlinear and nonstationary signals and complex signals with strong component correlations. In addition, this algorithm is generic and represents a natural extension of the real-valued RTRL. Simulations on benchmark and realworld complex-valued signals support the approach. 1 Introduction Recurrent neural networks (RNNs) are a widely used class of nonlinear adaptive filters with feedback, due to their ability to represent highly nonlinear dynamical systems (Elman, 1990; Tsoi & Back, 1994), attractor dynamics, and associative memories (Medsker & Jain, 2000; Principe, Euliano, & Lefebvre, 2000).1 In principle, a static feedforward network can be transformed into a dynamic recurrent network by adding recurrent connections (Haykin, 1994; Puskorius & Feldkamp, 1994; Feldkamp & Puskorius, 1998). A very general class of networks that has both feedforward and feedback connections is the recurrent multilayer perceptron (RMLP) network (Puskorius & Feldkamp, 1994; Feldkamp & Puskorius, 1998), for which the representation capabilities have been shown to be considerably greater than those of static feedforward networks. Feedback neural networks have proven their potential in processing of nonlinear and nonstationary signals, with applications in signal modeling, system identification, time-series analysis, and prediction (Mandic & Chambers, 2001). 1 Nonlinear autoregressive (NAR) processes can be modeled using feedforward networks, whereas nonlinear autogressive moving average (NARMA) processes can be represented using RNNs.

c 2004 Massachusetts Institute of Technology Neural Computation 16, 2699–2713 (2004)

2700

S. Goh and D. Mandic

In 1989, Williams and Zipser proposed an on-line algorithm for training of RNNs, called the real-time recurrent learning (RTRL) algorithm (Williams & Zipser, 1989), which has since found a variety of signal processing applications (Haykin, 1994). This is a direct gradient algorithm, unlike the recurrent backpropagation used for the training of RMLPs. Despite its relative simplicity (as compared to recurrent backpropagation), it has been demonstrated that one of the difficulties of using direct gradient descent algorithms for training RNNs is the problem of vanishing gradient. Bengio, Simard, and Frascon˘i (1994) showed that the problem of vanishing gradients is the essential reason that gradient descent methods are often slow to converge. Several approaches have been proposed to circumvent this problem, which include both the new algorithms (such as extended Kalman filter, EKF) and new architectures, such as cascaded recurrent neural networks (Williams, 1992; Puskorius & Feldkamp, 1994; Haykin & Li, 1995). In the fields of engineering and biomedicine, signals are typically nonlinear, nonstationary, and often complex valued. Properties of such signals vary not only in terms of their statistical nature but also in terms of their bivariate or complex-valued nature (Gautama, Mandic, & Hulle, 2003). As for the learning algorithms, in the area of linear filters (linear perceptron), the least mean square (LMS) algorithm was extended to the complex domain in 1975 (Widrow, McCool, & Ball, 1975). Subsequently, complex-valued backpropagation (CBP) was introduced by Leung and Haykin (1991) and Benvenuto and Piazza (1992). Thereby, a complex nonlinear activation function (AF) that separately processes the in-phase (I) and quadrature (Q) components of the weighted sum of input signals (net input) was employed. This way, the output from a complex AF takes two independent paths, which has since been referred to as the split-complex approach.2 Although bounded, a split-complex AF cannot be analytic, and thus cannot cater to signals with strong coupling between the magnitude and phase (Kim & Adali, 2000). In the 1990s, Georgiou and Koutsougeras (1992) and Hirose (1990) derived the CBP that uses nonlinear complex AFs that jointly process the I and Q components. However, these algorithms had difficulties when learning the nonlinear phase changes and the AF used in Georgiou and Koutsougeras (1992) was not analytic (Kim & Adali, 2000). In general, the split-complex approach has been shown to yield reasonable performance for some applications in channel equalization (Leung & Haykin, 1991; Benvenuto & Piazza, 1992; Kechriotis & Manolakos, 1994), and for applications where there is no strong coupling between the real and imaginary part within the complex signal. However, for the common case where the inphase (I) and quadra-

2 In a split-complex AF, the real and imaginary components of the input signal x are separated and fed through the real-valued activation function fR (x) = fI (x), x ∈ R. A split-complex activation function is therefore given as f (x) = fR (Re(x)) + jfI (Im(x)), for 1 1 + j 1+e−β(Im(x)) . example, f (x) = 1+e−β(Re(x))

A Complex-Valued RTRL Algorithm for Recurrent Neural Networks

2701

ture (Q) components are strongly correlated, algorithms employing splitcomplex activation function tend to yield poor performance (Gautama et al., 2003). Notice that split-complex algorithms cannot calculate the true gradient unless the real and imaginary weight updates are mutually independent. Research toward a fully complex RNN has focused on the issue of analytical activation functions. A comprehensive account of elementary complex transcendental activation functions (ETFs) used as AFs is given in Kim and Adali (2002). They employed ETFs to derive a fully CBP algorithm using the Cauchy-Riemann3 equations, which also helped to relax requirements on the properties of a fully complex activation function.4 For the case of RNNs, Kechriotis and Manalokas (1994) and Coelho (2001) introduced a complex-valued RTRL (CRTRL) algorithm. Both approaches, however, employed split complex AFs, thus restricting the domain of application. In addition, these algorithms did not follow the generic form of their real RTRL counterpart. There are other problems encountered with splitcomplex RTRL algorithms for nonlinear adaptive filtering: (1) the solutions are not general since split-complex AFs are not universal approximators (Leung & Haykin, 1991); (2) split-complex AFs are not analytic and hence the Cauchy-Riemann equations do not apply; (3) split-complex algorithms do not account for a “fully” complex nature of signal (Kim & Adali, 2003) and such algorithms therefore underperform in applications where complex signals exhibit strong component correlations (Mandic & Chambers, 2001); (4) these algorithms do not have a generic form of their real-valued counterparts, and hence their signal flow graphs are fundamentally different (Nerrand, Roussel-Ragot, Personnaz, & Dreyfus, 1993a). Therefore, there is a need to address the problem of training a fully connected complex-valued recurrent neural network. Although there have been attempts to devise fully complex algorithms for online training of RNNs, a general fully complex CRTRL has been lacking to date. To this cause, we derive a CRTRL for a single recurrent neural network with a general “fully” complex activation function. This makes complex RNNs suitable for adaptive filtering of complex-valued nonlinear and nonstationary signals. The fully connected recurrent neural network (FCRNN) used here is the canonical form of a feedback neural network (Nerrand, Roussel-Ragot, Personnaz, & Dreyfus, 1993b) that is general enough for the class of direct gradientbased CRTRL algorithms. The analysis is comprehensive and is supported by examples on benchmark complex-valued nonlinear and colored signals and together with simulations on real-world radar and environmental measurements. 3 Cauchy-Riemann equations state that the partial derivatives of a function f (z) = ∂v u(x, y) + jv(x, y) along the real and imaginary axes should be equal: f (z) = ∂u ∂x + j ∂x = ∂v ∂u ∂u ∂v ∂v ∂u ∂y − j ∂y . This way ∂x = ∂y , ∂x = − ∂y . 4 A fully complex activation function is analytic and bounded almost everywhere in C.

2702

S. Goh and D. Mandic

Feedback inputs

z-1

Outputs y

z-1

z-1

z-1

s(k-1) External Inputs s(k-p) I/O layer

Feedforward Processing layer of and hidden and putput feedback connections neurons

Figure 1: A fully connected recurrent neural network.

2 The Complex-Valued RTRL Algorithm Figure 1 shows an FCRNN, which consists of N neurons with p external inputs. The network has two distinct layers consisting of the external inputfeedback layer and a layer of processing elements. Let yl (k) denote the complex-valued output of each neuron, l = 1, . . . , N at time index k and s(k) the (1 × p) external complex-valued input vector. The overall input to the network I(k) represents the concatenation of vectors y(k), s(k) and the bias input (1 + j), and is given by I(k) = [s(k − 1), . . . , s(k − p), 1 + j, y1 (k − 1), . . . , yN (k − 1)]T = Inr (k) + jIni (k), n = 1, . . . , p + N + 1, (2.1) √ where j = −1, (·)T denotes the vector transpose operator, and the superr scripts (·) and (·)i denote, respectively, the real and imaginary parts of a complex number. A complex-valued weight matrix of the network is denoted by W, where for the lth neuron, its weights form a (p + F + 1) × 1 dimensional weight

A Complex-Valued RTRL Algorithm for Recurrent Neural Networks

2703

vector wl = [wl,1 , . . . , wl,p+F+1 ]T where F is the number of feedback connections. The feedback connections represent the delayed output signals of the FCRNN. In the case of Figure 1, we have F = N. The output of each neuron can be expressed as yl (k) = (netl (k)),

l = 1, . . . , N,

(2.2)

where netl (k) =

p+N+1

wl,n (k)In (k)

(2.3)

n=1

is the net input to lth node at time index k. For simplicity, we state that yl (k) = r (netl (k)) + ji (netl (k)) = ul (k) + jvl (k) netl (k) = σl (k) + jτl (k)

(2.4) (2.5)

where is a complex nonlinear activation function. 3 The Derivation of the Complex-Valued RTRL The output error consists of its real and imaginary parts and is defined as el (k) = d(k) − yl (k) = erl (k) + jeil (k)

(3.1)

erl (k) = dr (k) − ul (k), eil (k) = di (k) − vl (k),

(3.2)

where d(k) is the teaching signal. For real-time applications the cost function of the recurrent network is given by (Widrow et al., 1975), E(k) =

N N N 1 1 1 |el (k)|2 = el (k)e∗l (k) = [(er )2 + (eil )2 ], 2 l=1 2 l=1 2 l=1 l

(3.3)

where (·)∗ denotes the complex conjugate. Notice that E(k) is a real-valued function, and we are required to derive the gradient E(k) with respect to both the real and imaginary part of the complex weights as ∇ws,t E(k) =

∂E(k) ∂E(k) +j , 1 ≤ l, s ≤ N, 1 ≤ t ≤ p + N + 1. ∂wrs,t ∂wis,t

(3.4)

The CRTRL algorithm minimizes cost function E(k) by recursively altering the weight coefficients based on gradient descent, given by ws,t (k + 1) = ws,t (k) + ws,t (k) = ws,t (k) − η∇ws,t E(k)|ws,t =ws,t (k) ,

(3.5)

2704

S. Goh and D. Mandic

where η is the learning rate, a small, positive constant. Calculating the gradient of the cost function with respect to the real part of the complex weight gives ∂E(k) ∂ul (k) ∂vl (k) ∂E ∂E 1 ≤ l, s ≤ N = + , . (3.6) ∂wrs,t (k) ∂ul ∂wrs,t (k) ∂vl ∂wrs,t (k) 1≤t≤p+N+1 Similarly, the partial derivative of the cost function with respect to the imaginary part of the complex weight yields ∂E ∂E(k) = i ∂ws,t (k) ∂ul The factors

∂yl (k) ∂wrs,t (k)

∂ul (k) ∂wis,t (k) =

∂ul (k) ∂wrs,t (k)

∂E + ∂vl

1 ≤ l, s ≤ N ∂vl (k) . , i 1≤t≤p+N+1 ∂ws,t (k)

∂vl (k) + j ∂w r (k) and s,t

∂yl (k) ∂wis,t (k)

=

∂ul (k) ∂wis,t (k)

(3.7)

∂vl (k) + j ∂w i (k) are s,t

measures of sensitivity of the output of the mth unit at time k to a small variation in the value of ws,t (k). These sensitivities can be evaluated as ∂ul (k) ∂σl ∂τl ∂ul ∂ul · · = + ∂wrs,t (k) ∂σl ∂wrs,t (k) ∂τl ∂wrs,t (k)

(3.8)

∂ul (k) ∂σl ∂τl ∂ul ∂ul · · = + ∂wis,t (k) ∂σl ∂wis,t (k) ∂τl ∂wis,t (k)

(3.9)

∂vl (k) ∂σl ∂τl ∂vl ∂vl · · = + ∂wrs,t (k) ∂σl ∂wrs,t (k) ∂τl ∂wrs,t (k)

(3.10)

∂vl (k) ∂σl ∂τl ∂vl ∂vl · · = + ∂wis,t (k) ∂σl ∂wis,t (k) ∂τl ∂wis,t (k)

(3.11)

Following the derivation of the real-valued RTRL (Williams & Zipser, 1989), to compute these sensitivities, start with differentiating equation 2.3 which yields   N ∂uq (k − 1) r ∂vq (k − 1) i ∂σl (k) = w w (k) − (k)  ∂wrs,t (k) ∂wrs,t (k) l,p+1+q ∂wrs,t (k) l,p+1+q q=1 + δsl Inr (k)   N ∂vq (k − 1) r ∂uq (k − 1) i ∂τl (k) = w w (k) + (k)  ∂wrs,t (k) ∂wrs,t (k) l,p+1+q ∂wrs,t (k) l,p+1+q q=1 + δsl Ini (k)   N ∂vq (k − 1) i ∂uq (k − 1) r ∂σl (k) (k) − (k)  = w w ∂wis,t (k) ∂wis,t (k) l,p+1+q ∂wis,t (k) l,p+1+q q=1 − δsl Ini (k)

A Complex-Valued RTRL Algorithm for Recurrent Neural Networks

2705

  N ∂uq (k − 1) i ∂vq (k − 1) r ∂τl (k) (k) + (k)  = w w ∂wis,t (k) ∂wis,t (k) l,p+1+q ∂wis,t (k) l,p+1+q q=1 + δsl Inr (k), where

δsl =

l=s l = s

1, 0,

(3.12)

is the Kronecker delta. For a complex function to be analytic at a point in C, it needs to satisfy the Cauchy-Riemann equations. To arrive at the CauchyRiemann equations, the partial derivatives (sensitivities) along the real and imaginary axes should be equal, that is, ∂ul (k) ∂vl (k) ∂vl (k) ∂ul (k) +j r = −j i . r i ∂ws,t (k) ∂ws,t (k) ∂ws,t (k) ∂ws,t (k)

(3.13)

Equating the real and imaginary parts in equation 3.13, we obtain ∂vl (k) ∂ul (k) = ∂wrs,t (k) ∂wis,t (k)

(3.14)

∂vl (k) ∂ul (k) =− r . ∂wis,t (k) ∂ws,t (k)

(3.15)

l,(rr) (k) = For convenience, we denote the sensitivities as πs,t ∂vl (k) ∂wrs,t (k) ,

∂ul (k) ∂wis,t (k)

l,(ri) πs,t (k) =

l,(ii) and πs,t (k) =

∂vl (k) ∂wis,t (k) .

l,(ir) ∂ul (k) ∂wrs,t (k) , πs,t (k)

=

By using the Cauchy-

Riemann equations, a more compact representation of gradient ∇ws,t E(k) is given by ∂E(k) ∂E(k) ∂E(k) l,(ir) l,(ri) (k) (k) + πs,t + jπs,t ∂ul (k) ∂vl (k) ∂ul (k) ∂E(k) l,(ii) + jπs,t (k) ∂vl (k)

∂E(k) ∂E(k) l,(rr) l,(ri) = (k) +j πs,t (k) + jπs,t ∂ul (k) ∂vl (k) N l,(rr) l,(ir) el (k) πs,t (k) − jπs,t (k) =

l,(rr) (k) ∇ws,t E(k) = πs,t

l=1

=

N

∗ l el (k) πs,t (k).

(3.16)

l=1

The weight update is finally given by

ws,t (k) = η

N l=1

l ∗ el (k)(πs,t ) (k),

1 ≤ l, s ≤ N, 1 ≤ t ≤ p + N + 1, (3.17)

2706

S. Goh and D. Mandic

with the initial condition l ∗ (πs,t ) (0) = 0.

(3.18)

Under the assumption, also used in the RTRL algorithm (Williams & Zipser, 1989), that for a sufficiently small learning rate η, we have ∂ul (k − 1) ∂ul (k − 1) ≈ , 1 ≤ l, s ≤ N, 1 ≤ t ≤ p + N + 1 ∂ws,t (k) ∂ws,t (k − 1) ∂vl (k − l) ∂vl (k − l) ≈ , 1 ≤ l, s ≤ N, 1 ≤ t ≤ p + N + 1, ∂ws,t (k) ∂ws,t (k − l)

(3.19) (3.20)

l ∗ the update for the sensitivities (πs,t ) (k) becomes

l ∗ ) (k) (πs,t

N ∂ul δsl Inr (k) − jδsl Ini (k) + = wrl,p+1+q (k) − jwil,p+1+q (k) ∂σl q=1 q,(rr) πs,t (k

− 1) +

jwrl,p+1+q (k)

+

wil,p+1+q (k)

q,(ri) πs,t (k

− 1)

N ∂ul δsl Ini (k) + jδsl Inr (k) + −wrl,p+1+q (k) + jwil,p+1+q (k) + ∂τl q=1 q,(ri) πs,t (k

q,(rr) − 1) + wil,p+1+q (k) + jwrl,p+1+q (k) πs,t (k − 1)

 N ∂ul  ∗ q,(rr) q,(ri) ∗ ∗ δsl In (k) + wl,p+1+q (k)πs,t (k − 1) + jwl,p+1+q (k)πs,t (k − 1) = ∂σl q=1  N ∂ul  q,(ri) jδsl In∗ (k) + −w∗l,p+1+q (k)πs,t (k − 1) + ∂τl q=1 q,(rr) ∗ + jwl,p+1+q (k)πs,t (k − 1) =

∂ul ∂ul +j ∂σl ∂τl

= { (k)}

∗

δsl In∗ (k) +

N

w∗l,p+1+q (k) πs,t

q=1

wlH (k)π ∗ (k



− 1) +

q,(rr)

q,(ri)

(k − 1) + jπs,t

(k − 1) 

δsl In∗ (k)

,

(3.21)

where (·)H denotes the Hermitian transpose operator. This completes the derivation of the fully complex RTRL (CRTRL).

A Complex-Valued RTRL Algorithm for Recurrent Neural Networks

2707

3.1 Computational Requirements. The memory requirements and computational complexity of complex-valued neural networks differ from those of real-valued neural networks. A complex addition is roughly equivalent of two real additions, whereas a complex multiplication can be represented as four real multiplications and two additions or three multiplil ∗ cations and five additions.5 The CRTRL computes sensitivities (πs,t ) (k) = l,(ri) l,(ir) (k) − jπs,t (k) at every iteration with respect to all the elements of the πs,t complex weight matrix, where every column corresponds to a neuron in the network. For a real-valued FCRNN, there are N2 weights, N3 gradients, and O(N4 ) operations required for the gradient computation per sample. When the network is fully connected and all weights are adaptable, this algorithm has space complexity of O(N3 ) (Williams & Zipser, 1989). Thus, the space complexity required by the complex-valued FCRNN becomes twice that of the real-valued case, whereas the computational complexity becomes approximately four times that of the real-valued RTRL.

4 Simulations For the experiments, the nonlinearity at the neuron was chosen to be the logistic sigmoid function, given by (x) =

1 , 1 + e−βx

(4.1)

where x is complex valued. The slope was chosen to be β = 1 and learning rate η = 0.01. The architecture of the FCRNN (see Figure 1) consists of N = 2 neurons with the tap input length of p = 4. Simulations were undertaken by averaging 100 iterations of independent trials on prediction of complex-valued signals. To support the analysis, we tested the CRTRL on a wide range of signals, including the complex linear, complex nonlinear, and chaotic (Ikeda map). To further illustrate the approach and verify the advantage of using the fully CRTRL (FCRTRL) over split CRTRL (SCRTRL), single-trial experiment were performed on real-world complex wind6 and radar7 data. In the first experiment, the input signal was a complex linear AR(4) process given by r(k) = 1.79r(k − 1)−1.85r(k − 2)+1.27r(k − 3)+0.41r(k − 4) + n(k),

(4.2)

5 There are two ways to perform a complex multiplication: for the first case, (a + ib)(c + id) = (ac − bd) + i(bc + ad), and for the second case, (a + ib)(c + id) = a(c + d) − d(a + b) + i(a(c + d) + c(b − a)) (publicly available online from http://mathworld.wolfram.com/). The latter form may be preferred when multiplication takes much more time than addition but can be less numerically stable. 6 Publicly available online from http://mesonet.agron.iastate.edu/. 7 Publicly available online from http://soma.ece.mcmaster.ca/ipix/.

2708

S. Goh and D. Mandic −5

10log10E(k)

−10

SCRTRL

−15

−20

FCRTRL −25 0

1000

2000

3000

4000

5000

Number of iteration (k) Figure 2: Performance curves for FCRTRL and SCRTRL on prediction of complex colored input, equation 4.2.

with complex white gaussian noise (CWGN) n(k) ∼ N (0, 1) as the driving input. The CWGN can be expressed as n(k) = nr (k) + jni (k). The real and imaginary components of CWGN are mutually independent sequences having equal variances so that σn2 = σn2r +σn2i . For the second experiment, the complex benchmark nonlinear input signal was (Narendra & Parthasarathy, 1990): z(k) =

z(k − 1) + r3 (k). 1 + z2 (k − 1)

(4.3)

Figures 2 and 3 show, respectively, the performance curves for the FCRTRL and SCRTRL on complex colored (see equation 4.2) and nonlinear (see equation 4.3) signal. The proposed approach showed improved performance for both the complex colored and nonlinear input. The performance improvement was roughly 3 dB for the linear and 6 dB for the nonlinear signal. In addition, the proposed CRTRL exhibited faster convergence. The simulations results for the Ikeda map are shown in Figure 4. Observe that the FCRTRL algorithm was more stable and has exhibited improved and more consistent performance over that of the SCRTRL algorithm. Figure 5 shows the prediction performance of the FCRNN applied to the complex-valued radar signal in both the split and fully CRTRL case. The FCRNN was able to track the complex radar signal, which was not the case

A Complex-Valued RTRL Algorithm for Recurrent Neural Networks

2709

−5 −10 −15

10log10E(k)

−20 −25

SCRTRL

−30 −35 −40

FCRTRL

−45 −50 0

1000

2000

3000

4000

5000

Number of iteration (k) Figure 3: Performance curves for FCRTRL and SCRTRL on prediction of complex nonlinear input, equation 4.3.

for split CRTRL. For the next experiment, we compared the performance of FCRTRL and SCRTRL on wind data. Figures 6 and 7 show the results obtained for FCRTRL and SCRTRL. The FCRTRL achieved much improved prediction performance over the SCRTRL. 5 Conclusions A fully complex real-time recurrent learning (CRTRL) algorithm for online training of fully connected recurrent neural networks (FCRNNs) has been proposed. The FCRTRL algorithm provides an essential tool for nonlinear adaptive filtering using FCRNNs in the complex domain and is derived for a general complex nonlinear activation function of a neuron. We have made use of the Cauchy-Riemann equations to obtain a generic form of the proposed algorithm. The performance of the CRTRL algorithm has been evaluated on benchmark complex-valued nonlinear and colored input signals, Ikeda map, and also on real-life complex-valued wind and radar signals. Experimental results have justified the potential of the CRTRL algorithm in nonlinear neural adaptive filtering applications. Expanding fully complex RNN into modular RNNs and developing extended Kalman filter techniques to enhance the estimation capability remain among the most important further challenges in the area of fully complex RNNs.

2710

S. Goh and D. Mandic 0

−10

’split’ CRTRL

−20

−30

10

10log E(k)

’fully’ CRTRL

−40

−50

−60

−70 0

500

1000

1500

2000

Number of iterations (k) Figure 4: Performance curve of FCRTRL and SCRTRL for Ikeda map.

Radar signal

0.6

A

0.5 0.4 0.3 0.2 0.1 00

Radar signal

0.6

20

40

60

80

100

120

140

160

180

200

160

180

200

Number of iterations (k)

B

0.5 0.4 0.3 0.2 0.1 00

20

40

60

80

100

120

140

Number of iterations (k) Figure 5: Prediction of complex radar signal based on fully complex and split complex activation function. (A) Split-complex CRTRL. (B) Fully complex CRTRL. Solid curve: actual radar signal. Dashed curve: nonlinear prediction.

A Complex-Valued RTRL Algorithm for Recurrent Neural Networks

2711

1.2

Absolute wind signal

1 0.8 0.6 0.4 0.2 0 −0.2 Rp = 5.7717 dB −0.4 0

2000

4000

6000

8000

10000

12000

14000

Number of iteration (k) Figure 6: Prediction of complex wind signal using FCRTRL. Solid curve: Actual wind signal. Dotted curve: Predicted signal. 1.2

Absolute wind signal

1 0.8 0.6 0.4 0.2 0 −0.2 Rp = 1.6298 dB −0.4 0

2000

4000

6000

8000

10000

12000

14000

Number of iteration (k) Figure 7: Prediction of complex wind signal using SCRTRL. Solid curve: Actual wind signal. Dotted curve: Predicted signal.

2712

S. Goh and D. Mandic

References Bengio, Y., Simard, P., & Frasconi, P. (1994). Learning long-term dependencies with gradient descent is difficult. IEEE Transactions on Neural Networks, 5(2), 157–166. Benvenuto, N., & Piazza, F. (1992). On the complex backpropagation algorithm. IEEE Transactions on Signal Processing, 40(4), 967–969. Coelho, P. H. G. (2001). A complex EKF-RTRL neural network. International Joint Conference on Neural Networks, 1, 120–125. Elman, J. L. (1990). Finding structure in time. Cognitive Science, 14, 179–211. Feldkamp, L. A., & Puskorius, G. V. (1998). A signal processing framework based on dynamical neural networks with application to problems in adaptation, filtering and classification. IEEE Transactions on Neural Networks, 86(4), 2259– 2277. Gautama, T., Mandic, D. P., & Hulle, M. M. V. (2003). A non-parametric test for detecting the complex-valued nature of time series. In Knowledge-Based Intelligent Information and Engineering Systems: 7th International Conference (pp. 1364–1371). Berlin: Springer-Verlag. Georgiou, G. M., & Koutsougeras, C. (1992). Complex domain backpropagation. IEEE Transactions on Circuits and Systems—II: Analog and Digital Signal Processing, 39(5), 330–334. Haykin, S. (1994). Neural networks: A comprehensive foundation. Englewood Cliffs, NJ: Prentice Hall. Haykin, S., & Li, L. (1995). Nonlinear adaptive prediction of nonstationary signals. IEEE Transactions on Signal Processing, 43(2), 526–535. Hirose, A. (1990). Continuous complex-valued backpropagation learning. Electronics Letters, 28(20), 1854–1855. Kechriotis, G., & Manolakos, E. S. (1994). Training fully recurrent neural networks with complex weights. IEEE Transactions on Circuits and Systems—II: Analog and Digital Signal Processing, 41(3), 235–238. Kim, T., & Adali, T. (2000). Fully complex backpropagation for constant envelope signal processing. In Neural Networks for Signal Processing X: Proceedings of the 2000 IEEE Signal Processing Society Workshop (Vol. 1, pp. 231–240). Piscataway, NJ: IEEE. Kim, T., & Adali, T. (2002). Universal approximation of fully complex feedforward neural networks. In Proceedings of the International Conference on Acoustics, Speech, and Signal Processing, ICASSP (Vol. 1, pp. 973–976). Piscataway, NJ: IEEE. Kim, T., & Adali, T. (2003). Approximation by fully complex multilayer perceptrons. Neural Computation, 15(7), 1641–1666. Leung, H., & Haykin, S. (1991). The complex backpropagation algorithm. IEEE Transactions on Signal Processing, 3(9), 2101–2104. Mandic, D. P., & Chambers, J. A. (2001). Recurrent neural networks for prediction: Learning algorithms, architectures and stability. New York: Wiley. Medsker, L. R., & Jain, L. C. (2000). Recurrent neural networks: Design and applications. Boca Raton, FL: CRC Press.

A Complex-Valued RTRL Algorithm for Recurrent Neural Networks

2713

Narendra, K. S., & Parthasarathy, K. (1990). Identification and Control of Dynamical Systems Using Neural Networks. IEEE Transactions on Neural Networks, 1(1), 4–27. Nerrand, O., Roussel-Ragot, P., Personnaz, L., & Dreyfus, G. (1993a). Neural networks and non-linear adaptive filtering: Unifying concepts and new algorithms. Neural Computation, 5, 165–199. Nerrand, O., Roussel-Ragot, P., Personnaz, L., & Dreyfus, G. (1993b). Neural networks and non-linear adaptive filtering: Unifying concepts and new algorithms. Neural Computation, 3, 165–197. Principe, J. C., Euliano, N. R., & Lefebvre, W. C. (2000). Neural and adaptive systems: Fundamentals through simulations. New York: Wiley. Puskorius, G. V., & Feldkamp, L. A. (1994). Neurocontrol of Nonlinear Dynamical Systems with Kalman Filter Trained Recurrent Networks. IEEE Transactions on Neural Networks, 5(2), 279–297. Tsoi, A. C., & Back, A. D. (1994). Locally recurrent globally feedforward networks: A critical review of architectures. IEEE Transactions on Neural Networks, 5(2), 229–239. Widrow, B., McCool, J., & Ball, M. (1975). The complex LMS algorithm. Proceedings of the IEEE, 63, 712–720. Williams, R. J. (1992). Training recurrent networks using the extended Kalman filter. In Proceedings of the IJCNN’92 Baltimore, 4, 241–246. Williams, R. J., & Zipser, D. A. (1989). A learning algorithm for continually running fully recurrent neural networks. Neural Computation, 1(2), 270–280. Received June 17, 2003; accepted May 5, 2004.

Index

Volume 16 By Author Abarbanel, H.D.I.—See Huerta, Ramon Aihara, Kazuyuki—See Masuda, Naoki Aires, F., Prigent, C., and Rossow, W.B. Neural Network Uncertainty Assessment Using Bayesian Statistics: A Remote Sensing Application (Letter)

16(11): 2415–2458

Akaho, Shotaro—See Tsuda, Koji Amari, Shun-ichi—See Ikeda, Shiro Amari, Shun-ichi—See Li, Yuanqing Amari, Shun-ichi—See Park, Hyeyoung Amigo, Jose M., Szczepanski, Janusz, Wajnryb, Elek, and Sanchez-Vives, Maria V. Estimating the Entropy Rate of Spike Trains via Lempel-Ziv Complexity (Letter)

16(4): 717–736

Amit, Daniel J.—See Curti, Emanuele Barbieri, Riccardo—See Eden, Uri T. Barbieri, Riccardo, Frank, Loren M., Nguyen, David P., Quirk, Michael C., Solo, Victor, Wilson, Matthew A., and Brown, Emery N. Dynamic Analyses of Information Encoding in Neural Ensembles (Letter)

16(2): 277–307

Basak, Jayanta Online Adaptive Decision Trees (Letter)

16(9): 1959–1981

Basak, Jayanta and Kothari, Ravi A Classification Paradigm for Distributed Vertically Partitioned Data (Letter)

16(7): 1525–1544

2716

Bayerl, Pierre and Neumann, Heiko Disambiguating Visual Motion Through Contextual Feedback Modulation (Letter)

Index

16(10): 2041–2066

Becker, Suzanna—See Byrne, Patrick Ben-Jacob, Eshel—See Persi, Erez Ben-Shahar, Ohad and Zucker, Steven Geometrical Computations Explain Projection Patterns of Long-Range Horizontal Connections in Visual Cortex (Article)

16(3): 445–476

Ben-Shaul, Y.—See Quiroga, R. Quian Bengio, Yoshua, Delalleau, Olivier, Le Roux, Nicolas, Paiement, Jean-Francois, Vincent, Pascal, and Ouimet, Marie Learning Eigenfunctions Links Spectral Embedding and Kernel PCA (Letter)

16(10): 2197–2219

Benucci, Andrea, Verschure, Paul F.M.J., and Konig, ¨ Peter Two-State Membrane Potential Fluctuations Driven by Weak Pairwise Correlations (Letter)

16(11): 2351–2378

Bertschinger, Nils and Natschlaeger, Thomas Real-Time Computation at the Edge of Chaos in Recurrent Neural Networks (Letter)

16(7): 1413–1436

Bialek, William—See Sharpee, Tatyana Bialek, William—See Still, Susanne Blanchard, Gilles Different Paradigms for Choosing Sequential Reweighting Algorithms (Letter) Bolanowski, Stanley J.—See Gu¸ ¨ clu, ¨ Burak Bowyer, Kevin W.—See Liu, Xiaomei Braun, Mikio L.—See Lange, Tilman Brown, Emery N.—See Barbieri, Riccardo

16(4): 811–836

Index

2717

Brown, Emery N.—See Eden, Uri T. Brown, Eric, Moehlis, Jeff, and Holmes, Philip On the Phase Reduction and Response Dynamics of Neural Oscillator Populations (Letter)

16(4): 673–715

Buhmann, Joachim M.—See Lange, Tilman Burkitt, Anthony N., Meffin, Hamish, and Grayden, David B. Spike-Timing-Dependent Plasticity: The Relationship to Rate-Based Learning for Models with Weight Dynamics Determined by a Stable Fixed-Point (Letter)

16(5): 885–940

Byrne, Patrick and Becker, Suzanna Modeling Mental Navigation in Scenes with Multiple Objects (Letter)

16(9): 1851–1872

Caponnetto, Andrea—See Rosasco, Lorenzo Chahl, Javaan S.—See Franz, Matthias O. Chen, Yuzhi and Qian, Ning A Coarse-to-Fine Disparity Energy Model with Both Phase-Shift and Position-Shift Receptive Field Mechanisms (Letter)

16(8): 1545–1577

Chi, Zheru—See Huang, De-Shuang Chiu, Kai-Chun—See Liu, Zhi-Yong Chklovskii, Dmitri B. Exact Solution for the Optimal Neuronal Layout Problem (Letter) Chung, Kai-Min—See Kao, Wei-Chun Cichocki, Andrzej—See Li, Yuanqing Clark, James J.—See Li, Muhua Coop, Allan D.—See Reeke, George N. Cumming, Bruce G.—See Read, Jenny C.A.

16(10): 2067–2078

2718

Curti, Emanuele, Mongillo, Gianluigi, La Camera, Giancarlo, and Amit, Daniel J. Mean Field and Capacity in Realistic Networks of Spiking Neurons Storing Sparsely Coded Random Memories (Letter)

Index

16(12): 2597–2637

Delalleau, Olivier—See Bengio, Yoshua De Vito, Ernesto—See Rosasco, Lorenzo Domijan, Drazen Recurrent Network with Large Representational Capacity (Letter)

16(9): 1917–1942

Dorado, Juli´an—See Rabunal, ˜ Juan R. Dreyfus, Gerard—See Oussar, Yacine Eden, Uri T., Frank, Loren M., Barbieri, Riccardo, Solo, Victor, and Brown, Emery N. Dynamic Analysis of Neural Encoding by Point Process Adaptive Filtering (Letter)

16(5): 971–998

Eguchi, Shinto—See Murata, Noboru Eguchi, Shinto—See Takenouchi, Takashi Erdogmus, Deniz, Hild II, Kenneth E., Rao, Yadunandana N., and Principe, Jose C. Minimax Mutual Information Approach for Independent Component Analysis (Letter)

16(6): 1235–1252

Frank, Loren M.—See Barbieri, Riccardo Frank, Loren M.—See Eden, Uri T. Franz, Matthias O., Chahl, Javaan S., and Krapp, Holger G. Insect-Inspired Estimation of Egomotion (Letter)

16(11): 2245–2260

Fusi, Stefano—See La Camera, Giancarlo Gal´an, Roberto Fdez., Sachse, Silke, Galizia, C. Giovanni, and Herz, Andreas V.M. Odor-Driven Attractor Dynamics in the Antennal Lobe Allow for Simple and Rapid Olfactory Pattern Classification (Letter)

16(5): 999–1012

Index

Galicki, Miroslaw, Leistritz, Lutz, Zwick, Ernst Bernhard, and Witte, Herbert Improving Generalization Capabilities of Dynamic Neural Networks (Letter)

2719

16(6): 1253–1282

Galizia, C. Giovanni—See Gal´an, Roberto Fdez. Garcia-Sanchez, Marta—See Huerta, Ramon George, John S.—See Kenyon, Garrett T. Goh, Su Lee and Mandic, Danilo P. A Complex-Valued RTRL Algorithm for Recurrent Neural Networks (Letter) Goldman, Mark S. Enhancement of Information Transmission Efficiency by Synaptic Failures (Letter)

16(12): 2699–2713

16(6): 1137–1162

Gomez-Ruiz, ´ Jose Antonio—See Lopez-Rubio, ´ Ezequiel Goodhill, Geoffrey J., Gu, Ming, and Urbach, Jeffrey S. Predicting Axonal Response to Molecular Gradients with a Computational Model of Filopodial Dynamics (Letter)

16(11): 2221–2243

Grayden, David B.—See Burkitt, Anthony N. Gu, Ming—See Goodhill, Geoffrey J. Gu, Ming—See Mo, Chun-Hui Gu¸ ¨ clu, ¨ Burak and Bolanowski, Stanley J. Probability of Stimulus Detection in a Model Population of Rapidly Adapting Fibers (Letter)

16(1): 39–58

Hall, Lawrence O.—See Liu, Xiaomei Hansen, Thorsten and Neumann, Heiko Neural Mechanisms for the Robust Representation of Junctions (Letter)

16(5): 1013–1037

2720

Hardoon, David R., Szedmak, Sandor, and Shawe-Taylor, John Canonical Correlation Analysis: An Overview with Application to Learning Methods (Letter) Hayasaka, Taichi, Kitahara, Masashi, and Usui, Shiro On the Asymptotic Distribution of the Least-Squares Estimators in Unidentifiable Models (Letter)

Index

16(12): 2639–2664

16(1): 99–114

Heikkola, Erkki—See K¨arkk¨ainen, Tommi Herz, Andreas V. M.—See Gal´an, Roberto Fdez. Heskes, Tom On The Uniqueness of Loopy Belief Propagation Fixed Points (Letter)

16(11): 2379–2413

Hild II, Kenneth E.—See Erdogmus, Deniz Hilgetag, Claus, C.—See Keinan, Alon Hof, Patrick R.—See Weaver, Christina M. Holmes, Philip—See Brown, Eric Hoshino, Osamu Neuronal Bases of Perceptual Learning Revealed by a Synaptic Balance Scheme (Letter)

16(3): 563–594

Horn, David—See Persi, Erez Huang, De-Shuang, Ip, Horace H.S., and Chi, Zheru A Neural Root Finder of Polynomials Based on Root Moments (Letter)

16(8): 1721–1762

Huerta, Ramon, Nowotny, Thomas, Garcia-Sanchez, Marta, Abarbanel, H.D.I., and Rabinovich, M.I. Learning Classification in the Olfactory System of Insects (Letter)

16(8): 1601–1640

Ikeda, Kazushi An Asymptotic Statistical Theory of Polynomial Kernel Methods (Letter)

16(8): 1705–1719

Index

2721

Ikeda, Shiro, Tanaka, Toshiyuki, and Amari, Shun-ichi Stochastic Reasoning, Free Energy, and Information Geometry (Letter)

16(9): 1779–1810

Imaoka, Hitoshi and Okajima, Kenji An Algorithm for the Detection of Faces on the Basis of Gabor Features and Information Maximization (Letter)

16(6): 1163–1191

Ip, Horace H.S.—See Huang, De-Shuang Jackson, B. Scott Including Long-Range Dependence in Integrate-and-Fire Models of the High Interspike-Interval Variability of Cortical Neurons (Letter)

16(10): 2125–2195

Jenison, Rick L. and Reale, Richard A. The Shape of Neural Dependence (Note)

16(4): 665–672

Jiang, Wenxin Boosting with Noisy Data: Some Views from Statistical Theory (Letter)

16(4): 789–810

Kanamori, Takafumi—See Murata, Noboru Kao, Wei-Chun, Chung, Kai-Min, Sun, Chia-Liang, and Lin, Chih-Jen Decomposition Methods for Linear Support Vector Machines (Letter) K¨arkk¨ainen, Tommi and Heikkola, Erkki Robust Formulations for Training Multilayer Perceptrons (Letter)

16(8): 1689–1704

16(4): 837–862

Kawanabe, Motoaki—See Sugiyama, Masashi Kawanabe, Motoaki—See Tsuda, Koji Keinan, Alon, Sandbank, Ben, Hilgetag, Claus, C., Meilijson, Isaac, and Ruppin, Eytan Fair Attribution of Functional Contribution in Artificial and Biological Networks (Letter)

16(9): 1887–1915

2722

Kenyon, Garrett T., Theiler, James, George, John S., Travis, Bryan J., and Marshak, David W. Correlated Firing Improves Stimulus Discrimination in a Retinal Model (Letter)

Index

16(11): 2261–2291

Kitahara, Masashi—See Hayasaka, Taichi Knusel, ¨ Philipp, Wyss, Reto, Konig, ¨ Peter, and Verschure, Paul F.M.J. Decoding a Temporal Population Code (Letter)

16(10): 2079–2100

Koch, Christof—See Mo, Chun-Hui Konig, ¨ Peter—See Benucci, Andrea Konig, ¨ Peter—See Knusel, ¨ Philipp Kothari, Ravi—See Basak, Jayanta Latham, Peter E. and Nirenberg, Sheila Computing and Stability in Cortical Networks (Letter)

16(7): 1385–1412

Krapp, Holger G.—See Franz, Matthias O. La Camera, Giancarlo—See Curti, Emanuele La Camera, Giancarlo, Rauch, Alexander, Luscher, Hans-R., Senn, Walter, and Fusi, Stefano Minimal Models of Adapted Neuronal Response to In Vivo-Like Input Currents (Letter)

16(10): 2101–2124

Lange, Tilman, Roth, Volker, Braun, Mikio L., and Buhmann, Joachim M. Stability-Based Validation of Clustering Solutions (Letter)

16(6): 1299–1323

L´ansky, Petr, Rodriguez, Roger, and Sacerdote, Laura Mean Instantaneous Firing Frequency Is Always Higher Than the Firing Rate (Note) Lehky, Sidney R. Bayesian Estimation of Stimulus Responses in Poisson Spike Trains (Note) Leistritz, Lutz—See Galicki, Miroslaw

16(3): 477–489

16(7): 1325–1343

Index

2723

Le Roux, Nicolas—See Bengio, Yoshua Li, Muhua and Clark, James J. A Temporal Stability Approach to Position and Attention-Shift -Invariant Recognition (Letter) Li, Yuanqing, Cichocki, Andrzej, and Amari, Shun-ichi Analysis of Sparse Representation and Blind Source Separation (Letter)

16(11): 2293–2321

16(6): 1193–1234

Lin, Chih-Jen—See Kao, Wei-Chun Lin, Tzu-Chao and Yu, Pao-Ta Adaptive Two-Pass Median Filter Based on Support Vector Machines for Image Restoration (Letter)

16(2): 333–354

Lindquist, W. Brent—See Weaver, Christina M. Liu, Xiaomei, Hall, Lawrence O., and Bowyer, Kevin W. Comments on “A Parallel Mixture of SVMs for Very Large Scale Problems” (Note) Liu, Zhi-Yong, Chiu, Kai-Chun, and Xu, Lei One-Bit-Matching Conjecture for Independent Component Analysis (Letter) Lopez-Rubio, ´ Ezequiel, Ortiz-de-Lazcano-Lobato, Juan Miguel, Munoz-P´ ˜ erez, Jos´e, and Gomez-Ruiz, ´ Jos´e Antonio Principal Components Analysis Competitive Learning (Letter)

16(7): 1345–1351

16(2): 383–399

16(11): 2459–2481

L˝orincz, Andr´as—See Szita, Istv´an Lucke, ¨ Jorg ¨ and von der Malsburg, Christoph Rapid Processing and Unsupervised Learning in a Model of the Cortical Macrocolumn (Letter) Luscher, Hans-R.—See La Camera, Giancarlo Mandic, Danilo P.—See Goh, Su Lee

16(3): 501–533

2724

Masuda, Naoki and Aihara, Kazuyuki Self-Organizing Dual Coding Based on Spike-Time-Dependent Plasticity (Letter)

Index

16(3): 627–663

Marshak, David W.—See Kenyon, Garrett T. Meffin, Hamish—See Burkitt, Anthony N. Meilijson, Isaac—See Keinan, Alon Miyawaki, Yoichi and Okada, Masato A Network Model of Perceptual Suppression Induced by Transcranial Magnetic Stimulation (Letter) Mo, Chun-Hui, Gu, Ming, and Koch, Christof A Learning Rule for Local Synaptic Interactions Between Excitation and Shunting Inhibition (Letter)

16(2): 309–331

16(12): 2507–2532

Moehlis, Jeff—See Brown, Eric Monari, Ga´etan—See Oussar, Yacine Mongillo, Gianluigi—See Curti, Emanuele Muller, ¨ Klaus-Robert—See Sugiyama, Masashi Muller, ¨ Klaus-Robert—See Tsuda, Koji Munoz-P´ ˜ erez—See Lopez-Rubio, Ezequiel Murata, Noboru—See Park, Hyeyoung Murata, Noboru, Takenouchi, Takashi, Kanamori, Takafumi, and Eguchi, Shinto Information Geometry of U-Boost and Bregman Divergence (Letter)

16(7): 1437–1481

Nadasdy, Z.—See Quiroga, R. Quian Nara, Shigetoshi—See Suemitsu, Yoshikazu Natschlaeger, Thomas—See Bertschinger, Nils Navarro, Daniel J. A Note on the Applied Use of MDL Approximations (Note)

16(9): 1763–1768

Index

2725

Neumann, Heiko—See Bayerl, Pierre Neumann, Heiko—See Hansen, Thorsten Nguyen, David P.—See Barbieri, Riccardo Nirenberg, Sheila—See Latham, Peter E. Nitta, Tohru Orthogonality of Decision Boundaries in Complex-Valued Neural Networks (Letter)

16(1): 73–97

Nowotny, Thomas—See Huerta, Ramon Oja, Erkki and Plumbley, Mark Blind Separation of Positive Sources by Globally Convergent Gradient Search (Letter)

16(9): 1811–1825

Okada, Masato—See Miyawaki, Yoichi Okada, Masato—See Tatsuno, Masami Okajima, Kenji—See Imaoka, Hitoshi Omlin, C.W.—See Vahed, A. Ortiz-de-Lazcano-Lobato, Juan Miguel—See Lopez-Rubio, ´ Ezequiel Ouimet, Marie—See Bengio, Yoshua Oussar, Yacine, Monari, Ga´etan, and Dreyfus, G´erard Reply to the Comments on “Local Overfitting Control via Leverages” in “Jacobian Conditioning Analysis for Model Validation” by I. Rivals and L. Personnaz (Letter)

16(2): 419–443

Paiement, Jean-Francois—See Bengio, Yoshua Paninski, Liam, Pillow, Jonathan W., and Simoncelli, Eero P. Maximum Likelihood Estimation of a Stochastic Integrate-and-Fire Neural Encoding Model (Letter)

16(12): 2533–2561

2726

Park, Hyeyoung, Murata, Noboru, and Amari, Shun-ichi Improving Generalization Performance of Natural Gradient Learning Using Optimized Regularization by NIC (Letter)

Index

16(2): 355–382

Pazos, Alejandro—See Rabunal, ˜ Juan R. Pereira, Javier—See Rabunal, ˜ Juan R. Persi, Erez, Horn, David, Volman, Vladislav, Segev, Ronen, and Ben-Jacob, Eshel Modeling of Synchronized Bursting Events: The Importance of Inhomogeneity (Letter)

16(12): 2577–2595

Personnaz, L´eon—See Rivals, Isabelle Piana, Michele—See Rosasco, Lorenzo Pillow, Jonathan W.—See Paninski, Liam Plumbley, Mark—See Oja, Erkki Porr, Bernd—See Saudargiene, Ausra Prigent, C.—See Aires, F. Principe, Jose C.—See Erdogmus, Deniz Qian, Ning—See Chen, Yuzhi Qian, Ning—See Tanaka, Hirokazu Queen, Nat—See Smithies, Rob Quirk, Michael C.—See Barbieri, Riccardo Quiroga, R. Quian, Nadasdy, Z., and Ben-Shaul, Y. Unsupervised Spike Detection and Sorting with Wavelets and Superparamagnetic Clustering (Letter)

16(8): 1661–1687

Rabinovich, M.I.—See Huerta, Ramon Rabunal, ˜ Juan R., Dorado, Juli´an, Pazos, Alejandro, Pereira, Javier, and Rivero, Daniel A New Approach to the Extraction of ANN Rules and To Their Generalization Capacity Through GP (Letter)

16(7): 1483–1523

Index

Rao, Rajesh P.N. Bayesian Computation in Recurrent Neural Circuits (Article)

2727

16(1): 1–38

Rao, Yadunandana N.—See Erdogmus, Deniz Rauch, Alexander—See La Camera, Giancarlo Read, Jenny C.A. and Cumming, Bruce G. Understanding the Cortical Specialization for Horizontal Disparity (Letter)

16(10): 1983–2020

Reale, Richard A.—See Jenison, Rick L. Reeke, George N. and Coop, Allan D. Estimating the Temporal Interval Entropy of Neuronal Discharge (Letter)

16(5): 941–970

Reggia, James A.—See Schulz, Reiner Rivals, Isabelle and Personnaz, L´eon Jacobian Conditioning Analysis for Model Validation (Letter)

16(2): 401–418

Rivero, Daniel—See Rabunal, ˜ Juan R. Rodriguez, Roger—See L´ansky, Petr Rosasco, Lorenzo, De Vito, Ernesto, Caponnetto, Andrea. Piana, Michele, and Verri, Alessandro Are Loss Functions All the Same? (Letter) Rossow, W.B.—See Aires, F. Roth, Volker—See Lange, Tilman Ruppin, Eytan—See Keinan, Alon Rust, Nicole C.—See Sharpee, Tatyana Sacerdote, Laura—See L´ansky, Petr Salhi, Said—See Smithies, Rob Sanchez-Vives, Maria V.—See Amigo, Jos´e M. Sachse, Silke—See Gal´an, Roberto Fdez. Sandbank, Ben—See Keinan, Alon

16(5): 1063–1076

2728

Index

Sanger, Terence D. Failure of Motor Learning for Large Intitial Errors (Letter)

16(9): 1873–1886

Saudargiene, Ausra, Porr, Bernd, and Worg ¨ otter, ¨ Florentin How the Shape of Pre-and Postsynaptic Signals Can Influence STDP: A Biophysical Model (Letter)

16(3): 595–625

Schulz, Reiner and Reggia, James A. Temporally Asymmetric Learning Supports Sequence Processing in Multi-Winner Self-Organizing Maps (Letter)

16(3): 535–561

Segev, Ronen—See Persi, Erez Sejnowski, T.J.—See Tiesinga, P.H.E. Senn, Walter—See La Camera, Giancarlo Shamir, Maoz and Sompolinsky, Haim Nonlinear Population Codes (Letter)

16(6): 1105–1136

Shawe-Taylor, John—See Hardoon, David R. Sharpee, Tatyana, Rust, Nicole C., and Bialek, William Analyzing Neural Responses to Natural Signals: Maximally Informative Dimensions (Article)

16(2): 223–250

Shi, Bertram E.—See Tsang, Eric K.C. Simoncelli, Eero P.—See Paninski, Liam Smithies, Rob, Salhi, Said, and Queen, Nat Adaptive Hybrid Learning for Neural Networks (Letter) Solo, Victor—See Barbieri, Riccardo Solo, Victor—See Eden, Uri T. Sompolinsky, Haim—See Shamir, Maoz

16(1): 139–157

Index

Still, Susanne and Bialek, William How Many Clusters? An Information-Theoretic Perspective (Letter)

2729

16(12): 2483–2506

Suemitsu, Yoshikazu and Nara, Shigetoshi A Solution for Two-Dimensional Mazes with Use of Chaotic Dynamics in a Recurrent Neural Network Model (Letter)

16(9): 1943–1957

Sugiyama, Masashi, Kawanabe, Motoaki, and Muller, ¨ Klaus-Robert Trading Variance Reduction with Unbiasedness: The Regularized Subspace Information Criterion for Robust Model Selection in Kernal Regression (Letter)

16(5): 1077–1104

Sun, Chia-Liang—See Kao, Wei-Chun Szczepanski, Janusz—See Amigo, Jos´e M. Szedmak, Sandor—See Hardoon, David R. Szita, Istv´an and L˝orincz, Andr´as Kalman Filter Control Embeded into the Reinforcement Learning Framework (Note)

16(3): 491–499

Tai, Meihua—See Tanaka, Hirokazu Takenouchi, Takashi and Eguchi, Shinto Robustifying AdaBoost by Adding the Naive Error Rate (Letter)

16(4): 767–787

Takenouchi, Takashi—See Murata, Noboru Tanaka, Hirokazu, Tai, Meihua, and Qian, Ning Different Predictions by the Minimum Variance and Minimum Torque-Change Models on the Skewness of Movement Velocity Profiles (Letter)

16(10): 2021–2040

Tanaka, Toshiyuki—See Ikeda, Shiro Tatsuno, Masami and Okada, Masato Investigation on Possible Neural Architectures Underlying InformationGeometric Measures (Letter) Teh, Yee Whye—See Welling, Max

16(4): 737–765

2730

Index

Theiler, James—See Kenyon, Garrett T. Theis, Fabian J. A New Concept for Separability Problems in Blind Source Separation (Letter)

16(9): 1827–1850

Thies, Thorsten and Weber, Frank Optimal Reduced-Set Vectors for Support Vector Machines with a Quadratic Kernel (Note)

16(9): 1769–1777

Tiesinga, P.H.E. and Sejnowski, T.J. Rapid Temporal Modulation of Synchrony by Competition in Cortical Interneuron Networks (Letter)

16(2): 251–275

Titsias, Michalis K.—See Williams, Christopher K.I. Travis, Bryan J.—See Kenyon, Garrett T. Tsang, Eric K.C. and Shi, Bertram E. A Preference for Phase-Based Disparity in a Neuromorphic Implementation of the Binocular Energy Model (Letter) Tsuda, Koji, Akaho, Shotaro, Kawanabe, Motoaki, and Muller, ¨ Klaus-Robert Asymptotic Properties of the Fisher Kernel (Letter)

16(8): 1579–1600

16(1): 115–137

Urbach, Jeffrey S.—See Goodhill, Geoffrey J. Usui, Shiro—See Hayasaka, Taichi Vahed, A. and Omlin, C.W. A Machine Learning Method for Extracting Symbolic Knowledge from Recurrent Neural Networks (Letter) Ventura, Val´erie Testing for and Estimating Latency Effects for Poisson and Non-Poisson Spike Trains (Letter) Verri, Alessandro—See Rosasco, Lorenzo Verschure, Paul F.M.J.—See Benucci, Andrea

16(1): 59–71

16(11): 2323–2349

Index

2731

Verschure, Paul F.M.J.—See Knusel, ¨ Philipp Vincent, Pascal—See Bengio, Yoshua Volman, Vladislav—See Persi, Erez von der Malsburg, Christoph—See Lucke, ¨ Jorg ¨ von der Malsburg, Christof—See Wundrich, Ingo J. Wagatsuma, Hiroaki and Yamaguchi, Yoko Cognitive Map Formation Through Sequence Encoding by Theta Phase Precession (Letter)

16(12): 2665–2697

Wajnryb, Elek—See Amigo, Jose M. Wearne, Susan L.—See Weaver, Christina M. Weaver, Christina M., Hof, Patrick R., Wearne, Susan L., and Lindquist, W. Brent Automated Algorithms for Multiscale Morphometry of Neuronal Dendrites (Letter) 16(7): 1353–1383 Weber, Frank—See Thies, Thorsten Welling, Max and Teh, Yee Whye Linear Response Algorithms for Approximate Inference in Graphical Models (Letter)

16(1): 197–221

Williams, Christopher K.I. and Titsias, Michalis K. Greedy Learning of Multiple Objects in Images using Robust Statistics and Factorial Learning (Letter)

16(5): 1039–1062

Wilson, Matthew A.—See Barbieri, Riccardo Witte, Herbert—See Galicki, Miroslaw Worg ¨ otter, ¨ Florentin—See Saudargiene, Ausra Wurtz, ¨ Rolf P.—See Wundrich, Ingo J. Wundrich, Ingo J., von der Malsburg, Christoph, and Wurtz, ¨ Rolf P. Image Representation by Complex Cell Responses (Letter)

16(12): 2563–2575

2732

Index

Wyss, Reto—See Knusel, ¨ Philipp Xia, Yousheng An Extended Projection Neural Network for Constrained Optimization (Letter)

16(4): 863–883

Xu, Lei—See Liu, Zhi-Yong Yamaguchi, Yoko—See Wagatsuma, Hiroaki Ye, Ji-Min, Zhu, Xiao-Long, and Zhang, Xian-Da Adaptive Blind Separation with Unknown Number of Sources (Letter)

16(8): 1641–1660

Yu, Pao-Ta—See Lin, Tzu-Chao Zhang, Jun Divergence Function, Duality, and Convex Analysis (Letter)

16(1): 159–195

Zhang, Xian-Da—See Ye, Ji-Min Zhao, Li—See Zheng, Wenming Zheng, Wenming, Zhao, Li, and Zou, Cairong A Modified Algorithm for Generalized Discriminant Analysis (Letter) Zhu, Xiao-Long—See Ye, Ji-Min Zou, Cairong—See Zheng, Wenming Zucker, Steven—See Ben-Shahar, Ohad Zwick, Ernst Bernhard—See Galicki, Miroslaw

16(6): 1283–1297

No title

Read more

No title

Read more

No title

Read more

No title

Read more

No title

Read more

No title

Read more

No title

Read more

No title

Read more

No title

Read more

No title

Read more

No title

Read more

No title

Read more

No title

Read more

No title

Read more

No title

Read more

No title

Read more

No title

Read more

No title

Read more

No title

Read more

No title

Read more

No title

Read more

No title

Read more

No title

Read more

No title

Read more

No title

Read more

No title

Read more

No title

Read more

No title

Read more

No title

Read more

No title

Read more

Recommend Documents

No title

Contents Introduction Dean Wesley Smith Star Trek® “A Test of Character” Kevin Lauderdale “Indomitable” Kevin Killiany “...

No title

METHODS IN CELL PHYSIOLOGY VOLUME I1 This Page Intentionally Left Blank Methods in Cell Physiology Edited by DAVID...

No title

No title

HYPATI A SPECIAL ISSUE FrenchFeministPhilosophy WINTER1989 A Journalof FeministPhilosophy HYPATI A SPECIAL ISSUE Fre...

No title

No title

No title

No title

JOURNAL OF ANCIENT NEAR EASTERN RELIGIONS JOURNAL OF ANCIENT NEAR EASTERN RELIGIONS Aims & Scope The Journal of Ancien...

No title

6 INVESTIGACION Ingeniería hidráulica CIENCIA en el México prehistórico s. Christopher Caran y James E. Neely Ed ic...

No title