Neural coding: linear models 9.29 Lecture 1
1
What is computational neuroscience?
The term “computational neuroscienc...
29 downloads
602 Views
478KB Size
Report
This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!
Report copyright / DMCA form
Neural coding: linear models 9.29 Lecture 1
1
What is computational neuroscience?
The term “computational neuroscience” has two different definitions: 1. using a computer to study the brain 2. studying the brain as a computer In the first, the field is defined by a technique. In the second, it is defined by an idea. Let’s discuss these two definitions in more depth. Why use a computer to study the brain? The most compelling reason is the tor rential flow of data generated by neurophysiology experiments. Today it is common to simultaneously record the signals generated by tens of neurons in an awake behaving animal. Once the measurement is done, the neuroscientist must analyze the data to figure out what it means, and computers are necessary for this task. Computers are also used to simulate neural systems. This is important when the models are complex, so that their behaviors are not obvious from mere verbal reasoning. On to the second definition. What does it mean to say that the brain is a computer? To grasp this idea we must think beyond our desktop computers with their glowing screens. The abacus is a computer, and so is a slide rule. What do these examples have in common? They are all dynamical systems, but they are of a special class. What’s special is that the state of a computer represents something else. The states of transistors in your computer’s display memory represent the words and pictures that are displayed on its screen. The locations of the beads on a abacus represent the money passing through a shopkeeper’s hands. And the activities of neurons in our brains represent the things that we sense and think about. In short, computation = coding + dynamics The two terms on the right hand side of this equation are the two great questions for computational neuroscience. How are computational variables are encoded in neu ral activity? How do the dynamical behaviors of neural networks emerge from the properties of neurons? The first half of this course will address the problem of encoding, or representa tion. The second half of the course will address the issue of brain dynamics, but only incompletely. The biophysics of single neurons will be discussed, but the collective behaviors of networks are left for 9.641 Introduction to Neural Networks. 1
2
Neural coding
As an introduction to the problem of neural coding, let me show you a video of a neurophysiology experiment. This video comes from the laboratory of David Hubel, who won the Nobel prize with his colleague Torsten Wiesel for their discoveries in the mammalian visual system. In the video, you will see a visual stimulus, a flashed or moving bar of light pro jected onto a screen. This is the stimulus that is being presented to the cat. You will also hear the activity of a neuron recorded from the cat’s brain. I should also describe what you will not see and hear. A cat has been anesthetized and placed in front of the screen, with its eyelids held open. The tip of a tungsten wire has been placed inside the skull, and lodged next to a neuron in a visual area of the brain. Although the cat is not conscious, neurons in this area are still responsive to visual stimuli. The tungsten wire is connected to an amplifier, so that the weak electrical signals from the neuron can be recorded. The amplified signal is also used to drive a loudspeaker, and that is the sound that you will hear. As played on the loudspeaker, the response of the neuron consists of brief clicking sounds. These clicks are due to spikes in the waveform of the electrical signal from the neuron. The more technical term for spike is action potential. Almost without exception, such spikes are characteristic of neural activity in the vertebrate brain. As you can see and hear, the frequency of spiking is dependent on the properties of the stimulus. The neuron is activated only when the bar is placed at a particular location in the visual field. Furthermore, it is most strongly activated when the bar is presented at a particular orientation. Arriving at such a verbal model of neural coding is more difficult than it may seem from the video. David Hubel has recounted his feelings of frustration during his initial studies of the visual cortex. For a long time, he used spots of light as visual stimuli, because that had worked well in his previous studies of other visual areas of the brain. But spots of light evoked only feeble responses from cortical neurons. The spots of light were produced by a kind of slide projector. One day Hubel was wrapping up yet another unsuccessful experiment. As he pulled the slide out of the projector, he heard an eruption of spikes from the neuron. It was that observation that led to the discovery that cortical neurons were most sensitive to oriented stimuli like edges or bars. The study of neural coding is not restricted to sensory processing. One can also investigate the neural coding of motor variables. In this video, you will see the move ments of a goldfish eye, and hear the activity of a neuron involved in control of these movements. The oculomotor behavior consists of periods of static fixation, punctuated by rapid saccadic movements. The rate of action potential firing during the fixation periods is correlated with the horizontal position of the eye. Finally, some neuroscientists study the encoding of computational variables that can’t be classified as either sensory nor motor. This video shows a recording of a neuron in a rat as it moves about a circular arena. Neurons like this are sensitive to the direction of the rat’s head relative to the arena, and are thought to be important for the rat’s ability to navigate. Verbal models are the first step towards understanding neural coding. But compu tational neuroscientists do not stop there. They strive for a deeper understanding by 2
constructing mathematically precise, quantitative models of neural coding. In the next few lectures, you will learn how to construct such models. But first you have to become familiar with the format of data from neurophysiological experiments.
3
Neurophysiological data
For your first homework assignment, you will be given data from an experiment on the weakly electric fish Eigenmannia. The fish has a special organ that generates an oscillating electric field with a frequency of several hundred Hz. It also has an elec trosensory organ, with which it is able to sense its electric field and the fields of other fish. The electric field is used for electrolocation and communication. In the experiment, the fish was stimulated with an artificial electric field, and the activity of a neuron in the electrosensory organ was recorded. The artificial electric field was an amplitude-modulated sine wave, much like the natural electric field of the fish. The stimulus vector si in the dataset contains the modulation signal sampled every 0.5 ms. The response vector ρi contains the spike train of the neuron. Its components are either zero or one, indicating whether or not a spike occurred during each 0.5 ms time bin. As you will see in the homework, the probability of spiking during a time bin depends linearly on the modulation signal. To visualize this dependence, one must first transform the binary vector ρi into an analog firing probability pi . This is done by some method of smoothing, as will be explained in a later lecture and in the assignment. If the pairs (si , pi ) are plotted as points on a graph, a linear relationship can be seen. The slope and intercept of the line can be found by optimizing the approximation pi ≈ a + bsi with respect to the parameters a and b. So in this case, the neural coding problem can be addressed by simply fitting a straight line to data points. This is probably the most common way to fit experimental data in all of the sciences. Before we describe the technique below, let’s pause to note that this is a very simple dataset. The stimulus is a scalar signal that varies with time. More generally, a vector might be required to describe the stimulus at a given time, as in the case of a dynamically varying image. The neural response might also be more complicated, if the experiment involved simultaneous recording of many neurons. But even in these more complex cases, it is sometimes possible to construct a linear model. When we do so later, we will see that some of the simple concepts introduced below can be generalized.
4
Fitting a straight line to data points
Suppose that we are given measurements (xi , yi ), where the index i runs from 1 to
m. In the context of the previous experiment, the measurements are (si , pi ). We have simply switched notation to emphasize the generality of the problem. Our task is to find parameters a and b so that the approximation yi ≈ a + bxi
3
(1)
is as accurate as possible. Note that it is not generally possible to find a and b so that the error vanishes completely. There are two reasons for this. First, measurement are not exact, but suffer from experimental error. Second, while linear models are often used in computational neuroscience, the underlying behavior is not truly linear. The linear model is just an approximation. Note that this is unlike the case of physics, where the proportionality of force and acceleration (F = ma) is considered a true “law.” While there are many ways of finding an optimal a and b, the canonical one is the method of least squares. Its starting point is the squared error function E=
m � 1
(a + bxi − yi )2 2 i=1
(2)
which quantifies the accuracy of the model in Eq. (1). If E = 0 the model is perfect. Minimizing E with respect to a and b is a reasonable way of finding the best approxi mation. Since E is quadratic in a and b, its minimum can be found by setting the partial derivatives with respect to a and b equal to zero. Setting ∂E/∂a = 0 yields � � 0 = ma + b xi − yi i
i
while setting ∂E/∂b = 0 produces � 0 = (a + bxi − yi )xi
(3)
i
= a
�
xi + b
i
�
x2i −
i
�
yi xi
(4)
i
Rearranging slightly, we obtain two simultaneous linear equations in two unknowns, � � ma + b xi = yi (5) i
a
�
xi + b
i
�
i
x2i
i
=
�
yi xi
(6)
i
As a shorthand for the coefficients of these linear equations, it is helpful to define m
�x� =
1 � xi m i=1
�y� =
1 � yi m i=1
m
�x2 � =
1 � 2 x m i=1 i
�xy� =
1 � xi yi m i=1
m
(7)
m
(8)
The quantity �x� is known as the mean or first moment of x, while �x2 � is known as the second moment. The quantity �xy� is called the correlation of x and y. With this new notation, the equations for a and b take the compact form a + b�x� = �y� a�x� + b�x2 � = �xy� 4
(9) (10)
We can solve for a in terms of b via a = �y� − b�x�
(11)
This can be used to eliminate a completely, yielding b=
�xy� − �x��y� �x2 � − �x�2
(12)
Backsubstituting this expression in Eq. (11) allows us to solve for a. The numerator and denominator in Eq. (12) have special names. The denominator �x2 � − �x�2 is called the variance of x, because it measures how much x fluctuates. Note that if all the xi are equal to a large constant C, the second moment �x2 � = C 2 is large also. In contrast, the variance vanishes completely. The meaning of the variance is also evident in the identity �(δx)2 � = �x2 � − �x�2 which you should verify for yourself. This equation says that the variance is the second moment of δx = x − �x�, which is the deviation of x from its mean. The standard deviation is another term that you should learn. It is defined as the square root of the variance. The numerator �xy� − �x��y� in Eq. (12) is called the covariance of x and y. It is equal to the correlation of the fluctuations δx and δy, �δxδy� = �xy� − �x��y� Again, I recommend that you verify this identity on your own. In summary, we have a simple recipe for a linear fit. Compute the covariance Cov(x, y) of x and y, and the variance Var(x) of x. The ratio of these two quantities gives the slope b of the linear fit. Then compute a by Eq. (11). Substituting Eq. (11) in the linear approximation of Eq. (1) yields yi − �y� ≈ b(xi − �x�) In other words, the constant a is unnecessary, if the linear fit is done to δx and δy, rather than to x and y. Given this fact, one approach is to compute the means �x� and �y� first, and subtract them from the data to get δx and δy. Then apply the formula b=
�δxδy� �(δx)2 �
which is equivalent to Eq. (12). The trick of subtracting the mean comes up over and over again in linear modeling. Some of you may already have encountered the correlation coefficient r, which is defined by �xy� − �x��y� � r= � �x2 � − �x�2 �y 2 � − �y�2 5
You may have learned that r close to ±1 means that the linear approximation is a good one. The correlation coefficient is similar to the covariance, except for the presence of the standard deviations of x and y in the denominator. The denominator normalizes the correlation coefficient, so that it must lie between −1 and 1, unlike the covariance, which can take on any value in principle. If you know the Cauchy-Schwarz inequality, you can use it to prove that −1 ≤ r ≤ 1, but this is not so illuminating. The correlation coefficient can be interpreted as measuring the reduction in variance that comes from taking a linear (first-order) model of the data, as opposed to a constant (zeroth-order) model. Recall that the squared error of Eq. (2) measures the variance of the deviation of the data points from the straight line. This variance vanishes only when the model is perfect. For the best zeroth-order model, we constrain b = 0 in Eq. (2), so that E is min imized when a = �y�, taking a value proportional to the variance of y. For the best first-order model, E is minimized with respect to both a and b, so that its optimal value is further reduced. The ratio of the new E to the old E is 1 − r2 . Another way of saying it is that r2 is the fraction of the variance in y that is explained by the linear term in the model.
6
Convolution, correlation, and the Wiener-Hopf equations 9.29 Lecture 2 In this lecture, we’ll learn about two mathematical operations that are commonly used in signal processing, convolution and correlation. The convolution is used to linearly filter a signal, for example to smooth a spike train to estimate probability of firing. The correlation is used to characterize the statistical dependencies between two signals. When analyzing neural data, the firing rate of a neuron is sometimes modeled as a linear filtering of the stimulus. Alternatively, the stimulus is modeled as a linear filter ing of the spike train. To construct such a model, the optimal filter must be determined from the data. This problem was studied by the famous mathematician Norbert Wiener in the 1940s. It requires the solution of the famous Wiener-Hopf equations.
1
Convolution
Let’s consider two time series, gi and hi , where the index i runs from −∞ to ∞. The convolution of these two time series is defined as (g ∗ h)i =
∞ �
gi−j hj
(1)
j=−∞
This definition is applicable to time series of infinite length. If g and h are finite, they can be extended to infinite length by adding zeros at both ends. After this trick, called zero padding, the definition in Eq. (1) becomes applicable. For example, the sum in Eq. (1) becomes n−1 � (g ∗ h)i = gi−j hj (2) j=0
for the finite time series h0 , . . . , hn−1 . Another trick for turning a finite time series into an infinite one is to repeat it over and over. This is sometimes called periodic boundary conditions, and will be encountered later in our study of Fourier analysis. The convolution operation, like ordinary scalar multiplication, is both commutative g ∗ h = h ∗ g and associative f ∗ (g ∗ h) = (f ∗ g) ∗ h. Although g and h are treated symmetrically by the convolution, they generally have very different natures. Typically,
1
one is a signal that goes on indefinitely in time. The other is concentrated near time zero, and is called a filter or convolution kernel. The output of the convolution is also a signal, a filtered version of the input signal. In Eq. (2), we chose hi to be zero for all negative i. This is called a causal filter, because g ∗ h is affected by h in the present and past, but not in the future. In some contexts, the causality constraint is not important, and one can take h−M , . . . , hM to be nonzero, for example. Formulas are nice and compact, but now let’s draw some diagrams to see how this works. Let m and n be the dimensions of g and h respectively. For simplicity, assume zero-offset indexing, so that the first components of g and h are g0 and h0 (not g1 and h1 as in MATLAB). Then (g ∗ h)0 is given by summing g−j hj over j, which can be visualized as ··· ···
gm−1 0
··· ···
g1 0
g0 h0
0 h1
0 ···
0 hn−1
0 ··· 0 ···
Next, (g ∗ h)1 is found by summing g1−j hj over j, which can be visualized as ··· ···
0 gm−1 0 ···
··· 0
g1 h0
g0 h1
0 ···
0 hn−1
0 ··· 0 ···
The rest of the components of g ∗ h are generated by sliding the g vector to the right. The last nonzero component (g ∗ h)m+n−2 can be visualized as ··· ···
0 h0
0 h1
··· ···
gm−1 hn−1
··· ···
g1 0
g0 0
0 ··· 0 ···
Therefore g ∗ h has m + n − 1 nonvanishing components, which is why the MATLAB function conv returns an m + n − 1 dimensional vector.
2
Probability of firing
The spike train ρi is a binary-valued time series. Since linear models are best suited for analog variables, it is helpful to replace ρi with a probability pi of firing per time bin. Many methods for doing this can be expressed in the convolutional form � pi = ρi−j wj j
�
where w satisfies the constraint j wj = 1. According to this formula, pi is the weighted average of ρi and its neighbors, so that 0 ≤ pi ≤ 1. There are many different ways to choose w, depending on the particulars of the application. For example, w could be chosen to be of length n, with nonzero values equal to 1/n. This is sometimes called a “boxcar” filter. MATLAB comes with a lot of other filter shapes. Try typing help bartlett, and you’ll find more information about the Bartlett and other types of windows that are good for smoothing. Depending on the context, you might want a causal or a noncausal filter for estimating probability of firing. 2
Another option is to choose the kernel to be a decaying exponential, � 0, j<0 wj = γ(1 − γ)j , j ≥ 0 This is causal, but has infinite duration. As an exercise, you could try proving that this is equivalent to pi = (1 − γ)pi−1 + γρi The probability p of firing in a time bin is closely related to frequency ν of firing by p = νΔt, where Δt is the sampling interval. Probabilistic models of neural activity will be treated more formally in a later lecture.
3
Correlation
The correlation of two time series is Corr[g, h]j =
∞ �
gi hi+j
i=−∞
The case j = 0 corresponds to the correlation that was defined in the first lecture. The difference here is that g and h are correlated at times separated by the lag j. 1 . As with the convolution, this definition can be applied to finite time series by using zero padding. Note that Corr[g, h]j = Corr[h, g]−j , so that the correlation operation is not commutative. Typically, the correlation is applied to two signals, while its output is concentrated near zero. If g and h are n-dimensional vectors, then the MATLAB command xcorr(g,h) returns a 2n − 1 dimensional vector, corresponding to the lags j = −n to n. Lags beyond this range are not included, as the correlation vanishes. The zero lag case looks like · · · 0 g1 g2 · · · gn 0 · · · · · · 0 h1 h2 · · · h n 0 · · · and the other lags correspond to sliding h right or left. A maximum lag can also be given, xcorr(g,h,maxlag), restrict the range of lags computed to -maxlag to maxlag. The default is the unnormalized correlation given above, but there are other options too. The autocorrelation is a special case of the correlation, with g = h. If g �= h, the correlation is sometimes called the crosscorrelation to distinguish it from the autocor relation. In the first lecture, we distinguished between correlation and covariance. The covariance was defined as the correlation with the means subtracted out. Similarly, the cross-covariance can be defined as the correlation left between two time series after subtracting out the means. The auto-covariance is a special case. The command xcov can be used for this purpose. 1 Warning: This is the convention followed by Dayan and Abbott, and by MATLAB. Some other books, like Numerical Recipes, call the above sum Corr[h, g]j
3
4
Spike-triggered average
Demonstration of these ideas: • Convolve spike train ρ with filter to find firing rate • Autocorrelation of stimulus • Autocorrelation of spike train • Cross-correlation of spike train and stimulus
5
The Wiener-Hopf equations
Suppose that we’d like to model the time series yi as a filtered version of xi , i.e. find the h that optimizes the approximation � yi ≈ hj xi−j j
We assume that both x and y have had their means subtracted out, so that no additive constant is needed in the model. Also, hj is assumed to be zero for j < M1 or j > M2 . This constrains how far forward or backward in time the kernel extends. For example, M1 = 0 corresponds to the case of a causal filter. The best approximation in the least squares sense is obtained by minimizing the squared error 2 M2 � 1 � E= yi − hj xi−j 2 i j=M1
relative to hj for j = M1 to M2 . This is analogous to the squared error function for linear regression, which we saw in the first lecture. The minimum is given by the equations, ∂E/∂hk = 0, for k = M1 to M2 . These are the famous Wiener-Hopf equations, Ckxy =
M2 �
xx hj Ck−j
k = M1 , . . . , M 2
(3)
j=M1
where the shorthand notation Ckxy =
�
Clxx =
xi yi+k
i
�
xi xi+l
i
has been used for the cross-covariance and auto-covariance. You’ll be asked to prove this in the homework. This is a set of M2 − M1 + 1 linear equations in M2 − M1 + 1 unknowns, so it typically has a unique solution. For our purposes, it will be sufficient to solve them using the backslash (\) and toeplitz commands in MATLAB. If you’re
4
worried about minimizing computation time, there are more efficient methods, like Levinson-Durbin recursion. Recall that in simple linear regression, the slope of the optimal line times the vari ance of x is equal to the covariance of x and y. This is a special case of the Wiener-Hopf equations. In particular, linear regression corresponds to the case M1 = M2 = 0, for which h0 = C0xy /C0xx
6
White noise analysis
If the input x is Gaussian white noise, then the solution of the Wiener-Hopf equation is xx trivial, because Ck−j = C0xx δkj . Therefore hk =
Ckxy C0xx
(4)
So a simple way to model a linear system is to stimulate it with white noise, and correlate the input with the output. This method is called reverse correlation or white noise analysis. If the input x is not white noise, then you must actually do some work to solve the Wiener-Hopf equations. But if the input x is close to being white noise, you might get away with being lazy. Just choose the filter to be proportional to the xy crosscorrelation, hk = Ckxy /γ, as in the formula (4). The optimal choice of the normaliza tion factor γ is � xy xx xy jl Cj Cj−l Cl γ= � xy xy m Cm Cm where the summations run from M1 to M2 . Note this reduces to γ = C0xx in the case of white noise, as in Eq. (4).
5
Basic Linear Algebra in MATLAB 9.29 Optional Lecture 2 In the last optional lecture we learned the the basic type in MATLAB is a matrix of double precision floating point numbers. You learned a number of different tools for initializing matrices and some basic functions that used them. This time, we’ll make sure that we understand the basic algebraic operations that can be performed on matrices, and how we can use them to solve a set of linear equations.
A Note on Notation The convention used in this lecture and in most linear algebra books is that an italics lower case letter (k) denotes a scalar, a bold lower case letter (x) denotes a vector, and a capital letter (A) denotes a matrix. Typically we name our MATLAB variables with a capital letter if they will be used as matrices, and lower case for scalars and vectors.
1 Vector Algebra Remember that in MATLAB, a vector is simply a matrix with the size of one dimension equal to 1. We should distinguish between a row vector (a 1xn matrix) and a column vector (an nx1 matrix). Recall that we change a row vector x into a column vector using the transpose operator (x� in MATLAB). The same trick works for changing a column vector into a row vector. We can add two vectors, x and y, together if they have the same dimensions. The resulting vector z = x + y is simply an element by element addition of the components of x and y: zi = xi + yi . From this is follows that vector addition is both commutative and associative, just like regular addition. MATLAB also allows you to add a scalar k (a 1x1 matrix) to a vector. The result of z = x + k is the element by element addition zi = k + x i . Vector multiplication can take a few different forms. First of all, if we multiply a scalar k times a vector x, the result is a vector with the same dimension as x: z = kx implies zi = kxi . There are two standard ways to multiply two vectors together: the inner product and the outer product. The inner product, sometimes called the dot product, is the result� of multiplying a row vector times a column vector. The result is a scalar z = xy = i xi yi . To take the inner product of two column vectors, use z = x � y. As we’ll see, the orientation of the vectors matters because MATLAB treats vectors as matrices. 1
Unlike the inner product, the result of the outer product of two vectors is a matrix. In MATLAB, you get the outer product my multiplying a column vector times a row vector: Z = xy. The components of Z are Zij = xi yj . To take the outer product of two column vectors, use Z = xy � . Occassionally, what we really want to do is to multiply two vectors together ele ment by element: zi = xi yi . MATLAB provides the .* command for this operation: z = x. ∗ y. To test our understanding, let’s try some basic matlab commands: x = 1:5
y = 6:10
x+y
x+5
5*x
x*y’
x’*y
x.*y
How would you initialize the following matrix in MATLAB using outer products? 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5
2 Matrix Algebra The matrix operations are simply generalizations of the vector operations when the matrix has multiple rows and columns. You can think of a matrix as a set of row vectors or as a set of column vectors. Matrix addition works element by element, just like vector addition. It is defined for any two matrices of the same size. C = A + B implies that C ij = Aij + Bij . Once again, it is both commutative and associative. Scalar multiplication of matrices is also defined as it was with vectors. The result is a matrix: C = kA implies C ij = kAij . If you multiply a matrix times a column vector, then the result � is another column vector - the column of inner products: b = Ax implies b i = j Aij xj . Similarly, you can multiply � a row vector times a matrix to get a row of inner products: b = xA implies bi = j Aji xj . Notice that in both cases, the definitions require that the first variable must have the same number of columns as the the second variable has rows. This idea generalizes to multiplying two matrices together. For the multiplication � C = AB, the matrix C is simply a collection of inner products: C ik = j Aij Bjk . In this case, A must have the same number of columns as B has rows. Like ordinary multiplication, matrix multiplication is associative and distributive, but unlike ordinary multiplication, it is not commutative. In general, AB �= BA. Now we are in a position to better understand the matrix transpose. If B = A � , then Bij = Aji . Think of this as flipping the matrix along the diagonal. This explains why 2
the transpose operator changes a row vector into a column vector and vice versa. The following identity holds for the definitions of multiplication and transpose: (AB) � = B � A� . This help us to understand the difference between x � A and Ax. Notice that for column vector x, (Ax)� = x� A� . There are a few more matrix terms we should know. A square matrix is an nxn ma trix (it has the same number of rows and columns). A diagonal matrix A has non-zero elements only along the diagonal (Aii ), and zeros everywhere else. You can initialize a diagonal matrix in MATLAB by passing a vector to the diag command. The iden tity matrix is a special diagonal matrix with all diagonal elements set to 1. You can initialize an idenitity matrix using the eye command. Try the following matlab commands: diag(1:5)
3 Solving Linear Equations Let’s take a step back for a moment, and try to solve the following set of linear equa tions: x1 + 3x2 = 4 2x1 + 2x2 = 9 With a little manipulation, we find that x1 = 4.75 and x2 = −0.25. We could solve this set of equations because we had 2 equations and 2 unknowns. How should we solve a set of equations with 50 equations and 50 unknowns? Let’s rewrite the previous expression in matrix form: � �� � � � 1 3 x1 4 = 9 2 2 x2 Notice that we could use the same form, Ax = b, for our set of 50 equations with 50 unknowns. As expected, MATLAB provides all of the tools that we need to solve this matrix formula, and it uses the idea of a matrix inverse. The inverse of a square matrix A, which is A−1 in the textbooks and inv(A) in MATLAB, has the property that A−1 A = AA−1 = I. Using this, let’s manipulate our previous equation: Ax = b A
−1
Ax = A−1 b x = A−1 b
Now solve the original equations in MATLAB using inv(A) ∗ b. You should get the vector containing 4.75 and -0.25. There are a few things to remember about matrix inverses. First of all, they are only defined for square matrices. It works with the transpose and multiplication operations with the following identities: (A� )−1 = (A−1 )� and (AB)−1 = B −1 A−1 . You should 3
be able to verify these properties on your own using the ideas we’ve developed. But the most important thing to know about matrix inverses is that they don’t always exist, even for square matrices. Using MATLAB, try taking the inverse of the following matrix: � � 1 2 A= 2 4 Now try inserting this A into the system of equations at the beginning of this section, and solving it using good old fashioned algebra. Why doesn’t the inverse exist?
4 Quadratic Optimization We would like to solve the equation Ax = b even if A is not square. Let’s seperate the problem into a few cases where the matrix A is an mxn matrix: If m < n, then we have more unknowns than equations. In general, this system will have infinitely many solutions. If m > n, then we have more equations than unknowns. In general, this system doesn’t have any solution. What if we don’t want to have matlab always return ”no solution”, but we actually want the closest solution in the least squares sense? This is equivalent to minimizing the following quantity: E=
1� � ( Aij xj − bi )2 2 i j
MATLAB provides the backslash operator to accomplish the least squares fit for a matrix equation: x = A \ b. Type help slash to appreciate the power of this command. You’ll see that we could have used this command to solve the square matrix equations, too.
5 Eigenmannia How does this relate to the fish data that we used in problem set 1? Recall that we were given a set of points (xi , yi ), and we were asked to find the coefficients a and b to fit the following linear model: yi ≈ a + bxi You can think of each point as an equation, and write the entire data set in matrix form: y1 1 x1 1 x 2 � � y2 a . .. b = .. .. . . 1 xm ym
If you call the left matrix A and the right side b, then calling A \ b will ask MATLAB to solve for the values of a and b that minimize the least-squared error of the model. It will return exactly the same values for a and b that the polyfit command returns. 4
For this simple example, the polyfit command and the backslash command accom plished the same task. But what if you were given a set of points (x i , yi , zi ) and you were asked to fit the following linear model: zi ≈ a + bxi + cyi The matrix notation easily scales to this problem, but the polyfit function does not.
5
More about convolution and correlation 9.29 Lecture 3
1
Some odds and ends
Consider a spike train ρ1 , . . . , ρN . One estimate of the probability of firing is 1 � p= ρi N i
(1)
This estimate is satisfactory, as long as it makes sense to describe the whole spike train by a single probability that does not vary with time. This is an assumption of statistical stationarity. More commonly, it’s a better model to assume that the probability varies slowly with time (is nonstationary). Then it’s better to apply something like Eq. (1) to small segments of the spike train, rather than to the whole spike train. For example, the formula pi = (ρi+1 + ρi + ρi−1 )/3 (2) estimates the probability at time i by counting the number of spikes in three time bins, and then dividing by three. In the first problem set, you were instructed to smooth the spike train like this, but to use a much wider window. In general, choosing the size of window involves a tradeoff. A larger window minimizes the effects of statistical sam pling error (like flipping a coin many times to more accurately determine its probability of coming up heads). But a larger window also reduces the ability to follow more rapid changes in the probability as a function of time. Note that the formula (2) isn’t to be trusted near the edges of the signal, as the filter operates on the zeros that surround the signal. In the last lecture, we defined the unnormalized correlation. There is also a normal ized version that looks like m 1 � Qxy = xi yi+j j m i=1 To compensate for boundary effects, the form m
Qxy j =
� 1 xi yi+j m − |j| i=1
is sometimes preferred. Both forms can be obtained through the appropriate options to the xcorr command. A signal is called white noise if the correlation vanishes, except at lag zero. 1
2
Using the conv function
We learned last time that if g0 , g1 , . . . , gM −1 and h0 , h1 . . . , hN −1 are given as argu ments to the conv function, then the output is f0 , f1 , . . . , fM +N −2 , where we denote f = g ∗ h. Let’s generalize this: if gM1 , . . . , gM2 and hN1 , . . . , hN2 are given as arguments to the conv function, then the output is fM1 +N1 , . . . , fM2 +N2 . For example, suppose that g is a signal, and h represents an acausal filter, with N1 < 0 and N2 > 0. Throwing out the first |N1 | and last N2 elements of f leaves us with fM1 , . . . , fM2 , which are at the same times as the signal g. Note that this prescription for discarding the elements is intended for time aligning the result of the convolution with the input signal, and for producing a result that is the same length. A different motivation for discarding elements at the beginning and end is that they may be corrupted by edge effects. If you are really worried about this, you may have to discard more than was prescribed above.
3
Impulse response
Consider the signal consisting of a single impulse at time zero, � 1, j = 0 δj = 0, j �= 0 The convolution of this signal with a filter h is � (δ ∗ h)i = δj−k hk = hj k
which is just the filter h again. In other words h, is the response of the filter to an impulse, or the impulse response function. If the impulse is displaced from time 0 to time i, then the result of the convolution is the filter h, displaced by i time steps. A spike train is just a superposition of impulses at different times. Therefore, con volving a spike train with a filter gives a superposition of filters at different times. The “Kronecker delta” notation δij is equivalent to δi−j .
4
Matrix form of convolution
The convolution of g0 , g1 , g2 and h0 , h1 , h2 can be written as g ∗ h = Gh where the matrix G is defined by
G=
g0 g1 g2 0 0
0 g0 g1 g2 0 2
0 0 g0 g1 g2
(3)
and g ∗h and h are treated as column vectors. Each column of G is the same time series, but shifted by a different amount. You can use the MATLAB function convmtx to create matrices like G from time series like g. This function is found in the Signal Processing Toolbox. If you don’t have this toolbox installed, you can make use of the fact that Eq. (3) is a Toeplitz matrix, and can be constructed by giving its first column and first row to the toeplitz command in MATLAB.
5
Convolution as multiplication of polynomials
If the second degree polynomials g0 + g1 z + g2 z 2 and h0 + h1 z + h2 z 2 are multiplied together, the result is a fourth degree polynomial. Let’s call this polynomial f0 + f1 z + f2 z 2 + f3 z 3 + f4 z 4 . This is equivalent to f = g ∗ h.
6
Discrete versus continuous time
In the previous lecture, the convolution, correlation, and the Wiener-Hopf equations were defined for data sampled at discrete time points. In the remainder of this lecture, the parallel definitions will be given for continuous time. Before the advent of the digital computer, the continuous time formulation was more important, because of its convenience for symbolic calculations. But for numeri cal analysis of experimental data, it is the discrete time formulation that is essential.
7
Convolution
Consider two functions g and h defined on the real line. Their convolution g ∗ h is defined as � ∞ (g ∗ h)(t) = dt� g(t − t� )h(t� ) −∞
�
The continuous variables t and t have taken the place of the discrete indices i and j. Again, you should verify commutativity and associativity. If g and h are only defined on finite intervals, they can be extended to the entire real line using the zero padding trick. For example, if h vanishes outside the interval [0, T ], then � T (g ∗ h)(t) = dt� g(t − t� )h(t� ) 0
8
Firing rate
To define the continuous-time representation of a spike train, we need to make use of a mathematical construct called the Dirac delta function. The delta function is zero everywhere, except at the origin, where it is infinite. You can imagine it as a box of
3
width Δt and height 1/Δt centered around the origin, with the limit Δt → 0. The delta function is defined by the identity � ∞ h(t) = dt� δ(t − t� )h(t� ) −∞
In other words, when the delta function is convolved with a function, the result is the same function, or h = δ ∗ h. A special case of this formula is the normalization condition � ∞ 1= dt� δ(t − t� ) −∞
Note that the delta function has dimensions of inverse time. The delta function represents a single spike at the origin. A spike train with spikes at times ta can be written as a sum of delta functions, � ρ(t) = δ(t − ta ) a
A standard way of estimating firing rate from a spike train is to convolve it with a response function w � ν(t) = dt w(t − t� )ρ(t� ) (4) � � = dt w(t − t� ) δ(t� − ta ) (5) a
��
=
dt w(t − t� )δ(t� − ta )
(6)
a
�
=
w(t − ta )
(7)
a
So the convolution simply adds up copies of the response function centered around the spike times. Note that it’s important to choose a kernel satisfying � dt w(t) = 1 so that
�
dt ν(t) =
�
dt ρ(t)
Since the Dirac delta function has dimensions of inverse time, smoothing ρ(t) results in an estimate of firing rate. In contrast, the discrete spike train ρi is dimensionless, so smoothing it results in an estimate of probability of firing. You can think of ρ(t) as the Δt → 0 limit of ρi /Δt.
4
9
Low-pass filter
To see the convolution in action, consider the differential equation τ
dx +x=h dt
This is an equation for a low-pass filter with time constant τ . Given a signal h, the output of the filter is a signal x that is smoothed over the time scale τ . The solution can be written as the convolution x = g ∗ h, where the “impulse response function” g is defined as 1 g(t) = e−t/τ θ(t) τ and we have defined the Heaviside step function θ(t), which is zero for all negative time and one for all positive time. The response function g is zero for all negative time, jumps to a nonzero value at time zero, and then decays exponentially for positive time. To construct the function x, the convolution places a copy of the response function g(t − t� ) at every time t� . Each copy gets weighted by h(t� ), and they are all summed to obtain x(t). The response function is sometimes called the kernel of the convolution. To see another application of the delta function, note that the impulse response function for the low-pass filter satisfies the differential equation τ
dg + g = δ(t) dt
In other words, g is the response to driving the low-pass filter with an “impulse” δ(t), which is why it’s called the impulse response.
10
Correlation
The correlation is defined as Corr[g, h](t) =
�
∞
dt� g(t� )h(t + t� )
−∞
This compares g and h at times separated by the lag t.1 Note that Corr[g, h](t) = Corr[h, g](−t), so that the correlation operation is not commutative. As before, if g and h are only defined on the interval [0, T ], they can be extended by defining them to be zero outside the interval. Then the above definition is equivalent to � T
dt� g(t� )h(t + t� )
Corr[g, h](t) =
0
This is the unnormalized version of the correlation. In the Dayan and Abbott textbook, Qgh (t) = (1/T ) Corr[g, h](t), which is the normalized correlation. 1 The expression above is the definition used in the Dayan and Abbott book, but take note that the opposite convention is used in other books like Numerical Recipes, which call the above integral Corr[h, g](t).
5
11
The spike-triggered average
Dayan and Abbott define the spike-triggered average of the stimulus as the average value of the stimulus at time τ before a spike, C(τ ) =
1� s(ta − τ ) n a
where n is the number of spikes. Then in Figure 1.9 they plot C(τ ) with the positive τ axis pointing left. This sign convention may be standard, but it is certainly confusing. Exactly the same graph would be produced by the alternative convention of taking C(τ ) to be the average value of the stimulus at time τ after a spike, and plotting it with the positive τ axis pointing right. Note that in this convention, C(τ ) would have the same shape as the cross-correlation of ρ and s, � Corr[ρ, s](τ ) = dt ρ(t)s(t + τ ) (8) � � = dt δ(t − ta )s(t + τ ) (9) a
=
�
s(ta + τ )
(10)
a
12
Visual images
So far we’ve discussed situations where the neural response encodes a single timevarying scalar variable. In the case of visual images, the stimulus is a function of space as well as time. This means that a more complex linear model is necessary for modeling the relationship between stimulus and response. Let the stimulus be denoted by sab i , where the indices a and b specify pixel location in the two-dimensional image. ab ab Construct xab i = si −�s � by subtracting out the pixel means. Similarly, let yi denote the neural response with the mean subtracted out. Then consider the linear model � ab yi ≈ hab j xi−j jab
We won’t derive the Wiener-Hopf equations for this case, as the indices get messy. But for white noise the optimal filter is given by the cross correlation � hab xab j ∝ i yi+j i
definition of white noise
6